[00:12:59] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:12:59] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:22:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:43:24] <wikibugs>	 10Data-Engineering (Sprint 6), 10Patch-For-Review: [Iceberg Migration] Migrate aqs hourly tables to Iceberg - https://phabricator.wikimedia.org/T352669 (10CodeReviewBot) tchin opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/569  Draft: Add iceberg version of aqs_hourly...
[07:51:51] <wikibugs>	 10Data-Platform-SRE, 10Infrastructure-Foundations: Fix IPv6 service IP ranges for all Kubernetes clusters - https://phabricator.wikimedia.org/T353705 (10elukey) @Gehel @CDanis we should change the network subnets in Netbox and update puppet before closing, but I'd delegate this to specific teams to avoid messi...
[08:00:43] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:02:44] <jinxer-wm>	 (SystemdUnitFailed) resolved: monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:26:56] <wikibugs>	 10Data-Platform-SRE: Decom an-coord100[1-2] - https://phabricator.wikimedia.org/T353774 (10Gehel)
[09:27:07] <wikibugs>	 10Data-Platform-SRE: Decom an-coord100[1-2] - https://phabricator.wikimedia.org/T353774 (10Gehel)
[09:27:09] <wikibugs>	 10Data-Platform-SRE: Refresh hadoop coordinators an-coord100[1-2] with an-coord[3-4] - https://phabricator.wikimedia.org/T332572 (10Gehel)
[09:28:08] <wikibugs>	 10Data-Platform-SRE: Decom an-master100[1-2] - https://phabricator.wikimedia.org/T353775 (10Gehel)
[09:28:35] <wikibugs>	 10Data-Platform-SRE: Decom an-master100[1-2] - https://phabricator.wikimedia.org/T353775 (10Gehel)
[09:28:37] <wikibugs>	 10Data-Platform-SRE (23/24 Q3 Milestone 1): Upgrade hadoop master to bullseye - https://phabricator.wikimedia.org/T332573 (10Gehel)
[09:29:27] <wikibugs>	 10Data-Engineering (Sprint 6), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol)
[09:29:29] <wikibugs>	 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Document how to browse the History server locally - https://phabricator.wikimedia.org/T353232 (10brouberol) 05Open→03Resolved I have created https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark_History, which contai...
[09:29:59] <wikibugs>	 10Data-Platform-SRE: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10Gehel)
[09:30:49] <wikibugs>	 10Data-Platform-SRE: Decom an-master100[1-2] - https://phabricator.wikimedia.org/T353775 (10Gehel) p:05Triage→03High
[09:30:51] <wikibugs>	 10Data-Platform-SRE: Decom an-coord100[1-2] - https://phabricator.wikimedia.org/T353774 (10Gehel) p:05Triage→03High
[09:30:56] <wikibugs>	 10Data-Platform-SRE: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10Gehel) p:05Triage→03High
[09:35:18] <wikibugs>	 10Data-Engineering (Sprint 6), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol)
[09:36:18] <wikibugs>	 10Data-Engineering (Sprint 6), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) The server is available at https://yarn.wikimedia.org/history-server/. We will enable the collection of historical metrics for all spa...
[09:49:39] <wikibugs>	 10Data-Engineering, 10Infrastructure-Foundations, 10SRE: krb1001: krb5kdc.log excessive size - https://phabricator.wikimedia.org/T337906 (10MoritzMuehlenhoff) 05Resolved→03Open krbkdc.logs have increased by a lot (used to be ~ 2.5G uncompressed per days, now at 25G since just 0:00 UTC), reopening
[09:58:02] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1): conda-analytics package install ConnectTimeoutError on stat1009 - https://phabricator.wikimedia.org/T353745 (10Gehel)
[09:58:19] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1): conda-analytics package install ConnectTimeoutError on stat1009 - https://phabricator.wikimedia.org/T353745 (10Gehel) a:03BTullis
[10:00:39] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1): conda-analytics package install ConnectTimeoutError on stat1009 - https://phabricator.wikimedia.org/T353745 (10BTullis) p:05Triage→03High
[10:03:33] <wikibugs>	 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: krb1001: krb5kdc.log excessive size - https://phabricator.wikimedia.org/T337906 (10MoritzMuehlenhoff) >>! In T337906#9417847, @MoritzMuehlenhoff wrote: > krbkdc.logs have increased by a lot (used to be ~ 2.5G uncompressed per days...
[10:38:39] <wikibugs>	 10Data-Platform-SRE: Decommission an-tool1010 - https://phabricator.wikimedia.org/T353782 (10Gehel)
[10:38:41] <wikibugs>	 10Data-Platform-SRE: Decommission an-tool1010 - https://phabricator.wikimedia.org/T353782 (10Gehel)
[10:38:47] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Epic, 10Patch-For-Review: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 (10Gehel)
[10:40:02] <wikibugs>	 10Data-Platform-SRE: Decommission an-tool1010 - https://phabricator.wikimedia.org/T353782 (10Gehel) p:05Triage→03High
[10:41:01] <wikibugs>	 10Data-Platform-SRE: Decommission an-worker10[78-95] & an-worker1116 - https://phabricator.wikimedia.org/T353784 (10Gehel)
[10:41:10] <wikibugs>	 10Data-Platform-SRE: Decommission an-worker10[78-95] & an-worker1116 - https://phabricator.wikimedia.org/T353784 (10Gehel)
[10:41:12] <wikibugs>	 10Data-Platform-SRE: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10Gehel)
[10:41:28] <wikibugs>	 10Data-Platform-SRE: Decommission an-worker10[78-95] & an-worker1116 - https://phabricator.wikimedia.org/T353784 (10Gehel) p:05Triage→03High
[10:41:58] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1): conda-analytics package install ConnectTimeoutError on stat1009 - https://phabricator.wikimedia.org/T353745 (10BTullis) I can't reproduce this. I just tried the following on stat1009. Create a new environment: ` btullis@stat1009:~$ conda-analytics-clone T353745-btull...
[10:43:18] <wikibugs>	 10Data-Platform-SRE: Decom EOL stats servers stat100[4-7] - https://phabricator.wikimedia.org/T353785 (10Gehel)
[10:43:47] <wikibugs>	 10Data-Platform-SRE: Decom EOL stats servers stat100[4-7] - https://phabricator.wikimedia.org/T353785 (10Gehel) p:05Triage→03High
[10:44:47] <wikibugs>	 10Data-Platform-SRE: Plan to decom an-launcher1002 - https://phabricator.wikimedia.org/T353786 (10Gehel)
[10:45:18] <wikibugs>	 10Data-Platform-SRE: Plan to decom an-launcher1002 - https://phabricator.wikimedia.org/T353786 (10Gehel) p:05Triage→03High
[10:45:34] <wikibugs>	 10Data-Platform-SRE: Decom dumpsdata100[1-2] - https://phabricator.wikimedia.org/T353787 (10Gehel)
[10:46:04] <wikibugs>	 10Data-Platform-SRE: Decom dumpsdata100[1-2] - https://phabricator.wikimedia.org/T353787 (10Gehel) p:05Triage→03High
[10:47:07] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 (10Gehel)
[10:48:01] <wikibugs>	 10Data-Platform-SRE: Add kafka-stretch100[1-2] to the hadoop cluster - https://phabricator.wikimedia.org/T353788 (10Gehel)
[10:48:45] <wikibugs>	 10Data-Platform-SRE: Add kafka-stretch100[1-2] to the hadoop cluster - https://phabricator.wikimedia.org/T353788 (10Gehel) p:05Triage→03Medium
[10:49:12] <wikibugs>	 10Data-Platform-SRE: Re-purpose kafka-stretch200[1-2] as DSE workers in codfw - https://phabricator.wikimedia.org/T353789 (10Gehel)
[10:49:40] <wikibugs>	 10Data-Platform-SRE: Re-purpose kafka-stretch200[1-2] as DSE workers in codfw - https://phabricator.wikimedia.org/T353789 (10Gehel) p:05Triage→03Low
[10:53:00] <wikibugs>	 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10SRE-OnFire, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Update WDQS Runbook following update lag incident - https://phabricator.wikimedia.org/T336577 (10Gehel) a:05bking→03None
[10:53:08] <wikibugs>	 10Data-Platform-SRE: Create helmfile deployment files for superset and superset-next - https://phabricator.wikimedia.org/T353790 (10BTullis)
[10:53:14] <wikibugs>	 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10SRE-OnFire, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Update WDQS Runbook following update lag incident - https://phabricator.wikimedia.org/T336577 (10Gehel)
[10:57:14] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Epic, 10Patch-For-Review: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 (10BTullis)
[11:05:09] <aqu>	 Hello btullis , the Airflow dashboard is now running https://grafana-rw.wikimedia.org/d/5jPZXONIk/airflow
[11:05:09] <aqu>	 When you have time we could deploy the https://gerrit.wikimedia.org/r/c/operations/puppet/+/984510 to make all Airflow instances send metrics :)
[11:05:09] <aqu>	 Currently running pcc on it.
[11:25:28] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[11:28:56] <wikibugs>	 10Data-Platform-SRE: Configure OIDC Authentication for Superset on K8S - https://phabricator.wikimedia.org/T353794 (10BTullis)
[11:30:28] <wikibugs>	 10Data-Platform-SRE: Try to improve the LDAP integration for Superset user account creation - https://phabricator.wikimedia.org/T297120 (10BTullis)
[11:30:32] <wikibugs>	 10Data-Platform-SRE: Configure OIDC Authentication for Superset on K8S - https://phabricator.wikimedia.org/T353794 (10BTullis)
[11:30:35] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Epic, 10Patch-For-Review: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 (10BTullis)
[11:42:36] <wikibugs>	 10Data-Platform-SRE: Try to improve the LDAP integration for Superset user account creation - https://phabricator.wikimedia.org/T297120 (10BTullis) There is a useful description of the problem in [[https://docs.google.com/document/d/1PT9cRVFtN23GlWfYo-_bTUzVcK12-dSSJcX-SV4rtqs/edit#heading=h.264n1btdzjgk|this do...
[11:44:46] <wikibugs>	 10Data-Platform-SRE: Assign users' real names and email addresses to their Superset user accounts automatically - https://phabricator.wikimedia.org/T297120 (10BTullis)
[12:07:31] <btullis>	 aqu: This is excellent! I'll review that patch now.
[12:08:45] <btullis>	 Am I right in thinking that the statsd_exporter isn't restarted when the airflow config file is changed? Maybe we can find a way to make this happen automatically.
[12:21:32] <btullis>	 aqu: I created https://gerrit.wikimedia.org/r/c/operations/puppet/+/984520 as well, to start scraping these metrics.
[12:21:46] <btullis>	 aqu: Ready when you are for the deploy.
[13:10:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[13:10:35] <wikibugs>	 10Data-Engineering, 10Infrastructure-Foundations, 10SRE: krb1001: krb5kdc.log excessive size - https://phabricator.wikimedia.org/T337906 (10BTullis) @MoritzMuehlenhoff - Thanks for sorting out the more frequent log rotation as a workaround. At the moment, it seems that more than half of the log entries being...
[13:24:47] <aqu>	 btullis I think only 1 file need to trigger a statsd-exporter restart: /etc/prometheus/statsd_exporter.conf
[13:27:36] <aqu>	 I'm ready for deploy. As the etherpad opsweek https://etherpad.wikimedia.org/p/analytics-weekly-train is empty. I'm going to scap deploy on all instances. And it should be a noop.
[13:27:36] <aqu>	 The idea is to make sure we have the airflow-dags code version compatible with the metric configuration.
[13:28:48] <btullis>	 Cool. Let's do the auto restart in a separate patch. I can do a manual restart after this puppet run. 
[13:31:47] <wikibugs>	 10Data-Engineering, 10Infrastructure-Foundations, 10SRE: krb1001: krb5kdc.log excessive size - https://phabricator.wikimedia.org/T337906 (10MoritzMuehlenhoff) So with the current hourly compression we're always safely in the realm where compression of the chunks completes and which should prevent the server...
[13:35:59] <wikibugs>	 10Data-Engineering, 10Infrastructure-Foundations, 10SRE: Reduce Kerberos logs produced by Presto - https://phabricator.wikimedia.org/T353802 (10MoritzMuehlenhoff)
[13:36:08] <wikibugs>	 10Data-Engineering, 10Infrastructure-Foundations, 10SRE: Reduce Kerberos logs produced by Presto - https://phabricator.wikimedia.org/T353802 (10MoritzMuehlenhoff) p:05Triage→03Medium
[13:36:35] <wikibugs>	 10Data-Engineering, 10Infrastructure-Foundations, 10SRE: krb1001: krb5kdc.log excessive size - https://phabricator.wikimedia.org/T337906 (10MoritzMuehlenhoff) 05Open→03Resolved Closing the task since the immediate issue is resolved, I created https://phabricator.wikimedia.org/T353802 as a followup
[13:39:23] <aqu>	 btullis It's ready to deploy
[13:39:40] <btullis>	 OK, proceeding.
[13:39:48] <aqu>	 Your patch goes first https://gerrit.wikimedia.org/r/c/operations/puppet/+/984520
[13:40:41] <btullis>	 Does it? I would have thought that your patch goes first :-) What am I missing?
[13:42:49] <aqu>	 I'm thinking again and I'm not sure. Are the statsd-exporter already running beside the airflow instances ?
[13:43:37] <aqu>	 If this patch only defines the scraping then you are right: your patch goes after. https://gerrit.wikimedia.org/r/c/operations/puppet/+/984520 
[13:44:43] <btullis>	 I've done them both. I'll run puppet on the airflow hosts first, then the prometheus host. Should be fine 🤞
[13:45:12] <aqu>	 ok
[13:49:10] <btullis>	 OK, it looks like the prometheus-statsd-exporters are not installed. Let's find out why...
[13:49:17] <icinga-wm>	 PROBLEM - Checks that the local airflow scheduler for airflow @platform_eng is working properly on an-airflow1004 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-platform_eng /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1004.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[13:49:17] <icinga-wm>	 PROBLEM - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[13:49:18] <icinga-wm>	 PROBLEM - Checks that the local airflow scheduler for airflow @wmde is working properly on an-airflow1007 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-wmde /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1007.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[13:50:27] <icinga-wm>	 PROBLEM - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[13:50:27] <icinga-wm>	 PROBLEM - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[13:50:47] <jinxer-wm>	 (SystemdUnitCrashLoop) firing: (2)  crashloop on an-airflow1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[13:50:51] <icinga-wm>	 PROBLEM - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[13:52:09] <btullis>	 I'm checking to see whether these alerts are genuine. According to systemctl the airflow-scheduler service is up and running on all of these hosts.
[13:55:47] <jinxer-wm>	 (SystemdUnitCrashLoop) firing: (3)  crashloop on an-airflow1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[13:56:07] <btullis>	 I believe that this is a problem with the monitoring, rather than with airflow. I can run this.
[13:56:12] <btullis>	 https://www.irccloud.com/pastebin/rmrLofhd/
[13:59:47] <aqu>	 Can't connect on statsd-exporter on an-launcher with `ssh -t -N -L9112:localhost:9112 an-launcher1002.e`
[13:59:53] <aqu>	 I'm going to check the service...
[14:00:11] <btullis>	 statsd exporter not installed, I'm writing a patch now.,
[14:01:30] <aqu>	 +1
[14:03:07] <btullis>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/984607
[14:03:24] <btullis>	 PCC running now.
[14:03:49] <btullis>	 This doesn't explain the change in behaviour in the monitoring though.
[14:04:46] <btullis>	 Maybe there was something in the airflow_dags deploy that affected this, then the restart of the airflow-scheduler triggered it.
[14:06:47] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Cirrus-streaming-updater test: validate relforge indices are correctly updated - https://phabricator.wikimedia.org/T350186 (10bking) Per conversation with @pfischer yesterday, [[ https://phabricator.wikimedia.org/P54487 | here ]] are the latest numbers. The mismatche...
[14:09:58] <btullis>	 aqu: prometheus-statsd-exporter is now installed, I restarted the airflow-scheduler service on each instance.
[14:10:49] <btullis>	 https://usercontent.irccloud-cdn.com/file/nkXwFsJV/image.png
[14:11:11] <aqu>	 Cool. I begin to see metrics arriving in the dashboard.
[14:11:22] <aqu>	 Yes, same :)
[14:12:20] <aqu>	 I will continue to adjust the dashbaord for the new data. Thanks for all your help btullis !
[14:12:47] <btullis>	 You're very welcome. I have to find a way to deal with these pesky alerts.
[14:12:50] <btullis>	 https://usercontent.irccloud-cdn.com/file/1n2YNUCf/image.png
[14:17:34] <wikibugs>	 10Data-Engineering (Sprint 8), 10Data Products, 10serviceops-radar: Use config-master.wikimedia.org/mediawiki.yaml to automatically switch code that depends on active datacenter - https://phabricator.wikimedia.org/T338796 (10lbowmaker)
[14:17:48] <wikibugs>	 10Data-Engineering (Sprint 8): [Iceberg Migration] Implement mechanism for automatic Iceberg data deletion and optimization - https://phabricator.wikimedia.org/T338065 (10lbowmaker)
[14:25:03] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Airflow scheduler monitoring is broken since the most recent deploy - https://phabricator.wikimedia.org/T353806 (10BTullis)
[14:25:33] <btullis>	 aqu: `ERROR - No module named 'wmf_airflow_common'` is coming from the airflow scheduler check.
[14:26:16] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Airflow scheduler monitoring is broken since the most recent deploy - https://phabricator.wikimedia.org/T353806 (10BTullis) p:05Triage→03High
[14:28:52] <aqu>	 I'm investigating.
[14:29:01] <icinga-wm>	 ACKNOWLEDGEMENT - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet did not succeed Btullis T353806 - This is a problem with the monitoring system, related to the latest deploy https://wikitech.wikimedia.org/wiki/Analyti
[14:29:01] <icinga-wm>	 ems/Airflow
[14:29:02] <icinga-wm>	 ACKNOWLEDGEMENT - Checks that the local airflow scheduler for airflow @platform_eng is working properly on an-airflow1004 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-platform_eng /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1004.eqiad.wmnet did not succeed Btullis T353806 - This is a problem with the monitoring system, related to the latest deploy https://wikitech.wikimedia.org/wiki
[14:29:03] <icinga-wm>	 ics/Systems/Airflow
[14:29:03] <icinga-wm>	 ACKNOWLEDGEMENT - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet did not succeed Btullis T353806 - This is a problem with the monitoring system, related to the latest deploy https://wikitech.wikimedia.org/wiki/Analytics/S
[14:29:04] <icinga-wm>	 Airflow
[14:29:06] <icinga-wm>	 ACKNOWLEDGEMENT - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet did not succeed Btullis T353806 - This is a problem with the monitoring system, related to the latest deploy https://wikitech.wikimedi
[14:29:07] <icinga-wm>	 iki/Analytics/Systems/Airflow
[14:29:08] <icinga-wm>	 ACKNOWLEDGEMENT - Checks that the local airflow scheduler for airflow @wmde is working properly on an-airflow1007 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-wmde /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1007.eqiad.wmnet did not succeed Btullis T353806 - This is a problem with the monitoring system, related to the latest deploy https://wikitech.wikimedia.org/wiki/Analytics/Syste
[14:29:09] <icinga-wm>	 low
[14:29:10] <icinga-wm>	 ACKNOWLEDGEMENT - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed Btullis T353806 - This is a problem with the monitoring system, related to the latest deploy https://wikitech.wikimedia.org/wiki/Ana
[14:29:10] <icinga-wm>	 Systems/Airflow
[14:31:22] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Airflow scheduler monitoring is broken since the most recent deploy - https://phabricator.wikimedia.org/T353806 (10BTullis) a:03BTullis
[14:35:29] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Discovery-Search (Current work): Cirrus-streaming-updater test: validate relforge indices are correctly updated - https://phabricator.wikimedia.org/T350186 (10Gehel) a:05bking→03None
[14:35:34] <aqu>	 btullis I think it's because the airflow-analytics jobs check should be run like the other commands: with a custom PYTHONPATH=/path/to/root/of/airflow-dags/ as an env variable.
[14:35:35] <aqu>	 To enable the airflow metrics, a reference to some custom code in the `wmf_airflow_common` dir has been added into the configuration. It's expected to be found when running the `job check` command.
[14:35:37] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Airflow scheduler monitoring is broken since the most recent deploy - https://phabricator.wikimedia.org/T353806 (10Antoine_Quhen) I think it's because the airflow-analytics jobs check should be run like the other commands: with a custom PYTHONPA...
[14:35:45] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Discovery-Search (Current work): Cirrus-streaming-updater test: validate relforge indices are correctly updated - https://phabricator.wikimedia.org/T350186 (10Gehel) a:03pfischer
[14:37:14] <aqu>	 Like here: https://github.com/wikimedia/operations-puppet/blob/f7c3eb56a9417571792b7636367f3c13e850bc83/modules/profile/manifests/airflow.pp#L199
[14:37:27] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Airflow scheduler monitoring is broken since the most recent deploy - https://phabricator.wikimedia.org/T353806 (10Antoine_Quhen) https://github.com/wikimedia/operations-puppet/blob/f7c3eb56a9417571792b7636367f3c13e850bc83/modules/profile/manife...
[14:38:19] <btullis>	 Great, I thouht it was probably pythonpath related. (See my email to d-e-alerts) I'll try this now.
[14:42:39] <btullis>	 Yes, success.
[14:42:44] <btullis>	 https://www.irccloud.com/pastebin/WUHEfUp4/
[14:43:24] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Airflow scheduler monitoring is broken since the most recent deploy - https://phabricator.wikimedia.org/T353806 (10BTullis) This works if we change the check to run like this: ` btullis@an-launcher1002:/srv/deployment/airflow-dags/analytics$ /us...
[14:49:38] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Discovery-Search (Current work): Load Wikidata split graphs into test servers - https://phabricator.wikimedia.org/T350465 (10Gehel) 05Open→03Resolved
[14:49:45] <wikibugs>	 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Expose 3 new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650 (10Gehel)
[14:52:32] <btullis>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/984614/
[14:53:37] <wikibugs>	 10Data-Platform-SRE (23/24 Q3 Milestone 1): [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene) We now have a sample DAG on the WMDE instance which should meet the acceptance criteria, huge thanks to @mforns for the help with the sample DAG. {F41614666}  The WMDE...
[14:53:49] <wikibugs>	 10Data-Platform-SRE (23/24 Q3 Milestone 1): [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene)
[14:55:20] <wikibugs>	 10Data-Platform-SRE: Troubleshoot recurring systemd unit failures and availability issues for wdqs1022-24 - https://phabricator.wikimedia.org/T352878 (10bking) Related issue mentioned by Volans:  > wdqs1024 has a unit that is failing and complains that the service is not here at all: wmf_auto_restart_prometheus-...
[14:59:33] <icinga-wm>	 RECOVERY - Checks that the local airflow scheduler for airflow @platform_eng is working properly on an-airflow1004 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/platform_eng AIRFLOW_HOME=/srv/airflow-platform_eng /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1004.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[15:00:45] <icinga-wm>	 RECOVERY - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics_product AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[15:01:03] <icinga-wm>	 RECOVERY - Checks that the local airflow scheduler for airflow @wmde is working properly on an-airflow1007 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/wmde AIRFLOW_HOME=/srv/airflow-wmde /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1007.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[15:01:03] <icinga-wm>	 RECOVERY - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/search AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[15:01:51] <icinga-wm>	 RECOVERY - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[15:04:45] <wikibugs>	 10Data-Platform-SRE (23/24 Q3 Milestone 1): [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10AndrewTavis_WMDE) Hey @Stevemunene! Thanks so much for the efforts here! Further thanks to the others at WMF who have helped along the way :) This is such an important step for ana...
[15:08:51] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1): conda-analytics package install ConnectTimeoutError on stat1009 - https://phabricator.wikimedia.org/T353745 (10mpopov) 05Open→03Resolved It works now – weird! Thanks for looking into it!
[15:14:54] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Airflow scheduler monitoring is broken since the most recent deploy - https://phabricator.wikimedia.org/T353806 (10BTullis) They're all working now, except for the research instance: {F41614688,width=60%} The error message...
[15:15:04] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Airflow scheduler monitoring is broken since the most recent deploy - https://phabricator.wikimedia.org/T353806 (10BTullis) p:05High→03Medium
[15:15:51] <btullis>	 aqu: Did you deploy the latest airflow-dags to an-airflow1002 ? The research instance? It's the only alert that remains.
[15:28:50] <wikibugs>	 10Data-Platform-SRE: Troubleshoot recurring systemd unit failures and availability issues for wdqs1022-24 - https://phabricator.wikimedia.org/T352878 (10bking) Per IRC conversation in #wikimedia-sre , this issue has affected a few servers in the past (see T199911 and T265323 ). As such, we've [[ https://gerrit.w...
[15:39:26] <aqu>	 Let me check.
[15:40:31] <wikibugs>	 10Data-Engineering, 10MediaWiki-General, 10Event-Platform: Create legacy EventLogging proxy HTTP intake (for MediaWikiPingback) endpoint to EventGate - https://phabricator.wikimedia.org/T353817 (10Ottomata)
[15:41:10] <aqu>	 Yes I did it.
[15:41:46] <btullis>	 OK, thanks. Any clue why we are seeing this then? https://phabricator.wikimedia.org/T353806#9418827
[15:44:00] <aqu>	 :/ the code is not up to date. I did deploy. But my previous git pull failed with: 
[15:44:00] <aqu>	 aqu@deploy2002:/srv/deployment/airflow-dags/research$ git pull
[15:44:00] <aqu>	 Your configuration specifies to merge with the ref 'refs/heads/knowledge_gap_outputs'
[15:44:19] <aqu>	 I'm checking.
[15:54:50] <aqu>	 They are deploying from a custom branch which I need to patch. But sometime they need to reconcile with main.
[15:59:37] <btullis>	 OK, sure thing. I think that fab might know most about this instance.
[16:18:20] <aqu>	 The merge is not obvious. I'm asking Fab.
[17:05:59] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Migrate MediaWikiPingback schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/938814 (https://phabricator.wikimedia.org/T323828) (owner: 10Phuedx)
[17:06:43] <wikibugs>	 (03Merged) 10jenkins-bot: Migrate MediaWikiPingback schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/938814 (https://phabricator.wikimedia.org/T323828) (owner: 10Phuedx)
[17:26:55] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Troubleshoot recurring systemd unit failures and availability issues for wdqs1022-24 - https://phabricator.wikimedia.org/T352878 (10ops-monitoring-bot) Host rebooted by bking@cumin2002 with reason: None
[17:27:11] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Troubleshoot recurring systemd unit failures and availability issues for wdqs1022-24 - https://phabricator.wikimedia.org/T352878 (10bking) >>! In T352878#9418736, @bking wrote:  > Some loose theories with proposals how to test them: >  > - Hosts have outdated firmware/u...
[17:33:23] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: [NEEDS GROOMING] schema services should be moved to k8s - https://phabricator.wikimedia.org/T347421 (10BTullis) >>! In T347421#9216784, @Ottomata wrote: > I think this will be harder than it sounds.  I don't think there is a way to automate dynamic d...
[17:55:47] <jinxer-wm>	 (SystemdUnitCrashLoop) firing: (3)  crashloop on an-airflow1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[17:59:08] <btullis>	 aqu: I have added downtime until Jan 2nd for the an-airflow1002 alert. I can remove it if we get the fix deployed, but given that the service is working I'd like to stop the alert noise first.
[18:01:01] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Restore service for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347284 (10bking)
[18:01:18] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355 (10bking) 05Open→03Resolved I've rolled out the Puppet patches and confirmed they are working as expected. Closing...
[18:03:19] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Troubleshoot recurring systemd unit failures and availability issues for wdqs1022-24 - https://phabricator.wikimedia.org/T352878 (10RKemper)
[18:34:42] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_airflow-kerberos@research.service Failed on an-airflow1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:36:05] <icinga-wm>	 PROBLEM - Check systemd state on an-airflow1002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_airflow-kerberos@research.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:42:45] <wikibugs>	 10Data-Engineering, 10Observability-Logging, 10Traffic, 10Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10Ottomata) To do this migration plan ^, we'd need Kafka jumbo to support 2x webrequest volume while we migrate. Let's check with Data Platform...
[18:43:59] <mforns>	 !log re-ran Airflow DAG unique_devices_per_domain_monthly for 2023-11
[18:44:01] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[20:26:55] <mforns>	 !log re-ran Airflow DAG unique_devices_per_domain_daily for 2023-11-08
[20:26:57] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[20:27:46] <mforns>	 !log re-ran Airflow DAG unique_devices_per_project_family_daily for 2023-11-08
[20:27:48] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[20:56:34] <mforns>	 !log re-ran Airflow DAG cassandra_load_unique_devices_daily for 2023-11-08
[20:56:36] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[21:01:58] <icinga-wm>	 RECOVERY - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/research AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[21:05:21] <wikibugs>	 10Data-Engineering (Sprint 6), 10Patch-For-Review: [Data Quality] Metrics Alerting - https://phabricator.wikimedia.org/T352685 (10gmodena) f/up to some conversations we had off phab. We will initially alert by email, and persist alerts in Iceberg to aid analysis and troubleshooting (via Superset). To start we'...
[21:07:40] <aqu>	 btullis I've just deployed airflow-dags/main on research instance after a go from fab . It should cause no more troubles now. And you can remove the alert downtime :)
[21:10:49] <jinxer-wm>	 (SystemdUnitCrashLoop) resolved: (3)  crashloop on an-airflow1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[21:15:29] <wikibugs>	 10Data-Engineering (Sprint 6), 10Patch-For-Review: [Data Quality] Metrics Alerting - https://phabricator.wikimedia.org/T352685 (10CodeReviewBot) gmodena updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/550  Draft: webrequest: add metrics generation and quality check ale...
[21:18:54] <wikibugs>	 (03PS23) 10Gmodena: refinery-job: add WebrequestMetrics. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/979359 (https://phabricator.wikimedia.org/T349763)
[21:19:28] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[21:19:52] <wikibugs>	 (03CR) 10Gmodena: refinery-job: add WebrequestMetrics. (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/979359 (https://phabricator.wikimedia.org/T349763) (owner: 10Gmodena)
[21:22:57] <wikibugs>	 (03PS24) 10Gmodena: refinery-job: add WebrequestMetrics. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/979359 (https://phabricator.wikimedia.org/T349763)
[21:34:48] <mforns>	 !log re-ran Airflow DAG cassandra_load_unique_devices_monthly for 2023-11
[21:34:50] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[22:04:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[22:05:21] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 (10RKemper)
[22:21:53] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 (10RKemper)
[22:22:58] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 (10RKemper) Decom cookbook ran: https://sal.toolforge.org/log/tSXrhYwBhuQtenzvzt4I
[22:23:07] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 (10RKemper)
[22:24:53] <wikibugs>	 (03CR) 10Ottomata: "Thanks! left some comments." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/979359 (https://phabricator.wikimedia.org/T349763) (owner: 10Gmodena)
[22:25:44] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ryankemper@cumin1002 for hosts: `wdqs[1006-1008].eqiad.wmnet` - wdqs1006.eqiad.wmnet (**FAIL...
[22:28:42] <mforns>	 !log re-ran Airflow DAG druid_load_unique_devices_per_domain_daily_aggregated_monthly for 2023-11
[22:28:44] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[22:34:58] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_airflow-kerberos@research.service Failed on an-airflow1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:35:18] <mforns>	 !log re-ran Airflow DAG druid_load_unique_devices_per_domain_monthly for 2023-11
[22:35:20] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[22:40:14] <mforns>	 !log re-ran Airflow DAG druid_load_unique_devices_per_project_family_daily_aggregated_monthly for 2023-11
[22:40:16] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[22:45:44] <mforns>	 !log re-ran Airflow DAG druid_load_unique_devices_per_project_family_monthly for 2023-11
[22:45:45] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[22:52:57] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 (10bking) a:05RKemper→03bking
[23:16:07] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 (10RKemper) After talking in the #wikimedia-sre IRC channel, I'll run the `sre.network.configure-switch-interfaces` myself, and then Volans will take care of the puppetdb/debmonitor...