[00:12:59] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:12:59] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:22:43] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:43:24] 10Data-Engineering (Sprint 6), 10Patch-For-Review: [Iceberg Migration] Migrate aqs hourly tables to Iceberg - https://phabricator.wikimedia.org/T352669 (10CodeReviewBot) tchin opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/569 Draft: Add iceberg version of aqs_hourly... [07:51:51] 10Data-Platform-SRE, 10Infrastructure-Foundations: Fix IPv6 service IP ranges for all Kubernetes clusters - https://phabricator.wikimedia.org/T353705 (10elukey) @Gehel @CDanis we should change the network subnets in Netbox and update puppet before closing, but I'd delegate this to specific teams to avoid messi... [08:00:43] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:02:44] (SystemdUnitFailed) resolved: monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:26:56] 10Data-Platform-SRE: Decom an-coord100[1-2] - https://phabricator.wikimedia.org/T353774 (10Gehel) [09:27:07] 10Data-Platform-SRE: Decom an-coord100[1-2] - https://phabricator.wikimedia.org/T353774 (10Gehel) [09:27:09] 10Data-Platform-SRE: Refresh hadoop coordinators an-coord100[1-2] with an-coord[3-4] - https://phabricator.wikimedia.org/T332572 (10Gehel) [09:28:08] 10Data-Platform-SRE: Decom an-master100[1-2] - https://phabricator.wikimedia.org/T353775 (10Gehel) [09:28:35] 10Data-Platform-SRE: Decom an-master100[1-2] - https://phabricator.wikimedia.org/T353775 (10Gehel) [09:28:37] 10Data-Platform-SRE (23/24 Q3 Milestone 1): Upgrade hadoop master to bullseye - https://phabricator.wikimedia.org/T332573 (10Gehel) [09:29:27] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [09:29:29] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Document how to browse the History server locally - https://phabricator.wikimedia.org/T353232 (10brouberol) 05Open→03Resolved I have created https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark_History, which contai... [09:29:59] 10Data-Platform-SRE: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10Gehel) [09:30:49] 10Data-Platform-SRE: Decom an-master100[1-2] - https://phabricator.wikimedia.org/T353775 (10Gehel) p:05Triage→03High [09:30:51] 10Data-Platform-SRE: Decom an-coord100[1-2] - https://phabricator.wikimedia.org/T353774 (10Gehel) p:05Triage→03High [09:30:56] 10Data-Platform-SRE: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10Gehel) p:05Triage→03High [09:35:18] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [09:36:18] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) The server is available at https://yarn.wikimedia.org/history-server/. We will enable the collection of historical metrics for all spa... [09:49:39] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE: krb1001: krb5kdc.log excessive size - https://phabricator.wikimedia.org/T337906 (10MoritzMuehlenhoff) 05Resolved→03Open krbkdc.logs have increased by a lot (used to be ~ 2.5G uncompressed per days, now at 25G since just 0:00 UTC), reopening [09:58:02] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): conda-analytics package install ConnectTimeoutError on stat1009 - https://phabricator.wikimedia.org/T353745 (10Gehel) [09:58:19] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): conda-analytics package install ConnectTimeoutError on stat1009 - https://phabricator.wikimedia.org/T353745 (10Gehel) a:03BTullis [10:00:39] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): conda-analytics package install ConnectTimeoutError on stat1009 - https://phabricator.wikimedia.org/T353745 (10BTullis) p:05Triage→03High [10:03:33] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: krb1001: krb5kdc.log excessive size - https://phabricator.wikimedia.org/T337906 (10MoritzMuehlenhoff) >>! In T337906#9417847, @MoritzMuehlenhoff wrote: > krbkdc.logs have increased by a lot (used to be ~ 2.5G uncompressed per days... [10:38:39] 10Data-Platform-SRE: Decommission an-tool1010 - https://phabricator.wikimedia.org/T353782 (10Gehel) [10:38:41] 10Data-Platform-SRE: Decommission an-tool1010 - https://phabricator.wikimedia.org/T353782 (10Gehel) [10:38:47] 10Data-Engineering, 10Data-Platform-SRE, 10Epic, 10Patch-For-Review: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 (10Gehel) [10:40:02] 10Data-Platform-SRE: Decommission an-tool1010 - https://phabricator.wikimedia.org/T353782 (10Gehel) p:05Triage→03High [10:41:01] 10Data-Platform-SRE: Decommission an-worker10[78-95] & an-worker1116 - https://phabricator.wikimedia.org/T353784 (10Gehel) [10:41:10] 10Data-Platform-SRE: Decommission an-worker10[78-95] & an-worker1116 - https://phabricator.wikimedia.org/T353784 (10Gehel) [10:41:12] 10Data-Platform-SRE: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10Gehel) [10:41:28] 10Data-Platform-SRE: Decommission an-worker10[78-95] & an-worker1116 - https://phabricator.wikimedia.org/T353784 (10Gehel) p:05Triage→03High [10:41:58] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): conda-analytics package install ConnectTimeoutError on stat1009 - https://phabricator.wikimedia.org/T353745 (10BTullis) I can't reproduce this. I just tried the following on stat1009. Create a new environment: ` btullis@stat1009:~$ conda-analytics-clone T353745-btull... [10:43:18] 10Data-Platform-SRE: Decom EOL stats servers stat100[4-7] - https://phabricator.wikimedia.org/T353785 (10Gehel) [10:43:47] 10Data-Platform-SRE: Decom EOL stats servers stat100[4-7] - https://phabricator.wikimedia.org/T353785 (10Gehel) p:05Triage→03High [10:44:47] 10Data-Platform-SRE: Plan to decom an-launcher1002 - https://phabricator.wikimedia.org/T353786 (10Gehel) [10:45:18] 10Data-Platform-SRE: Plan to decom an-launcher1002 - https://phabricator.wikimedia.org/T353786 (10Gehel) p:05Triage→03High [10:45:34] 10Data-Platform-SRE: Decom dumpsdata100[1-2] - https://phabricator.wikimedia.org/T353787 (10Gehel) [10:46:04] 10Data-Platform-SRE: Decom dumpsdata100[1-2] - https://phabricator.wikimedia.org/T353787 (10Gehel) p:05Triage→03High [10:47:07] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 (10Gehel) [10:48:01] 10Data-Platform-SRE: Add kafka-stretch100[1-2] to the hadoop cluster - https://phabricator.wikimedia.org/T353788 (10Gehel) [10:48:45] 10Data-Platform-SRE: Add kafka-stretch100[1-2] to the hadoop cluster - https://phabricator.wikimedia.org/T353788 (10Gehel) p:05Triage→03Medium [10:49:12] 10Data-Platform-SRE: Re-purpose kafka-stretch200[1-2] as DSE workers in codfw - https://phabricator.wikimedia.org/T353789 (10Gehel) [10:49:40] 10Data-Platform-SRE: Re-purpose kafka-stretch200[1-2] as DSE workers in codfw - https://phabricator.wikimedia.org/T353789 (10Gehel) p:05Triage→03Low [10:53:00] 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10SRE-OnFire, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Update WDQS Runbook following update lag incident - https://phabricator.wikimedia.org/T336577 (10Gehel) a:05bking→03None [10:53:08] 10Data-Platform-SRE: Create helmfile deployment files for superset and superset-next - https://phabricator.wikimedia.org/T353790 (10BTullis) [10:53:14] 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10SRE-OnFire, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Update WDQS Runbook following update lag incident - https://phabricator.wikimedia.org/T336577 (10Gehel) [10:57:14] 10Data-Engineering, 10Data-Platform-SRE, 10Epic, 10Patch-For-Review: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 (10BTullis) [11:05:09] Hello btullis , the Airflow dashboard is now running https://grafana-rw.wikimedia.org/d/5jPZXONIk/airflow [11:05:09] When you have time we could deploy the https://gerrit.wikimedia.org/r/c/operations/puppet/+/984510 to make all Airflow instances send metrics :) [11:05:09] Currently running pcc on it. [11:25:28] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [11:28:56] 10Data-Platform-SRE: Configure OIDC Authentication for Superset on K8S - https://phabricator.wikimedia.org/T353794 (10BTullis) [11:30:28] 10Data-Platform-SRE: Try to improve the LDAP integration for Superset user account creation - https://phabricator.wikimedia.org/T297120 (10BTullis) [11:30:32] 10Data-Platform-SRE: Configure OIDC Authentication for Superset on K8S - https://phabricator.wikimedia.org/T353794 (10BTullis) [11:30:35] 10Data-Engineering, 10Data-Platform-SRE, 10Epic, 10Patch-For-Review: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 (10BTullis) [11:42:36] 10Data-Platform-SRE: Try to improve the LDAP integration for Superset user account creation - https://phabricator.wikimedia.org/T297120 (10BTullis) There is a useful description of the problem in [[https://docs.google.com/document/d/1PT9cRVFtN23GlWfYo-_bTUzVcK12-dSSJcX-SV4rtqs/edit#heading=h.264n1btdzjgk|this do... [11:44:46] 10Data-Platform-SRE: Assign users' real names and email addresses to their Superset user accounts automatically - https://phabricator.wikimedia.org/T297120 (10BTullis) [12:07:31] aqu: This is excellent! I'll review that patch now. [12:08:45] Am I right in thinking that the statsd_exporter isn't restarted when the airflow config file is changed? Maybe we can find a way to make this happen automatically. [12:21:32] aqu: I created https://gerrit.wikimedia.org/r/c/operations/puppet/+/984520 as well, to start scraping these metrics. [12:21:46] aqu: Ready when you are for the deploy. [13:10:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [13:10:35] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE: krb1001: krb5kdc.log excessive size - https://phabricator.wikimedia.org/T337906 (10BTullis) @MoritzMuehlenhoff - Thanks for sorting out the more frequent log rotation as a workaround. At the moment, it seems that more than half of the log entries being... [13:24:47] btullis I think only 1 file need to trigger a statsd-exporter restart: /etc/prometheus/statsd_exporter.conf [13:27:36] I'm ready for deploy. As the etherpad opsweek https://etherpad.wikimedia.org/p/analytics-weekly-train is empty. I'm going to scap deploy on all instances. And it should be a noop. [13:27:36] The idea is to make sure we have the airflow-dags code version compatible with the metric configuration. [13:28:48] Cool. Let's do the auto restart in a separate patch. I can do a manual restart after this puppet run. [13:31:47] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE: krb1001: krb5kdc.log excessive size - https://phabricator.wikimedia.org/T337906 (10MoritzMuehlenhoff) So with the current hourly compression we're always safely in the realm where compression of the chunks completes and which should prevent the server... [13:35:59] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE: Reduce Kerberos logs produced by Presto - https://phabricator.wikimedia.org/T353802 (10MoritzMuehlenhoff) [13:36:08] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE: Reduce Kerberos logs produced by Presto - https://phabricator.wikimedia.org/T353802 (10MoritzMuehlenhoff) p:05Triage→03Medium [13:36:35] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE: krb1001: krb5kdc.log excessive size - https://phabricator.wikimedia.org/T337906 (10MoritzMuehlenhoff) 05Open→03Resolved Closing the task since the immediate issue is resolved, I created https://phabricator.wikimedia.org/T353802 as a followup [13:39:23] btullis It's ready to deploy [13:39:40] OK, proceeding. [13:39:48] Your patch goes first https://gerrit.wikimedia.org/r/c/operations/puppet/+/984520 [13:40:41] Does it? I would have thought that your patch goes first :-) What am I missing? [13:42:49] I'm thinking again and I'm not sure. Are the statsd-exporter already running beside the airflow instances ? [13:43:37] If this patch only defines the scraping then you are right: your patch goes after. https://gerrit.wikimedia.org/r/c/operations/puppet/+/984520 [13:44:43] I've done them both. I'll run puppet on the airflow hosts first, then the prometheus host. Should be fine 🤞 [13:45:12] ok [13:49:10] OK, it looks like the prometheus-statsd-exporters are not installed. Let's find out why... [13:49:17] PROBLEM - Checks that the local airflow scheduler for airflow @platform_eng is working properly on an-airflow1004 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-platform_eng /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1004.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [13:49:17] PROBLEM - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [13:49:18] PROBLEM - Checks that the local airflow scheduler for airflow @wmde is working properly on an-airflow1007 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-wmde /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1007.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [13:50:27] PROBLEM - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [13:50:27] PROBLEM - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [13:50:47] (SystemdUnitCrashLoop) firing: (2) crashloop on an-airflow1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:50:51] PROBLEM - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [13:52:09] I'm checking to see whether these alerts are genuine. According to systemctl the airflow-scheduler service is up and running on all of these hosts. [13:55:47] (SystemdUnitCrashLoop) firing: (3) crashloop on an-airflow1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:56:07] I believe that this is a problem with the monitoring, rather than with airflow. I can run this. [13:56:12] https://www.irccloud.com/pastebin/rmrLofhd/ [13:59:47] Can't connect on statsd-exporter on an-launcher with `ssh -t -N -L9112:localhost:9112 an-launcher1002.e` [13:59:53] I'm going to check the service... [14:00:11] statsd exporter not installed, I'm writing a patch now., [14:01:30] +1 [14:03:07] https://gerrit.wikimedia.org/r/c/operations/puppet/+/984607 [14:03:24] PCC running now. [14:03:49] This doesn't explain the change in behaviour in the monitoring though. [14:04:46] Maybe there was something in the airflow_dags deploy that affected this, then the restart of the airflow-scheduler triggered it. [14:06:47] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Cirrus-streaming-updater test: validate relforge indices are correctly updated - https://phabricator.wikimedia.org/T350186 (10bking) Per conversation with @pfischer yesterday, [[ https://phabricator.wikimedia.org/P54487 | here ]] are the latest numbers. The mismatche... [14:09:58] aqu: prometheus-statsd-exporter is now installed, I restarted the airflow-scheduler service on each instance. [14:10:49] https://usercontent.irccloud-cdn.com/file/nkXwFsJV/image.png [14:11:11] Cool. I begin to see metrics arriving in the dashboard. [14:11:22] Yes, same :) [14:12:20] I will continue to adjust the dashbaord for the new data. Thanks for all your help btullis ! [14:12:47] You're very welcome. I have to find a way to deal with these pesky alerts. [14:12:50] https://usercontent.irccloud-cdn.com/file/1n2YNUCf/image.png [14:17:34] 10Data-Engineering (Sprint 8), 10Data Products, 10serviceops-radar: Use config-master.wikimedia.org/mediawiki.yaml to automatically switch code that depends on active datacenter - https://phabricator.wikimedia.org/T338796 (10lbowmaker) [14:17:48] 10Data-Engineering (Sprint 8): [Iceberg Migration] Implement mechanism for automatic Iceberg data deletion and optimization - https://phabricator.wikimedia.org/T338065 (10lbowmaker) [14:25:03] 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Airflow scheduler monitoring is broken since the most recent deploy - https://phabricator.wikimedia.org/T353806 (10BTullis) [14:25:33] aqu: `ERROR - No module named 'wmf_airflow_common'` is coming from the airflow scheduler check. [14:26:16] 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Airflow scheduler monitoring is broken since the most recent deploy - https://phabricator.wikimedia.org/T353806 (10BTullis) p:05Triage→03High [14:28:52] I'm investigating. [14:29:01] ACKNOWLEDGEMENT - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet did not succeed Btullis T353806 - This is a problem with the monitoring system, related to the latest deploy https://wikitech.wikimedia.org/wiki/Analyti [14:29:01] ems/Airflow [14:29:02] ACKNOWLEDGEMENT - Checks that the local airflow scheduler for airflow @platform_eng is working properly on an-airflow1004 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-platform_eng /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1004.eqiad.wmnet did not succeed Btullis T353806 - This is a problem with the monitoring system, related to the latest deploy https://wikitech.wikimedia.org/wiki [14:29:03] ics/Systems/Airflow [14:29:03] ACKNOWLEDGEMENT - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet did not succeed Btullis T353806 - This is a problem with the monitoring system, related to the latest deploy https://wikitech.wikimedia.org/wiki/Analytics/S [14:29:04] Airflow [14:29:06] ACKNOWLEDGEMENT - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet did not succeed Btullis T353806 - This is a problem with the monitoring system, related to the latest deploy https://wikitech.wikimedi [14:29:07] iki/Analytics/Systems/Airflow [14:29:08] ACKNOWLEDGEMENT - Checks that the local airflow scheduler for airflow @wmde is working properly on an-airflow1007 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-wmde /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1007.eqiad.wmnet did not succeed Btullis T353806 - This is a problem with the monitoring system, related to the latest deploy https://wikitech.wikimedia.org/wiki/Analytics/Syste [14:29:09] low [14:29:10] ACKNOWLEDGEMENT - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed Btullis T353806 - This is a problem with the monitoring system, related to the latest deploy https://wikitech.wikimedia.org/wiki/Ana [14:29:10] Systems/Airflow [14:31:22] 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Airflow scheduler monitoring is broken since the most recent deploy - https://phabricator.wikimedia.org/T353806 (10BTullis) a:03BTullis [14:35:29] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Discovery-Search (Current work): Cirrus-streaming-updater test: validate relforge indices are correctly updated - https://phabricator.wikimedia.org/T350186 (10Gehel) a:05bking→03None [14:35:34] btullis I think it's because the airflow-analytics jobs check should be run like the other commands: with a custom PYTHONPATH=/path/to/root/of/airflow-dags/ as an env variable. [14:35:35] To enable the airflow metrics, a reference to some custom code in the `wmf_airflow_common` dir has been added into the configuration. It's expected to be found when running the `job check` command. [14:35:37] 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Airflow scheduler monitoring is broken since the most recent deploy - https://phabricator.wikimedia.org/T353806 (10Antoine_Quhen) I think it's because the airflow-analytics jobs check should be run like the other commands: with a custom PYTHONPA... [14:35:45] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Discovery-Search (Current work): Cirrus-streaming-updater test: validate relforge indices are correctly updated - https://phabricator.wikimedia.org/T350186 (10Gehel) a:03pfischer [14:37:14] Like here: https://github.com/wikimedia/operations-puppet/blob/f7c3eb56a9417571792b7636367f3c13e850bc83/modules/profile/manifests/airflow.pp#L199 [14:37:27] 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Airflow scheduler monitoring is broken since the most recent deploy - https://phabricator.wikimedia.org/T353806 (10Antoine_Quhen) https://github.com/wikimedia/operations-puppet/blob/f7c3eb56a9417571792b7636367f3c13e850bc83/modules/profile/manife... [14:38:19] Great, I thouht it was probably pythonpath related. (See my email to d-e-alerts) I'll try this now. [14:42:39] Yes, success. [14:42:44] https://www.irccloud.com/pastebin/WUHEfUp4/ [14:43:24] 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Airflow scheduler monitoring is broken since the most recent deploy - https://phabricator.wikimedia.org/T353806 (10BTullis) This works if we change the check to run like this: ` btullis@an-launcher1002:/srv/deployment/airflow-dags/analytics$ /us... [14:49:38] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Discovery-Search (Current work): Load Wikidata split graphs into test servers - https://phabricator.wikimedia.org/T350465 (10Gehel) 05Open→03Resolved [14:49:45] 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Expose 3 new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650 (10Gehel) [14:52:32] https://gerrit.wikimedia.org/r/c/operations/puppet/+/984614/ [14:53:37] 10Data-Platform-SRE (23/24 Q3 Milestone 1): [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene) We now have a sample DAG on the WMDE instance which should meet the acceptance criteria, huge thanks to @mforns for the help with the sample DAG. {F41614666} The WMDE... [14:53:49] 10Data-Platform-SRE (23/24 Q3 Milestone 1): [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene) [14:55:20] 10Data-Platform-SRE: Troubleshoot recurring systemd unit failures and availability issues for wdqs1022-24 - https://phabricator.wikimedia.org/T352878 (10bking) Related issue mentioned by Volans: > wdqs1024 has a unit that is failing and complains that the service is not here at all: wmf_auto_restart_prometheus-... [14:59:33] RECOVERY - Checks that the local airflow scheduler for airflow @platform_eng is working properly on an-airflow1004 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/platform_eng AIRFLOW_HOME=/srv/airflow-platform_eng /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1004.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [15:00:45] RECOVERY - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics_product AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [15:01:03] RECOVERY - Checks that the local airflow scheduler for airflow @wmde is working properly on an-airflow1007 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/wmde AIRFLOW_HOME=/srv/airflow-wmde /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1007.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [15:01:03] RECOVERY - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/search AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [15:01:51] RECOVERY - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [15:04:45] 10Data-Platform-SRE (23/24 Q3 Milestone 1): [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10AndrewTavis_WMDE) Hey @Stevemunene! Thanks so much for the efforts here! Further thanks to the others at WMF who have helped along the way :) This is such an important step for ana... [15:08:51] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): conda-analytics package install ConnectTimeoutError on stat1009 - https://phabricator.wikimedia.org/T353745 (10mpopov) 05Open→03Resolved It works now – weird! Thanks for looking into it! [15:14:54] 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Airflow scheduler monitoring is broken since the most recent deploy - https://phabricator.wikimedia.org/T353806 (10BTullis) They're all working now, except for the research instance: {F41614688,width=60%} The error message... [15:15:04] 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Airflow scheduler monitoring is broken since the most recent deploy - https://phabricator.wikimedia.org/T353806 (10BTullis) p:05High→03Medium [15:15:51] aqu: Did you deploy the latest airflow-dags to an-airflow1002 ? The research instance? It's the only alert that remains. [15:28:50] 10Data-Platform-SRE: Troubleshoot recurring systemd unit failures and availability issues for wdqs1022-24 - https://phabricator.wikimedia.org/T352878 (10bking) Per IRC conversation in #wikimedia-sre , this issue has affected a few servers in the past (see T199911 and T265323 ). As such, we've [[ https://gerrit.w... [15:39:26] Let me check. [15:40:31] 10Data-Engineering, 10MediaWiki-General, 10Event-Platform: Create legacy EventLogging proxy HTTP intake (for MediaWikiPingback) endpoint to EventGate - https://phabricator.wikimedia.org/T353817 (10Ottomata) [15:41:10] Yes I did it. [15:41:46] OK, thanks. Any clue why we are seeing this then? https://phabricator.wikimedia.org/T353806#9418827 [15:44:00] :/ the code is not up to date. I did deploy. But my previous git pull failed with: [15:44:00] aqu@deploy2002:/srv/deployment/airflow-dags/research$ git pull [15:44:00] Your configuration specifies to merge with the ref 'refs/heads/knowledge_gap_outputs' [15:44:19] I'm checking. [15:54:50] They are deploying from a custom branch which I need to patch. But sometime they need to reconcile with main. [15:59:37] OK, sure thing. I think that fab might know most about this instance. [16:18:20] The merge is not obvious. I'm asking Fab. [17:05:59] (03CR) 10Ottomata: [C: 03+2] Migrate MediaWikiPingback schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/938814 (https://phabricator.wikimedia.org/T323828) (owner: 10Phuedx) [17:06:43] (03Merged) 10jenkins-bot: Migrate MediaWikiPingback schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/938814 (https://phabricator.wikimedia.org/T323828) (owner: 10Phuedx) [17:26:55] 10Data-Platform-SRE, 10Patch-For-Review: Troubleshoot recurring systemd unit failures and availability issues for wdqs1022-24 - https://phabricator.wikimedia.org/T352878 (10ops-monitoring-bot) Host rebooted by bking@cumin2002 with reason: None [17:27:11] 10Data-Platform-SRE, 10Patch-For-Review: Troubleshoot recurring systemd unit failures and availability issues for wdqs1022-24 - https://phabricator.wikimedia.org/T352878 (10bking) >>! In T352878#9418736, @bking wrote: > Some loose theories with proposals how to test them: > > - Hosts have outdated firmware/u... [17:33:23] 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: [NEEDS GROOMING] schema services should be moved to k8s - https://phabricator.wikimedia.org/T347421 (10BTullis) >>! In T347421#9216784, @Ottomata wrote: > I think this will be harder than it sounds. I don't think there is a way to automate dynamic d... [17:55:47] (SystemdUnitCrashLoop) firing: (3) crashloop on an-airflow1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:59:08] aqu: I have added downtime until Jan 2nd for the an-airflow1002 alert. I can remove it if we get the fix deployed, but given that the service is working I'd like to stop the alert noise first. [18:01:01] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Restore service for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347284 (10bking) [18:01:18] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355 (10bking) 05Open→03Resolved I've rolled out the Puppet patches and confirmed they are working as expected. Closing... [18:03:19] 10Data-Platform-SRE, 10Patch-For-Review: Troubleshoot recurring systemd unit failures and availability issues for wdqs1022-24 - https://phabricator.wikimedia.org/T352878 (10RKemper) [18:34:42] (SystemdUnitFailed) firing: wmf_auto_restart_airflow-kerberos@research.service Failed on an-airflow1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:36:05] PROBLEM - Check systemd state on an-airflow1002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_airflow-kerberos@research.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:42:45] 10Data-Engineering, 10Observability-Logging, 10Traffic, 10Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10Ottomata) To do this migration plan ^, we'd need Kafka jumbo to support 2x webrequest volume while we migrate. Let's check with Data Platform... [18:43:59] !log re-ran Airflow DAG unique_devices_per_domain_monthly for 2023-11 [18:44:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:26:55] !log re-ran Airflow DAG unique_devices_per_domain_daily for 2023-11-08 [20:26:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:27:46] !log re-ran Airflow DAG unique_devices_per_project_family_daily for 2023-11-08 [20:27:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:56:34] !log re-ran Airflow DAG cassandra_load_unique_devices_daily for 2023-11-08 [20:56:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:01:58] RECOVERY - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/research AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [21:05:21] 10Data-Engineering (Sprint 6), 10Patch-For-Review: [Data Quality] Metrics Alerting - https://phabricator.wikimedia.org/T352685 (10gmodena) f/up to some conversations we had off phab. We will initially alert by email, and persist alerts in Iceberg to aid analysis and troubleshooting (via Superset). To start we'... [21:07:40] btullis I've just deployed airflow-dags/main on research instance after a go from fab . It should cause no more troubles now. And you can remove the alert downtime :) [21:10:49] (SystemdUnitCrashLoop) resolved: (3) crashloop on an-airflow1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [21:15:29] 10Data-Engineering (Sprint 6), 10Patch-For-Review: [Data Quality] Metrics Alerting - https://phabricator.wikimedia.org/T352685 (10CodeReviewBot) gmodena updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/550 Draft: webrequest: add metrics generation and quality check ale... [21:18:54] (03PS23) 10Gmodena: refinery-job: add WebrequestMetrics. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/979359 (https://phabricator.wikimedia.org/T349763) [21:19:28] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [21:19:52] (03CR) 10Gmodena: refinery-job: add WebrequestMetrics. (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/979359 (https://phabricator.wikimedia.org/T349763) (owner: 10Gmodena) [21:22:57] (03PS24) 10Gmodena: refinery-job: add WebrequestMetrics. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/979359 (https://phabricator.wikimedia.org/T349763) [21:34:48] !log re-ran Airflow DAG cassandra_load_unique_devices_monthly for 2023-11 [21:34:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:04:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [22:05:21] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 (10RKemper) [22:21:53] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 (10RKemper) [22:22:58] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 (10RKemper) Decom cookbook ran: https://sal.toolforge.org/log/tSXrhYwBhuQtenzvzt4I [22:23:07] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 (10RKemper) [22:24:53] (03CR) 10Ottomata: "Thanks! left some comments." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/979359 (https://phabricator.wikimedia.org/T349763) (owner: 10Gmodena) [22:25:44] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ryankemper@cumin1002 for hosts: `wdqs[1006-1008].eqiad.wmnet` - wdqs1006.eqiad.wmnet (**FAIL... [22:28:42] !log re-ran Airflow DAG druid_load_unique_devices_per_domain_daily_aggregated_monthly for 2023-11 [22:28:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:34:58] (SystemdUnitFailed) firing: wmf_auto_restart_airflow-kerberos@research.service Failed on an-airflow1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:35:18] !log re-ran Airflow DAG druid_load_unique_devices_per_domain_monthly for 2023-11 [22:35:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:40:14] !log re-ran Airflow DAG druid_load_unique_devices_per_project_family_daily_aggregated_monthly for 2023-11 [22:40:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:45:44] !log re-ran Airflow DAG druid_load_unique_devices_per_project_family_monthly for 2023-11 [22:45:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:52:57] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 (10bking) a:05RKemper→03bking [23:16:07] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 (10RKemper) After talking in the #wikimedia-sre IRC channel, I'll run the `sre.network.configure-switch-interfaces` myself, and then Volans will take care of the puppetdb/debmonitor...