[00:59:30] (SystemdUnitFailed) firing: (15) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:11:56] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 1.278% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [01:19:30] (SystemdUnitFailed) firing: (15) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:36:32] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [04:35:58] 10Data-Engineering, 10Data Products, 10Data-Platform: Temporary Accounts Initiative (IP Masking) - Add user_is_temp to data tables - https://phabricator.wikimedia.org/T356701 (10VirginiaPoundstone) [05:11:56] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 0.9774% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [05:20:57] (SystemdUnitFailed) firing: (14) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:36:32] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [08:48:25] (SystemdUnitFailed) firing: (14) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:07:44] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Ensure necessary firewall rules are open between the DSE worker nodes and external services - https://phabricator.wikimedia.org/T356623 (10brouberol) Superset -> dbstore1009: ✅ ` brouberol@dse-k8s-worker1001:~$ sudo nsen... [09:11:57] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 0.8053% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:36:32] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [09:52:42] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Ensure necessary firewall rules are open between the DSE worker nodes and external services - https://phabricator.wikimedia.org/T356623 (10brouberol) Hadoop masters and workers are now working (they were missing a network... [10:18:20] (03CR) 10Joal: "Some nits" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997496 (https://phabricator.wikimedia.org/T356628) (owner: 10Gmodena) [10:21:05] (03CR) 10Joal: [C: 03+1] "One comment idea." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/997485 (https://phabricator.wikimedia.org/T356628) (owner: 10Gmodena) [10:49:10] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Products, 10Patch-For-Review: Migrate an-web1001 to Debian bullseye - https://phabricator.wikimedia.org/T349398 (10BTullis) [10:49:24] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Products, 10Patch-For-Review: Migrate an-web1001 to Debian bullseye - https://phabricator.wikimedia.org/T349398 (10BTullis) [10:49:50] 10Data-Engineering, 10Data Products, 10Data-Platform, 10Temporary accounts: Temporary Accounts Initiative (IP Masking) - Add user_is_temp to data tables - https://phabricator.wikimedia.org/T356701 (10kostajh) [11:03:08] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Products, 10Patch-For-Review: Migrate an-web1001 to Debian bullseye - https://phabricator.wikimedia.org/T349398 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-web1001.eqiad.wm... [11:03:27] !log reimaging an-web1001 to bullseye for T349398 [11:03:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:03:30] T349398: Migrate an-web1001 to Debian bullseye - https://phabricator.wikimedia.org/T349398 [11:08:35] btullis: before you merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/997797/ can I quickly switch these to Puppet 7 via hieradata/hosts? We did that previously for stat1009 already, so the setup is known to be compatible [11:09:01] the only reason it's not applied on the role level is because there are still buster stat hosts [11:09:10] Yes, please feel free. [11:09:15] ok, on it [11:14:45] btullis: turns out, I already did that back in December :-) [11:15:15] Ah, cool. Nice one. [11:18:25] (SystemdUnitFailed) firing: (21) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:31:03] 10Data-Engineering, 10Anti-Harassment, 10Data-Persistence, 10Temporary accounts, and 2 others: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10Ladsgroup) The schema was deployed long time ago, we just started writing to the field but the defualt value was 0 and now we a... [11:42:58] btullis: reimages are currently broken, which also affects your an-web1001 reimage. https://gerrit.wikimedia.org/r/c/operations/puppet/+/997804 will unblock this shortly, then you can retrigger the cookbook [11:43:31] I told him in -ops earlier, he should know :D [11:43:53] ah, missed that! [11:44:05] better two pings than zero ;) [11:45:17] !log add new an-workers to analytics_cluster hadoop worker role analytics_cluster::hadoop::worker T353776 [11:45:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:45:20] T353776: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 [11:49:50] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Products: Migrate an-web1001 to Debian bullseye - https://phabricator.wikimedia.org/T349398 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-web1001.eqiad.wmnet with OS bullseye execu... [11:53:25] (SystemdUnitFailed) firing: (23) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:55:45] 10Data-Engineering: [NEEDS GROOMING][SPIKE] Extract refine schema management into a dedicated tool - https://phabricator.wikimedia.org/T356762 (10gmodena) [11:56:16] (HdfsCapacityRemainingPercent) resolved: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [11:58:25] (SystemdUnitFailed) firing: (25) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:59:18] PROBLEM - Check systemd state on an-worker1164 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:00:20] RECOVERY - Check systemd state on an-worker1164 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:02:24] PROBLEM - Check systemd state on an-worker1168 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:03:25] (SystemdUnitFailed) firing: (28) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:03:32] --^ onboarding new workers an-worker1157-1175, we expect some hadoop-hdfs-datanode.service alerts, adding a brief silence to reduce the noise [12:05:38] RECOVERY - Check systemd state on an-worker1168 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:05:42] (03PS2) 10Gmodena: hql: add data quality DDL. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997496 (https://phabricator.wikimedia.org/T356628) [12:06:37] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10Stevemunene) [12:08:25] (SystemdUnitFailed) firing: (39) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:10:01] (03PS2) 10Gmodena: data-quality: rename source table column [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/997485 (https://phabricator.wikimedia.org/T356628) [12:10:53] btullis: confirmed reimage should work again [12:11:16] volans: Ack, many thanks. [12:13:27] (SystemdUnitFailed) firing: (41) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:14:48] PROBLEM - HDFS topology check on an-master1003 is CRITICAL: CRITICAL: There is at least one node in the default rack. https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_topology_check [12:16:34] PROBLEM - Check systemd state on an-worker1175 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:17:20] PROBLEM - Check systemd state on an-worker1167 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:17:58] PROBLEM - Check systemd state on an-worker1160 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:20] PROBLEM - Check systemd state on an-worker1173 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:27] (SystemdUnitFailed) firing: (47) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:18:36] RECOVERY - Check systemd state on an-worker1175 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:19:18] PROBLEM - Check systemd state on an-worker1171 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:20:51] 10Data-Engineering, 10Community-Tech, 10Multiblocks, 10Data Products (Data Products Sprint 09), 10Event-Platform: Investigate if the new 'Multiblocks' user blocks feature affects the mediawiki.user-blocks-change event stream - https://phabricator.wikimedia.org/T356597 (10WDoranWMF) p:05Triage→03High [12:21:30] RECOVERY - Check systemd state on an-worker1171 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:21:32] RECOVERY - Check systemd state on an-worker1173 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:22:28] PROBLEM - Hadoop DataNode on an-worker1167 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [12:22:36] RECOVERY - Check systemd state on an-worker1167 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:23:26] (SystemdUnitFailed) firing: (44) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:23:28] RECOVERY - Hadoop DataNode on an-worker1167 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [12:28:25] (SystemdUnitFailed) firing: (40) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:32:06] PROBLEM - Check systemd state on an-worker1169 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:36:15] PROBLEM - Hadoop DataNode on an-worker1169 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [12:40:59] PROBLEM - Check systemd state on an-worker1170 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:41:35] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10Stevemunene) The hosts have been added to net_topology and assigned the right role. Hosts are also running OK without any RAID related alerts. However, some hosts are in the defaul... [12:42:49] RECOVERY - Hadoop DataNode on an-worker1169 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [12:46:39] RECOVERY - Check systemd state on an-worker1169 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:48:26] (SystemdUnitFailed) firing: (16) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:48:45] RECOVERY - Check systemd state on an-worker1160 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:48:57] RECOVERY - Check systemd state on an-worker1170 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:59:59] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Products, 10Patch-For-Review: Migrate an-web1001 to Debian bullseye - https://phabricator.wikimedia.org/T349398 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-web1001.eqiad.wm... [13:10:21] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Products, 10Patch-For-Review: Migrate an-web1001 to Debian bullseye - https://phabricator.wikimedia.org/T349398 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-web1001.eqiad.wmnet... [13:11:54] 10Data-Engineering, 10Release-Engineering-Team, 10collaboration-services, 10GitLab (CI & Job Runners), 10Patch-For-Review: Unblock Dockerfile syntax to build images with Gitlab trusted runner - https://phabricator.wikimedia.org/T351792 (10CodeReviewBot) aqu merged https://gitlab.wikimedia.org/repos/data-... [13:11:57] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 0.6867% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [13:15:43] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Products, 10Patch-For-Review: Migrate an-web1001 to Debian bullseye - https://phabricator.wikimedia.org/T349398 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-web1001.eqiad.wm... [13:18:25] (SystemdUnitFailed) firing: (21) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:20:29] (03PS1) 10Gehel: Cleanup dependencies in refinery-tools module. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/997823 [13:24:07] 10Data-Engineering (Sprint 6), 10Patch-For-Review: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs - https://phabricator.wikimedia.org/T349532 (10Antoine_Quhen) [13:24:23] 10Data-Engineering, 10Release-Engineering-Team, 10collaboration-services, 10GitLab (CI & Job Runners), 10Patch-For-Review: Unblock Dockerfile syntax to build images with Gitlab trusted runner - https://phabricator.wikimedia.org/T351792 (10Antoine_Quhen) 05In progress→03Resolved a:03Antoine_Quhen [13:24:33] 10Data-Engineering (Sprint 8): [Maintenance] Migrate Gitlab CI to blubber - https://phabricator.wikimedia.org/T356364 (10Antoine_Quhen) 05Open→03Resolved [13:29:49] !log roll restart hadoop masters to pick up the right rack assignment for new hosts T353776 [13:29:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:29:53] T353776: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 [13:34:09] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Ensure necessary firewall rules are open between the DSE worker nodes and external services - https://phabricator.wikimedia.org/T356623 (10brouberol) Superset -> analytics druid ✅ ` brouberol@dse-k8s-worker1001:~$ sudo nse... [13:39:10] !log add new TLS SANs to the superset/superset-next certificates in dse-k8s-eqiad - T356481 [13:39:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:39:13] T356481: Configure ingress internal DNS records - https://phabricator.wikimedia.org/T356481 [13:40:45] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Products: Migrate an-web1001 to Debian bullseye - https://phabricator.wikimedia.org/T349398 (10taavi) Doing an in-place reimage here means that analytics.wm.o and stats.wm.o are currently now. Is there a reason why this could not be do... [13:44:54] (03PS1) 10Joal: Fix unique devices druid loading [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997849 [13:45:06] aqu: if you have a minute please --^ [13:48:25] (SystemdUnitFailed) firing: (21) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:53:34] Hi SRE folks - Have we removed/changed druid instances lately? [13:54:18] 10Data-Engineering (Sprint 8): Data Quality Issue: Wikitext History Job fail / rerun in Airflow - https://phabricator.wikimedia.org/T342911 (10lbowmaker) 05Open→03Resolved a:03lbowmaker Resolving this ticket. Retries has been set to 0 for the job, data quality work is being done in this ticket: https://pha... [13:54:26] hm - looks like it [13:54:35] druid1004.eqiad.wmnet is gone [13:56:15] Hi joal , yes we have [13:56:32] hm, I guess this was 2 days ago, right? [13:56:55] nope was a while back https://phabricator.wikimedia.org/T336043 [13:57:50] ack stevemunene - thanks for letting me know :) [14:00:21] (03PS3) 10Gmodena: hql: add data quality DDL. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997496 (https://phabricator.wikimedia.org/T356628) [14:00:46] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Products: Migrate an-web1001 to Debian bullseye - https://phabricator.wikimedia.org/T349398 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-web1001.eqiad.wmnet with OS bullseye compl... [14:04:01] !log Rerun mediawiki-history-reduced druid indexation after airflow variable update [14:04:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:07:37] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Products: Migrate an-web1001 to Debian bullseye - https://phabricator.wikimedia.org/T349398 (10BTullis) >>! In T349398#9517216, @taavi wrote: > Doing an in-place reimage here means that analytics.wm.o and stats.wm.o are currently now.... [14:08:23] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Products: Migrate an-web1001 to Debian bullseye - https://phabricator.wikimedia.org/T349398 (10BTullis) 05Open→03Resolved [14:08:25] 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10BTullis) [14:08:58] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10Stevemunene) Hosts are visiblue on the namenode UI and should rebalance with time {F41793107} Having an issue with the an-masters roll restart, Namenode failover from `an-master100... [14:11:01] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Ensure necessary firewall rules are open between the DSE worker nodes and external services - https://phabricator.wikimedia.org/T356623 (10brouberol) 05Open→03Resolved Superset -> an-coord ✅ ` brouberol@dse-k8s-worker1... [14:11:07] (03CR) 10Aqu: [V: 03+1 C: 03+1] "Thanks for the catch and fix @Joal !" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997849 (owner: 10Joal) [14:11:08] 10Data-Engineering, 10Data-Platform-SRE, 10Epic, 10Patch-For-Review: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 (10brouberol) [14:11:21] 10Data-Engineering, 10Data-Platform-SRE, 10Epic, 10Patch-For-Review: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 (10brouberol) [14:17:50] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [14:17:58] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10BTullis) >>! In T353776#9517317, @Stevemunene wrote: > `Run manual HDFS Namenode failover from an-master1003-eqiad-wmnet to an-master1004-eqiad-wmnet. > ----- OUTPUT of 'kerberos-r... [14:26:53] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops, 10SRE, 10ops-codfw: Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10Jhancock.wm) @BTullis I can reseat the backplane to try and fix this. Is it safe for me to do so? or are you currently working... [14:27:14] 10Data-Engineering, 10Data-Platform-SRE: Migrate hue.wikimedia.org to bullseye - https://phabricator.wikimedia.org/T349400 (10Gehel) a:05brouberol→03None [14:27:40] 10Data-Engineering, 10Data-Platform-SRE: Migrate hue.wikimedia.org to bullseye - https://phabricator.wikimedia.org/T349400 (10Gehel) Moving back to our backlog until we know if we can deprecated hue or not. [14:28:18] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops, 10SRE, 10ops-codfw: Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10BTullis) >>! In T355830#9517383, @Jhancock.wm wrote: > @BTullis I can reseat the backplane to try and fix this. Is it safe for... [14:29:12] 10Data-Platform-SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops: Requesting permission to enable kafka log compaction for page_rerender on kafka-main - https://phabricator.wikimedia.org/T354794 (10Gehel) Moving to our backlog board, to be picked up again after March 20th 2024 [14:31:50] 10Data-Platform-SRE, 10Discovery-Search, 10Patch-For-Review: Track and clean up object storage used by rdf-streaming-updater - https://phabricator.wikimedia.org/T348685 (10bking) [14:31:52] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Discovery-Search: Clean up object storage in response to latest alert - https://phabricator.wikimedia.org/T356283 (10bking) 05Open→03Invalid This duplicates T356313 . Closing.... [14:33:14] stevemunene: we're receiving an alert-email for the job hdfs_rsync_analytics_hadoop_published [14:33:27] Have we changed something onto an-web1001? [14:33:44] joal: Oh, that's probably me. I have recently reimaged an-web1001 to bullseye. I will investigate now. [14:33:58] btullis: looks like we have a java something :) [14:34:14] thanks for investigating btullis [14:34:22] OK, thanks for the pointer. Looking now. [14:37:20] joal: I believe that this is already fixed. It was caused by a failed reimage, leaving an-web1001 half-installed. [14:37:28] I will respond to the emails. [14:37:35] Thanks a lot btullis [14:37:51] Latest run: [14:37:55] https://www.irccloud.com/pastebin/zqUU2ayR/ [14:42:04] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops, 10SRE, 10ops-codfw: Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10Jhancock.wm) @BTullis looks like it worked. But since that backplane error occurred twice already, if it happens again lmk and... [14:44:33] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops, 10SRE, 10ops-codfw: Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10BTullis) >>! In T355830#9517443, @Jhancock.wm wrote: > @BTullis looks like it worked. But since that backplane error occurred... [14:49:27] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bking@cumin2002 for hosts: `cloudelastic1009.wikimedia.org` - cloudelastic1009.wikim... [14:50:47] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: [superset k8s] Update public domain DNS records to make them point to the DSE Kubernetes ingress - https://phabricator.wikimedia.org/T356482 (10brouberol) [14:51:13] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: [superset k8s] Add entries to the puppet service catalog - https://phabricator.wikimedia.org/T356483 (10brouberol) [14:51:30] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: [superset k8s] Update public domain DNS records to make them point to the DSE Kubernetes ingress - https://phabricator.wikimedia.org/T356482 (10brouberol) a:03brouberol [14:51:33] 10Data-Engineering, 10Data-Platform-SRE, 10superset.wikimedia.org: Running into errors while adding Hive table to Superset dataset - https://phabricator.wikimedia.org/T284604 (10BTullis) @cchen - Could you test this again now please? I believe that it has been fixed by our recent upgrade to Superset 3.1.0. [14:51:40] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: [superset k8s] Add entries to the puppet service catalog - https://phabricator.wikimedia.org/T356483 (10brouberol) a:03brouberol [14:55:44] 10Data-Engineering, 10Data-Platform-SRE: Monitor the availability of the superset deployments - https://phabricator.wikimedia.org/T356484 (10Gehel) p:05Triage→03High [14:55:53] 10Data-Engineering, 10Data-Platform-SRE: [superset k8s] Update the wikitech page with our production readiness checklist - https://phabricator.wikimedia.org/T356486 (10Gehel) p:05Triage→03High [14:56:11] 10Data-Engineering, 10Data-Platform-SRE: Create saved views for the superset deployment logs - https://phabricator.wikimedia.org/T356485 (10Gehel) p:05Triage→03High [14:56:54] 10Data-Engineering, 10Data-Platform-SRE: Find a solution for the requestctl-generator html page - https://phabricator.wikimedia.org/T356490 (10Gehel) p:05Triage→03High [14:57:04] !log roll-restarting the presto workers for T356382 [14:57:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:57:23] 10Data-Platform-SRE: Check home/HDFS leftovers of mhoutti - https://phabricator.wikimedia.org/T356641 (10Gehel) p:05Triage→03Low [14:57:28] 10Data-Platform-SRE, 10Discovery-Search (Current work): Rebuild and deploy textify plugin - https://phabricator.wikimedia.org/T356651 (10Gehel) p:05Triage→03High [14:58:58] (03CR) 10Gmodena: hql: add data quality DDL. (036 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997496 (https://phabricator.wikimedia.org/T356628) (owner: 10Gmodena) [15:06:59] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8b38d8fa-15e7-4772-81bd-035e55b1e01f) set by cmooney@cumin1002 for 0:30:00 on 1 host... [15:08:48] (PuppetFailure) firing: Puppet has failed on druid1010:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:13:48] (PuppetFailure) firing: (2) Puppet has failed on druid1009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:18:48] (PuppetFailure) firing: (3) Puppet has failed on druid1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:23:49] (PuppetFailure) firing: (4) Puppet has failed on druid1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:28:49] (PuppetFailure) firing: (5) Puppet has failed on druid1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:40:22] (03CR) 10Gmodena: Fix unique devices druid loading (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997849 (owner: 10Joal) [15:41:23] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): DataHub search is throwing errors on search - https://phabricator.wikimedia.org/T356783 (10BTullis) [15:42:42] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): DataHub search is throwing errors on search - https://phabricator.wikimedia.org/T356783 (10BTullis) p:05Triage→03Unbreak! I'm trying a rolling restart of the pods in codfw with the following command: ` btullis@deploy2002:/srv/deployment-charts/helmfile.d/service... [15:44:09] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops, 10SRE, 10ops-codfw: Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host elastic2094.codfw.wmnet with OS... [15:46:29] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10BTullis) [15:46:40] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops, 10SRE, 10ops-codfw: Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10BTullis) 05Open→03Resolved a:03BTullis Thanks @Jhancock.wm - The reimage cookbook hung once at PXE boot, but I gave it a... [15:53:28] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bking@cumin2002 for hosts: `cloudelastic1009.wikimedia.org` - cloudelastic1009.wikim... [15:55:23] (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for mediawiki_history_reduced_2024_01 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [16:00:00] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): DataHub search is throwing errors on search - https://phabricator.wikimedia.org/T356783 (10BTullis) p:05Unbreak!→03High The restart has worked and the search results now seem ok, but I'm not confident that this won't happen again. {F41794245,width=60%} I think w... [16:12:39] \o We (ML team) are seeing something odd when downloading from https://analytics.wikimedia.org/published/... URLs: no matter location (tried my laptop at home, my rootserver, both Linux) or OS (team members elsewhere in EMEA can reproduce: somewhere in the range of 1-2GB of a download, the server just stops, dropping the connection, or sometimes it just hangs forever. Using wget [16:12:41] --contine, one can often make progress and eventually complete the file, but that's obviosuly not ideal. Ontop of that, sometimes even with --continue, the download remains stuck forever, and only amnually retrying for ten minutes or more yields a complete download. [16:13:02] I am not sure this is an analytics-server problem (might be CDN, too), but I thought I'd go to the source first [16:13:25] btullis: you strike me as the man to talk to (if only to tell me to talk to someone else) [16:13:35] Some examples: https://phabricator.wikimedia.org/P56346 [16:15:13] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10superset.wikimedia.org: Running into errors while adding Hive table to Superset dataset - https://phabricator.wikimedia.org/T284604 (10BTullis) [16:15:24] (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for mediawiki_history_reduced_2024_01 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [16:20:49] 10Data-Engineering, 10Data-Platform-SRE: [superset-k8s] Find a solution for the requestctl-generator html page - https://phabricator.wikimedia.org/T356490 (10BTullis) [16:21:31] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10Stevemunene) Sure, I'll try the manual failover and restart of the services probably during our sync [16:21:34] 10Data-Platform-SRE, 10Discovery-Search (Current work): Rebuild and deploy textify plugin - https://phabricator.wikimedia.org/T356651 (10TJones) T332337 has been comitted, so this is ready to go. [16:22:08] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Ensure Data Platform SREs have a contact group in puppet/alerting - https://phabricator.wikimedia.org/T342578 (10bking) a:03bking [16:23:55] 10Data-Platform-SRE, 10Discovery-Search (Current work): Review wikitech:Search and write processes for k8s world - https://phabricator.wikimedia.org/T356303 (10bking) [16:30:48] (03CR) 10Joal: [C: 03+1] "One formatting nit, but god to go as is :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997496 (https://phabricator.wikimedia.org/T356628) (owner: 10Gmodena) [16:34:28] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 (10Gehel) a:05BTullis→03Stevemunene [16:35:17] The puppet errors on druid hosts should be resolved [16:35:46] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1009.eqiad.wmnet with OS bullseye [16:36:04] (03CR) 10Joal: "One question before merging" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/995042 (owner: 10Gehel) [16:37:10] (03CR) 10Joal: [C: 03+1] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/995258 (owner: 10Bearloga) [16:37:33] klausman: I have just seen this. Will look asap when out of meeting. [16:37:46] thanks! [16:38:49] (PuppetFailure) firing: (5) Puppet has failed on druid1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:43:39] (03CR) 10Joal: Cleanup dependencies in refinery-tools module. (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/997823 (owner: 10Gehel) [16:49:01] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10superset.wikimedia.org: Running into errors while adding Hive table to Superset dataset - https://phabricator.wikimedia.org/T284604 (10mpopov) @BTullis: heads-up that @cchen won't be able to until {T356645} is resolved [16:52:29] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10superset.wikimedia.org: Running into errors while adding Hive table to Superset dataset - https://phabricator.wikimedia.org/T284604 (10BTullis) 05Open→03Resolved a:03BTullis OK, thanks @mpopov - I understand. Well, I think that I can... [16:53:49] (PuppetFailure) firing: (5) Puppet has failed on druid1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:54:07] klausman: Looking now. I think that this is likely related to the upgrade of an-web1001 to bullseye, which happened today. I'll make a ticket now. [16:54:32] btullis: This may already have been happening last week, let me check my history [16:54:46] OK, thanks. [16:55:03] btullis: Yetp, I think the first time it happened was 2024-01-31T17:22:01+0100 [16:55:26] (hooray for HISTTIMEFORMAT=%Y-%m-%dT%H:%M:%S%z) [16:55:59] Oh, OK. Thanks, so likely unrelated to today's upgrade. I will make a note, but it still warrants a new ticket. [16:56:19] ack, do you want me to add anything to the ticket? [16:58:34] The Jan 31 may be unrelated, but I can definitely say that it was happening on Feb 1 [16:58:49] (PuppetFailure) firing: (4) Puppet has failed on druid1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:00:58] klausman: in terms of priority, how badly is this affecting your work in the ML team? Bigly, I guess? [17:01:34] We have a sortof workaround (using --continue), and at worst, we can d/l on a statbox and scp over, but yeah, it's quite a speedbump [17:01:55] ah, right, another, symptom: wget'ing on a statbox seems fine [17:02:17] Ah, thanks. I was just about to ask that. [17:02:38] I haven't tested it myself, but Kevin reported that that works for him [17:03:02] I also tried both v4 and v6, that didn't seem to make a diff [17:03:49] (PuppetFailure) resolved: (4) Puppet has failed on druid1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:07:27] klausman: One more thng, is your paste above missing line 1? I can see a wget on line 34 but just wondered if there were any other unexpected arguments to the test case. [17:08:21] the first one is a snippet out of a multi-file/recursive d/l, let me fetch the cmdline [17:08:35] `wget --no-host-directories --recursive --reject "index.html*" --accept-regex '(bert-base-multilingual-uncased|mbart-large-cc25)' --cut-dirs=3 --directory-prefix=article_descriptions/models https://analytics.wikimedia.org/published/wmf-ml-models/article-descriptions/` [17:08:50] the smaller files all worked, so I copied only the large one. [17:09:05] Gotcha, thanks. [17:11:57] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 0.5206% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [17:13:47] RECOVERY - MD RAID on aqs1013 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:20:14] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Problem downloading large files from analytics.wikimedia.org - https://phabricator.wikimedia.org/T356792 (10BTullis) [17:20:38] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Problem downloading large files from analytics.wikimedia.org - https://phabricator.wikimedia.org/T356792 (10BTullis) p:05Triage→03High [17:47:23] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Traffic: Problem downloading large files from analytics.wikimedia.org - https://phabricator.wikimedia.org/T356792 (10BTullis) I have a feeling that this is more related to the CDN than to the webserver behind it. I tried the following test from stat1004, which... [17:48:10] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Traffic: Problem downloading large files from analytics.wikimedia.org - https://phabricator.wikimedia.org/T356792 (10BTullis) I'm going to add the #traffic team to see if they have any insight as to why this might have started happening recently. [17:48:25] (SystemdUnitFailed) firing: (14) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:57:32] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Traffic: Problem downloading large files from analytics.wikimedia.org - https://phabricator.wikimedia.org/T356792 (10Vgutierrez) I can reproduce via text@drmrs, I'll take a look ASAP :) [18:02:22] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Traffic: Problem downloading large files from analytics.wikimedia.org - https://phabricator.wikimedia.org/T356792 (10Vgutierrez) @BTullis it's origin related: ` vgutierrez@bast6003:~$ curl -o /dev/null --resolve analytics.wikimedia.org:8443:10.64.21.14 --limit-... [18:10:54] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Traffic: Problem downloading large files from analytics.wikimedia.org - https://phabricator.wikimedia.org/T356792 (10BTullis) @Vgutierrez - Thanks so much. Let me know if I can help with anything. [18:24:00] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Traffic: Problem downloading large files from analytics.wikimedia.org - https://phabricator.wikimedia.org/T356792 (10Vgutierrez) https://github.com/wikimedia/operations-puppet/blob/1a6c9d13ee7a499ee7a28e47449774a6a6dcdccc/modules/envoyproxy/manifests/tls_termina... [18:26:42] Starting with deployment [18:27:09] (03PS4) 10Gmodena: hql: add data quality DDL. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997496 (https://phabricator.wikimedia.org/T356628) [18:28:25] (03CR) 10Gmodena: hql: add data quality DDL. (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997496 (https://phabricator.wikimedia.org/T356628) (owner: 10Gmodena) [18:30:23] (03PS1) 10Joal: Bump changelog to v0.2.31 for release [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/997942 [18:30:44] gmodena: if you're nearby and wish --^ [18:36:55] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/997942 (owner: 10Joal) [18:37:54] Starting build #136 for job analytics-refinery-maven-release-docker [18:53:49] Project analytics-refinery-maven-release-docker build #136: 09SUCCESS in 15 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/136/ [19:14:12] Starting build #97 for job analytics-refinery-update-jars-docker [19:14:38] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.2.31 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997837 [19:14:39] Project analytics-refinery-update-jars-docker build #97: 09SUCCESS in 26 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/97/ [19:27:56] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997837 (owner: 10Maven-release-user) [19:29:33] (03CR) 10Joal: [C: 03+2] hql: add data quality DDL. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997496 (https://phabricator.wikimedia.org/T356628) (owner: 10Gmodena) [19:29:35] (03CR) 10Joal: [V: 03+2 C: 03+2] hql: add data quality DDL. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997496 (https://phabricator.wikimedia.org/T356628) (owner: 10Gmodena) [19:29:58] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/994697 (https://phabricator.wikimedia.org/T352672) (owner: 10Aqu) [19:30:21] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/995104 (https://phabricator.wikimedia.org/T356214) (owner: 10Bearloga) [19:30:43] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/995258 (owner: 10Bearloga) [19:32:16] (03PS2) 10Joal: Fix unique devices druid loading [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997849 (https://phabricator.wikimedia.org/T347879) [19:32:41] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997849 (https://phabricator.wikimedia.org/T347879) (owner: 10Joal) [19:34:35] !log Refinery-source v0.2.31 released to archiva [19:34:37] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:34:44] !log Deploying refinery using scap [19:34:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:39:56] (03CR) 10Gehel: Cleanup dependencies in refinery-tools module. (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/997823 (owner: 10Gehel) [19:42:36] 10Data-Engineering (Sprint 8), 10Patch-For-Review: [Maintenance] Migrate ReportUpdater browser queries to Airflow - https://phabricator.wikimedia.org/T354552 (10Snwachukwu) We added the following: - A spark-scala job to perform dynamic pivot because some reports need to be pivoted. - A UDF to get the star... [19:43:58] (03CR) 10Gehel: Cleanup of ISPDatabaseReader. (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/995042 (owner: 10Gehel) [19:45:51] (03CR) 10Joal: [C: 03+2] "LGTM!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/995042 (owner: 10Gehel) [19:46:14] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Discovery-Search (Current work): Create dashboards for Search SLOs - https://phabricator.wikimedia.org/T338009 (10RKemper) Added new threshold markers at 95% for the 4 SLO graphs. We may want to revise the % SLO upwards, but let's stick with 95% for now until we... [19:47:36] joal: let me know when I start to be more annoying than constructive with my drive by commits. I'm using those as an excuse to record a few more videos, but I do recognize that I'm creating work to review / merge those, and that I'm not fixing real issues. [19:49:16] gehel: More love for our code is always welcome <3 [19:57:46] !log Deploy refinery onto HDFS [19:57:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:59:11] (03Merged) 10jenkins-bot: Cleanup of ISPDatabaseReader. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/995042 (owner: 10Gehel) [20:19:40] 10Data-Engineering (Sprint 8), 10Patch-For-Review: [Iceberg Migration] Migrate session length tables to Iceberg - https://phabricator.wikimedia.org/T352672 (10CodeReviewBot) joal merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/600 Migrate session length to Iceberg [20:56:27] 10Data-Platform-SRE, 10Discovery-Search (Current work): Develop recovery/reindex procedures for new Search Update Pipeline - https://phabricator.wikimedia.org/T356803 (10bking) [20:58:15] (03PS1) 10Joal: Fix unique devices druid loading [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997960 (https://phabricator.wikimedia.org/T347879) [21:02:51] 10Data-Platform-SRE, 10Discovery-Search (Current work): Develop recovery/reindex procedures for new Search Update Pipeline - https://phabricator.wikimedia.org/T356803 (10bking) [21:09:49] 10Data-Platform-SRE, 10Discovery-Search (Current work): Document review/refresh for https://wikitech.wikimedia.org/wiki/Search - https://phabricator.wikimedia.org/T356806 (10bking) [21:10:08] 10Data-Platform-SRE, 10Discovery-Search (Current work): Document review/refresh for https://wikitech.wikimedia.org/wiki/Search - https://phabricator.wikimedia.org/T356806 (10bking) [21:16:36] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 0.4517% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [21:26:00] 10Data-Platform-SRE, 10Discovery-Search (Current work): Develop recovery/reindex procedures for new Search Update Pipeline - https://phabricator.wikimedia.org/T356803 (10bking) Per today's 1x1 with @RKemper , @EBernhardson and myself, we talked about what a new outage recovery process might look like. Today'... [21:48:26] (SystemdUnitFailed) firing: (14) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:22:31] 10Data-Engineering, 10Data Products, 10Data-Platform, 10Movement-Insights, and 2 others: Temporary Accounts Initiative (IP Masking) - Add user_is_temp to data tables - https://phabricator.wikimedia.org/T356701 (10Mayakp.wiki) [22:51:32] 10Data-Platform-SRE, 10Discovery-Search (Current work): Develop recovery/reindex procedures for new Search Update Pipeline - https://phabricator.wikimedia.org/T356803 (10bking) Per IRC conversation with @RLazarus , he's done a lot of work towards this exact use case (see T341553 ). [23:32:22] 10Data-Engineering, 10Anti-Harassment, 10Data-Persistence, 10Temporary accounts, and 2 others: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10Mayakp.wiki) thanks @Ladsgroup. yes I'm able to query the mariadb table for a few wikis and can see '0' in the user_is_temp col...