[00:59:30] <jinxer-wm>	 (SystemdUnitFailed) firing: (15) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:11:56] <jinxer-wm>	 (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 1.278% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[01:19:30] <jinxer-wm>	 (SystemdUnitFailed) firing: (15) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:36:32] <jinxer-wm>	 (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent
[04:35:58] <wikibugs>	 10Data-Engineering, 10Data Products, 10Data-Platform: Temporary Accounts Initiative (IP Masking) - Add user_is_temp to data tables - https://phabricator.wikimedia.org/T356701 (10VirginiaPoundstone)
[05:11:56] <jinxer-wm>	 (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 0.9774% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[05:20:57] <jinxer-wm>	 (SystemdUnitFailed) firing: (14) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:36:32] <jinxer-wm>	 (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent
[08:48:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (14) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:07:44] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Ensure necessary firewall rules are open between the DSE worker nodes and external services - https://phabricator.wikimedia.org/T356623 (10brouberol) Superset -> dbstore1009: ✅   ` brouberol@dse-k8s-worker1001:~$ sudo nsen...
[09:11:57] <jinxer-wm>	 (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 0.8053% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[09:36:32] <jinxer-wm>	 (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent
[09:52:42] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Ensure necessary firewall rules are open between the DSE worker nodes and external services - https://phabricator.wikimedia.org/T356623 (10brouberol) Hadoop masters and workers are now working (they were missing a network...
[10:18:20] <wikibugs>	 (03CR) 10Joal: "Some nits" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997496 (https://phabricator.wikimedia.org/T356628) (owner: 10Gmodena)
[10:21:05] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "One comment idea." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/997485 (https://phabricator.wikimedia.org/T356628) (owner: 10Gmodena)
[10:49:10] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Products, 10Patch-For-Review: Migrate an-web1001 to Debian bullseye - https://phabricator.wikimedia.org/T349398 (10BTullis)
[10:49:24] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Products, 10Patch-For-Review: Migrate an-web1001 to Debian bullseye - https://phabricator.wikimedia.org/T349398 (10BTullis)
[10:49:50] <wikibugs>	 10Data-Engineering, 10Data Products, 10Data-Platform, 10Temporary accounts: Temporary Accounts Initiative (IP Masking) - Add user_is_temp to data tables - https://phabricator.wikimedia.org/T356701 (10kostajh)
[11:03:08] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Products, 10Patch-For-Review: Migrate an-web1001 to Debian bullseye - https://phabricator.wikimedia.org/T349398 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-web1001.eqiad.wm...
[11:03:27] <btullis>	 !log reimaging an-web1001 to bullseye for T349398
[11:03:30] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:03:30] <stashbot>	 T349398: Migrate an-web1001 to Debian bullseye - https://phabricator.wikimedia.org/T349398
[11:08:35] <moritzm>	 btullis: before you merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/997797/ can I quickly switch these to Puppet 7 via hieradata/hosts? We did that previously for stat1009 already, so the setup is known to be compatible
[11:09:01] <moritzm>	 the only reason it's not applied on the role level is because there are still buster stat hosts
[11:09:10] <btullis>	 Yes, please feel free.
[11:09:15] <moritzm>	 ok, on it
[11:14:45] <moritzm>	 btullis: turns out, I already did that back in December :-)
[11:15:15] <btullis>	 Ah, cool. Nice one.
[11:18:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (21) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:31:03] <wikibugs>	 10Data-Engineering, 10Anti-Harassment, 10Data-Persistence, 10Temporary accounts, and 2 others: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10Ladsgroup) The schema was deployed long time ago, we just started writing to the field but the defualt value was 0 and now we a...
[11:42:58] <moritzm>	 btullis: reimages are currently broken, which also affects your an-web1001 reimage. https://gerrit.wikimedia.org/r/c/operations/puppet/+/997804 will unblock this shortly, then you can retrigger the cookbook
[11:43:31] <volans>	 I told him in -ops earlier, he should know :D
[11:43:53] <moritzm>	 ah, missed that!
[11:44:05] <volans>	 better two pings than zero ;)
[11:45:17] <stevemunene>	 !log add new an-workers to analytics_cluster hadoop worker role analytics_cluster::hadoop::worker T353776
[11:45:19] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:45:20] <stashbot>	 T353776: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776
[11:49:50] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Products: Migrate an-web1001 to Debian bullseye - https://phabricator.wikimedia.org/T349398 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-web1001.eqiad.wmnet with OS bullseye execu...
[11:53:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (23) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:55:45] <wikibugs>	 10Data-Engineering: [NEEDS GROOMING][SPIKE] Extract refine schema management into a dedicated tool - https://phabricator.wikimedia.org/T356762 (10gmodena)
[11:56:16] <jinxer-wm>	 (HdfsCapacityRemainingPercent) resolved: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent
[11:58:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (25) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:59:18] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1164 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:00:20] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1164 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:02:24] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1168 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:03:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (28) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:03:32] <stevemunene>	 --^ onboarding new workers an-worker1157-1175, we expect some hadoop-hdfs-datanode.service alerts, adding a brief silence to reduce the noise
[12:05:38] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1168 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:05:42] <wikibugs>	 (03PS2) 10Gmodena: hql: add data quality DDL. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997496 (https://phabricator.wikimedia.org/T356628)
[12:06:37] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10Stevemunene)
[12:08:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (39) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:10:01] <wikibugs>	 (03PS2) 10Gmodena: data-quality: rename source table column [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/997485 (https://phabricator.wikimedia.org/T356628)
[12:10:53] <volans>	 btullis: confirmed reimage should work again
[12:11:16] <btullis>	 volans: Ack, many thanks.
[12:13:27] <jinxer-wm>	 (SystemdUnitFailed) firing: (41) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:14:48] <icinga-wm>	 PROBLEM - HDFS topology check on an-master1003 is CRITICAL: CRITICAL: There is at least one node in the default rack. https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_topology_check
[12:16:34] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1175 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:17:20] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1167 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:17:58] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1160 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:18:20] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1173 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:18:27] <jinxer-wm>	 (SystemdUnitFailed) firing: (47) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:18:36] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1175 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:19:18] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1171 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:20:51] <wikibugs>	 10Data-Engineering, 10Community-Tech, 10Multiblocks, 10Data Products (Data Products Sprint 09), 10Event-Platform: Investigate if the new 'Multiblocks' user blocks feature affects the mediawiki.user-blocks-change event stream - https://phabricator.wikimedia.org/T356597 (10WDoranWMF) p:05Triage→03High
[12:21:30] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1171 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:21:32] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1173 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:22:28] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1167 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[12:22:36] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1167 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:23:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (44) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:23:28] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1167 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[12:28:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (40) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:32:06] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1169 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:36:15] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1169 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[12:40:59] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1170 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:41:35] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10Stevemunene) The hosts have been added to net_topology and assigned the right role.  Hosts are also running OK without any RAID related alerts. However, some hosts are in the defaul...
[12:42:49] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1169 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[12:46:39] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1169 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:48:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (16) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:48:45] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1160 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:48:57] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1170 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:59:59] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Products, 10Patch-For-Review: Migrate an-web1001 to Debian bullseye - https://phabricator.wikimedia.org/T349398 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-web1001.eqiad.wm...
[13:10:21] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Products, 10Patch-For-Review: Migrate an-web1001 to Debian bullseye - https://phabricator.wikimedia.org/T349398 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-web1001.eqiad.wmnet...
[13:11:54] <wikibugs>	 10Data-Engineering, 10Release-Engineering-Team, 10collaboration-services, 10GitLab (CI & Job Runners), 10Patch-For-Review: Unblock Dockerfile syntax to build images with Gitlab trusted runner - https://phabricator.wikimedia.org/T351792 (10CodeReviewBot) aqu merged https://gitlab.wikimedia.org/repos/data-...
[13:11:57] <jinxer-wm>	 (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 0.6867% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[13:15:43] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Products, 10Patch-For-Review: Migrate an-web1001 to Debian bullseye - https://phabricator.wikimedia.org/T349398 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-web1001.eqiad.wm...
[13:18:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (21) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:20:29] <wikibugs>	 (03PS1) 10Gehel: Cleanup dependencies in refinery-tools module. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/997823
[13:24:07] <wikibugs>	 10Data-Engineering (Sprint 6), 10Patch-For-Review: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs - https://phabricator.wikimedia.org/T349532 (10Antoine_Quhen)
[13:24:23] <wikibugs>	 10Data-Engineering, 10Release-Engineering-Team, 10collaboration-services, 10GitLab (CI & Job Runners), 10Patch-For-Review: Unblock Dockerfile syntax to build images with Gitlab trusted runner - https://phabricator.wikimedia.org/T351792 (10Antoine_Quhen) 05In progress→03Resolved a:03Antoine_Quhen
[13:24:33] <wikibugs>	 10Data-Engineering (Sprint 8): [Maintenance] Migrate Gitlab CI to blubber - https://phabricator.wikimedia.org/T356364 (10Antoine_Quhen) 05Open→03Resolved
[13:29:49] <stevemunene>	 !log roll restart hadoop masters to pick up the right rack assignment for new hosts T353776
[13:29:52] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:29:53] <stashbot>	 T353776: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776
[13:34:09] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Ensure necessary firewall rules are open between the DSE worker nodes and external services - https://phabricator.wikimedia.org/T356623 (10brouberol) Superset -> analytics druid ✅ ` brouberol@dse-k8s-worker1001:~$ sudo nse...
[13:39:10] <brouberol>	 !log add new TLS SANs to the superset/superset-next certificates in dse-k8s-eqiad - T356481
[13:39:13] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:39:13] <stashbot>	 T356481: Configure ingress internal DNS records - https://phabricator.wikimedia.org/T356481
[13:40:45] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Products: Migrate an-web1001 to Debian bullseye - https://phabricator.wikimedia.org/T349398 (10taavi) Doing an in-place reimage here means that analytics.wm.o and stats.wm.o are currently now. Is there a reason why this could not be do...
[13:44:54] <wikibugs>	 (03PS1) 10Joal: Fix unique devices druid loading [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997849
[13:45:06] <joal>	 aqu: if you have a minute please --^
[13:48:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (21) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:53:34] <joal>	 Hi SRE folks - Have we removed/changed druid instances lately?
[13:54:18] <wikibugs>	 10Data-Engineering (Sprint 8): Data Quality Issue: Wikitext History Job fail / rerun in Airflow - https://phabricator.wikimedia.org/T342911 (10lbowmaker) 05Open→03Resolved a:03lbowmaker Resolving this ticket. Retries has been set to 0 for the job, data quality work is being done in this ticket: https://pha...
[13:54:26] <joal>	 hm - looks like it
[13:54:35] <joal>	 druid1004.eqiad.wmnet is gone 
[13:56:15] <stevemunene>	 Hi joal , yes we have
[13:56:32] <joal>	 hm, I guess this was 2 days ago, right?
[13:56:55] <stevemunene>	 nope was a while back https://phabricator.wikimedia.org/T336043
[13:57:50] <joal>	 ack stevemunene - thanks for letting me know :)
[14:00:21] <wikibugs>	 (03PS3) 10Gmodena: hql: add data quality DDL. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997496 (https://phabricator.wikimedia.org/T356628)
[14:00:46] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Products: Migrate an-web1001 to Debian bullseye - https://phabricator.wikimedia.org/T349398 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-web1001.eqiad.wmnet with OS bullseye compl...
[14:04:01] <joal>	 !log Rerun mediawiki-history-reduced druid indexation after airflow variable update
[14:04:03] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:07:37] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Products: Migrate an-web1001 to Debian bullseye - https://phabricator.wikimedia.org/T349398 (10BTullis) >>! In T349398#9517216, @taavi wrote: > Doing an in-place reimage here means that analytics.wm.o and stats.wm.o are currently now....
[14:08:23] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Products: Migrate an-web1001 to Debian bullseye - https://phabricator.wikimedia.org/T349398 (10BTullis) 05Open→03Resolved
[14:08:25] <wikibugs>	 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10BTullis)
[14:08:58] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10Stevemunene) Hosts are visiblue on the namenode UI and should rebalance with time {F41793107}  Having an issue with the an-masters roll restart, Namenode failover from `an-master100...
[14:11:01] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Ensure necessary firewall rules are open between the DSE worker nodes and external services - https://phabricator.wikimedia.org/T356623 (10brouberol) 05Open→03Resolved Superset -> an-coord ✅ ` brouberol@dse-k8s-worker1...
[14:11:07] <wikibugs>	 (03CR) 10Aqu: [V: 03+1 C: 03+1] "Thanks for the catch and fix @Joal !" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997849 (owner: 10Joal)
[14:11:08] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Epic, 10Patch-For-Review: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 (10brouberol)
[14:11:21] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Epic, 10Patch-For-Review: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 (10brouberol)
[14:17:50] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[14:17:58] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10BTullis) >>! In T353776#9517317, @Stevemunene wrote:  > `Run manual HDFS Namenode failover from an-master1003-eqiad-wmnet to an-master1004-eqiad-wmnet. > ----- OUTPUT of 'kerberos-r...
[14:26:53] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops, 10SRE, 10ops-codfw: Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10Jhancock.wm) @BTullis I can reseat the backplane to try and fix this. Is it safe for me to do so? or are you currently working...
[14:27:14] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Migrate hue.wikimedia.org to bullseye - https://phabricator.wikimedia.org/T349400 (10Gehel) a:05brouberol→03None
[14:27:40] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Migrate hue.wikimedia.org to bullseye - https://phabricator.wikimedia.org/T349400 (10Gehel) Moving back to our backlog until we know if we can deprecated hue or not.
[14:28:18] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops, 10SRE, 10ops-codfw: Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10BTullis) >>! In T355830#9517383, @Jhancock.wm wrote: > @BTullis I can reseat the backplane to try and fix this. Is it safe for...
[14:29:12] <wikibugs>	 10Data-Platform-SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops: Requesting permission to enable kafka log compaction for page_rerender on kafka-main - https://phabricator.wikimedia.org/T354794 (10Gehel) Moving to our backlog board, to be picked up again after March 20th 2024
[14:31:50] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search, 10Patch-For-Review: Track and clean up object storage used by rdf-streaming-updater - https://phabricator.wikimedia.org/T348685 (10bking)
[14:31:52] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Discovery-Search: Clean up object storage in response to latest alert - https://phabricator.wikimedia.org/T356283 (10bking) 05Open→03Invalid This duplicates T356313  . Closing....
[14:33:14] <joal>	 stevemunene: we're receiving an alert-email for the job hdfs_rsync_analytics_hadoop_published
[14:33:27] <joal>	 Have we changed something onto an-web1001?
[14:33:44] <btullis>	 joal: Oh, that's probably me. I have recently reimaged an-web1001 to bullseye. I will investigate now.
[14:33:58] <joal>	 btullis: looks like we have a java something :)
[14:34:14] <joal>	 thanks for investigating btullis 
[14:34:22] <btullis>	 OK, thanks for the pointer. Looking now.
[14:37:20] <btullis>	 joal: I believe that this is already fixed. It was caused by a failed reimage, leaving an-web1001 half-installed.
[14:37:28] <btullis>	 I will respond to the emails.
[14:37:35] <joal>	 Thanks a lot btullis 
[14:37:51] <btullis>	 Latest run:
[14:37:55] <btullis>	 https://www.irccloud.com/pastebin/zqUU2ayR/
[14:42:04] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops, 10SRE, 10ops-codfw: Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10Jhancock.wm) @BTullis looks like it worked. But since that backplane error occurred twice already, if it happens again lmk and...
[14:44:33] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops, 10SRE, 10ops-codfw: Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10BTullis) >>! In T355830#9517443, @Jhancock.wm wrote: > @BTullis looks like it worked. But since that backplane error occurred...
[14:49:27] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bking@cumin2002 for hosts: `cloudelastic1009.wikimedia.org` - cloudelastic1009.wikim...
[14:50:47] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: [superset k8s]  Update public domain DNS records to make them point to the DSE Kubernetes ingress - https://phabricator.wikimedia.org/T356482 (10brouberol)
[14:51:13] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: [superset k8s] Add entries to the puppet service catalog - https://phabricator.wikimedia.org/T356483 (10brouberol)
[14:51:30] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: [superset k8s]  Update public domain DNS records to make them point to the DSE Kubernetes ingress - https://phabricator.wikimedia.org/T356482 (10brouberol) a:03brouberol
[14:51:33] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10superset.wikimedia.org: Running into errors while adding Hive table to Superset dataset - https://phabricator.wikimedia.org/T284604 (10BTullis) @cchen - Could you test this again now please? I believe that it has been fixed by our recent upgrade to Superset 3.1.0.
[14:51:40] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: [superset k8s] Add entries to the puppet service catalog - https://phabricator.wikimedia.org/T356483 (10brouberol) a:03brouberol
[14:55:44] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Monitor the availability of the superset deployments - https://phabricator.wikimedia.org/T356484 (10Gehel) p:05Triage→03High
[14:55:53] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: [superset k8s] Update the wikitech page with our production readiness checklist - https://phabricator.wikimedia.org/T356486 (10Gehel) p:05Triage→03High
[14:56:11] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Create saved views for the superset deployment logs - https://phabricator.wikimedia.org/T356485 (10Gehel) p:05Triage→03High
[14:56:54] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Find a solution for the requestctl-generator html page - https://phabricator.wikimedia.org/T356490 (10Gehel) p:05Triage→03High
[14:57:04] <btullis>	 !log roll-restarting the presto workers for T356382
[14:57:06] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:57:23] <wikibugs>	 10Data-Platform-SRE: Check home/HDFS leftovers of mhoutti - https://phabricator.wikimedia.org/T356641 (10Gehel) p:05Triage→03Low
[14:57:28] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work): Rebuild and deploy textify plugin - https://phabricator.wikimedia.org/T356651 (10Gehel) p:05Triage→03High
[14:58:58] <wikibugs>	 (03CR) 10Gmodena: hql: add data quality DDL. (036 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997496 (https://phabricator.wikimedia.org/T356628) (owner: 10Gmodena)
[15:06:59] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8b38d8fa-15e7-4772-81bd-035e55b1e01f) set by cmooney@cumin1002 for 0:30:00 on 1 host...
[15:08:48] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on druid1010:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:13:48] <jinxer-wm>	 (PuppetFailure) firing: (2) Puppet has failed on druid1009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:18:48] <jinxer-wm>	 (PuppetFailure) firing: (3) Puppet has failed on druid1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:23:49] <jinxer-wm>	 (PuppetFailure) firing: (4) Puppet has failed on druid1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:28:49] <jinxer-wm>	 (PuppetFailure) firing: (5) Puppet has failed on druid1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:40:22] <wikibugs>	 (03CR) 10Gmodena: Fix unique devices druid loading (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997849 (owner: 10Joal)
[15:41:23] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11): DataHub search is throwing errors on search - https://phabricator.wikimedia.org/T356783 (10BTullis)
[15:42:42] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11): DataHub search is throwing errors on search - https://phabricator.wikimedia.org/T356783 (10BTullis) p:05Triage→03Unbreak! I'm trying a rolling restart of the pods in codfw with the following command: ` btullis@deploy2002:/srv/deployment-charts/helmfile.d/service...
[15:44:09] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops, 10SRE, 10ops-codfw: Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host elastic2094.codfw.wmnet with OS...
[15:46:29] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10BTullis)
[15:46:40] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops, 10SRE, 10ops-codfw: Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10BTullis) 05Open→03Resolved a:03BTullis Thanks @Jhancock.wm - The reimage cookbook hung once at PXE boot, but I gave it a...
[15:53:28] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bking@cumin2002 for hosts: `cloudelastic1009.wikimedia.org` - cloudelastic1009.wikim...
[15:55:23] <jinxer-wm>	 (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for mediawiki_history_reduced_2024_01 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable
[16:00:00] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11): DataHub search is throwing errors on search - https://phabricator.wikimedia.org/T356783 (10BTullis) p:05Unbreak!→03High The restart has worked and the search results now seem ok, but I'm not confident that this won't happen again. {F41794245,width=60%} I think w...
[16:12:39] <klausman>	 \o We (ML team) are seeing something odd when downloading from https://analytics.wikimedia.org/published/... URLs: no matter location (tried my laptop at home, my rootserver, both Linux) or OS (team members elsewhere in EMEA can reproduce: somewhere in the range of 1-2GB of a download, the server just stops, dropping the connection, or sometimes it just hangs forever. Using wget
[16:12:41] <klausman>	 --contine, one can often make progress and eventually complete the file, but that's obviosuly not ideal. Ontop of that, sometimes even with --continue, the download remains stuck forever, and only amnually retrying for ten minutes or more yields a complete download.
[16:13:02] <klausman>	 I am not sure this is an analytics-server problem (might be CDN, too), but I thought I'd go to the source first
[16:13:25] <klausman>	 btullis: you strike me as the man to talk to (if only to tell me to talk to someone else)
[16:13:35] <klausman>	 Some examples: https://phabricator.wikimedia.org/P56346
[16:15:13] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10superset.wikimedia.org: Running into errors while adding Hive table to Superset dataset - https://phabricator.wikimedia.org/T284604 (10BTullis)
[16:15:24] <jinxer-wm>	 (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for mediawiki_history_reduced_2024_01 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable
[16:20:49] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: [superset-k8s] Find a solution for the requestctl-generator html page - https://phabricator.wikimedia.org/T356490 (10BTullis)
[16:21:31] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10Stevemunene) Sure, I'll try the manual failover and restart of the services probably during our sync
[16:21:34] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work): Rebuild and deploy textify plugin - https://phabricator.wikimedia.org/T356651 (10TJones) T332337 has been comitted, so this is ready to go.
[16:22:08] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Ensure Data Platform SREs have a contact group in puppet/alerting - https://phabricator.wikimedia.org/T342578 (10bking) a:03bking
[16:23:55] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work): Review wikitech:Search and write processes for k8s world - https://phabricator.wikimedia.org/T356303 (10bking)
[16:30:48] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "One formatting nit, but god to go as is :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997496 (https://phabricator.wikimedia.org/T356628) (owner: 10Gmodena)
[16:34:28] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 (10Gehel) a:05BTullis→03Stevemunene
[16:35:17] <brouberol>	 The puppet errors on druid hosts should be resolved
[16:35:46] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1009.eqiad.wmnet with OS bullseye
[16:36:04] <wikibugs>	 (03CR) 10Joal: "One question before merging" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/995042 (owner: 10Gehel)
[16:37:10] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/995258 (owner: 10Bearloga)
[16:37:33] <btullis>	 klausman: I have just seen this. Will look asap when out of meeting.
[16:37:46] <klausman>	 thanks!
[16:38:49] <jinxer-wm>	 (PuppetFailure) firing: (5) Puppet has failed on druid1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[16:43:39] <wikibugs>	 (03CR) 10Joal: Cleanup dependencies in refinery-tools module. (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/997823 (owner: 10Gehel)
[16:49:01] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10superset.wikimedia.org: Running into errors while adding Hive table to Superset dataset - https://phabricator.wikimedia.org/T284604 (10mpopov) @BTullis: heads-up that @cchen won't be able to until {T356645} is resolved
[16:52:29] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10superset.wikimedia.org: Running into errors while adding Hive table to Superset dataset - https://phabricator.wikimedia.org/T284604 (10BTullis) 05Open→03Resolved a:03BTullis OK, thanks @mpopov - I understand. Well, I think that I can...
[16:53:49] <jinxer-wm>	 (PuppetFailure) firing: (5) Puppet has failed on druid1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[16:54:07] <btullis>	 klausman: Looking now. I think that this is likely related to the upgrade of an-web1001 to bullseye, which happened today. I'll make a ticket now.
[16:54:32] <klausman>	 btullis: This may already have been happening last week, let me check my history
[16:54:46] <btullis>	 OK, thanks.
[16:55:03] <klausman>	 btullis: Yetp, I think the first time it happened was 2024-01-31T17:22:01+0100
[16:55:26] <klausman>	 (hooray for HISTTIMEFORMAT=%Y-%m-%dT%H:%M:%S%z)
[16:55:59] <btullis>	 Oh, OK. Thanks, so likely unrelated to today's upgrade. I will make a note, but it still warrants a new ticket.
[16:56:19] <klausman>	 ack, do you want me to add anything to the ticket?
[16:58:34] <klausman>	 The Jan 31 may be unrelated, but I can definitely say that it was happening on Feb 1
[16:58:49] <jinxer-wm>	 (PuppetFailure) firing: (4) Puppet has failed on druid1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[17:00:58] <btullis>	 klausman: in terms of priority, how badly is this affecting your work in the ML team? Bigly, I guess?
[17:01:34] <klausman>	 We have a sortof workaround (using --continue), and at worst, we can d/l on a statbox and scp over, but yeah, it's quite a speedbump
[17:01:55] <klausman>	 ah, right, another, symptom: wget'ing on a statbox seems fine
[17:02:17] <btullis>	 Ah, thanks. I was just about to ask that.
[17:02:38] <klausman>	 I haven't tested it myself, but Kevin reported that that works for him
[17:03:02] <klausman>	 I also tried both v4 and v6, that didn't seem to make a diff
[17:03:49] <jinxer-wm>	 (PuppetFailure) resolved: (4) Puppet has failed on druid1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[17:07:27] <btullis>	 klausman: One more thng, is your paste above missing line 1? I can see a wget on line 34 but just wondered if there were any other unexpected arguments to the test case.
[17:08:21] <klausman>	 the first one is a snippet out of a multi-file/recursive d/l, let me fetch the cmdline
[17:08:35] <klausman>	 `wget --no-host-directories --recursive --reject "index.html*" --accept-regex '(bert-base-multilingual-uncased|mbart-large-cc25)' --cut-dirs=3 --directory-prefix=article_descriptions/models https://analytics.wikimedia.org/published/wmf-ml-models/article-descriptions/`
[17:08:50] <klausman>	 the smaller files all worked, so I copied only the large one.
[17:09:05] <btullis>	 Gotcha, thanks.
[17:11:57] <jinxer-wm>	 (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 0.5206% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[17:13:47] <icinga-wm>	 RECOVERY - MD RAID on aqs1013 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[17:20:14] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Problem downloading large files from analytics.wikimedia.org - https://phabricator.wikimedia.org/T356792 (10BTullis)
[17:20:38] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Problem downloading large files from analytics.wikimedia.org - https://phabricator.wikimedia.org/T356792 (10BTullis) p:05Triage→03High
[17:47:23] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Traffic: Problem downloading large files from analytics.wikimedia.org - https://phabricator.wikimedia.org/T356792 (10BTullis) I have a feeling that this is more related to the CDN than to the webserver behind it.  I tried the following test from stat1004, which...
[17:48:10] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Traffic: Problem downloading large files from analytics.wikimedia.org - https://phabricator.wikimedia.org/T356792 (10BTullis) I'm going to add the #traffic team to see if they have any insight as to why this might have started happening recently.
[17:48:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (14) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:57:32] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Traffic: Problem downloading large files from analytics.wikimedia.org - https://phabricator.wikimedia.org/T356792 (10Vgutierrez) I can reproduce via text@drmrs, I'll take a look ASAP :)
[18:02:22] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Traffic: Problem downloading large files from analytics.wikimedia.org - https://phabricator.wikimedia.org/T356792 (10Vgutierrez) @BTullis it's origin related:  ` vgutierrez@bast6003:~$ curl -o /dev/null --resolve analytics.wikimedia.org:8443:10.64.21.14 --limit-...
[18:10:54] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Traffic: Problem downloading large files from analytics.wikimedia.org - https://phabricator.wikimedia.org/T356792 (10BTullis) @Vgutierrez - Thanks so much. Let me know if I can help with anything.
[18:24:00] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Traffic: Problem downloading large files from analytics.wikimedia.org - https://phabricator.wikimedia.org/T356792 (10Vgutierrez) https://github.com/wikimedia/operations-puppet/blob/1a6c9d13ee7a499ee7a28e47449774a6a6dcdccc/modules/envoyproxy/manifests/tls_termina...
[18:26:42] <joal>	 Starting with deployment
[18:27:09] <wikibugs>	 (03PS4) 10Gmodena: hql: add data quality DDL. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997496 (https://phabricator.wikimedia.org/T356628)
[18:28:25] <wikibugs>	 (03CR) 10Gmodena: hql: add data quality DDL. (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997496 (https://phabricator.wikimedia.org/T356628) (owner: 10Gmodena)
[18:30:23] <wikibugs>	 (03PS1) 10Joal: Bump changelog to v0.2.31 for release [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/997942
[18:30:44] <joal>	 gmodena: if you're nearby and wish --^
[18:36:55] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/997942 (owner: 10Joal)
[18:37:54] <wmf-insecte>	 Starting build #136 for job analytics-refinery-maven-release-docker
[18:53:49] <wmf-insecte>	 Project analytics-refinery-maven-release-docker build #136: 09SUCCESS in 15 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/136/
[19:14:12] <wmf-insecte>	 Starting build #97 for job analytics-refinery-update-jars-docker
[19:14:38] <wikibugs>	 (03PS1) 10Maven-release-user: Add refinery-source jars for v0.2.31 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997837
[19:14:39] <wmf-insecte>	 Project analytics-refinery-update-jars-docker build #97: 09SUCCESS in 26 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/97/
[19:27:56] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997837 (owner: 10Maven-release-user)
[19:29:33] <wikibugs>	 (03CR) 10Joal: [C: 03+2] hql: add data quality DDL. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997496 (https://phabricator.wikimedia.org/T356628) (owner: 10Gmodena)
[19:29:35] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] hql: add data quality DDL. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997496 (https://phabricator.wikimedia.org/T356628) (owner: 10Gmodena)
[19:29:58] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/994697 (https://phabricator.wikimedia.org/T352672) (owner: 10Aqu)
[19:30:21] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/995104 (https://phabricator.wikimedia.org/T356214) (owner: 10Bearloga)
[19:30:43] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/995258 (owner: 10Bearloga)
[19:32:16] <wikibugs>	 (03PS2) 10Joal: Fix unique devices druid loading [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997849 (https://phabricator.wikimedia.org/T347879)
[19:32:41] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997849 (https://phabricator.wikimedia.org/T347879) (owner: 10Joal)
[19:34:35] <joal>	 !log Refinery-source v0.2.31 released to archiva
[19:34:37] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:34:44] <joal>	 !log Deploying refinery using scap
[19:34:46] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:39:56] <wikibugs>	 (03CR) 10Gehel: Cleanup dependencies in refinery-tools module. (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/997823 (owner: 10Gehel)
[19:42:36] <wikibugs>	 10Data-Engineering (Sprint 8), 10Patch-For-Review: [Maintenance] Migrate ReportUpdater browser queries to Airflow - https://phabricator.wikimedia.org/T354552 (10Snwachukwu) We added the following:    - A spark-scala job to perform dynamic pivot because some reports need to be pivoted.   - A UDF to get the star...
[19:43:58] <wikibugs>	 (03CR) 10Gehel: Cleanup of ISPDatabaseReader. (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/995042 (owner: 10Gehel)
[19:45:51] <wikibugs>	 (03CR) 10Joal: [C: 03+2] "LGTM!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/995042 (owner: 10Gehel)
[19:46:14] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Discovery-Search (Current work): Create dashboards for Search SLOs - https://phabricator.wikimedia.org/T338009 (10RKemper) Added new threshold markers at 95% for the 4 SLO graphs. We may want to revise the % SLO upwards, but let's stick with 95% for now until we...
[19:47:36] <gehel>	 joal: let me know when I start to be more annoying than constructive with my drive by commits. I'm using those as an excuse to record a few more videos, but I do recognize that I'm creating work to review / merge those, and that I'm not fixing real issues.
[19:49:16] <joal>	 gehel: More love for our code is always welcome <3
[19:57:46] <joal>	 !log Deploy refinery onto HDFS
[19:57:47] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:59:11] <wikibugs>	 (03Merged) 10jenkins-bot: Cleanup of ISPDatabaseReader. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/995042 (owner: 10Gehel)
[20:19:40] <wikibugs>	 10Data-Engineering (Sprint 8), 10Patch-For-Review: [Iceberg Migration] Migrate session length tables to Iceberg - https://phabricator.wikimedia.org/T352672 (10CodeReviewBot) joal merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/600  Migrate session length to Iceberg
[20:56:27] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work): Develop recovery/reindex procedures for new Search Update Pipeline - https://phabricator.wikimedia.org/T356803 (10bking)
[20:58:15] <wikibugs>	 (03PS1) 10Joal: Fix unique devices druid loading [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997960 (https://phabricator.wikimedia.org/T347879)
[21:02:51] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work): Develop recovery/reindex procedures for new Search Update Pipeline - https://phabricator.wikimedia.org/T356803 (10bking)
[21:09:49] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work): Document review/refresh for https://wikitech.wikimedia.org/wiki/Search - https://phabricator.wikimedia.org/T356806 (10bking)
[21:10:08] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work): Document review/refresh for https://wikitech.wikimedia.org/wiki/Search - https://phabricator.wikimedia.org/T356806 (10bking)
[21:16:36] <jinxer-wm>	 (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 0.4517% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[21:26:00] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work): Develop recovery/reindex procedures for new Search Update Pipeline - https://phabricator.wikimedia.org/T356803 (10bking) Per today's 1x1 with @RKemper , @EBernhardson  and myself, we talked about what a new outage recovery process might look like.  Today'...
[21:48:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (14) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:22:31] <wikibugs>	 10Data-Engineering, 10Data Products, 10Data-Platform, 10Movement-Insights, and 2 others: Temporary Accounts Initiative (IP Masking) - Add user_is_temp to data tables - https://phabricator.wikimedia.org/T356701 (10Mayakp.wiki)
[22:51:32] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work): Develop recovery/reindex procedures for new Search Update Pipeline - https://phabricator.wikimedia.org/T356803 (10bking) Per IRC conversation with @RLazarus , he's done a lot of work towards this exact use case (see T341553 ).
[23:32:22] <wikibugs>	 10Data-Engineering, 10Anti-Harassment, 10Data-Persistence, 10Temporary accounts, and 2 others: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10Mayakp.wiki) thanks @Ladsgroup. yes I'm able to query the mariadb table for a few wikis and can see '0' in the user_is_temp col...