[01:25:06] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [01:29:42] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:31:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:28] (DiskSpace) firing: Disk space archiva1002:9100:/ 1.854% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=archiva1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [05:25:06] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [05:29:57] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:30:28] (DiskSpace) firing: Disk space archiva1002:9100:/ 1.948% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=archiva1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [06:33:50] Hi aqu I'd like some help with the `monitor_refine_event.service` failure. Are you available? [06:41:13] Hello stevemunene! [06:42:23] o/ [07:18:31] * brouberol waves good morning [08:43:11] Morning all. [08:48:50] Morning btullis [08:49:49] I'm checking out this archiva disk space issue first. [08:52:56] I'm going to delete the directories: `/var/cache/archiva/temp*` - I think that they were left behind by failed index operations. [08:53:28] !log `root@archiva1002:/var/cache/archiva# sudo rm -rf temp*` [08:53:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:55:13] (DiskSpace) resolved: Disk space archiva1002:9100:/ 1.945% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=archiva1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:08:27] 10Data-Platform-SRE, 10Patch-For-Review: Write new partman recipe for cloudelastic - https://phabricator.wikimedia.org/T342463 (10fgiunchedi) Please excuse the drive-by comment, I've worked with @Muehlenhoff on standardizing our partman recipes and I'm wondering if the standard raid0 recipes (i.e. raid1 for `/... [09:08:52] RECOVERY - Disk space on archiva1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [09:25:06] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [09:29:57] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:36:28] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10MediaWiki-libs-HTTP, 10Beta-Cluster-reproducible, and 6 others: PHP Warning: curl_multi_remove_handle(): supplied resource is not a valid cURL Multi Handle resource - https://phabricator.wikimedia.org/T288624 (10Lucas_Werkmeister_WMDE) The m... [09:58:09] 10Data-Engineering, 10Anti-Harassment, 10Data-Persistence, 10IP Masking, and 2 others: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10Ladsgroup) [10:19:42] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:04:30] btullis, joal o/ [11:04:33] did you see https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&var-hadoop_cluster=analytics-hadoop&viewPanel=28&from=now-30d&to=now ? [11:05:17] an old alert that I've set for total files vs hdfs namenode heap size fire [11:05:20] *fired [11:05:25] but the graph looks very weird [11:17:00] 10Data-Engineering, 10Machine-Learning-Team, 10Wikimedia Enterprise, 10Data Engineering and Event Platform Team (Sprint 2), and 2 others: Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10achou) a:05achou... [11:20:38] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: [SPIKE] Can we identify indicators to inform an SLO for event emission and intake? - https://phabricator.wikimedia.org/T345195 (10gmodena) I shared an initial draft to collate systems and instrumentation info interna... [11:36:02] Here is dashboard about HDFS usage: https://superset.wikimedia.org/superset/dashboard/409/ [11:36:02] The source could be identified by computing the diff of size for each node since 1 month. [11:37:38] Oh, interesting, I cannot see it because I am not in the `analytics-admins` group. This was skipped because I am in the `ops` group. [11:37:43] https://usercontent.irccloud-cdn.com/file/CXv7QUKR/image.png [11:38:27] I guess that brouberol and stevemunene sould be in the same boat. [11:39:03] % s/sould be/are/ [11:46:24] 10Data-Engineering, 10Machine-Learning-Team, 10Wikimedia Enterprise, 10Data Engineering and Event Platform Team (Sprint 2), and 2 others: Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10achou) After discu... [11:56:05] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: eventutilites-python: improve consistency guarantees of async process functions - https://phabricator.wikimedia.org/T347282 (10gmodena) [12:05:26] mforns: Hello! It seems the superset dashboard for data quality doesn't show any data - is that expectd? [12:23:51] 10Data-Platform-SRE, 10DC-Ops, 10ops-eqiad: Replace RAID controller battery on an-worker1086 - https://phabricator.wikimedia.org/T347287 (10BTullis) [12:24:12] 10Data-Platform-SRE, 10DC-Ops, 10ops-eqiad: Replace RAID controller battery on an-worker1086 - https://phabricator.wikimedia.org/T347287 (10BTullis) p:05Triage→03Medium [12:33:43] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: eventutilites-python: improve consistency guarantees of async process functions - https://phabricator.wikimedia.org/T347282 (10lbowmaker) [12:58:23] xcollazo: Good morning :) [12:58:43] xcollazo: We need to talk about file-numbers on hadoop :) [13:24:51] (HdfsTotalFilesHeap) resolved: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [13:27:59] This --^ is thanks to Marco and Cormac, having done some cleanup on their datasets - it doesn't solve the trend generated by dumps - we need to talk about this :) [13:28:36] Thanks joal :-) [13:32:08] (03PS14) 10Btullis: Update to Superset version 2.0.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/957938 (https://phabricator.wikimedia.org/T335356) [13:46:06] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Event-Platform: Add Antoine_Quhen and jallemandou to the mw-page-content-change-enrich namespace - https://phabricator.wikimedia.org/T347296 (10gmodena) [13:46:26] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Event-Platform: Add Antoine_Quhen and jallemandou to the mw-page-content-change-enrich namespace - https://phabricator.wikimedia.org/T347296 (10gmodena) [13:46:33] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineer... [13:48:00] joal: it doesn't show data at the first load on purpose, so that you choose a filter mandatorily... It's a bad hack, but otherwise it takes too long to load all the data. I think we could migrate that table to iceberg, it would probably improve the loading times a lot... [13:50:42] btullis: ack about the experiments, but maybe it is worth to bump the hdfs heap for a while, we ended up into problems in the past when we didn't do so in a timely manner [13:52:45] joal: I looked at the mobile_os_distribution anomaly_detection alert, and it seems a harmless peak, like some we've seen before. [13:56:48] mforns: hm, so I need to create a new filter? [14:00:27] dsaez: Hi - I'm chasing people storing big number of files on the cluster - And it happens you and a person you manage are two of them :) [14:00:39] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Event-Platform: Add Antoine_Quhen and jallemandou to the mw-page-content-change-enrich namespace - https://phabricator.wikimedia.org/T347296 (10BTullis) I believe that @JAllemandou may already have the required group mem... [14:01:36] btullis: forgot to mention - kudos to Wales rugby team yesterday :) [14:02:19] joal: no, no, just remove the filter called "remove me" on the left, and choose another one. [14:02:21] joal: Thanks, but I was busy playing unicycle hockey and didn't watch it :-) [14:02:37] btullis: Strong win! [14:02:50] mforns: I can't see any filter on the left :( [14:04:07] elukey: Ack, yes good thinking. I will prepare a patch. We haven't got an awful lot of headroom on these servers now, unfortunately. [14:19:48] joal wanna quickly google meet? [14:19:57] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:20:12] mforns: I'm in meeting now, after? [14:20:18] ok! [14:20:26] thanks for offering :) [14:20:33] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Replace RAID controller battery on an-worker1086 - https://phabricator.wikimedia.org/T347287 (10Jclark-ctr) @BTullis i am on site today. otherwise we can do it tomorrow [14:50:37] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10MediaWiki-libs-HTTP, 10Beta-Cluster-reproducible, and 6 others: PHP Warning: curl_multi_remove_handle(): supplied resource is not a valid cURL Multi Handle resource - https://phabricator.wikimedia.org/T288624 (10dancy) Yay! Thanks for taking... [14:55:35] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10MediaWiki-libs-HTTP, 10Beta-Cluster-reproducible, and 6 others: PHP Warning: curl_multi_remove_handle(): supplied resource is not a valid cURL Multi Handle resource - https://phabricator.wikimedia.org/T288624 (10Lucas_Werkmeister_WMDE) @tsta... [15:12:23] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr) [15:12:34] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10Jclark-ctr) [15:33:32] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10VRiley-WMF) wdqs1020 - E 2. U 18 CableID 2303045000257 Port 38 wdqs1021 - F 2. U 42. CableID 2303045000256 Port 20 wdqs1022 - D 2. U 13. CableID 230304500202 Port 25 wdqs1023... [15:59:01] mforns: have a minute now for superset? [15:59:11] joal: sure! [15:59:17] mforns: to the cave! [15:59:24] yeah! :-) [16:06:54] Heya xcollazo - would you have a minute? [17:50:18] (03PS6) 10Sharvaniharan: New Event schema for mobile apps [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960100 [18:19:58] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:12:33] 10Data-Platform-SRE, 10Discovery-Search: Investigate performance differences between elastic2037-2054 and 2055-2086 - https://phabricator.wikimedia.org/T347338 (10bking) [19:13:22] 10Data-Platform-SRE, 10Discovery-Search: Investigate performance differences between elastic2037-2054 and 2055-2086 - https://phabricator.wikimedia.org/T347338 (10bking) [19:13:26] 10Data-Platform-SRE, 10Discovery-Search, 10SRE-OnFire, 10Wikimedia-Incident: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10bking) [19:14:05] 10Data-Platform-SRE, 10Discovery-Search: Investigate performance differences between elastic2037-2054 and 2055-2086 - https://phabricator.wikimedia.org/T347338 (10bking) [19:14:21] 10Data-Platform-SRE, 10Discovery-Search: Investigate performance differences between elastic2037-2054 and 2055-2086 - https://phabricator.wikimedia.org/T347338 (10RKemper) [19:24:15] 10Data-Platform-SRE: Root cause Archiva outage from 2024-09-24 - https://phabricator.wikimedia.org/T347343 (10xcollazo) [19:25:12] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Restore service for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347284 (10bking) [19:25:21] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Restore service for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347284 (10bking) 05Open→03Resolved a:03bking @MisterSynergy After applying the last patch, https://query.wikidata.org/bigdata/ldf seems to be back u... [19:30:01] (03CR) 10Milimetric: [C: 03+1] "I had a bunch of comments before your last refactor and I think you addressed all of them, this looks awesome and I'm a little scared that" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) (owner: 10Aqu) [20:00:52] 10Data-Platform-SRE, 10Patch-For-Review: Write new partman recipe for cloudelastic - https://phabricator.wikimedia.org/T342463 (10bking) >>! In T342463#9194350, @fgiunchedi wrote: > Please excuse the drive-by comment, I've worked with @Muehlenhoff on standardizing our partman recipes and I'm wondering if the s... [20:12:50] (03CR) 10Tsevener: New Event schema for mobile apps (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960100 (owner: 10Sharvaniharan) [20:16:27] 10Data-Platform-SRE, 10Discovery-Search: Investigate performance differences between elastic2037-2054 and 2055-2086 - https://phabricator.wikimedia.org/T347338 (10EBernhardson) @RKemper noted that the difference here is the available memory. everything < 2055 has 128G of memory, >=2055 has 256G of memory. The... [20:40:00] 10Data-Platform-SRE, 10Discovery-Search: Investigate performance differences between elastic2037-2054 and 2055-2086 - https://phabricator.wikimedia.org/T347338 (10bking) 05Open→03Resolved a:03bking Upon further review, it looks like we have confirmed the reason for the performance differences. I don't th... [20:40:02] 10Data-Platform-SRE, 10Discovery-Search, 10SRE-OnFire, 10Wikimedia-Incident: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10bking) [21:21:41] (03PS3) 10Kimberly Sarabia: Adds new web fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960141 (https://phabricator.wikimedia.org/T346106) [21:24:14] (03PS4) 10Kimberly Sarabia: Adds new web fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960141 (https://phabricator.wikimedia.org/T346106) [21:29:52] 10Data-Platform-SRE, 10Wikidata-Query-Service: Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355 (10bking) [21:30:05] 10Data-Platform-SRE, 10Wikidata-Query-Service: Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355 (10bking) [21:30:08] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Restore service for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347284 (10bking) [21:47:28] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10Jclark-ctr) [21:48:21] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr) [21:48:44] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr) [21:52:12] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10Jclark-ctr) [21:53:09] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr) [21:54:00] 10Data-Platform-SRE, 10Discovery-Search, 10SRE-OnFire, 10Wikimedia-Incident: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10bking) [21:56:50] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr) [22:08:59] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1017.eqiad.wmnet with OS bullseye [22:11:49] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1018.eqiad.wmnet with OS bullseye [22:11:52] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1019.eqiad.wmnet with OS bullseye [22:14:04] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1021.eqiad.wmnet with OS bullseye [22:14:11] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1022.eqiad.wmnet with OS bullseye [22:14:19] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1023.eqiad.wmnet with OS bullseye [22:14:24] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1024.eqiad.wmnet with OS bullseye [22:19:58] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:46:13] 10Data-Platform-SRE, 10Discovery-Search, 10SRE-OnFire, 10Wikimedia-Incident: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10RKemper) [23:29:15] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1017.eqiad.wmnet with OS bullseye executed with errors: - wdqs1017 (**FAIL**) - Remove... [23:32:02] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1018.eqiad.wmnet with OS bullseye executed with errors: - wdqs1018 (**FAIL**) - Remove... [23:32:06] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1019.eqiad.wmnet with OS bullseye executed with errors: - wdqs1019 (**FAIL**) - Remove... [23:34:18] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1021.eqiad.wmnet with OS bullseye executed with errors: - wdqs1021 (**FAIL**... [23:34:25] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1022.eqiad.wmnet with OS bullseye executed with errors: - wdqs1022 (**FAIL**... [23:34:30] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1023.eqiad.wmnet with OS bullseye executed with errors: - wdqs1023 (**FAIL**... [23:34:37] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1024.eqiad.wmnet with OS bullseye executed with errors: - wdqs1024 (**FAIL**... [23:44:03] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1017.eqiad.wmnet with OS bullseye [23:44:55] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1018.eqiad.wmnet with OS bullseye [23:44:59] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1019.eqiad.wmnet with OS bullseye [23:45:11] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1024.eqiad.wmnet with OS bullseye [23:45:24] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1021.eqiad.wmnet with OS bullseye [23:49:41] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1022.eqiad.wmnet with OS bullseye