[01:25:06] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[01:29:42] <jinxer-wm>	 (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:31:13] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:30:28] <jinxer-wm>	 (DiskSpace) firing: Disk space archiva1002:9100:/ 1.854% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=archiva1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[05:25:06] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[05:29:57] <jinxer-wm>	 (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:30:28] <jinxer-wm>	 (DiskSpace) firing: Disk space archiva1002:9100:/ 1.948% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=archiva1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[06:33:50] <stevemunene>	 Hi aqu  I'd like some help with the `monitor_refine_event.service` failure. Are you available?
[06:41:13] <aqu>	 Hello stevemunene!
[06:42:23] <stevemunene>	 o/
[07:18:31] * brouberol waves good morning
[08:43:11] <btullis>	 Morning all.
[08:48:50] <stevemunene>	 Morning btullis 
[08:49:49] <btullis>	 I'm checking out this archiva disk space issue first.
[08:52:56] <btullis>	 I'm going to delete the directories: `/var/cache/archiva/temp*` - I think that they were left behind by failed index operations.
[08:53:28] <btullis>	 !log `root@archiva1002:/var/cache/archiva# sudo rm -rf temp*`
[08:53:30] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:55:13] <jinxer-wm>	 (DiskSpace) resolved: Disk space archiva1002:9100:/ 1.945% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=archiva1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[09:08:27] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Write new partman recipe for cloudelastic - https://phabricator.wikimedia.org/T342463 (10fgiunchedi) Please excuse the drive-by comment, I've worked with @Muehlenhoff on standardizing our partman recipes and I'm wondering if the standard raid0 recipes (i.e. raid1 for `/...
[09:08:52] <icinga-wm>	 RECOVERY - Disk space on archiva1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[09:25:06] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[09:29:57] <jinxer-wm>	 (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:36:28] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10MediaWiki-libs-HTTP, 10Beta-Cluster-reproducible, and 6 others: PHP Warning: curl_multi_remove_handle(): supplied resource is not a valid cURL Multi Handle resource - https://phabricator.wikimedia.org/T288624 (10Lucas_Werkmeister_WMDE) The m...
[09:58:09] <wikibugs>	 10Data-Engineering, 10Anti-Harassment, 10Data-Persistence, 10IP Masking, and 2 others: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10Ladsgroup)
[10:19:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:04:30] <elukey>	 btullis, joal o/
[11:04:33] <elukey>	 did you see https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&var-hadoop_cluster=analytics-hadoop&viewPanel=28&from=now-30d&to=now ?
[11:05:17] <elukey>	 an old alert that I've set for total files vs hdfs namenode heap size fire 
[11:05:20] <elukey>	 *fired
[11:05:25] <elukey>	 but the graph looks very weird
[11:17:00] <wikibugs>	 10Data-Engineering, 10Machine-Learning-Team, 10Wikimedia Enterprise, 10Data Engineering and Event Platform Team (Sprint 2), and 2 others: Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10achou) a:05achou...
[11:20:38] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: [SPIKE] Can we identify indicators to inform an SLO for event emission and intake? - https://phabricator.wikimedia.org/T345195 (10gmodena) I shared an initial draft to collate systems and instrumentation info interna...
[11:36:02] <aqu>	 Here is dashboard about HDFS usage: https://superset.wikimedia.org/superset/dashboard/409/
[11:36:02] <aqu>	 The source could be identified by computing the diff of size for each node since 1 month.
[11:37:38] <btullis>	 Oh, interesting, I cannot see it because I am not in the `analytics-admins` group. This was skipped because I am in the `ops` group.
[11:37:43] <btullis>	 https://usercontent.irccloud-cdn.com/file/CXv7QUKR/image.png
[11:38:27] <btullis>	 I guess that brouberol and stevemunene sould be in the same boat. 
[11:39:03] <btullis>	 % s/sould be/are/
[11:46:24] <wikibugs>	 10Data-Engineering, 10Machine-Learning-Team, 10Wikimedia Enterprise, 10Data Engineering and Event Platform Team (Sprint 2), and 2 others: Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10achou) After discu...
[11:56:05] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: eventutilites-python: improve consistency guarantees of async process functions - https://phabricator.wikimedia.org/T347282 (10gmodena)
[12:05:26] <joal>	 mforns: Hello! It seems the superset dashboard for data quality doesn't show any data - is that expectd?
[12:23:51] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10ops-eqiad: Replace RAID controller battery on an-worker1086 - https://phabricator.wikimedia.org/T347287 (10BTullis)
[12:24:12] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10ops-eqiad: Replace RAID controller battery on an-worker1086 - https://phabricator.wikimedia.org/T347287 (10BTullis) p:05Triage→03Medium
[12:33:43] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: eventutilites-python: improve consistency guarantees of async process functions - https://phabricator.wikimedia.org/T347282 (10lbowmaker)
[12:58:23] <joal>	 xcollazo: Good morning :)
[12:58:43] <joal>	 xcollazo: We need to talk about file-numbers on hadoop :)
[13:24:51] <jinxer-wm>	 (HdfsTotalFilesHeap) resolved: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[13:27:59] <joal>	 This --^ is thanks to Marco and Cormac, having done some cleanup on their datasets - it doesn't solve the trend generated by dumps - we need to talk about this :)
[13:28:36] <btullis>	 Thanks joal :-)
[13:32:08] <wikibugs>	 (03PS14) 10Btullis: Update to Superset version 2.0.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/957938 (https://phabricator.wikimedia.org/T335356)
[13:46:06] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Event-Platform: Add Antoine_Quhen and jallemandou to the mw-page-content-change-enrich namespace - https://phabricator.wikimedia.org/T347296 (10gmodena)
[13:46:26] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Event-Platform: Add Antoine_Quhen and jallemandou to the mw-page-content-change-enrich namespace - https://phabricator.wikimedia.org/T347296 (10gmodena)
[13:46:33] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineer...
[13:48:00] <mforns>	 joal: it doesn't show data at the first load on purpose, so that you choose a filter mandatorily... It's a bad hack, but otherwise it takes too long to load all the data. I think we could migrate that table to iceberg, it would probably improve the loading times a lot...
[13:50:42] <elukey>	 btullis: ack about the experiments, but maybe it is worth to bump the hdfs heap for a while, we ended up into problems in the past when we didn't do so in a timely manner
[13:52:45] <mforns>	 joal: I looked at the mobile_os_distribution anomaly_detection alert, and it seems a harmless peak, like some we've seen before.
[13:56:48] <joal>	 mforns: hm, so I need to create a new filter?
[14:00:27] <joal>	 dsaez: Hi - I'm chasing people storing big number of files on the cluster - And it happens you and a person you manage are two of them :)
[14:00:39] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Event-Platform: Add Antoine_Quhen and jallemandou to the mw-page-content-change-enrich namespace - https://phabricator.wikimedia.org/T347296 (10BTullis) I believe that @JAllemandou may already have the required group mem...
[14:01:36] <joal>	 btullis: forgot to mention - kudos to Wales rugby team yesterday :)
[14:02:19] <mforns>	 joal: no, no, just remove the filter called "remove me" on the left, and choose another one.
[14:02:21] <btullis>	 joal: Thanks, but I was busy playing unicycle hockey and didn't watch it :-)
[14:02:37] <joal>	 btullis: Strong win!
[14:02:50] <joal>	 mforns: I can't see any filter on the left :(
[14:04:07] <btullis>	 elukey: Ack, yes good thinking. I will prepare a patch. We haven't got an awful lot of headroom on these servers now, unfortunately.
[14:19:48] <mforns>	 joal wanna quickly google meet?
[14:19:57] <jinxer-wm>	 (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:20:12] <joal>	 mforns: I'm in meeting now, after?
[14:20:18] <mforns>	 ok!
[14:20:26] <joal>	 thanks for offering :)
[14:20:33] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Replace RAID controller battery on an-worker1086 - https://phabricator.wikimedia.org/T347287 (10Jclark-ctr) @BTullis  i am on site today.  otherwise we can do it tomorrow
[14:50:37] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10MediaWiki-libs-HTTP, 10Beta-Cluster-reproducible, and 6 others: PHP Warning: curl_multi_remove_handle(): supplied resource is not a valid cURL Multi Handle resource - https://phabricator.wikimedia.org/T288624 (10dancy) Yay! Thanks for taking...
[14:55:35] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10MediaWiki-libs-HTTP, 10Beta-Cluster-reproducible, and 6 others: PHP Warning: curl_multi_remove_handle(): supplied resource is not a valid cURL Multi Handle resource - https://phabricator.wikimedia.org/T288624 (10Lucas_Werkmeister_WMDE) @tsta...
[15:12:23] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr)
[15:12:34] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10Jclark-ctr)
[15:33:32] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10VRiley-WMF) wdqs1020 - E 2. U 18 CableID 2303045000257 Port 38 wdqs1021 - F 2. U 42. CableID 2303045000256 Port 20 wdqs1022 - D 2. U 13. CableID 230304500202 Port 25 wdqs1023...
[15:59:01] <joal>	 mforns: have a minute now for superset?
[15:59:11] <mforns>	 joal: sure!
[15:59:17] <joal>	 mforns: to the cave!
[15:59:24] <mforns>	 yeah! :-)
[16:06:54] <joal>	 Heya xcollazo - would you have a minute?
[17:50:18] <wikibugs>	 (03PS6) 10Sharvaniharan: New Event schema for mobile apps [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960100
[18:19:58] <jinxer-wm>	 (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:12:33] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search: Investigate performance differences between elastic2037-2054 and 2055-2086 - https://phabricator.wikimedia.org/T347338 (10bking)
[19:13:22] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search: Investigate performance differences between elastic2037-2054 and 2055-2086 - https://phabricator.wikimedia.org/T347338 (10bking)
[19:13:26] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search, 10SRE-OnFire, 10Wikimedia-Incident: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10bking)
[19:14:05] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search: Investigate performance differences between elastic2037-2054 and 2055-2086 - https://phabricator.wikimedia.org/T347338 (10bking)
[19:14:21] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search: Investigate performance differences between elastic2037-2054 and 2055-2086 - https://phabricator.wikimedia.org/T347338 (10RKemper)
[19:24:15] <wikibugs>	 10Data-Platform-SRE: Root cause Archiva outage from 2024-09-24 - https://phabricator.wikimedia.org/T347343 (10xcollazo)
[19:25:12] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Restore service for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347284 (10bking)
[19:25:21] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Restore service for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347284 (10bking) 05Open→03Resolved a:03bking @MisterSynergy After applying the last patch, https://query.wikidata.org/bigdata/ldf seems to be back u...
[19:30:01] <wikibugs>	 (03CR) 10Milimetric: [C: 03+1] "I had a bunch of comments before your last refactor and I think you addressed all of them, this looks awesome and I'm a little scared that" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) (owner: 10Aqu)
[20:00:52] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Write new partman recipe for cloudelastic - https://phabricator.wikimedia.org/T342463 (10bking) >>! In T342463#9194350, @fgiunchedi wrote: > Please excuse the drive-by comment, I've worked with @Muehlenhoff on standardizing our partman recipes and I'm wondering if the s...
[20:12:50] <wikibugs>	 (03CR) 10Tsevener: New Event schema for mobile apps (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960100 (owner: 10Sharvaniharan)
[20:16:27] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search: Investigate performance differences between elastic2037-2054 and 2055-2086 - https://phabricator.wikimedia.org/T347338 (10EBernhardson) @RKemper noted that the difference here is the available memory. everything  < 2055 has 128G of memory, >=2055 has 256G of memory. The...
[20:40:00] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search: Investigate performance differences between elastic2037-2054 and 2055-2086 - https://phabricator.wikimedia.org/T347338 (10bking) 05Open→03Resolved a:03bking Upon further review, it looks like we have confirmed the reason for the performance differences. I don't th...
[20:40:02] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search, 10SRE-OnFire, 10Wikimedia-Incident: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10bking)
[21:21:41] <wikibugs>	 (03PS3) 10Kimberly Sarabia: Adds new web fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960141 (https://phabricator.wikimedia.org/T346106)
[21:24:14] <wikibugs>	 (03PS4) 10Kimberly Sarabia: Adds new web fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960141 (https://phabricator.wikimedia.org/T346106)
[21:29:52] <wikibugs>	 10Data-Platform-SRE, 10Wikidata-Query-Service: Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355 (10bking)
[21:30:05] <wikibugs>	 10Data-Platform-SRE, 10Wikidata-Query-Service: Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355 (10bking)
[21:30:08] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Restore service for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347284 (10bking)
[21:47:28] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10Jclark-ctr)
[21:48:21] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr)
[21:48:44] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr)
[21:52:12] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10Jclark-ctr)
[21:53:09] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr)
[21:54:00] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search, 10SRE-OnFire, 10Wikimedia-Incident: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10bking)
[21:56:50] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr)
[22:08:59] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1017.eqiad.wmnet with OS bullseye
[22:11:49] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1018.eqiad.wmnet with OS bullseye
[22:11:52] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1019.eqiad.wmnet with OS bullseye
[22:14:04] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1021.eqiad.wmnet with OS bullseye
[22:14:11] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1022.eqiad.wmnet with OS bullseye
[22:14:19] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1023.eqiad.wmnet with OS bullseye
[22:14:24] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1024.eqiad.wmnet with OS bullseye
[22:19:58] <jinxer-wm>	 (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:46:13] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search, 10SRE-OnFire, 10Wikimedia-Incident: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10RKemper)
[23:29:15] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1017.eqiad.wmnet with OS bullseye executed with errors: - wdqs1017 (**FAIL**)   - Remove...
[23:32:02] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1018.eqiad.wmnet with OS bullseye executed with errors: - wdqs1018 (**FAIL**)   - Remove...
[23:32:06] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1019.eqiad.wmnet with OS bullseye executed with errors: - wdqs1019 (**FAIL**)   - Remove...
[23:34:18] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1021.eqiad.wmnet with OS bullseye executed with errors: - wdqs1021 (**FAIL**...
[23:34:25] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1022.eqiad.wmnet with OS bullseye executed with errors: - wdqs1022 (**FAIL**...
[23:34:30] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1023.eqiad.wmnet with OS bullseye executed with errors: - wdqs1023 (**FAIL**...
[23:34:37] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1024.eqiad.wmnet with OS bullseye executed with errors: - wdqs1024 (**FAIL**...
[23:44:03] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1017.eqiad.wmnet with OS bullseye
[23:44:55] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1018.eqiad.wmnet with OS bullseye
[23:44:59] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1019.eqiad.wmnet with OS bullseye
[23:45:11] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1024.eqiad.wmnet with OS bullseye
[23:45:24] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1021.eqiad.wmnet with OS bullseye
[23:49:41] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1022.eqiad.wmnet with OS bullseye