[01:16:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:17:44] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:32:44] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:32:50] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:27:44] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:35:06] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [04:29:47] 10Quarry: Quarry exports integers as floats to wikitable - https://phabricator.wikimedia.org/T151106 (10Audiodude) Documenting my investigation (no solution found). With this query against mywiki in dev: ` SELECT @rownum := @rownum + 1 AS rank, page_title FROM (SELECT page_title FROM page) t, (SELECT @rownum... [04:33:41] 10Quarry: Quarry exports integers as floats to wikitable - https://phabricator.wikimedia.org/T151106 (10Audiodude) Another puzzling part is that MariaDB doesn't appear to be returning results as floats. I exposed the mywiki MariaDB in docker and ran this: ` -------------- SELECT @rownum := @rownum + 1 AS rank,... [04:34:45] 10Quarry: Add maintainers to quarry - https://phabricator.wikimedia.org/T348184 (10Audiodude) I assume we need some kind of access to the Github repo too? https://github.com/toolforge/quarry [06:28:01] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:35:06] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [07:47:46] 10Data-Platform-SRE: Package kafkakit-prometheus-metricsfetcher as a debian package - https://phabricator.wikimedia.org/T348214 (10brouberol) [07:57:47] 10Data-Platform-SRE, 10Patch-For-Review: Package kafkakit-prometheus-metricsfetcher as a debian package - https://phabricator.wikimedia.org/T348214 (10CodeReviewBot) brouberol opened https://gitlab.wikimedia.org/repos/sre/kafka-kit/-/merge_requests/3 Drop metricsfetcher from the binaries installed by the kafk... [08:19:28] elukey: just FYI, I'm going to pause my work of moving partitions away from kafka-jumbo100[1-6], as I'm starting to bump into an imbalance problem: optimizing leadership causes network imbalances, as many partitions are empty. [08:20:17] I read https://phabricator.wikimedia.org/T341558 and saw that you used a version of metricsfetcher relying on the prometheus API, so I took the liberty to send MRs to debian package it, so we can apply the same logic and tools there as well [08:22:11] we can see the FetchFollower metric increasing for kafka-jumbo1009, as I go though batches of topic reassignments, and I'd rather we flatten that before I continue. WDYT? [08:23:09] 10Data-Platform-SRE, 10Patch-For-Review: Package kafkakit-prometheus-metricsfetcher as a debian package - https://phabricator.wikimedia.org/T348214 (10brouberol) Debian packaging rules MR for `kafka-kit-prometheus-metricsfetcher` https://gitlab.wikimedia.org/repos/sre/kafkakit-prometheus-metricsfetcher/-/merge... [08:23:59] 10Data-Platform-SRE, 10Patch-For-Review: Package kafkakit-prometheus-metricsfetcher as a debian package - https://phabricator.wikimedia.org/T348214 (10CodeReviewBot) brouberol updated https://gitlab.wikimedia.org/repos/sre/kafkakit-prometheus-metricsfetcher/-/merge_requests/1 Add debian packaging rules [09:49:51] (HdfsTotalFilesHeap) resolved: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [10:12:44] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:22:44] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:32:42] brouberol: o/ I am out today but +1 to keep going! [11:09:36] gotcha, sorry about the ping while you're OOO [11:46:46] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Manuel) Hi @Stevemunene and @BTullis, thank you for reaching out about this! I was unaware that all the cronjobs are being hosted/started from stat1007. I'll look into this and come... [12:05:04] 10Quarry: Add maintainers to quarry - https://phabricator.wikimedia.org/T348184 (10rook) >>! In T348184#9226989, @Audiodude wrote: > I assume we need some kind of access to the Github repo too? https://github.com/toolforge/quarry Oh that would be helpful, wouldn't it :) What are yinz github accounts? [12:38:46] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Document process for getting JNL files/consider automation - https://phabricator.wikimedia.org/T347605 (10dr0ptp4kt) Drawing from your inspiration, I downloaded with `wget` overnight and the `sha1sum' now matches that from `wdqs1016`. Deflating now,... [12:42:08] btullis for when you have a bit of time, I've added you as a reviewer to https://gitlab.wikimedia.org/repos/sre/kafka-kit/-/merge_requests/3 and https://gitlab.wikimedia.org/repos/sre/kafkakit-prometheus-metricsfetcher/-/merge_requests/1, two small debian-packaging-related MRs [12:44:09] also, unrelated question: do we know what are these hourly network spikes? https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=thanos&var-kafka_cluster=jumbo-eqiad&var-cluster=kafka_jumbo&var-kafka_broker=All&var-disk_device=All&from=1696498995792&to=1696509795792&viewPanel=31 could that be Gobblin pulling data from Kafka to [12:44:09] HDFS? [12:45:25] brouberol: Will look at those, thanks. [12:45:37] thank you! [12:46:43] brouberol: I think that's exactly what you have in mind :) Gobblin is scheduled every 10mins for webrequest data, and every hour for event data [12:47:19] thanks for confirming my hunch! [12:47:23] I was going to say the same thing, ish. We have the following three timers on an-launcher, which fire at 5 pas each hour. [12:47:28] https://www.irccloud.com/pastebin/ADiAMoAG/ [12:47:54] Oh gobblin-netflow is not hourly. [12:49:33] So yes, I think that the hourly spike is probably the combination of those three. Refine will be reading from HDFS and then writing new files to HDFS as well, gobblin will be just writing to HDFS. [12:52:02] gotcha, thank you! [12:57:44] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:58:00] FYI; I'll be rebooting matomo1002 in ~ 5m [12:58:51] moritzm: ack, thanks. [13:09:18] and completed [13:12:44] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:22:41] 10Quarry: Add maintainers to quarry - https://phabricator.wikimedia.org/T348184 (10SD0001) My github id is `siddharthvp`. Also, how do we login to the instances where quarry runs? Doesn't seem to be documented on wikitech. [13:53:37] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, 10Security: user_text exposed in public event streams when it should be hidden via user blocks - https://phabricator.wikimedia.org/T348252 (10Ottomata) [13:53:44] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, 10Security: user_text exposed in public event streams when it should be hidden via user blocks - https://phabricator.wikimedia.org/T348252 (10Ottomata) p:05Triage→03High [13:54:04] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, 10Security, 10Vuln-Infoleak: user_text exposed in public event streams when it should be hidden via user blocks - https://phabricator.wikimedia.org/T348252 (10Ottomata) [13:57:51] (HdfsFSImageAge) firing: (2) The HDFS FSImage on analytics-test-hadoop:an-test-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-test-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [14:06:38] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene) Added druid1009 and druid1010 to the role(druid::public::worker) and to the druid_public_hosts firewall block and the two were able to join the druid cluster and are currently... [14:06:42] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene) [14:12:51] (HdfsFSImageAge) resolved: (2) The HDFS FSImage on analytics-test-hadoop:an-test-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-test-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [14:15:42] 10Quarry: Add maintainers to quarry - https://phabricator.wikimedia.org/T348184 (10rook) >>! In T348184#9228143, @SD0001 wrote: > My github id is `siddharthvp`. Also, how do we login to the instances where quarry runs? Doesn't seem to be documented on wikitech. I've sent a github invite for 'read' could you ver... [15:00:05] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10serviceops-radar, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10bking) [15:00:12] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10bking) 05Open→03In progress We're unblocked now, and we were able to test some flink operati... [15:12:43] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10serviceops-radar, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10bking) [15:12:49] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10bking) 05In progress→03Resolved At this point, I am confident that we have enough informatio... [15:13:12] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10bking) [15:14:52] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python, 10Wikidata Analytics (Kanban): Improve df_to_remarkup formatting for wmfdata-python - https://phabricator.wikimedia.org/T341589 (10AndrewTavis_WMDE) [15:29:59] Interesting finding: I ran `kafka preferred-replica-election` which resulted in Brokers preferred replicas imbalance count going to 0 and a much more homogenous latency / broker (see [15:29:59] https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=thanos&var-kafka_cluster=jumbo-eqiad&var-cluster=kafka_jumbo&var-kafka_broker=All&var-disk_device=All&from=1696518744280&to=1696519565407) [15:30:01] !log failed over test cluster hadoop namenode services to an-test-master1002 [15:30:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:30:27] I assumed that the auto.leader.rebalance.enable setting (true by default) would take care of that automatically [15:31:47] brouberol: I would assume that too from the name of the setting. Good work anyway :-) 👍 [15:32:34] thanks. It also means that I misunderstood what that setting was doing, and I could do with more research. /me makes a note for later [15:42:34] aah, I think that leader.imbalance.per.broker.percentage is at play here: we were < 10% imbalance, so the controller didn't trigger a leader rebalance on its own [15:49:48] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: Add $comment and $performer to ArticleRevisionVisibilitySet params - https://phabricator.wikimedia.org/T321411 (10Ottomata) 05Declined→03Open This is valid after all. We do want this info in the hooks, we just don't want t... [15:59:37] 10Quarry: Add maintainers to quarry - https://phabricator.wikimedia.org/T348184 (10Audiodude) I'm `audiodude` on github. Thanks! [16:06:59] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) a:03VRiley-WMF [16:22:17] (03PS16) 10Phuedx: Add analytics/metrics_platform/{app,web}/base schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) [16:26:37] 10Quarry: Add maintainers to quarry - https://phabricator.wikimedia.org/T348184 (10rook) >>! In T348184#9228819, @Audiodude wrote: > I'm `audiodude` on github. Thanks! Added. Similarly please confirm here that I added the right person and I'll up the permission. [16:36:22] 10Quarry: Add maintainers to quarry - https://phabricator.wikimedia.org/T348184 (10SD0001) Confirming that I got the invite. (And am able to login to the instances now.) Thanks. [17:13:32] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:30:03] 10Quarry: Add maintainers to quarry - https://phabricator.wikimedia.org/T348184 (10Audiodude) Confirmed: I got the github invite. I can also access the instances with my wikitech account, thanks! [18:03:32] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:04:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [18:14:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [18:17:10] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Documentation, 10Epic, 10Event-Platform: Event Platform Value Stream Documentation Tasks - https://phabricator.wikimedia.org/T329628 (10TBurmeister) [18:18:32] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:20:36] 10Data-Engineering, 10Documentation: User-centric documentation links - https://phabricator.wikimedia.org/T329550 (10TBurmeister) p:05Triage→03Medium a:03TBurmeister [18:22:17] 10Data-Engineering, 10Data-Catalog, 10Documentation: Data Catalog Documentation Style Guide - https://phabricator.wikimedia.org/T310229 (10TBurmeister) [18:42:24] (03PS17) 10Clare Ming: Add analytics/metrics_platform/{app,web}/base schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx) [19:54:08] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: mw-page-content-change-enrich should not retry on badrevids if no replica lag - https://phabricator.wikimedia.org/T347884 (10dcausse) I believe that you should still attempt a couple retries on badrevids if the event... [20:03:26] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: mw-page-content-change-enrich should not retry on badrevids if no replica lag - https://phabricator.wikimedia.org/T347884 (10Ottomata) Hm, yeah in hindsight, we know what the max replica lag we allow is, right? So w... [20:14:14] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye [20:14:17] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye [20:29:05] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: mw-page-content-change-enrich should not retry on badrevids if no replica lag - https://phabricator.wikimedia.org/T347884 (10dcausse) >>! In T347884#9229421, @Ottomata wrote: > Hm, yeah in hindsight, we know what the... [21:22:29] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10EventStreams, 10Event-Platform, 10Patch-For-Review: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10Ottomata) EventGate status update: - [[ https://github.com/wikimedia/eventgate | EventGate ]]... [21:22:41] (03PS1) 10Milimetric: Expand mediawiki_project_namespace_map table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/963835 [21:23:24] (03PS1) 10Milimetric: [WIP] Add siteinfo information to output XML [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/963836 [21:26:28] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Document process for getting JNL files/consider automation - https://phabricator.wikimedia.org/T347605 (10dr0ptp4kt) Addressing @Addshore's comment in T344905#9210122... > I think the amount of time taken to decompress the JNL file should also be ta... [21:26:58] 10Quarry: Add maintainers to quarry - https://phabricator.wikimedia.org/T348184 (10rook) Awesome, Ok I think everyone is all connected. Let me know if I missed anything, feel free to poke me with any questions. [21:27:06] 10Quarry: Add maintainers to quarry - https://phabricator.wikimedia.org/T348184 (10rook) 05Open→03Resolved [21:32:43] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w... [21:34:56] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed w... [22:18:32] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:37:27] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudelastic1007.... [22:59:39] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudelastic1007.eqia... [23:00:20] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10Papaul) @bking I tried to do the re-images on cloudelastic1007, the re-image finished with the OS install without a... [23:02:47] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye [23:22:37] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye