[04:21:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_main_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:50:02] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:55:32] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:04:50] (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [05:09:14] PROBLEM - MegaRAID on an-worker1132 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:09:16] ACKNOWLEDGEMENT - MegaRAID on an-worker1132 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T330971 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:09:50] (HdfsCorruptBlocks) resolved: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [05:41:02] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:08:06] (03CR) 10Gmodena: [C: 03+1] "lgtm." [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/893518 (https://phabricator.wikimedia.org/T330918) (owner: 10Ottomata) [07:58:53] 10Data-Engineering, 10Data-Persistence, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) [08:36:54] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Investigate slownesses on an-worker1132 - https://phabricator.wikimedia.org/T330979 (10nfraison) [08:37:15] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Investigate trend of gradual hive server heap exhaustion - https://phabricator.wikimedia.org/T303168 (10nfraison) [08:37:44] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Investigate trend of gradual hive server heap exhaustion - https://phabricator.wikimedia.org/T303168 (10nfraison) a:05BTullis→03nfraison [08:41:38] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Investigate slownesses on an-worker1132 - https://phabricator.wikimedia.org/T330979 (10nfraison) See https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Hadoop/Administration#Decommis... [09:10:36] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Investigate slownesses on an-worker1132 - https://phabricator.wikimedia.org/T330979 (10nfraison) Node an-worker1132 under decommission [09:32:43] PROBLEM - Hadoop NodeManager on an-worker1132 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:46:02] (03CR) 10Gehel: [C: 03+1] Remove Guava from dependency (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [09:47:08] 10Data-Engineering, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T330971 (10Peachey88) [10:00:25] !log commencing second attempt to upgrade airflow on an-test-client1001 to version 2.5.1 [10:00:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:10:31] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:48:04] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10nfraison) [11:01:58] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) @ayounsi @akosiaris @Joe to confirm, we are going to depool eqiad before this maintenance like we've done in codfw right? [11:02:14] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) [11:19:40] (03CR) 10Kosta Harlan: [C: 03+2] homepagevisit: Add referer_route:postedit-panel-nonsuggested [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/893098 (https://phabricator.wikimedia.org/T330727) (owner: 10Gergő Tisza) [11:20:17] (03Merged) 10jenkins-bot: homepagevisit: Add referer_route:postedit-panel-nonsuggested [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/893098 (https://phabricator.wikimedia.org/T330727) (owner: 10Gergő Tisza) [11:20:19] (03CR) 10Kosta Harlan: [C: 03+2] helppanel: Add postedit-nonsuggested context [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/893092 (https://phabricator.wikimedia.org/T330727) (owner: 10Gergő Tisza) [11:20:54] (03Merged) 10jenkins-bot: helppanel: Add postedit-nonsuggested context [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/893092 (https://phabricator.wikimedia.org/T330727) (owner: 10Gergő Tisza) [12:12:36] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10akosiaris) >>! In T330165#8660042, @Marostegui wrote: > @ayounsi @akosiaris @Joe to confirm, we are going to depool eqiad before this m... [12:26:27] (03CR) 10Joal: Remove Guava from dependency (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [12:32:54] !log Rerun mediawiki-history-denormalize-wf-2023-02 [12:32:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:37:31] 10Data-Engineering: Use RDD checkpointing in Mediawiki-History spark job - https://phabricator.wikimedia.org/T331003 (10JAllemandou) [12:52:12] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Refactor analytics-meta MariaDB layout to use an-mariadb100[12] - https://phabricator.wikimedia.org/T284150 (10BTullis) [12:57:40] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Investigate slownesses on an-worker1132 - https://phabricator.wikimedia.org/T330979 (10nfraison) Downtime node to avoid false alert [13:12:24] btullis: have you at any point managed grants on the clouddb hosts? Or is that something we leave to the data-persistence folks? [13:15:05] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10BTullis) [13:18:32] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) >>! In T330165#8660202, @akosiaris wrote: >>>! In T330165#8660042, @Marostegui wrote: >> @ayounsi @akosiaris @Joe to confir... [13:27:10] !log airflow on an-test-client1001 is migrated to version 2.5.1 [13:27:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:28:52] (03PS1) 10Kosta Harlan: helppanel: Add support for trynewtask dialog [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/893734 (https://phabricator.wikimedia.org/T330637) [13:52:56] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): mediawiki-event-enrichment should support the latest eventutilities-python changes - https://phabricator.wikimedia.org/T330994 (10lbowmaker) [13:53:32] 10Data-Engineering-Planning, 10Event-Platform Value Stream: mediawiki-event-enrichment should support the latest eventutilities-python changes - https://phabricator.wikimedia.org/T330994 (10lbowmaker) [14:00:31] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Investigate slownesses on an-worker1132 - https://phabricator.wikimedia.org/T330979 (10nfraison) a:03nfraison [14:05:50] btullis, nfraison: FYI in case you missed my previous message in #wikimedia-operations, the icinga alert on change for every puppet run is reporting many (all?) aqs servers in codfw. They seems to change at every run [14:06:13] 10Data-Engineering, 10Cloud-Services, 10Wikimedia Enterprise: Data request: make rendered HTML page dumps available on stats machines or labs - https://phabricator.wikimedia.org/T331018 (10awight) [14:06:26] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10achou) @Isaac @Ottomata I dig a bit more into the event schema (https://schema.wikimedia.org/#!/) today and have some thoughts... [14:06:26] volans: ack looking [14:06:33] it changes [14:06:34] /etc/default/wikimedia-lvs-realserver with [14:06:34] -LVS_SERVICE_IPS="10.0.5.3" [14:06:34] +LVS_SERVICE_IPS="" [14:06:41] but then puppet runs /usr/sbin/dpkg-reconfigure -p critical -f noninteractive wikimedia-lvs-realserver [14:06:44] that I guess restores it [14:07:24] unrelated, puppet is broken on an-test-worker1001: E: Package 'python3.7' has no installation candidate [14:09:05] an-test-worker1001 -> bullseye reimage test on it with indeed some blocker so far. It should have been silenced by btullis [14:11:13] has notifications disabled indeed, I'll check with j.bond as it should be skipped (don't know if the patch is already live or not) [14:12:36] Oh thanks both for looking. Apologies for missing the ping volans. [14:12:56] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): mediawiki-event-enrichment should support the latest eventutilities-python changes - https://phabricator.wikimedia.org/T330994 (10JArguello-WMF) [14:13:54] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): mediawiki-event-enrichment should support the latest eventutilities-python changes - https://phabricator.wikimedia.org/T330994 (10JArguello-WMF) [14:14:31] np, thx [14:15:40] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Data Pipelines (Sprint 09), 10Epic: Deploy mediawiki-page-content-change-enrichment to wikikube k8s - https://phabricator.wikimedia.org/T325303 (10JArguello-WMF) [14:15:41] I think that we want to ping urandom about the aqs2* servers, he has been working on expansion of the aqs cluster from eqiad-only to multi-dc. [14:16:35] 10Data-Engineering-Planning, 10Epic, 10Event-Platform Value Stream (Sprint 09): Deploy mediawiki-page-content-change-enrichment to wikikube k8s - https://phabricator.wikimedia.org/T325303 (10JArguello-WMF) [14:18:03] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): Deploy mediawiki-page-content-change-enrichment to wikikube k8s - https://phabricator.wikimedia.org/T325303 (10JArguello-WMF) [14:19:01] 10Data-Engineering-Planning, 10Epic, 10Event-Platform Value Stream (Sprint 09): Deploy mediawiki-page-content-change-enrichment to wikikube k8s - https://phabricator.wikimedia.org/T325303 (10JArguello-WMF) [14:21:27] 10Data-Engineering-Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 09), 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10JArguello-WMF) [14:26:39] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: Deploy mediawiki-page-content-change-enrichment to wikikube k8s - https://phabricator.wikimedia.org/T325303 (10JArguello-WMF) [14:33:15] hi all is there anyone around i can chat to about aqs? [14:33:56] if it's for puppet I've already mentioned it above ;) [14:35:27] ahh ok; nfraison the erorr is cause because there is no codfw site listed in the service::catalog for aqs [14:36:23] https://github.com/wikimedia/operations-puppet/blob/production/hieradata/common/service.yaml#L152 [14:38:20] jbond: thanks I will take sometimes first to look at the puppet code impacted [14:38:23] RECOVERY - Hadoop NodeManager on an-worker1132 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:38:51] the problem is that I think the cluster is being setup and so might not yet be reaady for being added on service.yaml [14:40:25] nfraison: sure thing. i have created the following https://gerrit.wikimedia.org/r/c/operations/puppet/+/893758 which should at least cause puppet to error, which i think is more desirable behaviour [14:40:48] * jbond ^^ neds testing and review [14:41:07] 10Data-Engineering, 10Data Pipelines (Sprint 09): Differential privacy airflow-dags merge request - https://phabricator.wikimedia.org/T330234 (10JArguello-WMF) @Htriedman is there anything else you need from us for this ticket? [14:41:31] jbond: volans: I have pinged urandom to talk about it. I believe that he has been working on the expansion of aqs to codfw so I don't want to make any changes without consulting him. [14:41:45] ack, thx [14:41:57] ack sgtm thanks [14:43:32] It looks to me like it's ready to go multi-dc though, but I really haven't been involved for a while. [14:59:54] jbond Iwould say that the issue is more due to missing codfw ip entries here: https://github.com/wikimedia/operations-puppet/blob/production/hieradata/common/service.yaml#L132, from my understanding of https://github.com/wikimedia/operations-puppet/blob/production/modules/wmflib/functions/service/get_ips_for_services.pp. But I can be wrong [15:01:23] not only there, if you grep for aqs1010 or similar in puppet there are other places where eqiad hosts are stated [15:01:37] in the aqs role we add profile::lvs::realserver, that it doesn't work without proper LVS config [15:02:18] nfraison: yuo will likley need both [15:05:46] nfraison: profile::lvs::realserver parses the service catalog using get_ips_for_services. correcting the servic catalo with updated ips should help. that said im not familure with aqs so take what i say with a pinchg of salt [15:08:51] thks elukey, jbond for pointer let's wait first for feedback from urandom [15:08:52] OK, I've checked with u.random and he's happy for us in DE to fix this any time and bring aqs in codfw into service. Cassandra is ready for use. The aqs nodejs application is deployed to these hosts, so the LVS config should be the last thing. [15:10:41] I'd like to double-check with oljad that there was nothing else that prevented this from going ahead. [15:12:37] As an aside, I'm intrigued where the 10.0.5.3 comes from. It must be written by confd right? But I haven't found a reference to it yet. [15:13:58] btullis: that comes from [15:13:59] /usr/sbin/dpkg-reconfigure -p critical -f noninteractive wikimedia-lvs-realserver [15:15:02] i have not looked at the debian package but im guess it adds that ip address as a default io LVS_SERVICE_IPS='' [15:15:19] so puppet comes along sets LVS_SERVICE_IPS=''. then notifies [15:15:24] /usr/sbin/dpkg-reconfigure -p critical -f noninteractive wikimedia-lvs-realserver [15:15:33] which addess the 10.0.5.3 address back [15:16:41] Ah, thanks. Makes sense. I'd assumed it was confd because it was written again immediately after the puppet run (but hadn't really looked at the code enough). [15:23:50] nfraison: https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service should have some good info about the process (not exactly what you folks are doing but it is a good baseline) [15:24:38] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10Shared-Data-Infrastructure, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jclark-ctr) an-worker1149 A7 U1 port CableId an-worker1150 B7 U38 port CableId an-worker1151 C7... [15:24:50] (03PS34) 10Aqu: Remove Guava from dependency [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) [15:25:10] thks I'm indeed currently reading doc https://wikitech.wikimedia.org/wiki/LVS and seen that there is an aqs_7232 in eqiad lvs but not one in codfw which is probably linked to the doc you send :) [15:26:16] (03CR) 10Aqu: [C: 03+2] "Comments improved. Thanks all!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [15:30:11] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): Flink Enrichment monitoring - https://phabricator.wikimedia.org/T328925 (10Ottomata) FYI, alerts repo: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master [15:36:32] FYI I'm also checking with the API platform team. I know that we need to fix the puppet errors whatever, even if the eventual decision might be to leave aqs depooled in codfw. [15:38:16] I'd just like to make sure that any changes we make are expected. Certain aqs endpints will be slower from codfw because they access druid, which is only in eqiad. Cassandra is fully multi-dc now, but Druid isn't. [15:52:48] 10Data-Engineering, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T330971 (10Cmjohnson) a:03Cmjohnson Submitted a ticket with Dell for a new HDD. Create Dispatch: Success You have successfully submitted request SR163405094. [15:53:51] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Investigate slownesses on an-worker1132 - https://phabricator.wikimedia.org/T330979 (10nfraison) https://phabricator.wikimedia.org/T330971 [16:03:09] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10Ottomata) > Option 2 is more in line with the nature of the outlink topic model (link-based) since links change is the only ty... [16:11:10] 10Data-Engineering, 10Edit-Review-Improvements-Integrated-Filters, 10Event-Platform Value Stream, 10Growth-Team, and 2 others: Integration of Revert Risk Scores to Recent Changes as a filter - https://phabricator.wikimedia.org/T329071 (10Ottomata) [16:20:03] 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team, 10Research: Proposal: Create a stream end point for Revision Risk Model - https://phabricator.wikimedia.org/T326179 (10Ottomata) Relevant: T328899#8661226 We should all sync up and work on some big standardized modeling design dec... [16:20:20] 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team, 10Research: Proposal: Create a stream end point for Revision Risk Model - https://phabricator.wikimedia.org/T326179 (10Ottomata) [16:26:15] 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team, 10Research: Proposal: Create a stream end point for Revision Risk Model - https://phabricator.wikimedia.org/T326179 (10diego) > We should all sync up and work on some big standardized modeling design decisions and ideas. It would b... [16:29:04] 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team, 10Research: Proposal: Create a stream end point for Revision Risk Model - https://phabricator.wikimedia.org/T326179 (10Ottomata) +1 [17:21:37] (03CR) 10Joal: [C: 03+2] "Awesome work @aqu :) Thank you!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [17:22:26] (03CR) 10Joal: [V: 03+2 C: 03+2] "Manually merging" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [18:50:10] 10Data-Engineering-Planning: Use RDD checkpointing in Mediawiki-History spark job - https://phabricator.wikimedia.org/T331003 (10lbowmaker) [18:50:26] 10Data-Engineering-Planning, 10Data Pipelines: Use RDD checkpointing in Mediawiki-History spark job - https://phabricator.wikimedia.org/T331003 (10lbowmaker) [19:13:38] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10Shared-Data-Infrastructure, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jclark-ctr) [19:29:57] 10Data-Engineering, 10Data Pipelines (Sprint 09): Differential privacy airflow-dags merge request - https://phabricator.wikimedia.org/T330234 (10Htriedman) @JArguello-WMF nope! I chatted with @Milimetric a couple of days ago and he said that we're good to go (as an initial MVP release, at least). Waiting on hi... [20:13:44] ACKNOWLEDGEMENT - MegaRAID on an-worker1132 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T331059 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:24:19] (03PS1) 10Jennifer Ebe: T329854-Airflow]-Migrate-mediacounts-archive-Oozie-job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/893816 [21:16:48] ACKNOWLEDGEMENT - MegaRAID on an-worker1132 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T331064 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:30:24] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:00:12] ACKNOWLEDGEMENT - MegaRAID on an-worker1132 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T331068 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:31:44] ACKNOWLEDGEMENT - MegaRAID on an-worker1132 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T331073 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring