[01:19:09] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:54:36] PROBLEM - Disk space on aqs1004 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra-a 108310 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=aqs1004&var-datasource=eqiad+prometheus/ops [04:11:36] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:46:08] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:30:57] 10Data-Engineering: Check home/HDFS leftovers of aniketars - https://phabricator.wikimedia.org/T312514 (10Miriam) Thanks @mforns ! Would it be possible to move these files to my home directories? Thanks! Miriam [08:55:54] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:00:46] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:17:37] PROBLEM - Kafka broker TLS certificate validity on kafka-jumbo1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [11:19:29] RECOVERY - Kafka broker TLS certificate validity on kafka-jumbo1002 is OK: SSL OK - Certificate kafka_jumbo-eqiad_broker valid until 2022-12-04 14:47:46 +0000 (expires in 131 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [11:51:21] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:02:16] 10Data-Engineering, 10Event-Platform, 10SRE, 10serviceops: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10akosiaris) @ottomata, has there been any progress on this one? Anything (e.g. reviews) we can help with? [12:02:34] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Event-Platform, 10SRE, and 2 others: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10akosiaris) @ottomata, has there been any progress on this one? Anything (e.g. reviews) we can help with? [12:13:40] (03Abandoned) 10Amire80: Update deletion stat scripts for new tags and date format [analytics/limn-language-data] - 10https://gerrit.wikimedia.org/r/481245 (owner: 10Amire80) [12:25:35] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:19:22] hey aqu! :] Do you know something about the MegaRAID errors? I'm about to ask for help to the operations channel. [14:23:19] oh, I found T312626, B-en was working on this. Looking [14:23:20] T312626: Replace RAID controller battery in an-worker1082 - https://phabricator.wikimedia.org/T312626 [14:26:10] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:31:04] RECOVERY - Check unit status of analytics-dumps-fetch-unique_devices on clouddumps1001 is OK: OK: Status of the systemd unit analytics-dumps-fetch-unique_devices https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:45:28] Hola mforns, I know nothing except that he was working on it, and that those errors were already triggered 2 weeks ago without much worries from him. [14:46:11] thanks aqu! :] Yes, let's wait, it doesn't seem too critical. I commented on the email. [14:48:15] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Event-Platform, 10SRE, and 2 others: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10JArguello-WMF) Hi @akosiaris, Andrew is Out of office and will be back on Fri, Jul 29. :) [14:50:57] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Event-Platform, 10SRE, and 2 others: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10akosiaris) >>! In T303543#8104853, @JArguello-WMF wrote: > Hi @akosiaris, Andrew is Out of office and will be back on Fri... [14:54:22] 10Data-Engineering, 10Airflow: analytics-platform-eng admins should be able to restart airflow platform-eng systemctl services - https://phabricator.wikimedia.org/T313727 (10JArguello-WMF) [14:55:32] 10Data-Engineering, 10Airflow: analytics-research keytab on stat machines - https://phabricator.wikimedia.org/T313345 (10JArguello-WMF) [14:56:54] 10Data-Engineering: Add nokafor to receive analytics-alerts emails and have sudo -u hdfs rights in hdfs - https://phabricator.wikimedia.org/T313816 (10mforns) [14:58:03] 10Data-Engineering: Add nokafor to receive analytics-alerts emails and have sudo -u hdfs rights in hdfs - https://phabricator.wikimedia.org/T313816 (10mforns) [14:59:37] 10Data-Engineering: Add nokafor to receive analytics-alerts emails and have sudo -u hdfs rights in hdfs - https://phabricator.wikimedia.org/T313816 (10mforns) When trying to execute `sudo -u hdfs hdfs dfs -ls` in `stat1008.eqiad.wmnet` she gets the error: ` ls: Failed on local exception: java.io.IOException: jav... [15:00:40] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:12:08] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:53:50] 10Data-Engineering: Add nokafor to receive analytics-alerts emails and have sudo -u hdfs rights in hdfs - https://phabricator.wikimedia.org/T313816 (10elukey) Added to the analytics-alerts@ mailing list :) The hdfs command doesn't work on the stat100x hosts, they don't have the HDFS keytab available. Try the fo... [15:54:24] mforns: --^ [15:54:38] 10Data-Engineering: Add nokafor to receive analytics-alerts emails and have sudo -u hdfs rights in hdfs - https://phabricator.wikimedia.org/T313816 (10elukey) 05Open→03Resolved [15:56:20] 10Data-Engineering: Add nokafor to receive analytics-alerts emails and have sudo -u hdfs rights in hdfs - https://phabricator.wikimedia.org/T313816 (10mforns) Arfff, of course, sorry for that. Thanks for adding nokafor to the team's alerts! [16:09:38] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:20:03] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:53:05] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:15:53] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:34:30] (03PS4) 10NOkafor: Minor trailing space and back slash adjustments Cassandra Loading HQL files [Draft] Bug: T311507 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/812095 (https://phabricator.wikimedia.org/T311507) [17:50:11] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:01:35] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:04:39] (03PS5) 10NOkafor: Minor trailing space and back slash adjustments Cassandra Loading HQL files [Draft] Bug: T311507 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/812095 (https://phabricator.wikimedia.org/T311507) [18:11:29] (03PS1) 10NOkafor: Added meta.wikidata to the pageview allow-list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/817323 (https://phabricator.wikimedia.org/T313834) [18:47:19] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:51:37] (03CR) 10Mforns: Added meta.wikidata to the pageview allow-list (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/817323 (https://phabricator.wikimedia.org/T313834) (owner: 10NOkafor) [19:10:09] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:40:39] (03PS1) 10NOkafor: Added meta.wikidata to the pageview allow-list Bug: T313834 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/817366 (https://phabricator.wikimedia.org/T313834) [19:42:10] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/817366 (https://phabricator.wikimedia.org/T313834) (owner: 10NOkafor) [19:44:27] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:59:05] 10Data-Engineering: Document destination_event_service Event Platform stream configuration - https://phabricator.wikimedia.org/T313859 (10odimitrijevic) [21:30:25] 10Data-Engineering, 10API Platform: Establish testing procedure for Druid-based endpoints - https://phabricator.wikimedia.org/T311190 (10BPirkle) I have a question that may be better discussed somewhere else, But thought I'd start here as it is at least somewhat related. I have the Docker+Druid env working we... [21:48:31] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:44:28] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:54:26] (03PS2) 10NOkafor: Added meta.wikidata to the pageview allow-list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/817323 (https://phabricator.wikimedia.org/T313834) [23:15:20] (03CR) 10NOkafor: Added meta.wikidata to the pageview allow-list (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/817323 (https://phabricator.wikimedia.org/T313834) (owner: 10NOkafor) [23:34:32] 10Data-Engineering, 10Cassandra: Overly zealous aqs & aqsloader grants - https://phabricator.wikimedia.org/T313877 (10Eevans) [23:56:28] (03PS6) 10NOkafor: Add cassandra Loading HQL loading queries for airflow [Draft] Bug: T311507 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/812095 (https://phabricator.wikimedia.org/T311507)