[03:00:10] PROBLEM - MegaRAID on an-worker1095 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:21:16] RECOVERY - MegaRAID on an-worker1095 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:24:28] PROBLEM - MegaRAID on an-worker1095 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:45:33] RECOVERY - MegaRAID on an-worker1095 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:17:10] PROBLEM - MegaRAID on an-worker1095 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:30:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [05:38:16] RECOVERY - MegaRAID on an-worker1095 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:09:52] PROBLEM - MegaRAID on an-worker1095 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:20:24] RECOVERY - MegaRAID on an-worker1095 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:10:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [07:23:40] PROBLEM - MegaRAID on an-worker1095 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:34:10] RECOVERY - MegaRAID on an-worker1095 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:05:18] PROBLEM - MegaRAID on an-worker1095 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:15:32] RECOVERY - MegaRAID on an-worker1095 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:24:12] 10Data-Engineering: Check home/HDFS leftovers of tmtl.io contractors - https://phabricator.wikimedia.org/T340942 (10MoritzMuehlenhoff) [08:42:35] 10Data-Platform-SRE: Bring stat1009 into service - https://phabricator.wikimedia.org/T336036 (10BTullis) [08:42:48] 10Data-Engineering: Check home/HDFS leftovers of aranyap - https://phabricator.wikimedia.org/T340945 (10MoritzMuehlenhoff) [08:46:58] PROBLEM - MegaRAID on an-worker1095 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:08:00] RECOVERY - MegaRAID on an-worker1095 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:16:37] 10Data-Platform-SRE: Replace RAID controller battery on an-worker1095 - https://phabricator.wikimedia.org/T340946 (10BTullis) [09:17:09] 10Data-Platform-SRE: Replace RAID controller battery on an-worker1095 - https://phabricator.wikimedia.org/T340946 (10BTullis) [09:19:08] 10Data-Platform-SRE, 10ops-eqiad: Replace RAID controller battery on an-worker1095 - https://phabricator.wikimedia.org/T340946 (10BTullis) p:05Triage→03Medium a:03Jclark-ctr Hi @Jclark-ctr - We've had another RAID controller fail from the same batch of servers again. Would you be able to replace it pleas... [09:19:51] ACKNOWLEDGEMENT - MegaRAID on an-worker1095 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis T340946 - Requested replacement https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:32:17] 10Data-Engineering: Check home/HDFS leftovers of appledora - https://phabricator.wikimedia.org/T340948 (10MoritzMuehlenhoff) [09:54:40] Goodmorning btullis, stevemunene - I need an SRE hand on deprecating/updating WMDE cron jobs onourclient machines [09:55:08] Morning joal, I'm available. How can I help? [09:55:18] btullis: let's batcave :) [09:59:47] Hyper-efficient meeting 👍 :) [10:23:37] (03PS1) 10Btullis: Fix the kafka-setup container for datahub [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935011 (https://phabricator.wikimedia.org/T329514) [10:39:55] btullis: Heya - Is nowagood time for our second-round? [10:40:13] joal: Yes, let's do it! [10:40:20] to thecave! [10:55:36] (03CR) 10Btullis: [C: 03+2] Fix the kafka-setup container for datahub [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935011 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [11:13:16] (03CR) 10Btullis: [C: 03+2] "recheck" [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935011 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [11:14:12] (03CR) 10CI reject: [V: 04-1] Fix the kafka-setup container for datahub [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935011 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [11:34:54] (03CR) 10Btullis: [C: 03+2] "recheck" [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935011 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [11:37:22] 10Data-Platform-SRE, 10cloud-services-team: Review and fix any bugs found in the automated bootstrap process for a ceph mon/mgr server - https://phabricator.wikimedia.org/T332987 (10BTullis) p:05Triage→03Low [11:44:08] 10Data-Platform-SRE: Cleanup HDFS folders for departed users - https://phabricator.wikimedia.org/T332321 (10BTullis) 05Open→03Resolved a:03BTullis I'm resolving this ticket, since we have a specific ticket for each user. There was a significant backlog of home directories to remove, but that's not so bad now. [11:45:46] 10Data-Platform-SRE: Deploy timeline server - https://phabricator.wikimedia.org/T331133 (10BTullis) 05Open→03Declined This is nice in principle, but it's not a priority. Feel free to reopen if you feel this is incorrect. [11:46:31] 10Data-Platform-SRE: Make YARN web interface work with both primary and standby resourcemanager - https://phabricator.wikimedia.org/T331448 (10BTullis) p:05Triage→03Low [11:49:11] (03CR) 10CI reject: [V: 04-1] Fix the kafka-setup container for datahub [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935011 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [11:49:53] 10Data-Platform-SRE, 10Patch-For-Review: Deploy spark history - https://phabricator.wikimedia.org/T330176 (10BTullis) p:05Triage→03Low This would be a nice feature and there is already a patch for some parts of the implementation, but it's not a high-priority project at the moment. [11:51:32] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Add support for Iceberg to the Spark Docker Image - https://phabricator.wikimedia.org/T336012 (10BTullis) p:05Triage→03Low This is low priority for now, but will become more important when we revisit the project to run spa... [11:52:02] 10Data-Platform-SRE: Check log rotation settings on airflow instances - https://phabricator.wikimedia.org/T339015 (10BTullis) p:05Triage→03Medium [11:53:35] (03CR) 10Btullis: [C: 03+2] "recheck" [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935011 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [11:54:24] 10Data-Platform-SRE: [dse-k8s] Provide common spark config for spark jobs - https://phabricator.wikimedia.org/T332913 (10BTullis) p:05Triage→03Low [12:00:11] 10Data-Platform-SRE: [dse-k8s] Provide common hive config for spark jobs - https://phabricator.wikimedia.org/T332912 (10BTullis) p:05Triage→03Low [12:00:35] 10Data-Platform-SRE, 10Foundational Technology Requests, 10Epic: POC for Running Spark on DSE - https://phabricator.wikimedia.org/T318712 (10BTullis) [12:01:08] 10Data-Platform-SRE: [dse-k8s] Provide common spark config for spark jobs - https://phabricator.wikimedia.org/T332913 (10BTullis) [12:01:10] 10Data-Platform-SRE, 10Foundational Technology Requests, 10Epic: POC for Running Spark on DSE - https://phabricator.wikimedia.org/T318712 (10BTullis) [12:02:24] 10Data-Platform-SRE, 10Patch-For-Review: [dse-k8s] Provide common hadooop config for spark jobs - https://phabricator.wikimedia.org/T332909 (10BTullis) p:05Triage→03Low [12:02:41] 10Data-Platform-SRE, 10Patch-For-Review: [dse-k8s] Provide common hadooop config for spark jobs - https://phabricator.wikimedia.org/T332909 (10BTullis) [12:02:43] 10Data-Platform-SRE, 10Foundational Technology Requests, 10Epic: POC for Running Spark on DSE - https://phabricator.wikimedia.org/T318712 (10BTullis) [12:03:56] 10Data-Platform-SRE, 10Patch-For-Review: [dse-k8s] Spark-deploy need to create secret object in spark namespace - https://phabricator.wikimedia.org/T332908 (10BTullis) [12:03:58] 10Data-Platform-SRE, 10Foundational Technology Requests, 10Epic: POC for Running Spark on DSE - https://phabricator.wikimedia.org/T318712 (10BTullis) [12:04:02] 10Data-Platform-SRE, 10Patch-For-Review: [dse-k8s] Spark-deploy need to create secret object in spark namespace - https://phabricator.wikimedia.org/T332908 (10BTullis) p:05Triage→03Low [12:04:44] 10Data-Platform-SRE: [dse-k8s] Deploy spark cli to submit jobs on DSE K8S cluster with K8S config - https://phabricator.wikimedia.org/T331971 (10BTullis) p:05Triage→03Low [12:05:31] 10Data-Platform-SRE, 10cloud-services-team: ceph: introduce puppet logic to purge stale keyfiles - https://phabricator.wikimedia.org/T328010 (10BTullis) p:05Triage→03Low [12:06:13] 10Data-Platform-SRE: Getting the Metrics API (K8) functioning to support Auto Scaling - https://phabricator.wikimedia.org/T318925 (10BTullis) [12:06:48] 10Data-Engineering: Check home/HDFS leftovers of jminor - https://phabricator.wikimedia.org/T340978 (10MoritzMuehlenhoff) [12:08:11] 10Data-Platform-SRE: [dse-k8s] Deploy spark cli to submit jobs on DSE K8S cluster with K8S config - https://phabricator.wikimedia.org/T331971 (10BTullis) [12:08:14] 10Data-Platform-SRE, 10Foundational Technology Requests, 10Epic: POC for Running Spark on DSE - https://phabricator.wikimedia.org/T318712 (10BTullis) [12:08:52] 10Data-Platform-SRE: Add the sparkctl binary to the stat boxes - https://phabricator.wikimedia.org/T318923 (10BTullis) [12:08:59] 10Data-Platform-SRE: [dse-k8s] Deploy spark cli to submit jobs on DSE K8S cluster with K8S config - https://phabricator.wikimedia.org/T331971 (10BTullis) [12:11:41] (03Merged) 10jenkins-bot: Fix the kafka-setup container for datahub [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935011 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [12:12:28] 10Data-Platform-SRE: Move archiva to private IPs + CDN - https://phabricator.wikimedia.org/T317182 (10BTullis) p:05Triage→03High Setting to high priority, as it will help not only with the bullseye upgrade, but also T323210 [12:52:35] !log restarting the aqs service to pick up mediawiki history snapshot for June [12:52:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:08:09] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.0 - https://phabricator.wikimedia.org/T329514 (10BTullis) Well, the kafka-setup job is now doing much better than it was, but it's still producing errors. Full paste in P49498 It seens that the script may an issue with the version of `kafka-top... [14:12:01] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.0 - https://phabricator.wikimedia.org/T329514 (10BTullis) I've confirmed that upgrading the version of kafka fixes this incompatibility. Our current version of kafka comes from http://packages.confluent.io/deb/4.0/pool/main/c/confluent-kafka/ w... [14:30:29] 10Analytics-Radar, 10Data-Engineering-Icebox, 10observability, 10Puppet, 10User-Elukey: Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948 (10joanna_borun) [16:10:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [20:10:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [20:35:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage