[00:02:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [01:20:48] PROBLEM - MegaRAID on an-worker1080 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:21:44] 10Data-Engineering-Planning: analytics-platform-eng admins should be able to restart airflow platform-eng systemctl services - https://phabricator.wikimedia.org/T313727 (10Ottomata) 05Open→03Resolved [01:23:01] 10Data-Engineering, 10Data Pipelines: Event Platform canary events job occasionally fails to retrieve stream config settings - https://phabricator.wikimedia.org/T326002 (10Ottomata) Should be related to {T309717}. We merged a logging change to help debug, but were delayed on deploying it because of the Spark... [01:24:13] 10Data-Engineering, 10Data Pipelines: Event Platform canary events job occasionally fails to retrieve stream config settings - https://phabricator.wikimedia.org/T326002 (10Ottomata) But yeah, we should move this to airflow for sure. Its similar to but simpler than Refine -> Airflow. [02:45:30] RECOVERY - MegaRAID on an-worker1080 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:19:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [05:03:06] PROBLEM - MegaRAID on an-worker1080 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:34:50] RECOVERY - MegaRAID on an-worker1080 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:04:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [06:06:34] PROBLEM - MegaRAID on an-worker1080 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:59:26] RECOVERY - MegaRAID on an-worker1080 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:31:06] PROBLEM - MegaRAID on an-worker1080 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:45:10] RECOVERY - MegaRAID on an-worker1080 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:16:56] PROBLEM - MegaRAID on an-worker1080 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:27:30] RECOVERY - MegaRAID on an-worker1080 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:30:13] (VarnishkafkaNoMessages) firing: varnishkafka on cp5032 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5032%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:35:13] (VarnishkafkaNoMessages) resolved: varnishkafka on cp5032 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5032%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [10:06:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [10:20:17] !log restart hive-server2 and hive-metastore services on an-coord1002 prior to failover [10:20:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:33:20] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:39:25] !log fail over hive services to an-coord1002 with change to the DNS CNAME for analytics-hive.eqiad.wmnet [10:39:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:45:58] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [11:01:43] 10Data-Engineering-Radar, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: IPInfo MediaWiki extension depends on presence of maxmind db in the container/host - https://phabricator.wikimedia.org/T288375 (10Clement_Goubert) GeoIP data copied to all mw-on-k8s kubernetes hosts. [11:05:19] 10Data-Engineering, 10SRE, 10ops-eqiad: Check BBU on an-worker1084 - https://phabricator.wikimedia.org/T325984 (10BTullis) I've downtimed the host, shut it down, and created a hardware ticket for @Cmjohnson to replace the RAID controller battery: T326127 [11:06:13] 10Data-Engineering, 10SRE, 10ops-eqiad: Check BBU on an-worker1084 - https://phabricator.wikimedia.org/T325984 (10BTullis) a:03BTullis [11:06:52] 10Data-Engineering, 10SRE, 10ops-eqiad: Check BBU on an-worker1080 and an-worker1084 - https://phabricator.wikimedia.org/T325984 (10BTullis) [11:08:14] !log restarted hive-server2 and hive-metastore services on an-coord1001 after failover to standby server [11:08:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:09:35] 10Data-Engineering, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10BTullis) Given that the physical disk IDs are qll sequential and well ordered according to the iDRAC card, the results from `lsscsi` in the Deb... [12:36:01] (03PS2) 10Thiemo Kreuz (WMDE): Limit HTTP status code to 100…599 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/836736 [12:37:49] (03PS2) 10Thiemo Kreuz (WMDE): Limit HTTP status code to 100…599 [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/836735 [12:39:11] (03PS1) 10Gerrit maintenance bot: Add gor.wiktionary to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/874831 (https://phabricator.wikimedia.org/T326139) [12:49:47] (03CR) 10CI reject: [V: 04-1] Limit HTTP status code to 100…599 [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/836735 (owner: 10Thiemo Kreuz (WMDE)) [12:49:55] (03CR) 10CI reject: [V: 04-1] Limit HTTP status code to 100…599 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/836736 (owner: 10Thiemo Kreuz (WMDE)) [13:09:39] 10Data-Engineering, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10BTullis) Here is the dmesg output from the first two devices detected: ` [ 69.924961] mpt3sas_cm0: port enable: SUCCESS [ 69.925911] scsi 0... [13:54:32] 10Analytics-Kanban, 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: wmfdata.mariadb relies on analytics-mysql being available - https://phabricator.wikimedia.org/T292479 (10LSobanski) I don't believe there are any specific actions for #SRE here, untagging. [14:34:39] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:07] 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1001.eqiad.wmnet wi... [15:36:44] 10Data-Engineering: Requesting Kerberos identity for Hxi-ctr - https://phabricator.wikimedia.org/T325857 (10mpopov) [15:37:25] 10Data-Engineering: Requesting Kerberos identity for Hxi-ctr - https://phabricator.wikimedia.org/T325857 (10mpopov) [15:37:35] 10Data-Engineering, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1001.eqiad.wmnet with OS bullseye [15:44:39] 10Data-Engineering, 10CheckUser, 10MW-1.38-notes (1.38.0-wmf.26; 2022-03-14), 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), and 4 others: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 (10Milimetric) >>! In T233004#8491382, @Zabe wrote: >>>! In T233004#7796847, @Mil... [15:58:08] 10Data-Engineering: Check home/HDFS leftovers of akhatun - https://phabricator.wikimedia.org/T326157 (10MoritzMuehlenhoff) [16:03:17] Is anyone available to quickly hook up a new person on my team with a Kerberos identity? T325857 [16:03:18] T325857: Requesting Kerberos identity for Hxi-ctr - https://phabricator.wikimedia.org/T325857 [16:04:22] 10Data-Engineering, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10BTullis) It turns out that SR-IOV was disabled in BIOS. I tried enabling it, but it didn't make any difference, so I reverted to having it disa... [16:15:36] bearloga: Yes, I can have a look now. [16:16:10] btullis: Thank you, Ben! [16:16:20] 10Data-Engineering: Requesting Kerberos identity for Hxi-ctr - https://phabricator.wikimedia.org/T325857 (10BTullis) a:03BTullis [16:22:48] 10Data-Engineering, 10AQS 2.0 Roadmap, 10API Platform (API Platform Roadmap), 10Epic, and 2 others: AQS 2.0:Wikistats 2 service - https://phabricator.wikimedia.org/T288301 (10Milimetric) +1 on T288301#8487410, @BPirkle [16:26:40] 10Data-Engineering: Requesting Kerberos identity for Hxi-ctr - https://phabricator.wikimedia.org/T325857 (10BTullis) I believe that the approvals given in the linked ticket {T325004} are sufficient to permit me to create the Kerberos principal, so I'll go ahead and do it now. ` btullis@krb1001:~$ sudo manage_pr... [16:29:11] (03PS1) 10Ottomata: [POC] less-nested mediawiki/page/change schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/874900 (https://phabricator.wikimedia.org/T308017) [16:29:50] (03CR) 10CI reject: [V: 04-1] [POC] less-nested mediawiki/page/change schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/874900 (https://phabricator.wikimedia.org/T308017) (owner: 10Ottomata) [16:30:28] 10Data-Engineering, 10Event-Platform Value Stream, 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Ottomata) Did some work today seeing what the schema would look like flattened as discussed in [[ https://phabricat... [16:56:57] mforns: come back to talk mw history? [16:57:07] yes! [19:13:47] 10Data-Engineering, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1001.eqiad.wmnet with OS bullseye execute... [21:54:58] (03CR) 10Mforns: "Left a comment, I think this might not work in some cases.." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/870971 (https://phabricator.wikimedia.org/T323614) (owner: 10Xcollazo)