[00:08:09] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:13:59] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:20:09] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:22:05] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:25:23] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:31:39] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:39:19] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:43] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:12:55] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:24:55] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:44:07] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:56:09] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:53] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:15:17] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:27:15] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:46:27] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:58:25] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:04:51] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:17:35] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:29:35] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:48:05] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:02:29] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:21:37] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:33:37] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:51:11] (03PS1) 10DLynch: New schema: editattemptsblocked [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/820908 (https://phabricator.wikimedia.org/T310390) [04:51:41] (03CR) 10CI reject: [V: 04-1] New schema: editattemptsblocked [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/820908 (https://phabricator.wikimedia.org/T310390) (owner: 10DLynch) [04:52:47] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:53:23] (03CR) 10DLynch: "It occurred to me that `platform` and `interface` are redundant, and I have omitted `platform` as a result." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/820908 (https://phabricator.wikimedia.org/T310390) (owner: 10DLynch) [04:58:39] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:58:56] (03PS2) 10DLynch: New schema: editattemptsblocked [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/820908 (https://phabricator.wikimedia.org/T310390) [05:04:49] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:23:59] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:35:57] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:44:13] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:55:07] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:07:09] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:26:23] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:38:25] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:41:15] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:57:31] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:09:29] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:57:29] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:00:57] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:09:25] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:13:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [08:28:39] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:35:05] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:40:37] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:53:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [08:53:57] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [08:57:51] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:58:42] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [09:02:11] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:14:11] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:33:21] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:40:28] btullis: o/ [09:40:50] I didn't get from the etcd code review if the ml_etcd srv records need to be updated to unblock the work or not [09:57:21] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:15:30] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:18:58] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:26:54] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:34:50] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:36:36] elukey: no, I didn't think you need to change the ml_serve records to unblock the dse-k8s-etcd work. [10:37:29] It's only if you ever wished to switch that cluster from cergen/puppetCA to cfssl/PKI then you would need to. [10:46:16] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:52:06] btullis: ahhh okok nice, I'll try to do it in the future :) [11:08:57] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:45] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:42] !log rebooting an-worker1102 due to kernel soft lockups [11:43:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:03:44] PROBLEM - Host an-worker1102 is DOWN: PING CRITICAL - Packet loss = 100% [12:25:36] RECOVERY - Host an-worker1102 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [12:26:22] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:34:51] (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [12:49:49] ^^ investigating this corruptblock alert now [12:55:09] https://www.irccloud.com/pastebin/56qPrmIe/ [12:55:46] According to this: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks the alert may be a false positive. [12:55:50] hi, was a bout to look too (have meeting soon tho), [12:55:57] an-worker1102 flapped a lot over the weekend [12:56:01] over last night rearlly [12:56:42] Yeah, I just rebooted an-worker1002 and it seems better. It was a soft CPU lockup I think. However, the corrupt blocks alert appeared just after an-worker1002 booted again. [12:58:26] 1102? [12:58:42] Yep, sorry fat fingered typo. [12:58:49] okay [12:58:50] coo [12:59:56] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:01:06] ^^ These megaraid battery failures are all really annoying. We've got a whole batch of hadoop work nodes where the RAID battery is starting to fail at roughly the same time. This is about the 4th one. [13:33:52] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:21:28] this is the battery? [14:23:04] Yes, the backup battery on the RAID controller card in each host. When the charge is too low it reduces the performance, from WriteBack to WriteThrough. [14:26:25] huh, what can we do? escalate to dcops? can we get new batteries? [14:53:04] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:17:04] 10Analytics-Clusters, 10Data Engineering Planning, 10Voice & Tone: Rename geoeditors_blacklist_country - https://phabricator.wikimedia.org/T259804 (10odimitrijevic) [15:24:30] Hi btullis, this is the patch to review https://gerrit.wikimedia.org/r/c/operations/puppet/+/813278 plz . I scoped the modifications on the test cluster. So we can merge it, and I can test Spark3 on [15:25:00] aqu: Thanks. Will look at it now. [15:28:32] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:49:30] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:35:56] (03CR) 10Vivian Rook: [C: 03+2] Escape '|' from wikitable output [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/816254 (https://phabricator.wikimedia.org/T308362) (owner: 10WelpThatWorked) [16:40:27] (03Merged) 10jenkins-bot: Escape '|' from wikitable output [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/816254 (https://phabricator.wikimedia.org/T308362) (owner: 10WelpThatWorked) [16:41:16] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:42:49] 10Quarry, 10Patch-For-Review, 10good first task: Escape special characters in results - https://phabricator.wikimedia.org/T308362 (10rook) 05Open→03Resolved [16:46:47] aqu: That's merged, but it failed to run on some hosts: [16:46:51] `Error: Failed to apply catalog: Parameter source failed on File[/etc/spark3/conf/spark-env.sh]: Cannot use relative URLs '#!/usr/bin/env bash` [16:48:35] Ah, looks like it's here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/813278/12/modules/profile/manifests/hadoop/spark3.pp#147 I will patch it now. [16:59:03] aqu: I've deployed the fix now, so you're free to test whether or not your changes works as expected. [17:00:56] I've noticed the patch. Thanks. Will check now. [17:10:21] (03CR) 10Michael Große: "I ran this on stat1008 with the command in the comment and got plausible results:" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/817837 (owner: 10Michael Große) [17:15:57] (03PS1) 10Vivian Rook: Switch string and pipe [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821283 (https://phabricator.wikimedia.org/T308362) [17:45:20] (03CR) 10RhinosF1: [C: 04-1] Switch string and pipe (031 comment) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821283 (https://phabricator.wikimedia.org/T308362) (owner: 10Vivian Rook) [17:57:43] (03CR) 10Vivian Rook: Switch string and pipe (031 comment) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821283 (https://phabricator.wikimedia.org/T308362) (owner: 10Vivian Rook) [18:04:27] (03CR) 10RhinosF1: [C: 04-1] Switch string and pipe (031 comment) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821283 (https://phabricator.wikimedia.org/T308362) (owner: 10Vivian Rook) [18:13:51] (03PS2) 10Vivian Rook: Switch string and pipe [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821283 (https://phabricator.wikimedia.org/T308362) [18:14:15] (03CR) 10Vivian Rook: Switch string and pipe (031 comment) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821283 (https://phabricator.wikimedia.org/T308362) (owner: 10Vivian Rook) [18:27:56] aqu: btullis i can't totally recall, but was that patch ready to merge? [18:28:09] i think there are still issues with the .deb package [18:28:09] https://phabricator.wikimedia.org/T309227#8079678 [18:31:16] (03CR) 10RhinosF1: [C: 03+1] Switch string and pipe [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821283 (https://phabricator.wikimedia.org/T308362) (owner: 10Vivian Rook) [18:39:47] ottomata: I thought that it was ready to merge; that was what I took from Antoine's standup anyway. [18:41:41] i guess its got a guard on it now, but i think the .deb package isn't quite working. so the puppet maybe is okay? [18:41:51] been a while though. [18:45:01] Yeah, apologies if I jumped the gun. I thought that this was just facilitating further testing on the test cluster. The .deb itself can be iterated separately. [18:51:16] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:05:25] (03PS3) 10RhinosF1: mypy: add to tox [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821244 [19:09:20] (03CR) 10CI reject: [V: 04-1] mypy: add to tox [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821244 (owner: 10RhinosF1) [19:11:42] 10Data-Engineering, 10Event Metrics, 10GrowthExperiments-CommunityConfiguration, 10MediaWiki-extensions-EventLogging, and 2 others: editgrowthconfig schema: '' should NOT have additional properties, - https://phabricator.wikimedia.org/T314173 (10Ottomata) [19:11:47] 10Data-Engineering, 10Event Metrics, 10GrowthExperiments-CommunityConfiguration, 10MediaWiki-extensions-EventLogging, and 2 others: editgrowthconfig schema: '' should NOT have additional properties, - https://phabricator.wikimedia.org/T314173 (10Ottomata) [19:12:01] 10Data-Engineering, 10Event Metrics, 10GrowthExperiments-CommunityConfiguration, 10MediaWiki-extensions-EventLogging, and 2 others: editgrowthconfig schema: '' should NOT have additional properties, - https://phabricator.wikimedia.org/T314173 (10Ottomata) [19:19:09] (03PS4) 10RhinosF1: mypy: add to tox [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821244 [19:23:21] (03CR) 10CI reject: [V: 04-1] mypy: add to tox [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821244 (owner: 10RhinosF1) [19:25:10] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:27:00] (03PS5) 10RhinosF1: mypy: add to tox [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821244 [19:54:18] (03PS6) 10RhinosF1: mypy: add to tox [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821244 [19:54:36] (03PS7) 10RhinosF1: mypy: add to tox [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821244 [19:58:14] (03CR) 10CI reject: [V: 04-1] mypy: add to tox [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821244 (owner: 10RhinosF1) [19:59:07] (03PS8) 10RhinosF1: mypy: add to tox [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821244 [19:59:33] (03CR) 10RhinosF1: "will finish in morning. brain get sleepy" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821244 (owner: 10RhinosF1) [20:04:34] 10Quarry, 10cloud-services-team (Kanban): quarry-nfs-1 went down; quarry is offline - https://phabricator.wikimedia.org/T302154 (10RhinosF1) 05Open→03Resolved Not happened since. Closing per IRC. [20:11:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [20:45:44] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering: WikiStats in Uzbek - https://phabricator.wikimedia.org/T314477 (10EChetty) [20:46:05] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering: WikiStats in Uzbek - https://phabricator.wikimedia.org/T314477 (10EChetty) [20:48:14] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering: WikiStats in Uzbek - https://phabricator.wikimedia.org/T314477 (10JArguello-WMF) [21:26:14] milimetric: thanks for the ping. [21:26:31] they look like they are re-running sucessfully now [21:26:34] nice [21:33:53] Oh dear, I'm so sorry for the mess I made. [21:35:29] np! btullis not your fault! its been weeks since we looked at that. iirc antoine was off right before offsite too, and he and I have not synced up [22:50:49] 10Analytics-Wikistats, 10Data-Engineering: WikiStats in Uzbek - https://phabricator.wikimedia.org/T314477 (10Aklapper) @EChetty: Please keep/add valid code project tags such as #Analytics-Wikistats which allow finding tasks related to code bases, not to end up in a big unmaintainable pile of only some-team-in-... [23:25:22] (03PS4) 10Ottomata: WIP - Add new mediawiki entity fragments, and use them in new mediawiki page change schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/807565 (https://phabricator.wikimedia.org/T308017) [23:25:58] (03CR) 10CI reject: [V: 04-1] WIP - Add new mediawiki entity fragments, and use them in new mediawiki page change schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/807565 (https://phabricator.wikimedia.org/T308017) (owner: 10Ottomata) [23:26:19] (03CR) 10Ottomata: "Update: latest patch uses entity subobjects for each entity in the page cahgne schema. page, revision, actor, etc. See also example 2 he" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/807565 (https://phabricator.wikimedia.org/T308017) (owner: 10Ottomata)