[01:10:44] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:44:32] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:48:06] 10Data-Engineering, 10Event Metrics, 10GrowthExperiments-CommunityConfiguration, 10MediaWiki-extensions-EventLogging, and 2 others: editgrowthconfig schema: '' should NOT have additional properties, - https://phabricator.wikimedia.org/T314173 (10Etonkovidova) [01:49:08] 10Data-Engineering, 10Event Metrics, 10GrowthExperiments-CommunityConfiguration, 10MediaWiki-extensions-EventLogging, and 2 others: editgrowthconfig schema: '' should NOT have additional properties, - https://phabricator.wikimedia.org/T314173 (10Etonkovidova) 05Open→03Resolved [02:34:26] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:47:21] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:55:55] (03CR) 10Ladsgroup: "In the fear of pretending that I know what I'm saying (narrator: he doesn't), are you planning to sqoop linktarget table too? Without it t" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/821312 (https://phabricator.wikimedia.org/T314666) (owner: 10Milimetric) [05:21:17] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:43:57] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:51:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [06:16:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [06:16:57] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [06:21:42] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [06:26:56] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [06:31:42] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [07:13:20] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:56:03] 10Data-Engineering-Kanban, 10Data Engineering Planning (Sprint 02): Create conda-base-env with last pyspark - https://phabricator.wikimedia.org/T309227 (10EChetty) [08:33:50] 10Data-Engineering: RAID battery alert in an-worker1082 - https://phabricator.wikimedia.org/T311991 (10BTullis) 05Open→03Resolved [08:39:28] ACKNOWLEDGEMENT - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis T314838 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:43:30] (03PS9) 10RhinosF1: mypy: add to tox [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821244 [08:46:45] (03CR) 10CI reject: [V: 04-1] mypy: add to tox [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821244 (owner: 10RhinosF1) [08:49:06] (03PS10) 10RhinosF1: mypy: add to tox [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821244 [08:52:11] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:53:19] (03CR) 10CI reject: [V: 04-1] mypy: add to tox [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821244 (owner: 10RhinosF1) [08:54:42] btullis Hi, do you have a minute to explain me what went wrong with the deployment of spark3 conf & package on the test cluster? [08:55:39] aqu: Yes, shortly. Just deploying an etcd patch for a few minutes first. [09:00:36] aqu: Can you access this? https://puppetboard.wikimedia.org/report/an-launcher1002.eqiad.wmnet/65d2482c50c9d38d0f1f02bfc664fa4de80e014e [09:01:13] This is the puppet report for an-launcher1002 after I merged the spark3 change [09:02:07] No "Service access denied due to missing privileges." [09:03:57] OK, I believe what happened was that the file: `/etc/spark3/conf/spark-defaults.conf` was changed on an-launcher1002, which caused the launch failure for airflow jobs where spark3 is already used in production. [09:10:09] ok, yesterday, i check this https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36645/console [09:10:09] The only affected host was stat1007 with minor changes. [09:10:41] http://localhost:8600/log?dag_id=apis_metrics_to_graphite_hourly&task_id=generate_and_send_apis_metrics_to_graphite&execution_date=2022-08-08T18%3A00%3A00%2B00%3A00 [09:10:41] shows that "either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment" [09:11:26] Thanks ben. [09:12:48] Yes, if you look I limited that pcc run to a specific 'role' of server: `./utils/pcc 813278 O:statistics::explorer` [09:13:18] So that's my fault for not checking for other unintended consequences. [09:16:31] It would have been more comprehensive to run the test against every host which has `profile::analytics::cluster::client`applied. [10:13:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [10:53:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [10:56:44] A new attempt on Spark 3 puppetization https://gerrit.wikimedia.org/r/c/operations/puppet/+/821695 . + a big sorry for the previous crash. [11:48:51] aqu: Cool, thanks. I think I'd feel more comfortable waiting until it's not a US holiday before proceeding, but I'll happily review. [11:49:24] A good thing that you might like to check out is how to test your own changes with the puppet compiler: https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Catalog_compiler_local_run_(pcc_utility) [11:51:33] If you get your jenkins API you will then be able to run a compilation test against a number of hosts where it is applied like this: `./utils/pcc 813278 P:analytics::cluster::client` [11:51:51] s/Jenkins API token/ [11:58:54] (03Abandoned) 10Vivian Rook: ci test, do not merge [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/820170 (owner: 10Vivian Rook) [12:07:59] 10Analytics-Wikistats, 10Data Engineering Planning: "Pages to date" not loading with "daily" metric - https://phabricator.wikimedia.org/T312717 (10Aklapper) @EChetty: Please keep/add valid code project tags such as #Analytics-Wikistats which allow finding tasks related to code bases, not to end up in a big unm... [12:08:02] 10Analytics-Wikistats, 10Data Engineering Planning: WikiStats in Uzbek - https://phabricator.wikimedia.org/T314477 (10Aklapper) [12:32:43] (03PS1) 10Sergio Gimeno: [WIP] Instrument blocked account registration [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/821711 [12:33:07] (03CR) 10CI reject: [V: 04-1] [WIP] Instrument blocked account registration [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/821711 (owner: 10Sergio Gimeno) [12:33:38] (03PS2) 10Sergio Gimeno: [WIP] Instrument blocked account registration [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/821711 (https://phabricator.wikimedia.org/T306018) [12:34:20] (03CR) 10CI reject: [V: 04-1] api: return consistently [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821344 (owner: 10RhinosF1) [12:34:24] (03CR) 10CI reject: [V: 04-1] [WIP] Instrument blocked account registration [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/821711 (https://phabricator.wikimedia.org/T306018) (owner: 10Sergio Gimeno) [12:35:13] (03PS3) 10RhinosF1: api: return consistently [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821344 [12:37:27] (03PS4) 10RhinosF1: api: return consistently [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821344 [12:46:00] (03CR) 10CI reject: [V: 04-1] api: return consistently [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821344 (owner: 10RhinosF1) [12:47:29] (03PS5) 10RhinosF1: api: return consistently [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821344 [12:48:22] (03PS6) 10RhinosF1: api: return consistently [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821344 [12:51:34] (03CR) 10CI reject: [V: 04-1] api: return consistently [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821344 (owner: 10RhinosF1) [12:54:06] (03PS7) 10RhinosF1: api: return consistently [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821344 [13:22:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [13:27:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [13:30:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [13:34:05] (03PS8) 10RhinosF1: api: return consistently [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821344 [13:35:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [13:39:21] (03CR) 10RhinosF1: [C: 03+1] "locally tested" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821344 (owner: 10RhinosF1) [13:42:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [13:57:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [14:30:00] Does anyone have an idea about disabling logs in Jupyter Lab for Pyspark? Two-three hours ago, I started seeing WARN logs and "Stage information" in Jupyter, but I don't recall changing anything. As of now, I have tried a lot of different things, but the logs still get displayed, any idea how to turn them off? [16:10:29] (03CR) 10Vivian Rook: [C: 03+1] "Out today will see about getting this deployed tomorrow." [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821344 (owner: 10RhinosF1) [16:27:17] (03CR) 10RhinosF1: [C: 03+1] api: return consistently (031 comment) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821344 (owner: 10RhinosF1) [16:43:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [16:45:15] aarora: I wish I did know more about this, but I'm afraid I don't know. Hopefully someone more knowledgeable than I am about Jupyter will be online tomorrow. [16:48:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [17:25:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [17:30:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [21:41:47] 10Data-Engineering-Kanban, 10Data Engineering Planning (Sprint 02): Update ua-parser library for traffic data - https://phabricator.wikimedia.org/T306829 (10Antoine_Quhen) The changes in the library: https://github.com/ua-parser/uap-core/compare/08745ed0da1ef0f5f6a06b8de87b543cfcdd9ab9..09e9ccca9fcfc4348ae9e89... [23:18:56] (03PS1) 10Nettrom: Update to reflect multiple possible rejection reasons [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/821801 (https://phabricator.wikimedia.org/T314899) [23:51:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage