[00:19:04] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:26] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:03:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [02:42:07] 10Data-Engineering, 10Event-Platform Value Stream, 10ci-test-error: Use a fake timer in EventBus unit test for PageChangeEventSerializerTest::testCreatePageChangeVisibilityEvent - https://phabricator.wikimedia.org/T325341 (10Umherirrender) [02:53:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [04:53:52] 10Analytics-Wikistats, 10Data-Engineering-Planning, 10Data Pipelines: [Wikistats] Add newly translated languages - https://phabricator.wikimedia.org/T311315 (10Milimetric) 05Open→03Resolved a:03Milimetric [07:14:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [07:19:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [07:26:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [07:31:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [08:23:09] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Migrate Search Airflow jobs to Airflow 2 and use shared supporting code from the data engineering Airflow - https://phabricator.wikimedia.org/T318414 (10Gehel) [08:34:58] No airflow failed task this morning \o/ [09:51:05] 10Data-Engineering, 10Equity-Landscape: Grants input metric - https://phabricator.wikimedia.org/T309276 (10KCVelaga_WMF) a:05JAnstee_WMF→03ntsako [09:51:17] 10Data-Engineering, 10Equity-Landscape: Affiliates input metric - https://phabricator.wikimedia.org/T309275 (10KCVelaga_WMF) a:05KCVelaga_WMF→03ntsako [12:46:22] (03PS3) 10Matthias Mullie: Modify SearchPreview action to align with requirements [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/868187 (https://phabricator.wikimedia.org/T321069) (owner: 10Simone Cuomo) [12:47:43] (03PS4) 10Matthias Mullie: Modify SearchPreview action to align with requirements [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/868187 (https://phabricator.wikimedia.org/T321069) (owner: 10Simone Cuomo) [13:48:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:59:11] 10Data-Engineering, 10Event-Platform Value Stream: Flink Restart Strategy for Enrichment Service - https://phabricator.wikimedia.org/T325359 (10lbowmaker) [14:01:34] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:03:14] Hi btullis - do we have any idea of which service lead to too many logs on an-coord1001? [14:06:45] it's only airflow on there, right? [14:54:18] (03PS1) 10Umherirrender: Using ${var} in strings is deprecated, use {$var} instead [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/868692 [15:36:03] !log deploying 'Fix subtle bug on image_suggestions when resolving varprop.' on platform_eng Airflow instance. [15:36:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:00:05] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Using ${var} in strings is deprecated, use {$var} instead [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/868692 (owner: 10Umherirrender) [16:00:45] (03Merged) 10jenkins-bot: Using ${var} in strings is deprecated, use {$var} instead [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/868692 (owner: 10Umherirrender) [16:01:46] (03PS1) 10Lucas Werkmeister (WMDE): Using ${var} in strings is deprecated, use {$var} instead [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/868529 [16:02:00] (03CR) 10Lucas Werkmeister (WMDE): [V: 03+2 C: 03+2] Using ${var} in strings is deprecated, use {$var} instead [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/868529 (owner: 10Lucas Werkmeister (WMDE)) [16:02:25] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "oh, this branch has gate-and-submit?" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/868529 (owner: 10Lucas Werkmeister (WMDE)) [16:02:33] (03Merged) 10jenkins-bot: Using ${var} in strings is deprecated, use {$var} instead [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/868529 (owner: 10Lucas Werkmeister (WMDE)) [16:18:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:49:21] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:55:43] milimetric: Thank you again for all your help yesterday! I want to ask one more favor. I'm running a query that should be pretty innocuous? It runs for about ten minutes, then gives me an error: [16:55:43] presto error: Encountered too many errors talking to a worker node. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes. (getting task status https://an-presto1001.eqiad.wmnet:8281/v1/task/20221216_164731_00620_nsigb.1.0.11 - 35 failures, failure duration 300.42s, total failed request time 201.33s) [16:56:01] milimetric: Here's the query [16:56:43] https://www.irccloud.com/pastebin/u6qnl52E/ [18:10:59] 10Data-Engineering-Planning, 10Product-Analytics, 10Superset, 10Editing-team (Tracking): Sub-Saharan Africa Editors Superset Dashboard: Unable to access some charts - https://phabricator.wikimedia.org/T325331 (10odimitrijevic) [18:31:39] 10Data-Engineering, 10Product-Analytics: Provide aggregated user device data per-country - https://phabricator.wikimedia.org/T325306 (10Aklapper) [18:35:01] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:37:38] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 05-06): Build Druid Operator - https://phabricator.wikimedia.org/T309996 (10mforns) a:03mforns [18:37:56] joal: milimetric: Sorry for missing the ping earlier. On leave today. The disk space issue on an-coord1001 was not related to a particular service. It's just a fairly small root partition. There's a bit of `/var/log` in there and in particular `/var/log/hive`but the main data on that server is stored under `/srv/` [18:40:42] ACKNOWLEDGEMENT - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis Due for decom - T318659 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:45:07] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:45:11] gotcha, ok, maybe Hive logs can be cleaned more often, I don't think keeping those is too useful [18:48:50] apine: I'm not sure without fiddling more, but maybe the order by is slow? (sorry I'm not feeling well today and taking it easier) [18:49:33] Ordering in distributed data is always hard 'cause it causes lots of movement on the network. Ideally try just an hour of data too, see how that works [18:49:55] milimetric: That's true, we could probably clean them more frequently, but I don't think this is going to be too much of an issue. We're about to refresh an-coord100[1-2] with an-coord100[3-4]. I'll make sure that we have a) a bigger root filesystem and b) possibly even a separate partition for `/var` [18:50:32] luxurious! :) [22:57:44] (03PS1) 10Aqu: Fix typo in bash script used by HDFS usage pipeline [analytics/refinery] - 10https://gerrit.wikimedia.org/r/868753 (https://phabricator.wikimedia.org/T324850) [23:21:55] (03PS1) 10Aqu: Create adhoc log4j.properties for quieter Spark logs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/868754 (https://phabricator.wikimedia.org/T302500)