[00:05:32] RECOVERY - Check unit status of drop_event on an-launcher1002 is OK: OK: Status of the systemd unit drop_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:26:48] PROBLEM - Check unit status of drop_event on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit drop_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:05:16] PROBLEM - Check unit status of drop-anomaly-detection on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit drop-anomaly-detection https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:35:04] RECOVERY - Check unit status of refinery-drop-raw-netflow-event on an-launcher1002 is OK: OK: Status of the systemd unit refinery-drop-raw-netflow-event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:41:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [04:48:42] PROBLEM - Check unit status of refinery-drop-raw-netflow-event on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-drop-raw-netflow-event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:01:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [08:41:03] !log restarted hive-server2 and hive-metastore services on an-coord1002 (standby) server [08:41:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:52:43] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:03:59] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:06:06] 10Data-Engineering: Enable the Marketing Campaigns Reporting plugin for matomo - https://phabricator.wikimedia.org/T319013 (10BTullis) [11:10:24] (03CR) 10Awight: [C: 03+2] "As I remember, we can self-merge this repo and it's deployed automatically..." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/835621 (https://phabricator.wikimedia.org/T315972) (owner: 10Awight) [11:11:01] (03Merged) 10jenkins-bot: Maps interaction event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/835621 (https://phabricator.wikimedia.org/T315972) (owner: 10Awight) [12:03:06] heya team! [12:03:26] joal: is there anything I can take from here? [12:03:32] Hi mforns [12:03:38] heya :] [12:19:55] Hey milimetric, we've talked a few weeks ago about getting access to the analysis cluster in order to conduct some vandalism analysis. Looks like I may have the support from the french WMF, is there any place I could form an access request that would fit this type of procedure ? [12:20:02] Thanks a lot ! [12:46:49] Hi mforns :) [12:55:00] 10Data-Engineering, 10Foundational Technology Requests: Enable the Marketing Campaigns Reporting plugin for matomo - https://phabricator.wikimedia.org/T319013 (10JArguello-WMF) [13:07:03] hey joal! I'm available now, you? [13:07:08] Yes! [13:07:14] Let's batcave?! [13:07:14] bc? [13:07:16] ok [13:14:33] (03CR) 10Mforns: [C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/836813 (https://phabricator.wikimedia.org/T305841) (owner: 10Joal) [13:14:39] (03CR) 10Mforns: [V: 03+2 C: 03+2] Fix unique-devices per project-family HQL [analytics/refinery] - 10https://gerrit.wikimedia.org/r/836813 (https://phabricator.wikimedia.org/T305841) (owner: 10Joal) [13:27:43] PROBLEM - SSH on analytics1077.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:56:53] (03PS3) 10Joal: Fix end-of-month/year allowed_interval issue [analytics/refinery] - 10https://gerrit.wikimedia.org/r/836295 (https://phabricator.wikimedia.org/T316746) (owner: 10Mforns) [15:04:12] RECOVERY - SSH on analytics1077.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:18:40] joal, I've tested most of the deletion jobs that failed, and all except one are running successfully and expectedly. The one that fails, does it because the data has an empty directory! [16:19:34] At first I thought there was something wrong with the data, but then I realized that some event-based schemas are very low volume, and it could be that they have data for say the 29th but not for the last day of the month. [16:19:50] So when the deletion job reaches the end of month, the folder is empty! [16:19:56] And we should still delete it. [16:24:42] (03CR) 10Mforns: Fix end-of-month/year allowed_interval issue (036 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/836295 (https://phabricator.wikimedia.org/T316746) (owner: 10Mforns) [16:24:53] also, joal ^^ [16:26:42] Ah! [16:32:17] whatcha think joal? Is the suggestion OK? [16:33:13] I can add it! [17:10:42] btullis et. al., we have a backlog of tasks like T309056 on our clinic duty list. These are things that Razzi was taking care of while he was here... can I start dropping them on you instead? Or someone else on your team? [17:10:42] T309056: Prepare and check storage layer for guwwiktionary - https://phabricator.wikimedia.org/T309056 [17:18:34] andrewbogott: probably best just to tag them with Data Engineering. I'll make sure that Emil and Jackeline know about them, then they will probably come through to me for now anyway. Feel free to remove your own tags if you need them off your clinic duty lists. Thanks. [17:19:19] btullis: ok! thanks [17:20:30] 10Data-Engineering, 10DBA: Prepare and check storage layer for blkwiki - https://phabricator.wikimedia.org/T310872 (10Andrew) [17:21:01] 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for guwwiktionary - https://phabricator.wikimedia.org/T309056 (10Andrew) [17:21:18] 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for pcmwiki - https://phabricator.wikimedia.org/T310879 (10Andrew) [17:21:39] 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for bjnwiktionary - https://phabricator.wikimedia.org/T312214 (10Andrew) [17:21:55] 10Data-Engineering, 10Data-Services: Recreate views for globalblocks table - https://phabricator.wikimedia.org/T300988 (10Andrew) [17:22:28] 10Data-Engineering, 10Data-Services, 10Platform Engineering, 10Patch-For-Review: Log_param is redacted in wiki replica when only comment and/or user should be - https://phabricator.wikimedia.org/T301943 (10Andrew) [17:22:37] it's a bunch of tasks but I think they can almost all be done with a single run [18:42:59] (03PS4) 10Mforns: Fix end-of-month/year allowed_interval issue [analytics/refinery] - 10https://gerrit.wikimedia.org/r/836295 (https://phabricator.wikimedia.org/T316746)