[00:00:32] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:03:28] PROBLEM - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [00:03:34] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:28] RECOVERY - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [00:45:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:52:32] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:00:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:07:32] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:15:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:22:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:30:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:42] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:45:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:52:20] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:15:18] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:29:20] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:45:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:52:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:00:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:07:34] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:15:22] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:29:26] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:00:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:07:34] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:15:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:15:30] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:22:42] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:29:32] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:30:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:37:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:45:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:52:56] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:00:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:08:04] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:15:36] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:15:36] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:23:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:29:18] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:30:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:37:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:15:06] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:29:04] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:30:32] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:33:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:07:34] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:33:22] 10Data-Engineering, 10Event-Platform, 10Technical-Debt: Migrate usage of Database::select to SelectQueryBuilder in EventBus - https://phabricator.wikimedia.org/T312354 (10Ladsgroup) [07:44:37] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Technical-Debt: Migrate usage of Database::select to SelectQueryBuilder in EventLogging - https://phabricator.wikimedia.org/T312335 (10Ladsgroup) [07:53:54] 10Data-Engineering: Check home/HDFS leftovers of aniketars - https://phabricator.wikimedia.org/T312514 (10MoritzMuehlenhoff) [08:00:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:07:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:54:07] 10Data-Engineering-Kanban, 10Airflow, 10Data Engineering Planning (Sprint 01): Investigate why airflow sensor tasks fail without sending errors - https://phabricator.wikimedia.org/T311976 (10EChetty) [08:55:22] 10Data-Engineering-Kanban, 10Event-Platform, 10Wikidata, 10Wikidata-Campsite, and 2 others: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy - https://phabricator.wikimedia.org/T290303 (10EChetty) [08:55:51] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Event-Platform, 10Wikidata, and 3 others: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy - https://phabricator.wikimedia.org/T290303 (10EChetty) [09:02:55] !log restarting dbstore1003 as per announced maintenance window [09:02:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:21:08] !log rebooted dbstore1005 [09:21:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:23:06] !log rebooted dbstore1007 [09:23:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:36:25] The maintenance on the dbstore servers is now complete. [09:45:28] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:49:54] Hi btullis - would you be around? [09:52:20] Yes I am here. [09:53:26] btullis: I have started trying to look at the issues with HDFS update [09:53:32] do you wish we batcave? [09:53:41] Great. See you there. [09:56:44] !log restarted oozie on an-test-coord1001 [09:56:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:59:02] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:30:19] I am planning to reboot the an-woker nodes with GPS [10:30:32] Correction: I am planning to reboot the an-woker nodes with GPUs today [10:31:33] That's an-worker1[096-101] [10:32:07] I will also announce a maintenance reboot of stat1005 and stat1008 for next week, since they also need a reboot. [10:37:31] btullis: I think I have a lead about the refine error [10:37:54] Oh, I'm all ears. Do you want to go to the batcave again? [10:38:00] let's do that :) [10:45:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:50:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:56:59] Looking into the code, there seem to be a lot of places where we set refinery_job_jar (https://github.com/wikimedia/puppet/search?q=refinery_job_jar) [10:57:39] ...but only one place where the `-shaded` suffix has been specifically added (https://github.com/wikimedia/puppet/search?q=refinery_job_jar+shaded) [10:58:11] Should we be adding this `-shaded` suffix to all of these definitions? [11:03:25] The trial run of `refine_event_test` completed successfully after adding -shaded to the jar path, which I believe is defined here: https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/analytics/refinery/job/test/refine.pp#L23 [11:45:15] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:52:33] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:19:22] 10Data-Engineering-Kanban, 10Data Engineering Planning (Sprint 01), 10Patch-For-Review: Upgrade to latest PrestoDB and enable iceberg support - https://phabricator.wikimedia.org/T311525 (10JAllemandou) Awesome @Ottomata . Now we need to make it work - I get errors when querying. When doing super simple aggre... [12:30:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:53] 10Data-Engineering, 10Event-Platform, 10Platform Engineering Roadmap Decision Making, 10Platform Team Workboards (S&F Workboard): Need for new event-type - `user_create` and `user_rename` - https://phabricator.wikimedia.org/T262205 (10Ottomata) Newer relevant tickets: T308017 T310082. We'd love to model w... [12:32:06] 10Analytics, 10Data-Engineering, 10Event-Platform, 10serviceops: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Ottomata) Yes. [12:33:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:45:30] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:58:53] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:06:11] btullis: I checked versions of the jar in all places where I found it, and have seen no error (versions before 0.1.18 are used without -shaded) [13:06:32] Now the weird thing is that test-refine should use version 0.1.16, not 0.1.27! [13:08:21] Ah!!! Actually I had an updated version of puppet :) [13:08:40] old sorry - not updated [13:10:44] ottomata: o/ [13:10:49] going to roll restart eventgate pods :) [13:12:52] btullis: I got it - https://github.com/wikimedia/puppet/commit/73955266cef030ca83b40f257b204dc04eeaac2b [13:16:04] btullis, ottomata : https://gerrit.wikimedia.org/r/c/operations/puppet/+/811990/ [13:30:08] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:31:37] joal: I've merged and deployed that. [13:31:48] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:57] btullis: let's triple check that --^ ok - it seems to work :) [13:34:39] Puppet run on an-test-coord looks good. Three remaining refine jobs updated the path to the refinery jar. e.g. [13:34:43] https://www.irccloud.com/pastebin/SeYhWQ3r/ [13:36:18] OK, I understand now we don't use `-shaded` everywhere now, thanks for the explanation. [13:37:47] I still find it a bit weird that we have so many versions of refine in use and don't use /latest/ for everything, but there we go. [13:38:25] !log restart refine_eventlogging_legacy_test.service on an-test-coord1001 [13:38:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:56:31] 10Data-Engineering: Add a spark-history-server to our cluster - https://phabricator.wikimedia.org/T312541 (10JAllemandou) [14:09:58] joal: DOH [14:10:11] nice find btullis joal [14:10:19] that makes so mmuch since [14:10:50] (03Abandoned) 10Ottomata: WIP - include missing dependencies [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/811758 (https://phabricator.wikimedia.org/T311807) (owner: 10Ottomata) [14:24:45] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Integrate Superset with DataHub - https://phabricator.wikimedia.org/T306903 (10Ottomata) Idea: What about running a second instance of superset on the same node, accessible only via backend network, but with `AUTH_TYPE = AUTH_DB` in its superset... [15:58:33] 10Data-Engineering: Emit lineage information about Airflow jobs to DataHub - https://phabricator.wikimedia.org/T312566 (10Milimetric) [16:04:11] (03PS1) 10Milimetric: Add base_uri config parameter to Superset source [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/812023 (https://phabricator.wikimedia.org/T306903) [16:06:01] (03CR) 10Milimetric: "Ben, I'll do what you did in https://phabricator.wikimedia.org/T306903#7959985 to verify this, just gotta navigate all these meetings toda" [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/812023 (https://phabricator.wikimedia.org/T306903) (owner: 10Milimetric) [16:34:51] (03CR) 10Aqu: Fix done file path in HDFSArchiver (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/811325 (https://phabricator.wikimedia.org/T310542) (owner: 10Aqu) [16:49:21] 10Data-Engineering-Kanban, 10Data-Catalog, 10Data Engineering Planning (Sprint 01), 10Patch-For-Review: Integrate Superset with DataHub - https://phabricator.wikimedia.org/T306903 (10Milimetric) [17:03:04] 10Data-Engineering, 10Data-Catalog, 10Epic: Data Catalog MVP - https://phabricator.wikimedia.org/T299910 (10Milimetric) [17:03:10] 10Data-Engineering-Kanban, 10Data-Catalog, 10Data Engineering Planning (Sprint 01), 10Patch-For-Review: Integrate Superset with DataHub - https://phabricator.wikimedia.org/T306903 (10Milimetric) 05duplicate→03Open Ah, it wasn't showing up 'cause it's closed :) [17:58:22] 10Data-Engineering, 10Event-Platform, 10Technical-Debt: Migrate usage of Database::select to SelectQueryBuilder in EventStreamConfig - https://phabricator.wikimedia.org/T312398 (10Umherirrender) [18:52:00] (03CR) 10Joal: "Thanks Andrew for the review - Ne patch on the way, plus some comments :)" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/811212 (https://phabricator.wikimedia.org/T311739) (owner: 10Joal) [19:22:01] i had heard some mention of spark 3 in the prod clusters, how hard is that to target (from scala)? [19:24:36] from T295072 it looks like it's at least partially installed. Mostly i'm curious about the skew-join optimizations they've added. Might play around with it [19:24:37] T295072: Install spark3 in analytics clusters - https://phabricator.wikimedia.org/T295072 [19:36:01] 10Data-Engineering-Kanban, 10Data Engineering Planning (Sprint 01), 10Patch-For-Review: Upgrade to latest PrestoDB and enable iceberg support - https://phabricator.wikimedia.org/T311525 (10Ottomata) It looks like possible a compression mismatch problem? The error from the presto server is: ` Jul 07 19:18:... [20:26:03] Noticed a question on my watchlist about EventStreams rate limiting: https://wikitech.wikimedia.org/wiki/Talk:Event_Platform/EventStreams [20:26:07] ftyi :) [20:26:10] fyi, even :) [20:26:34] seems to affect consumers in WMCS [20:31:36] ebernhardson: not quite ready, we are actively working on it though! [20:34:49] Krinkle: interesting. [20:35:01] do all consumers in WMCS appear as the same IP when making requests to production services? [20:38:26] ottomata: not per-se as some Cloud VPS hosts can be given a dedicated floating IP [20:38:30] but it's certainly not a large range [22:48:09] (03PS1) 10NOkafor: Cassandra Loading HQL files [Draft] Bug: T311507 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/812095 (https://phabricator.wikimedia.org/T311507) [22:50:30] (03PS2) 10NOkafor: Cassandra Loading HQL files [Draft] Please note that descriptions are not updated yet Bug: T311507 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/812095 (https://phabricator.wikimedia.org/T311507)