[01:30:50] (03PS1) 10Gergő Tisza: helppanel: Add postedit-nonsuggested context [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/893092 (https://phabricator.wikimedia.org/T330727) [01:31:45] (03CR) 10CI reject: [V: 04-1] helppanel: Add postedit-nonsuggested context [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/893092 (https://phabricator.wikimedia.org/T330727) (owner: 10Gergő Tisza) [01:48:41] (03PS2) 10Gergő Tisza: helppanel: Add postedit-nonsuggested context [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/893092 (https://phabricator.wikimedia.org/T330727) [03:12:18] (03PS1) 10Gergő Tisza: homepagevisit: Add referer_route:postedit-panel-nonsuggested [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/893098 (https://phabricator.wikimedia.org/T330727) [07:35:14] 10Data-Engineering, 10Event-Platform Value Stream, 10MediaWiki-extensions-WikimediaEvents, 10Product-Analytics, 10Technical-Debt: Decommission the EditorActivation instrument - https://phabricator.wikimedia.org/T330766 (10phuedx) [07:59:12] !log restarted hiveserver2 in analytics-test to take in account -XX:MaxMetaspaceSize=512m JVM parameter [07:59:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:02:21] 10Data-Engineering, 10Event-Platform Value Stream, 10MediaWiki-extensions-WikimediaEvents, 10Product-Analytics, 10Technical-Debt: Decommission the EditorActivation instrument - https://phabricator.wikimedia.org/T330766 (10phuedx) Thanks, @nettrom_WMF! @Mayakp.wiki also pointed out that the `wmf_product.... [08:12:58] nfraison: nice work :) [08:13:07] curious to see how the setting will behave [08:18:59] It should now be well managed by GC which should reclaim data in metaspace when it reach max. It will only affect the Metaspace not the old gen (so we still have a leak). Let's see if we don't see any OOM now ;) [08:26:18] +1 let's see, but afaics from the docs it should (in theory) kick off GCs around 70% of the 512MB [08:41:16] 10Data-Engineering: Check home/HDFS leftovers of echetty - https://phabricator.wikimedia.org/T330834 (10MoritzMuehlenhoff) [09:46:26] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:28] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.564 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [09:55:26] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:57:32] RECOVERY - eventgate-analytics-external validation error rate too high on alert1001 is OK: (C)2 gt (W)1 gt 0.5667 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [10:08:12] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:13:40] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:25:07] !log rebooting an-worker1132 being slower than other node (potential issue with raid card/disks) [10:25:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:29:34] PROBLEM - Host an-worker1132 is DOWN: PING CRITICAL - Packet loss = 100% [10:30:04] ^ this is OK, I think it's nfraison working on it. [11:01:37] RECOVERY - Host an-worker1132 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [11:01:43] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:05:47] ACKNOWLEDGEMENT - SSH on an-worker1132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Nicolas Fraison Reboot https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:15:06] (03CR) 10Phuedx: [C: 03+2] Update metrics_event schema to 1.2.0 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/891398 (https://phabricator.wikimedia.org/T330459) (owner: 10Clare Ming) [11:15:42] (03Merged) 10jenkins-bot: Update metrics_event schema to 1.2.0 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/891398 (https://phabricator.wikimedia.org/T330459) (owner: 10Clare Ming) [13:24:08] I'm wondering if the new HTML rendered page dumps are availble from the stats cluster? [13:34:22] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Migrate import_ttl.py from airflow 1 to airflow 2 - https://phabricator.wikimedia.org/T329874 (10pfischer) a:05pfischer→03EBernhardson [14:29:35] I'm restarting AQS to pick uo the latest OpenSSL/c-ares security updates [14:29:57] moritzm: ack - Many thanks. [14:47:01] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:49:59] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:09] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:58:25] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:37] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:09:15] the new AQS cookbook is working fine, shall we keep the old one or rather remove to avoid confusion? the only perk of the old one is the staged workflow, since it first restarts the canary and then asks for confirmation before moving on to the main ones [15:09:21] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:09:32] but the same can be achieved by running the new cookbook with --alias aqs-canary first [15:09:41] and then a second run with --alias aqs [15:17:21] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:14] moritzm: I'm fine with deleting the old cookbook to remove the duplication and reduce confusion. We will need to update the reference to the old cookbook in the docs here: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/AQS#Deploy_new_History_snapshot_for_Wikistats_Backend [15:20:22] ok, removing a cookbook needs some manual cleanup steps anyway on the cumin hosts, I'll make a task to retire the old cookbook, ditch the old one and then I'll bounce back to the DE Kanban to clean up the docs, ok? [15:21:32] moritzm: Yep, perfect thank you. [15:21:59] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:26:49] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:23] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:35:59] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:36:37] 10Data-Engineering, 10Equity-Landscape: Social Progress index - https://phabricator.wikimedia.org/T330897 (10ntsako) [15:39:35] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:41:21] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:47] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:03] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:25] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:25:13] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:36:01] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:39:37] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:46:39] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:18:11] 10Data-Engineering-Planning: Check home/HDFS leftovers of echetty - https://phabricator.wikimedia.org/T330834 (10lbowmaker) [17:19:25] 10Data-Engineering-Planning, 10Event-Platform Value Stream: EventStreamCatalog removes 'topic' table option if connector = upsert-kafka - https://phabricator.wikimedia.org/T330769 (10lbowmaker) [17:20:27] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Flink EventStreamCatalog should not prevent creation of VIEWs - https://phabricator.wikimedia.org/T330703 (10lbowmaker) [17:21:12] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10SRE-swift-storage: Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10lbowmaker) [17:22:45] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Event partitions missing since 2023-02-21T10:00 for stream without events (canary events not produced?) - https://phabricator.wikimedia.org/T330236 (10lbowmaker) [17:26:05] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10serviceops, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10lbowmaker) [17:33:45] 10Data-Engineering, 10Event-Platform Value Stream: Add event dt field to error event schema - https://phabricator.wikimedia.org/T330918 (10Ottomata) [17:34:05] 10Data-Engineering, 10Event-Platform Value Stream: Add event dt field to error event schema - https://phabricator.wikimedia.org/T330918 (10Ottomata) [17:34:10] 10Analytics, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Platform Team Workboards (Clinic Duty Team): Adopt conventions for server receive and client/event timestamps in non analytics event schemas - https://phabricator.wikimedia.org/T267648 (10Ottomata) [17:35:35] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:39:13] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Add event dt field to error event schema - https://phabricator.wikimedia.org/T330918 (10lbowmaker) [17:39:59] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Flink EventStreamCatalog should add watermark - https://phabricator.wikimedia.org/T330441 (10lbowmaker) [17:40:53] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:42:26] (03PS1) 10Ottomata: error/2.0.0 - add dt field [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/893518 (https://phabricator.wikimedia.org/T330918) [17:48:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:00:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:47:38] mforns: so that you know, SandraEbele is taking care of rerunning the failed webrequest hour [18:54:28] !log Create empty partitions in event.mediawiki_page_move table for codfw datacenter from beginning of week (2023-02-27T00 -> 2023-02-28T13) [18:54:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:22:07] 10Data-Engineering-Planning: Load wmf.unique_editors_by_country_monthly into Druid for access in Turnilo & Superset - https://phabricator.wikimedia.org/T330436 (10lbowmaker) [19:22:51] 10Data-Engineering-Planning, 10Data Pipelines: Load wmf.unique_editors_by_country_monthly into Druid for access in Turnilo & Superset - https://phabricator.wikimedia.org/T330436 (10lbowmaker) [19:31:07] 10Data-Engineering-Planning, 10Data Pipelines: Load wmf.unique_editors_by_country_monthly into Druid for access in Turnilo & Superset - https://phabricator.wikimedia.org/T330436 (10xcollazo) Passing by to note that you can use `wmf.unique_editors_by_country_monthly` today in Superset by creating a dataset on t... [19:38:38] !log rerunning webrequest load text for 2023-03-01-08 hour. [19:38:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:02:54] 10Data-Engineering-Planning, 10conftool: an-launcher1002: failed services - https://phabricator.wikimedia.org/T330652 (10lbowmaker) [20:29:55] 10Data-Engineering-Planning, 10Growth-Team, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Icebox, and 4 others: [EPIC] Deprecate EventLogging::logEvent() - https://phabricator.wikimedia.org/T318263 (10kostajh) [21:03:39] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Migrate incoming_links.py from airflow 1 to airflow 2 - https://phabricator.wikimedia.org/T329875 (10EBernhardson) a:03EBernhardson [21:23:51] (03CR) 10Mforns: [V: 03+2 C: 03+2] "Deploying!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/889766 (https://phabricator.wikimedia.org/T307569) (owner: 10Joal) [21:26:59] !log starting refinery deploy [21:27:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:38:38] Hey mforns - I'm about to leave for tonight - if the edit_hourly airflow job is problematic just leave it in pause, and I'll check tomorrow :) [21:39:01] joal: don't worry, will do if it has issues I can not quickly fix [21:48:35] !log kill edit-hourly-coord in Hue to migrate it to Airflow [21:48:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:30:02] !log finished refinery deployment, although didn't manage to run refinery-deploy-to-hdfs without warnings... [22:30:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:42:20] !log deployed Airflow analytics [22:42:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:45:59] !log re-deployed airflow analytics with some forgotten changes [22:46:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log