[01:58:16] (03CR) 10Krinkle: [C: 03+2] PaintTiming: Add paint timing metrics to Navigation timing schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/902079 (https://phabricator.wikimedia.org/T328256) (owner: 10Barakat Ajadi) [01:58:50] (03Merged) 10jenkins-bot: PaintTiming: Add paint timing metrics to Navigation timing schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/902079 (https://phabricator.wikimedia.org/T328256) (owner: 10Barakat Ajadi) [03:46:13] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:51:51] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:59:27] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:03:15] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service,systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:16:37] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:21:13] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:22:23] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:28:01] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:35:37] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:41:17] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:35:00] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by stevemunene@cumin1001 for host an-test-clien... [08:10:26] 10Data-Engineering-Planning, 10Data Pipelines: Support for moving data from HDFS to public http file server - https://phabricator.wikimedia.org/T317167 (10JAllemandou) This use-case is appearing more and more - We should prioritize this. To support this use-case we would use the `hdfs-rsync` tool that mimic r... [08:18:37] (03CR) 10Joal: [C: 03+1] "One nit in comment - ready to go as is." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/900389 (https://phabricator.wikimedia.org/T330200) (owner: 10Snwachukwu) [08:28:07] !log Rerun failed virtualpageview-druid-daily-wf-2023-3-22 [08:28:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:18:44] (03CR) 10Aklapper: "T138647 is declined. Thus please also abandon this patch. Thanks." [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/302755 (https://phabricator.wikimedia.org/T138647) (owner: 10Nuria) [09:40:39] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10Stevemunene) [09:43:29] (03PS9) 10Jennifer Ebe: T305842-Migrate-The-Referrer-Job-Daily-hql Create archive and referrer hql for referrer daily airflow job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/902068 [09:50:38] 10Data-Engineering, 10API Platform: Turnilo: include authentication status in request data cube - https://phabricator.wikimedia.org/T332864 (10daniel) [09:58:18] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10MW-1.40-notes (1.40.0-wmf.24; 2023-02-20), 10MW-1.41-notes (1.41.0-wmf.2; 2023-03-27), and 2 others: Remove StreamConfig::INTERNAL_SETTINGS logic from EventStreamConfig and do it in EventLogging client ... - https://phabricator.wikimedia.org/T286344 [10:55:53] joal: Have you got a sec for a question about Druid please? [11:01:47] Ah sorry, I see from your calendar that you're busy at the moment. [11:08:21] Hi btullis - sure [11:35:12] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade an-test-druid1001 to bullseye - https://phabricator.wikimedia.org/T332584 (10BTullis) p:05Triage→03Medium [11:36:31] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade an-test-druid1001 to bullseye - https://phabricator.wikimedia.org/T332584 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by btullis@cumin1001 for host an-test-druid1001.eqiad.wmnet with OS... [11:36:53] !log reimaging an-test-druid1001 in place to upgrade to bullseye [11:36:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:37:54] !log we changed the retention policy on an-test-druid to `{"period":"P1M","includeFuture":true,"tieredReplicants":{"_default_tier":1},"type":"loadByPeriod"},{"type":"dropForever"}` [11:37:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:14:50] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade an-test-druid1001 to bullseye - https://phabricator.wikimedia.org/T332584 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by btullis@cumin1001 for host an-test-druid1001.eqiad.wmnet with OS bul... [12:44:57] hi milimetric - I'll have a question for you when you're up [13:11:25] hi joal, cave? [13:12:06] hey milimetric - OMW ! [13:14:44] 10Analytics-Radar, 10DC-Ops, 10SRE-swift-storage, 10ops-eqiad: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10jbond) p:05Triage→03Medium [13:27:28] (03PS1) 10Joal: Fix pageview_hourly oozie job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/902390 [13:27:33] milimetric: --^ [13:27:58] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Fix pageview_hourly oozie job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/902390 (owner: 10Joal) [13:28:45] thanks milimetric [13:29:18] !log Hotfix deploy refinery [13:29:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:40:51] 10Data-Engineering, 10WMDE-References-FocusArea, 10WMDE-TechWish-Sprint-2023-03-14: Shut down our previous Cloud VPS project and create a new one - https://phabricator.wikimedia.org/T332040 (10taavi) [13:47:37] !log Kill oozie virtualpageview-hourly-coord job [13:47:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:48:16] !log Restart virtualpageview-hourly-coord with pageview_allowlist fix - starting 2023-03-21T08:00 [13:48:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:55:15] (03PS1) 10Joal: Fix virtualpageivew oozie job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/902398 [13:55:23] milimetric: second issue :( --^ [13:58:24] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Fix virtualpageivew oozie job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/902398 (owner: 10Joal) [14:31:23] btullis: Do you remember on which host hue runs? [15:10:14] joal: an-tool1009 [15:10:28] joal: an-test-ui1001 in the test cluster. [15:13:17] thanks btullis [15:19:24] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10colewhite) [15:20:22] (03PS1) 10Bearloga: movement_metrics: Switch to conda-analytics [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/902413 (https://phabricator.wikimedia.org/T332896) [15:20:47] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10colewhite) [15:25:57] (03CR) 10Bearloga: [V: 03+2 C: 03+2] "Tested & verified: T332896#8721700" [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/902413 (https://phabricator.wikimedia.org/T332896) (owner: 10Bearloga) [16:44:07] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:51:43] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:57:17] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:40] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:06] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:28] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:16] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:18:06] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:18:28] joal: so I've got alerts from now on [17:18:56] milimetric: cave? [17:20:32] 10Data-Engineering-Radar, 10Data-Engineering-Wikistats, 10Product-Analytics, 10Wikipedia-Android-App-Backlog, 10Wikipedia-iOS-App-Backlog: Wikistats pageview data missing counts for Mobile App pageviews on Commons, going back to 2020-11 - https://phabricator.wikimedia.org/T299439 (10mpopov) [17:20:38] 10Data-Engineering-Radar, 10Data-Engineering-Wikistats, 10Product-Analytics, 10Wikipedia-Android-App-Backlog, 10Wikipedia-iOS-App-Backlog: Wikistats pageview data missing counts for Mobile App pageviews on Commons, going back to 2020-11 - https://phabricator.wikimedia.org/T299439 (10mpopov) [17:22:52] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:24:49] 10Data-Engineering-Radar, 10Data-Engineering-Wikistats, 10Product-Analytics, 10Wikipedia-Android-App-Backlog, 10Wikipedia-iOS-App-Backlog: Wikistats pageview data missing counts for Mobile App pageviews on Commons, going back to 2020-11 - https://phabricator.wikimedia.org/T299439 (10mpopov) 05Open→03I... [17:26:16] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:43:41] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:55:41] 10Data-Engineering-Planning, 10Data-Catalog, 10Shared-Data-Infrastructure: Streamline CI for our fork of DataHub - https://phabricator.wikimedia.org/T303381 (10BTullis) [17:55:46] 10Data-Engineering-Planning: Data Engineering Pairing system - https://phabricator.wikimedia.org/T327790 (10JArguello-WMF) [17:58:38] 10Data-Engineering-Planning, 10Data-Catalog, 10Shared-Data-Infrastructure: Streamline CI for our fork of DataHub - https://phabricator.wikimedia.org/T303381 (10BTullis) Repurposing this ticket to find a solution to the problem of the build process of DataHub. [17:59:17] 10Data-Engineering-Planning, 10Data-Catalog, 10Shared-Data-Infrastructure: Review and improve the build process for DataHub - https://phabricator.wikimedia.org/T303381 (10BTullis) p:05Triage→03High [18:12:59] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Machine-Learning-Team, 10Patch-For-Review: Upgrade ROCm to 4.5 - https://phabricator.wikimedia.org/T295661 (10fkaelin) Upgrading the ROCm version, and ideally having all hosts with GPUs use the same version, would be great. I tried to use the recently rel... [18:33:46] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Machine-Learning-Team, 10Patch-For-Review: Upgrade ROCm to 4.5 - https://phabricator.wikimedia.org/T295661 (10fkaelin) This is the complete snippet to use an AMD GPU for stable diffusion ` from diffusers import StableDiffusionPipeline pipe = StableDiffusionPi... [18:40:21] 10Data-Engineering: Create HDFS folder wmf/data/research - https://phabricator.wikimedia.org/T332926 (10fkaelin) [21:02:31] 10Data-Engineering, 10Event-Platform Value Stream: [NEEDS GROOMING] eventutilities-python: issue async request from FlatMap context - https://phabricator.wikimedia.org/T332948 (10gmodena) [21:02:58] 10Data-Engineering, 10Event-Platform Value Stream: [NEEDS GROOMING] eventutilities-python: issue async requests from FlatMap context - https://phabricator.wikimedia.org/T332948 (10gmodena) [21:07:26] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 10): [SPIKE] tune memory and latency of mediawiki-event-enrichment on k8s - https://phabricator.wikimedia.org/T332166 (10gmodena) I created https://phabricator.wikimedia.org/T332948 as follow up work for this spike. [21:24:34] 10Data-Engineering, 10Event-Platform Value Stream: [NEEDS GROOMING] eventutilities-python: issue async requests from MapFunction context - https://phabricator.wikimedia.org/T332948 (10gmodena) [21:41:09] 10Data-Engineering: Create HDFS folder wmf/data/research - https://phabricator.wikimedia.org/T332926 (10JAllemandou) Done! Let's make sure other people in the data-engineering team know about this. [21:47:07] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python, 10Documentation: Create end-user documentation for Wmfdata-Python - https://phabricator.wikimedia.org/T298178 (10nshahquinn-wmf) p:05Medium→03Low