[00:00:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:03:27] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:12:53] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): migrate mjolnir application and dag to airflow v2 and spark3 - https://phabricator.wikimedia.org/T329239 (10EBernhardson) Test run has completed, looked reasonable but i needed to make a few adjustments to get things running.... [00:45:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:50:25] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:57:14] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08), 10Patch-For-Review: Streaming services errors should be routed to an error event topic. - https://phabricator.wikimedia.org/T326536 (10Ottomata) Error event work in [[ https://gitlab.wikimedia.org/repos/data-engineering/eventutilities-py... [01:00:57] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:06:19] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:15:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:20:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:39:46] 10Data-Engineering-Planning, 10Data Pipelines, 10Release-Engineering-Team, 10serviceops-collab, 10GitLab (CI & Job Runners): Experiencing pipeline failure due to disk-space issues - https://phabricator.wikimedia.org/T310593 (10JAllemandou) Thank you ! [08:42:21] (03CR) 10Joal: Update Webrequest table to include referer_data column. (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/887371 (https://phabricator.wikimedia.org/T327074) (owner: 10Snwachukwu) [10:46:09] btullis: o/ [10:46:19] qq about etcd + pki (I am reading https://phabricator.wikimedia.org/T313129) [10:46:32] do we need any extra cergen-based certs if we enable pki? [10:46:35] Here. Looking now. [10:49:13] elukey: No, I don't think we do. [10:50:15] We just need to set profile::etcd::v3::use_pki_certs => true and I think you're good to go. [10:50:15] super thanks, I am going to switch to pki too :) [10:50:21] thanks for that work <3 [10:51:32] Nice! A pleasure. Let me know how it goes. I saw your report on https://phabricator.wikimedia.org/T329556 and was duly nerdsniped :-) [10:54:07] btullis: I *think* that we'll need an extra SAN and we should be good, with PKI it is way easier :D [10:55:01] I changed the cookbook so in theory dse and aux should be fine when upgrading [10:55:11] but yesterday for me it was hell with etcd [11:07:03] Yeah, it sounded pretty rough for you yesterday. [11:08:47] Thanks for the work on the cookbook. I haven't exactly been on my best form today, with two stupid puppet errors before getting bigtop::mysql_jdbc working. [12:09:18] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) The puppet manifests now compile correctly, but there are still anumber of errors preventing a successful run. [] `Error: /St... [12:43:09] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) I think that the first thing to look at is `conda-analytics` because we will need this and we don't have it in bullseye at all... [12:50:13] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) Oh, that seems to have worked, actually. ` btullis@apt1001:~$ sudo -i reprepro copy bullseye-wikimedia buster-wikimedia conda-... [13:02:15] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) [13:08:12] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) The `conda-analytics` package now installs on bullseye and behaves as expected. There is a warning message about a missing hiv... [13:12:27] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) [13:41:05] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) OK, it looks like [[https://packages.debian.org/buster/enchant|enchant]] (version 1.6.0) has been removed from buster and repl... [13:43:45] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10Ottomata) It might be possible to just remove the ores classes from Hadoop nodes. I don't know if anyone actually runs ORES code on Ha... [13:52:16] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) >>! In T329363#8614280, @Ottomata wrote: > It might be possible to just remove the ores classes from Hadoop nodes. I > don't... [14:01:14] btullis: o/ [14:01:21] whe you have a moment I added you to https://gerrit.wikimedia.org/r/c/operations/puppet/+/889084 [14:01:42] it is basically to add a SAN to the etcd's certs on bullseye [14:11:30] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08), 10Patch-For-Review: Flink SQL queries should access Kafka topics from a Catalog - https://phabricator.wikimedia.org/T322022 (10JArguello-WMF) [14:18:11] (03CR) 10Peter Fischer: [C: 03+1] "LGTM" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/854572 (https://phabricator.wikimedia.org/T317609) (owner: 10DCausse) [14:19:52] (03CR) 10Peter Fischer: [C: 03+1] [WIP] cirrussearch: add fetch_failure schema (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/854572 (https://phabricator.wikimedia.org/T317609) (owner: 10DCausse) [14:26:22] 10Data-Engineering, 10Event-Platform Value Stream, 10Epic: Event Platform Value Stream Documentation Tasks - https://phabricator.wikimedia.org/T329628 (10lbowmaker) [14:28:27] 10Data-Engineering, 10Event-Platform Value Stream: Document Flink job deployment to k8s - https://phabricator.wikimedia.org/T329629 (10lbowmaker) [15:37:36] elukey: Sorry for the delay there. +1d it but I'm still not sure why it was necessary. the dse-k8s-etcd servers were already on bullseye and I don't think we saw these errors. [15:45:47] Anyway, all good. I see the guidance you posted here: https://phabricator.wikimedia.org/T329556#8613586 [15:46:57] (03PS1) 10Snwachukwu: Remove Github.io from Mediasites Definition [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/889157 (https://phabricator.wikimedia.org/T329307) [15:48:25] 10Data-Engineering, 10AQS 2.0 Roadmap, 10API Platform (API Platform Roadmap), 10Epic, and 2 others: AQS 2.0: Page Analytics Service - https://phabricator.wikimedia.org/T288296 (10BPirkle) [15:53:44] (03CR) 10Mforns: [C: 03+1] "LGTM!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/889157 (https://phabricator.wikimedia.org/T329307) (owner: 10Snwachukwu) [16:08:35] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): migrate mjolnir application and dag to airflow v2 and spark3 - https://phabricator.wikimedia.org/T329239 (10EBernhardson) Forgot to mention earlier, the library mjolnir uses for feature selection look to be abandoned, last upda... [16:16:50] (03PS3) 10Snwachukwu: Update Webrequest table to include referer_data column. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/887371 (https://phabricator.wikimedia.org/T327074) [16:20:49] btullis: yeah in theory it happens only on some weird corner cases etc.. like bootstrappin etc.. [16:20:53] better safe than sorry :) [16:56:25] (03CR) 10Joal: [C: 03+2] "LGTM - Merging!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/889157 (https://phabricator.wikimedia.org/T329307) (owner: 10Snwachukwu) [16:59:27] (03CR) 10Joal: [C: 03+1] "LGTM :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/887371 (https://phabricator.wikimedia.org/T327074) (owner: 10Snwachukwu) [17:06:09] (03Merged) 10jenkins-bot: Remove Github.io from Mediasites Definition [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/889157 (https://phabricator.wikimedia.org/T329307) (owner: 10Snwachukwu) [17:28:10] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) [17:29:50] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) We have also decided not to fix spark2 for bullseye, but to make sure that we have migrated all jobs to... [19:53:36] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08), 10Patch-For-Review: Streaming services errors should be routed to an error event topic. - https://phabricator.wikimedia.org/T326536 (10Ottomata) Trying to see if we can support multiple calls to our `process`. https://lists.apache.org/th... [19:55:43] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10Halfak) Hi @BTullis! In the recent past (last 2 years), a lot of ORES model development happened on the stat ser... [20:09:52] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by bking@cumin1001 for host an-airflow1005... [20:24:59] 10Data-Engineering, 10Equity-Landscape: Population input metrics - https://phabricator.wikimedia.org/T309279 (10JAnstee_WMF) @ntsako -ok here are the revised column names round 2: FROM: population_total TO population_annual_signal population_growth_rate TO population_annual_change population_wikipedia_edit... [20:37:16] RECOVERY - Check systemd state on an-airflow1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:37:44] 10Data-Engineering, 10Equity-Landscape: Access input metrics - https://phabricator.wikimedia.org/T324968 (10JAnstee_WMF) @ntsako -ok here are the revised column names round 2: FROM: mobile_subscriptions TO mobile_subscriptions_annual_signal access_to_basic_knowledge TO access_to_basic_knowledge_annual_... [21:43:25] 10Data-Engineering, 10CheckUser, 10MW-1.38-notes (1.38.0-wmf.26; 2022-03-14), 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), and 4 others: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 (10Zabe) [21:43:59] 10Data-Engineering, 10CheckUser, 10MW-1.38-notes (1.38.0-wmf.26; 2022-03-14), 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), and 4 others: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 (10Zabe) 05Open→03Resolved \o/ [22:03:39] 10Data-Engineering, 10CheckUser, 10MW-1.38-notes (1.38.0-wmf.26; 2022-03-14), 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), and 4 others: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 (10Dreamy_Jazz) \o/