[01:33:47] (03PS7) 10David Martin: Create wikilambda_zobject_labels & wikilambda_zobject_function_join tables in Hive [analytics/refinery] - 10https://gerrit.wikimedia.org/r/937997 (https://phabricator.wikimedia.org/T341728) [02:06:43] (SystemdUnitFailed) firing: (2) wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:04:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:06:43] (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:16:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:16:43] (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:41:43] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10Joe) Can someone please elaborate on what the "generic content store" would need to look like for this project... [08:16:43] (SystemdUnitFailed) firing: (2) wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:16:43] (SystemdUnitFailed) firing: (3) wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:18:31] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10MW-1.40-notes (1.40.0-wmf.1; 2022-09-12), 10Metrics-Platform-Planning (Metrics Platform Kanban): Generate $wgEventLoggingStreamNames from $wgEventStreams - https://phabricator.wikimedia.org/T303602 (10phuedx) [09:19:30] 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Analytics Platform Future State Planing - https://phabricator.wikimedia.org/T302728 (10BTullis) [09:19:32] 10Data-Platform-SRE, 10Epic: Data Infrastructure as a Service MVP - https://phabricator.wikimedia.org/T308317 (10BTullis) [09:21:43] (SystemdUnitFailed) firing: (3) wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:41:17] mforns: good morning! [09:41:27] mforns: would you have a minutefor me? [09:42:11] I'm planning to go ahead with another couple of hadoop worker upgrades this morning, if that's OK with everyone. [09:50:22] All good on my side :) [09:52:55] Ack, thanks joal. [10:00:36] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1071.eqiad.wmnet with OS bullseye [10:19:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:20:43] 10Data-Engineering, 10Data-Platform-SRE: NEW BUG REPORT remove mysql databases from SQLLab - https://phabricator.wikimedia.org/T337056 (10BTullis) {F37142595,width=70%} I've now updated the superset configuration to exclude the wikishared and mysql_staging databases from SQL Lab, as requested. If it turns out... [10:21:43] (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:30:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:36:43] (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:37:32] 10Data-Platform-SRE, 10Epic: Data Infrastructure as a Service MVP - https://phabricator.wikimedia.org/T308317 (10BTullis) [10:41:37] 10Data-Platform-SRE, 10Epic: Install Ceph Cluster for Data Engineering - https://phabricator.wikimedia.org/T324660 (10BTullis) [10:55:00] 10Data-Platform-SRE, 10Foundational Technology Requests, 10Epic: Data Science and Engineering Kubernetes Cluster Experiment - https://phabricator.wikimedia.org/T327267 (10BTullis) [10:59:24] 10Data-Platform-SRE, 10Epic: Install Ceph Cluster for Data Engineering - https://phabricator.wikimedia.org/T324660 (10BTullis) [10:59:26] 10Data-Platform-SRE, 10Foundational Technology Requests, 10Epic: Data Science and Engineering Kubernetes Cluster Experiment - https://phabricator.wikimedia.org/T327267 (10BTullis) [10:59:28] 10Data-Platform-SRE, 10Epic: Data Infrastructure as a Service MVP - https://phabricator.wikimedia.org/T308317 (10BTullis) [11:00:20] 10Data-Platform-SRE, 10Foundational Technology Requests, 10Epic: Data Science and Engineering Kubernetes Cluster Experiment - https://phabricator.wikimedia.org/T327267 (10BTullis) [11:00:24] 10Data-Platform-SRE, 10Foundational Technology Requests, 10Epic: POC for Running Spark on DSE - https://phabricator.wikimedia.org/T318712 (10BTullis) [11:02:04] 10Data-Platform-SRE, 10Foundational Technology Requests, 10Epic: POC for Running Spark on DSE - https://phabricator.wikimedia.org/T318712 (10BTullis) [11:04:14] 10Data-Platform-SRE, 10Foundational Technology Requests, 10Epic: Data Science and Engineering Kubernetes Cluster Experiment - https://phabricator.wikimedia.org/T327267 (10BTullis) [11:08:18] 10Data-Platform-SRE, 10Foundational Technology Requests, 10Epic: Enable spark jobs on the dse-k8s cluster via the spark-operator - https://phabricator.wikimedia.org/T318712 (10BTullis) p:05Triage→03Medium [11:13:28] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1071.eqiad.wmnet with OS bullseye completed: - analytics1071 (**PASS**) - Downtimed on Icinga/Alertmanag... [11:17:54] ping mforns for when you come online - Ineed some help :) [11:17:59] 10Data-Engineering, 10Data-Platform-SRE, 10Product-Analytics, 10Epic: Reconstruct Hive & Hadoop permissions for shared database - https://phabricator.wikimedia.org/T288983 (10BTullis) @Mayakp.wiki - I'm currently reviewing tickets and I wondered whether you might be able to provide an update on the continu... [11:20:06] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1073.eqiad.wmnet with OS bullseye [11:20:10] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1074.eqiad.wmnet with OS bullseye [11:36:56] 10Data-Platform-SRE, 10Patch-For-Review: Migrate analytics_test airflow instance to bullseye an-test-client1002 - https://phabricator.wikimedia.org/T341700 (10CodeReviewBot) stevemunene opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags-scap-analytics_test/-/merge_requests/1 Change test... [12:01:28] (03PS1) 10KCVelaga: Add client_dt to EditAttemptStep allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/939265 (https://phabricator.wikimedia.org/T341888) [12:13:29] 10Data-Platform-SRE: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10AndrewTavis_WMDE) @WMF, do you all have a general estimate for when work will begin on this? We have one airflow exploration ticket already planned out in {https://phabricator.wikimedia.org/T341330}. Wou... [12:23:55] (03CR) 10Gmodena: [C: 03+1] Bump eventutilities to 1.2.12 and use new shaded artifact [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/937954 (owner: 10DCausse) [12:31:52] 10Data-Platform-SRE, 10Foundational Technology Requests, 10Epic: Create a DSE Kubernetes cluster with support for persistent storage from Ceph - https://phabricator.wikimedia.org/T327267 (10BTullis) [12:33:35] 10Data-Platform-SRE, 10Foundational Technology Requests, 10Epic: Create a DSE Kubernetes cluster with support for persistent storage from Ceph - https://phabricator.wikimedia.org/T327267 (10BTullis) [12:40:25] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1073.eqiad.wmnet with OS bullseye executed with errors: - analytics1073 (**FAIL**) - Downtimed on Icinga... [12:55:09] btullis: Hello! How are we doing about airflow 2.6.1? [12:55:22] I see there is patch on the dpeloyment train [12:55:49] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1074.eqiad.wmnet with OS bullseye completed: - analytics1074 (**PASS**) - Downtimed on Icinga/Alertmanag... [12:56:03] Ah, well. It's airflow 2.6.3 now. Two more security patches released. [12:56:09] :) [12:56:37] btullis: shall I remove the deploy commentthen? [12:56:43] I've installed 2.6.3 on an-test-client1002 but I'm waiting on stevemunene to deploy airflow-dags to an-test-client1002 (bullseye) [12:57:10] joal: Yes, I think so. Sorry, I haven't kept up-to-date with the etherpad. [12:57:20] no prob btullis :) Thank you! [12:58:04] I'm ready to install 2.6.3 on the analytics instance whenever we wish, followed by all of the other instances. I just thought it would be good to test a current airflow-dags on 2.6.3 first. [12:58:34] joal: Ref: https://phabricator.wikimedia.org/T341700#9019582 [13:02:11] joal: Quick question, are you aware of any Hadoop workers that have 1 Gbps network connections instead of 10 Gbps connections? I thought they were all supposed to be 10 Gbps. [13:03:48] btullis: I don't know about which hosts nor how many, but Iknow there are some with 1Gbps (oldest) and some with 10Gbps (newest) [13:05:18] joal: OK, cool. Thanks. I'm troubleshooting the failed reimage of analytics1073 (old) and noticed that it only has a 1Gpbs connection. Although it contains a 10 Gbps card, it's not plugged in. :-) [13:05:32] Anyway, that's fine. Many thanks. [13:06:43] (SystemdUnitFailed) firing: (3) wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:13:56] 10Data-Platform-SRE, 10Patch-For-Review: Migrate analytics_test airflow instance to bullseye an-test-client1002 - https://phabricator.wikimedia.org/T341700 (10CodeReviewBot) stevemunene merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags-scap-analytics_test/-/merge_requests/1 Change test... [13:19:19] btullis: doing the airflow-dags to an-test-client1002 deploy now [13:19:37] stevemunene: Excellent! Many thanks. [13:20:25] !log deploy airflow-dags to an-test-client1002 T341700 [13:20:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:20:28] T341700: Migrate analytics_test airflow instance to bullseye an-test-client1002 - https://phabricator.wikimedia.org/T341700 [13:46:51] (HdfsFSImageAge) firing: The HDFS FSImage on analytics-test-hadoop:an-test-master1002:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-test-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [13:51:51] (HdfsFSImageAge) firing: (2) The HDFS FSImage on analytics-test-hadoop:an-test-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-test-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [13:56:51] (HdfsFSImageAge) resolved: (2) The HDFS FSImage on analytics-test-hadoop:an-test-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-test-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [14:02:00] (03CR) 10Gmodena: [C: 03+1] "Changes LGTM as far as Jsonschema goes. I'm not familiar with applications using this schema though." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/934032 (https://phabricator.wikimedia.org/T337270) (owner: 10Kimberly Sarabia) [14:18:27] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 0): ProduceCanaryEvents job should be scheduled by Airflow - https://phabricator.wikimedia.org/T341229 (10Ottomata) [14:55:24] (03CR) 10DLynch: [C: 03+2] "I'll count someone else having approved of the novel $ref usage as being all I wanted to push off to specialists for this review. :D" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/934032 (https://phabricator.wikimedia.org/T337270) (owner: 10Kimberly Sarabia) [14:56:01] (03Merged) 10jenkins-bot: Fix editattemptstep ref [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/934032 (https://phabricator.wikimedia.org/T337270) (owner: 10Kimberly Sarabia) [15:03:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:06:43] (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:16:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:21:43] (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:42:44] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10JMeybohm) Could you please also add the expected events/s, requests/s towards the kafka-main cluster? @elukey... [15:47:23] (03PS2) 10KCVelaga: Add client_dt to EditAttemptStep allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/939265 (https://phabricator.wikimedia.org/T341888) [15:48:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:43] (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:01:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:01:43] (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:19:28] 10Data-Engineering, 10Data-Platform-SRE: 503 on Superset (reproducible) - https://phabricator.wikimedia.org/T322525 (10BTullis) 05Open→03Resolved a:03BTullis I'm going to resolve this ticket, if that's OK with everyone. We know that Superset could have better error reporting for situations where the quer... [16:20:59] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10BTullis) I've had a problem with analytics1073 which failed during the reimage and now appears to have lost connectivity. I've created T342141 to track that, but in the meantime I will continue with the upgrades. [16:23:01] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10BTullis) 4 of 78 hosts have been upgraded so far. ` btullis@cumin1001:~$ sudo cumin --no-progress a:hadoop-worker 'cat /etc/debian_version' 78 hosts will be targeted: an-worker[1078-1148].eqiad.wmnet,analytics[... [16:24:48] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1075.eqiad.wmnet with OS bullseye [16:34:10] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, 10SRE-Access-Requests: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10WDoranWMF) [16:35:30] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 0), 10Event-Platform (Sprint 14 B): jsonschema-tools test should fail if fields are removed in new (non major) version - https://phabricator.wikimedia.org/T340765 (10Ottomata) Added: https://wikitech.wikimedia.org/wiki/Event_Platform/Sche... [16:36:35] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, 10SRE-Access-Requests: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10WDoranWMF) I'm marking this as high because Thomas will need the access in order to be able to start supporting Ops work starting week... [16:42:42] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, 10SRE-Access-Requests: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10JAllemandou) Indeed, being part of Data Engineering team, Thomas will be in charge during his ops-week time to restart jobs as the `ana... [16:44:30] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, 10SRE-Access-Requests: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10BTullis) Happily, I can also approve this change, as per: https://gerrit.wikimedia.org/r/c/operations/puppet/+/933976 I'll merge this a... [16:50:39] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10BTullis) p:05Triage→03High I have created: https://gerrit.wikimedia.org/r/939348 [16:54:30] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10BTullis) I've just seen the same thing happen here with analytics1075 {T342141} I think I'd better pause the upgrades until I get that sorted, because I don't really want three nodes to be unavailable at the s... [17:05:33] 10Data-Engineering, 10Data-Platform-SRE: Deprecate Hue and stop the services - https://phabricator.wikimedia.org/T341895 (10BTullis) p:05Triage→03High [17:13:18] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10BTullis) a:03BTullis [17:25:25] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1002.eqiad.wmnet with OS bookworm [17:33:25] 10Data-Engineering, 10Data-Platform-SRE: NEW BUG REPORT remove mysql databases from SQLLab - https://phabricator.wikimedia.org/T337056 (10Mayakp.wiki) ty so much @BTullis for the quick turnaround on this! :) [17:45:01] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search, 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10bking) [17:45:09] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1075.eqiad.wmnet with OS bullseye executed with errors: - analytics1075 (**FAIL**) - Downtimed on Icinga... [17:56:39] (03CR) 10Milimetric: [V: 03+2 C: 03+2] "This looks good, I can show you how to test if you like, but there's no source data right now, right? So maybe this is a test in producti" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/937568 (https://phabricator.wikimedia.org/T341724) (owner: 10David Martin) [18:05:46] (03CR) 10Milimetric: Create wikilambda_zobject_labels & wikilambda_zobject_function_join tables in Hive (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/937997 (https://phabricator.wikimedia.org/T341728) (owner: 10David Martin) [18:09:12] (03CR) 10Milimetric: [V: 03+2 C: 03+2] "Looks good, let's start with this and get a basic flow in place, if our choices end up hurting us in any way, let's move to Iceberg sooner" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/937997 (https://phabricator.wikimedia.org/T341728) (owner: 10David Martin) [18:16:32] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1002.eqiad.wmnet with OS bookworm executed w... [18:17:32] (03CR) 10Jdlrobson: [C: 03+2] Update web_ab_test_enrollment schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/938280 (https://phabricator.wikimedia.org/T337270) (owner: 10Kimberly Sarabia) [18:18:05] (03Merged) 10jenkins-bot: Update web_ab_test_enrollment schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/938280 (https://phabricator.wikimedia.org/T337270) (owner: 10Kimberly Sarabia) [18:30:16] 10Data-Platform-SRE, 10Discovery-Search: Investigate WDQS categories update failures on Bullseye hosts - https://phabricator.wikimedia.org/T342060 (10bking) [18:30:18] 10Data-Platform-SRE, 10Discovery-Search (Current work): Ensure WCQS/WDQS stack works on Bullseye - https://phabricator.wikimedia.org/T331300 (10bking) [18:32:41] 10Data-Engineering, 10Data-Platform-SRE, 10Movement-Insights, 10Product-Analytics, 10Epic: Reconstruct Hive & Hadoop permissions for shared database - https://phabricator.wikimedia.org/T288983 (10Mayakp.wiki) [18:37:17] (03CR) 10Jdlrobson: [C: 03+1] "LGTM. Just waiting on a WikimediaEvents patch that bumps Scroll and ABEnrollement schemas as discussed." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/938284 (https://phabricator.wikimedia.org/T337270) (owner: 10Kimberly Sarabia) [18:52:55] 10Data-Platform-SRE, 10Patch-For-Review: Migrate analytics_test airflow instance to bullseye an-test-client1002 - https://phabricator.wikimedia.org/T341700 (10Stevemunene) Updated the [[ https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags-scap-analytics_test/-/blob/main/targets | targets ]] to `a... [20:01:43] (SystemdUnitFailed) firing: (2) wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:28:21] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10andrea.denisse) Hi! I see this task in the SRE Clinic Duty Triage, feel free to let me know if you would like me to help with creating the VMs. :) [20:34:06] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:36:43] (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:46:00] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:51:43] (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:59:26] (03PS1) 10TChin: Skip schema test cases that will fail validation in new jsonschema-tools version [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/939366 (https://phabricator.wikimedia.org/T340765) [20:59:40] (03PS1) 10TChin: Skip schema test cases that will fail validation in new jsonschema-tools version [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/939367 (https://phabricator.wikimedia.org/T340765) [21:00:43] 10Data-Platform-SRE, 10Discovery-Search: Investigate WDQS categories update failures on Bullseye hosts - https://phabricator.wikimedia.org/T342060 (10bking) [21:05:07] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10BTullis) @tchin that is now done. Welcome to the `analytics-admins` group. [21:15:34] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm [21:36:08] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) I've upgraded: * the iDRAC version * the NIC firmware * the BIOS I tried two versions of the NIC firmware, in case it was t... [22:07:00] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm executed w... [22:46:30] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10Papaul) @BTullis we had the same issue with sessionstore2001 in codw see task below what we did was to replace the 1G RJ45/SFP convert...