[01:33:47] <wikibugs>	 (03PS7) 10David Martin: Create wikilambda_zobject_labels & wikilambda_zobject_function_join tables in Hive [analytics/refinery] - 10https://gerrit.wikimedia.org/r/937997 (https://phabricator.wikimedia.org/T341728)
[02:06:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:04:37] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:06:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:16:31] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:16:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:41:43] <wikibugs>	 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10Joe) Can someone please elaborate on what the "generic content store" would need to look like for this project...
[08:16:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:16:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:18:31] <wikibugs>	 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10MW-1.40-notes (1.40.0-wmf.1; 2022-09-12), 10Metrics-Platform-Planning (Metrics Platform Kanban): Generate $wgEventLoggingStreamNames from $wgEventStreams - https://phabricator.wikimedia.org/T303602 (10phuedx)
[09:19:30] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Analytics Platform Future State Planing - https://phabricator.wikimedia.org/T302728 (10BTullis)
[09:19:32] <wikibugs>	 10Data-Platform-SRE, 10Epic: Data Infrastructure as a Service MVP - https://phabricator.wikimedia.org/T308317 (10BTullis)
[09:21:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:41:17] <joal>	 mforns: good morning!
[09:41:27] <joal>	 mforns: would you have a minutefor me?
[09:42:11] <btullis>	 I'm planning to go ahead with another couple of hadoop worker upgrades this morning, if that's OK with everyone.
[09:50:22] <joal>	 All good on my side :)
[09:52:55] <btullis>	 Ack, thanks joal.
[10:00:36] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1071.eqiad.wmnet with OS bullseye
[10:19:38] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:20:43] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: NEW BUG REPORT remove mysql databases from SQLLab - https://phabricator.wikimedia.org/T337056 (10BTullis) {F37142595,width=70%} I've now updated the superset configuration to exclude the wikishared and mysql_staging databases from SQL Lab, as requested.  If it turns out...
[10:21:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:30:12] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:36:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:37:32] <wikibugs>	 10Data-Platform-SRE, 10Epic: Data Infrastructure as a Service MVP - https://phabricator.wikimedia.org/T308317 (10BTullis)
[10:41:37] <wikibugs>	 10Data-Platform-SRE, 10Epic: Install Ceph Cluster for Data Engineering - https://phabricator.wikimedia.org/T324660 (10BTullis)
[10:55:00] <wikibugs>	 10Data-Platform-SRE, 10Foundational Technology Requests, 10Epic: Data Science and Engineering Kubernetes Cluster Experiment - https://phabricator.wikimedia.org/T327267 (10BTullis)
[10:59:24] <wikibugs>	 10Data-Platform-SRE, 10Epic: Install Ceph Cluster for Data Engineering - https://phabricator.wikimedia.org/T324660 (10BTullis)
[10:59:26] <wikibugs>	 10Data-Platform-SRE, 10Foundational Technology Requests, 10Epic: Data Science and Engineering Kubernetes Cluster Experiment - https://phabricator.wikimedia.org/T327267 (10BTullis)
[10:59:28] <wikibugs>	 10Data-Platform-SRE, 10Epic: Data Infrastructure as a Service MVP - https://phabricator.wikimedia.org/T308317 (10BTullis)
[11:00:20] <wikibugs>	 10Data-Platform-SRE, 10Foundational Technology Requests, 10Epic: Data Science and Engineering Kubernetes Cluster Experiment - https://phabricator.wikimedia.org/T327267 (10BTullis)
[11:00:24] <wikibugs>	 10Data-Platform-SRE, 10Foundational Technology Requests, 10Epic: POC for Running Spark on DSE - https://phabricator.wikimedia.org/T318712 (10BTullis)
[11:02:04] <wikibugs>	 10Data-Platform-SRE, 10Foundational Technology Requests, 10Epic: POC for Running Spark on DSE - https://phabricator.wikimedia.org/T318712 (10BTullis)
[11:04:14] <wikibugs>	 10Data-Platform-SRE, 10Foundational Technology Requests, 10Epic: Data Science and Engineering Kubernetes Cluster Experiment - https://phabricator.wikimedia.org/T327267 (10BTullis)
[11:08:18] <wikibugs>	 10Data-Platform-SRE, 10Foundational Technology Requests, 10Epic: Enable spark jobs on the dse-k8s cluster via the spark-operator - https://phabricator.wikimedia.org/T318712 (10BTullis) p:05Triage→03Medium
[11:13:28] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1071.eqiad.wmnet with OS bullseye completed: - analytics1071 (**PASS**)   - Downtimed on Icinga/Alertmanag...
[11:17:54] <joal>	 ping mforns for when you come online - Ineed some help :)
[11:17:59] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Product-Analytics, 10Epic: Reconstruct Hive & Hadoop permissions for shared database - https://phabricator.wikimedia.org/T288983 (10BTullis) @Mayakp.wiki - I'm currently reviewing tickets and I wondered whether you might be able to provide an update on the continu...
[11:20:06] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1073.eqiad.wmnet with OS bullseye
[11:20:10] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1074.eqiad.wmnet with OS bullseye
[11:36:56] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Migrate analytics_test airflow instance to bullseye  an-test-client1002 - https://phabricator.wikimedia.org/T341700 (10CodeReviewBot) stevemunene opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags-scap-analytics_test/-/merge_requests/1  Change test...
[12:01:28] <wikibugs>	 (03PS1) 10KCVelaga: Add client_dt to EditAttemptStep allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/939265 (https://phabricator.wikimedia.org/T341888)
[12:13:29] <wikibugs>	 10Data-Platform-SRE: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10AndrewTavis_WMDE) @WMF, do you all have a general estimate for when work will begin on this? We have one airflow exploration ticket already planned out in {https://phabricator.wikimedia.org/T341330}. Wou...
[12:23:55] <wikibugs>	 (03CR) 10Gmodena: [C: 03+1] Bump eventutilities to 1.2.12 and use new shaded artifact [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/937954 (owner: 10DCausse)
[12:31:52] <wikibugs>	 10Data-Platform-SRE, 10Foundational Technology Requests, 10Epic: Create a DSE Kubernetes cluster with support for persistent storage from Ceph - https://phabricator.wikimedia.org/T327267 (10BTullis)
[12:33:35] <wikibugs>	 10Data-Platform-SRE, 10Foundational Technology Requests, 10Epic: Create a DSE Kubernetes cluster with support for persistent storage from Ceph - https://phabricator.wikimedia.org/T327267 (10BTullis)
[12:40:25] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1073.eqiad.wmnet with OS bullseye executed with errors: - analytics1073 (**FAIL**)   - Downtimed on Icinga...
[12:55:09] <joal>	 btullis: Hello! How are we doing about airflow 2.6.1?
[12:55:22] <joal>	 I see there is patch on the dpeloyment train
[12:55:49] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1074.eqiad.wmnet with OS bullseye completed: - analytics1074 (**PASS**)   - Downtimed on Icinga/Alertmanag...
[12:56:03] <btullis>	 Ah, well. It's airflow 2.6.3 now. Two more security patches released.
[12:56:09] <joal>	 :)
[12:56:37] <joal>	 btullis: shall I remove the deploy commentthen?
[12:56:43] <btullis>	 I've installed 2.6.3 on an-test-client1002 but I'm waiting on stevemunene to deploy airflow-dags to an-test-client1002 (bullseye)
[12:57:10] <btullis>	 joal: Yes, I think so. Sorry, I haven't kept up-to-date with the etherpad.
[12:57:20] <joal>	 no prob btullis :) Thank you!
[12:58:04] <btullis>	 I'm ready to install 2.6.3 on the analytics instance whenever we wish, followed by all of the other instances. I just thought it would be good to test a current airflow-dags on 2.6.3 first.
[12:58:34] <btullis>	 joal: Ref: https://phabricator.wikimedia.org/T341700#9019582
[13:02:11] <btullis>	 joal: Quick question, are you aware of any Hadoop workers that have 1 Gbps network connections instead of 10 Gbps connections? I thought they were all supposed to be 10 Gbps.
[13:03:48] <joal>	 btullis: I don't know about which hosts nor how many, but Iknow there are some with 1Gbps (oldest) and some with 10Gbps (newest)
[13:05:18] <btullis>	 joal: OK, cool. Thanks. I'm troubleshooting the failed reimage of analytics1073 (old) and noticed that it only has a 1Gpbs connection. Although it contains a 10 Gbps card, it's not plugged in. :-)
[13:05:32] <btullis>	 Anyway, that's fine. Many thanks.
[13:06:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:13:56] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Migrate analytics_test airflow instance to bullseye  an-test-client1002 - https://phabricator.wikimedia.org/T341700 (10CodeReviewBot) stevemunene merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags-scap-analytics_test/-/merge_requests/1  Change test...
[13:19:19] <stevemunene>	 btullis: doing the airflow-dags to an-test-client1002 deploy now
[13:19:37] <btullis>	 stevemunene: Excellent! Many thanks.
[13:20:25] <stevemunene>	 !log deploy airflow-dags to an-test-client1002 T341700
[13:20:28] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:20:28] <stashbot>	 T341700: Migrate analytics_test airflow instance to bullseye  an-test-client1002  - https://phabricator.wikimedia.org/T341700
[13:46:51] <jinxer-wm>	 (HdfsFSImageAge) firing: The HDFS FSImage on analytics-test-hadoop:an-test-master1002:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-test-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge
[13:51:51] <jinxer-wm>	 (HdfsFSImageAge) firing: (2) The HDFS FSImage on analytics-test-hadoop:an-test-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-test-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge
[13:56:51] <jinxer-wm>	 (HdfsFSImageAge) resolved: (2) The HDFS FSImage on analytics-test-hadoop:an-test-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-test-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge
[14:02:00] <wikibugs>	 (03CR) 10Gmodena: [C: 03+1] "Changes LGTM as far as Jsonschema goes. I'm not familiar with applications using this schema though." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/934032 (https://phabricator.wikimedia.org/T337270) (owner: 10Kimberly Sarabia)
[14:18:27] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 0): ProduceCanaryEvents job should be scheduled by Airflow - https://phabricator.wikimedia.org/T341229 (10Ottomata)
[14:55:24] <wikibugs>	 (03CR) 10DLynch: [C: 03+2] "I'll count someone else having approved of the novel $ref usage as being all I wanted to push off to specialists for this review. :D" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/934032 (https://phabricator.wikimedia.org/T337270) (owner: 10Kimberly Sarabia)
[14:56:01] <wikibugs>	 (03Merged) 10jenkins-bot: Fix editattemptstep ref [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/934032 (https://phabricator.wikimedia.org/T337270) (owner: 10Kimberly Sarabia)
[15:03:15] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:06:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:16:07] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:21:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:42:44] <wikibugs>	 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10JMeybohm) Could you please also add the expected events/s, requests/s towards the kafka-main cluster? @elukey...
[15:47:23] <wikibugs>	 (03PS2) 10KCVelaga: Add client_dt to EditAttemptStep allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/939265 (https://phabricator.wikimedia.org/T341888)
[15:48:49] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:51:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:01:31] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:01:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:19:28] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: 503 on Superset (reproducible) - https://phabricator.wikimedia.org/T322525 (10BTullis) 05Open→03Resolved a:03BTullis I'm going to resolve this ticket, if that's OK with everyone. We know that Superset could have better error reporting for situations where the quer...
[16:20:59] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10BTullis) I've had a problem with analytics1073 which failed during the reimage and now appears to have lost connectivity. I've created T342141 to track that, but in the meantime I will continue with the upgrades.
[16:23:01] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10BTullis) 4 of 78 hosts have been upgraded so far. ` btullis@cumin1001:~$ sudo cumin --no-progress a:hadoop-worker 'cat /etc/debian_version' 78 hosts will be targeted: an-worker[1078-1148].eqiad.wmnet,analytics[...
[16:24:48] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1075.eqiad.wmnet with OS bullseye
[16:34:10] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, 10SRE-Access-Requests: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10WDoranWMF)
[16:35:30] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 0), 10Event-Platform (Sprint 14 B): jsonschema-tools test should fail if fields are removed in new (non major) version - https://phabricator.wikimedia.org/T340765 (10Ottomata) Added: https://wikitech.wikimedia.org/wiki/Event_Platform/Sche...
[16:36:35] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, 10SRE-Access-Requests: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10WDoranWMF) I'm marking this as high because Thomas will need the access in order to be able to start supporting Ops work starting week...
[16:42:42] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, 10SRE-Access-Requests: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10JAllemandou) Indeed, being part of Data Engineering team, Thomas will be in charge during his ops-week time to restart jobs as the `ana...
[16:44:30] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, 10SRE-Access-Requests: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10BTullis) Happily, I can also approve this change, as per: https://gerrit.wikimedia.org/r/c/operations/puppet/+/933976 I'll merge this a...
[16:50:39] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10BTullis) p:05Triage→03High I have created: https://gerrit.wikimedia.org/r/939348
[16:54:30] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10BTullis) I've just seen the same thing happen here with analytics1075 {T342141}  I think I'd better pause the upgrades until I get that sorted, because I don't really want three nodes to be unavailable at the s...
[17:05:33] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Deprecate Hue and stop the services - https://phabricator.wikimedia.org/T341895 (10BTullis) p:05Triage→03High
[17:13:18] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10BTullis) a:03BTullis
[17:25:25] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1002.eqiad.wmnet with OS bookworm
[17:33:25] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: NEW BUG REPORT remove mysql databases from SQLLab - https://phabricator.wikimedia.org/T337056 (10Mayakp.wiki) ty so much @BTullis for the quick turnaround on this! :)
[17:45:01] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search, 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10bking)
[17:45:09] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1075.eqiad.wmnet with OS bullseye executed with errors: - analytics1075 (**FAIL**)   - Downtimed on Icinga...
[17:56:39] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] "This looks good, I can show you how to test if you like, but there's no source data right now, right?  So maybe this is a test in producti" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/937568 (https://phabricator.wikimedia.org/T341724) (owner: 10David Martin)
[18:05:46] <wikibugs>	 (03CR) 10Milimetric: Create wikilambda_zobject_labels & wikilambda_zobject_function_join tables in Hive (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/937997 (https://phabricator.wikimedia.org/T341728) (owner: 10David Martin)
[18:09:12] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] "Looks good, let's start with this and get a basic flow in place, if our choices end up hurting us in any way, let's move to Iceberg sooner" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/937997 (https://phabricator.wikimedia.org/T341728) (owner: 10David Martin)
[18:16:32] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1002.eqiad.wmnet with OS bookworm executed w...
[18:17:32] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+2] Update web_ab_test_enrollment schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/938280 (https://phabricator.wikimedia.org/T337270) (owner: 10Kimberly Sarabia)
[18:18:05] <wikibugs>	 (03Merged) 10jenkins-bot: Update web_ab_test_enrollment schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/938280 (https://phabricator.wikimedia.org/T337270) (owner: 10Kimberly Sarabia)
[18:30:16] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search: Investigate WDQS categories update failures on Bullseye hosts - https://phabricator.wikimedia.org/T342060 (10bking)
[18:30:18] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work): Ensure WCQS/WDQS stack works on Bullseye - https://phabricator.wikimedia.org/T331300 (10bking)
[18:32:41] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Movement-Insights, 10Product-Analytics, 10Epic: Reconstruct Hive & Hadoop permissions for shared database - https://phabricator.wikimedia.org/T288983 (10Mayakp.wiki)
[18:37:17] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] "LGTM. Just waiting on a WikimediaEvents patch that bumps Scroll and ABEnrollement schemas as discussed." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/938284 (https://phabricator.wikimedia.org/T337270) (owner: 10Kimberly Sarabia)
[18:52:55] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Migrate analytics_test airflow instance to bullseye  an-test-client1002 - https://phabricator.wikimedia.org/T341700 (10Stevemunene) Updated the [[ https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags-scap-analytics_test/-/blob/main/targets | targets ]] to `a...
[20:01:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:28:21] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10andrea.denisse) Hi! I see this task in the SRE Clinic Duty Triage, feel free to let me know if you would like me to help with creating the VMs. :)
[20:34:06] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:36:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:46:00] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:51:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:59:26] <wikibugs>	 (03PS1) 10TChin: Skip schema test cases that will fail validation in new jsonschema-tools version [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/939366 (https://phabricator.wikimedia.org/T340765)
[20:59:40] <wikibugs>	 (03PS1) 10TChin: Skip schema test cases that will fail validation in new jsonschema-tools version [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/939367 (https://phabricator.wikimedia.org/T340765)
[21:00:43] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search: Investigate WDQS categories update failures on Bullseye hosts - https://phabricator.wikimedia.org/T342060 (10bking)
[21:05:07] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10BTullis) @tchin that is now done. Welcome to the `analytics-admins` group.
[21:15:34] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm
[21:36:08] <wikibugs>	 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) I've upgraded:  * the iDRAC version * the NIC firmware * the BIOS  I tried two versions of the NIC firmware, in case it was t...
[22:07:00] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm executed w...
[22:46:30] <wikibugs>	 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10Papaul) @BTullis we had the same issue with sessionstore2001 in codw see task below what we did was to replace the 1G RJ45/SFP convert...