[00:51:43] (SystemdUnitFailed) firing: wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:51:43] (SystemdUnitFailed) firing: wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:50:59] PROBLEM - puppet last run on kafka-test1006 is CRITICAL: CRITICAL: Puppet has been disabled for 604825 seconds, message: Elukey - elukey, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:51:43] (SystemdUnitFailed) firing: wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:53:24] 10Data-Platform-SRE, 10superset.wikimedia.org: Fix the LDAP integration and Superset user account creation. - https://phabricator.wikimedia.org/T298647 (10BTullis) [08:53:28] 10Data-Platform-SRE: Try to improve the LDAP integration for Superset user account creation - https://phabricator.wikimedia.org/T297120 (10BTullis) [08:56:29] RECOVERY - puppet last run on kafka-test1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:13:41] 10Data-Platform-SRE, 10serviceops-radar, 10Discovery-Search (Current work): Test version compatibility between production Kafka and newer ZooKeeper - https://phabricator.wikimedia.org/T341137 (10pfischer) Since we no longer rely on a shared ZK for kafa + flink (see [[ https://docs.google.com/document/d/1BLv3... [09:20:00] !log upgrading airflow on an-test-client1002 to version 2.6.3 [09:20:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:22:49] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10BTullis) p:05Triage→03High [09:28:50] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Airflow to version 2.6.3 - https://phabricator.wikimedia.org/T336286 (10BTullis) I have upgraded the analytics_test instance and evexuted the migrations. ` btullis@an-test-client1002:~$ sudo -u analytics airflow-analytics_test db check /usr/lib/airflow/lib/pytho... [09:29:14] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Airflow to version 2.6.3 - https://phabricator.wikimedia.org/T336286 (10BTullis) [09:31:27] !log restarted airflow services on an-test-client1002 in order to pick up new versions [09:31:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:32:53] joal: jennifer_ebe: (Ops week) Please be on the lookout for any unusual behaviour from airflow on the test cluster, since we're testing a new version. [09:34:34] ack btullis - thanks for letting us know :) [09:35:46] Greta, sent a message via Slack as well, to be on the safe side. Airflow web interface checks out OK and all services are running. [09:35:51] Great! [09:58:08] 10Data-Platform-SRE: Allow 'gpu-users' to reset the GPUs on stat machines - https://phabricator.wikimedia.org/T336784 (10BTullis) [09:58:18] 10Data-Platform-SRE: Allow 'gpu-users' to reset the GPUs on stat machines - https://phabricator.wikimedia.org/T336784 (10BTullis) p:05Triage→03Medium [10:07:39] (03PS1) 10Phuedx: Migrate MediaWikiPingback schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/938814 (https://phabricator.wikimedia.org/T323828) [10:10:29] 10Data-Platform-SRE, 10conftool: Investigate and fix dconf errors shown by services run as the analytics system user - https://phabricator.wikimedia.org/T330652 (10BTullis) p:05Triage→03Low [10:18:45] 10Data-Platform-SRE, 10Data Pipelines, 10Patch-For-Review: Upgrade Airflow to version 2.6.3 - https://phabricator.wikimedia.org/T336286 (10Aklapper) [10:30:39] joal: I'm also planning on doing the first upgrade of a hadoop worker in production to bullseye today, if that's OK with you. I will start with analytics1070 on its own and verify that the partition recipe to keep existing hadoop data works correctly. [10:32:45] works for me btullis! [10:33:35] btullis: question about this (upgrade to bullseye) - canwe do a 1-by-1 upgrade for journalnodes, or do theyneed to be done all-at-once (I always forget) [10:33:51] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10BTullis) a:03BTullis [10:36:10] joal: That's a good question. I'm *assuming* that 1-by-1 will be fine, because it's all Java and we're not upgrading Hadoop nor JVM versions, but I'll go back and see whether there were any other special precautions taken during the last upgrade. [10:36:56] ack btullis -if possible, I think it'll be a good idea to upgrade one journalnode soon [10:39:54] joal: Agreed. [10:44:15] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1070.eqiad.wmnet with OS bullseye [10:47:45] 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10BTullis) p:05Triage→03High [10:53:37] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10BTullis) The bullseye installer paused here: {F37141344,width=70%} I'll search around to see if other people have faced this issue and got around it. [10:58:52] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10BTullis) Ah, bingo! {T308106} It was patched, but it doesn't look like it's working. I will see if we can fix it via another method. [11:08:51] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10BTullis) I've answered the question with 'No', given that that's what the fix is doing anyway, and allowing the installation to proceed. [11:09:09] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines, 10Data Products: "Pages to date" not loading with "daily" metric - https://phabricator.wikimedia.org/T312717 (10Milimetric) 05Open→03Resolved Yes, there will be a bunch of these, especially in the in-betweens as we changed managers and... [11:35:40] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1070.eqiad.wmnet with OS bullseye completed: - analytics1070 (**PASS**) - Dow... [12:37:09] 10Data-Platform-SRE, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission analytics1067.eqiad.wmnet - https://phabricator.wikimedia.org/T341207 (10Jclark-ctr) [12:37:16] 10Data-Platform-SRE, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission analytics1067.eqiad.wmnet - https://phabricator.wikimedia.org/T341207 (10Jclark-ctr) 05Open→03Resolved [12:37:21] 10Data-Platform-SRE, 10Shared-Data-Infrastructure, 10Patch-For-Review: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Jclark-ctr) [12:38:29] 10Data-Platform-SRE, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission analytics1068.eqiad.wmnet - https://phabricator.wikimedia.org/T341208 (10Jclark-ctr) 05Open→03Resolved [12:38:36] 10Data-Platform-SRE, 10Shared-Data-Infrastructure, 10Patch-For-Review: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Jclark-ctr) [12:40:35] 10Data-Platform-SRE, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission analytics1058.eqiad.wmnet - https://phabricator.wikimedia.org/T338227 (10Jclark-ctr) [12:41:59] 10Data-Platform-SRE, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission analytics1058.eqiad.wmnet - https://phabricator.wikimedia.org/T338227 (10Jclark-ctr) 05Open→03Resolved [12:42:01] 10Data-Platform-SRE, 10Shared-Data-Infrastructure, 10Patch-For-Review: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Jclark-ctr) [12:44:11] 10Data-Platform-SRE, 10Shared-Data-Infrastructure, 10Patch-For-Review: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Jclark-ctr) [12:44:25] 10Data-Platform-SRE, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission analytics1065.eqiad.wmnet - https://phabricator.wikimedia.org/T341205 (10Jclark-ctr) 05Open→03Resolved [12:46:47] 10Data-Platform-SRE, 10Shared-Data-Infrastructure, 10Patch-For-Review: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Jclark-ctr) [12:46:49] 10Data-Platform-SRE, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission analytics1059.eqiad.wmnet - https://phabricator.wikimedia.org/T338408 (10Jclark-ctr) 05Open→03Resolved [12:50:11] 10Data-Platform-SRE, 10Shared-Data-Infrastructure, 10Patch-For-Review: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Jclark-ctr) [12:50:31] 10Data-Platform-SRE, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission analytics1060.eqiad.wmnet - https://phabricator.wikimedia.org/T338409 (10Jclark-ctr) 05Open→03Resolved [12:51:43] (SystemdUnitFailed) firing: wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:52:15] 10Data-Platform-SRE, 10Shared-Data-Infrastructure, 10Patch-For-Review: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Jclark-ctr) [12:52:34] 10Data-Platform-SRE, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission analytics1064.eqiad.wmnet - https://phabricator.wikimedia.org/T341204 (10Jclark-ctr) 05Open→03Resolved [12:55:59] 10Data-Platform-SRE, 10Shared-Data-Infrastructure, 10Patch-For-Review: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Jclark-ctr) [12:56:01] 10Data-Platform-SRE, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission analytics1069.eqiad.wmnet - https://phabricator.wikimedia.org/T341209 (10Jclark-ctr) 05Open→03Resolved [12:57:53] 10Data-Platform-SRE, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission analytics1066.eqiad.wmnet - https://phabricator.wikimedia.org/T341206 (10Jclark-ctr) [12:58:26] 10Data-Platform-SRE, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission analytics1066.eqiad.wmnet - https://phabricator.wikimedia.org/T341206 (10Jclark-ctr) 05Open→03Resolved [12:58:32] 10Data-Platform-SRE, 10Shared-Data-Infrastructure, 10Patch-For-Review: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Jclark-ctr) [13:16:58] joal: analytics1070 seems fine after the reimage. I propose to do analytics1072 next, which is a journalnode. OK by you? [13:19:09] 10Data-Platform-SRE, 10Patch-For-Review: Migrate analytics_test airflow instance to bullseye an-test-client1002 - https://phabricator.wikimedia.org/T341700 (10Stevemunene) could we first disable the jobs, then notify our users to start using for airflow related work and all other related work to `an-test1002`... [13:19:33] 10Data-Platform-SRE: Bring stat1009 into service - https://phabricator.wikimedia.org/T336036 (10Stevemunene) [13:20:08] 10Data-Platform-SRE, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission analytics1061.eqiad.wmnet - https://phabricator.wikimedia.org/T339199 (10Jclark-ctr) [13:20:15] 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Jclark-ctr) [13:20:17] 10Data-Platform-SRE, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission analytics1061.eqiad.wmnet - https://phabricator.wikimedia.org/T339199 (10Jclark-ctr) 05Open→03Resolved [13:20:51] 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Jclark-ctr) [13:20:53] 10Data-Platform-SRE, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission analytics1062.eqiad.wmnet - https://phabricator.wikimedia.org/T339200 (10Jclark-ctr) 05Open→03Resolved [13:21:16] 10Data-Platform-SRE, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission analytics1063.eqiad.wmnet - https://phabricator.wikimedia.org/T339201 (10Jclark-ctr) [13:21:27] 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Jclark-ctr) [13:21:30] 10Data-Platform-SRE, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission analytics1063.eqiad.wmnet - https://phabricator.wikimedia.org/T339201 (10Jclark-ctr) 05Open→03Resolved [13:21:43] (SystemdUnitFailed) firing: (2) wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:28:09] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Papaul) @Jhancock.wm I will check and let you know [13:31:58] 10Data-Platform-SRE, 10Patch-For-Review: Migrate analytics_test airflow instance to bullseye an-test-client1002 - https://phabricator.wikimedia.org/T341700 (10BTullis) We will need to update this: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags-scap-analytics_test/-/blob/main/targets >>! In... [13:32:44] !log proceeding to reimage analytics1072 (journalnode, in addition to datanode) [13:32:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:33:39] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1072.eqiad.wmnet with OS bullseye [13:34:06] !log `kill `pgrep -u appledora`` and `kill `pgrep -u akhatun`` on stat1008 to unblock puppet (offboarded users deletion) [13:34:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:34:27] elukey: thanks :-) [13:35:19] <# [13:35:20] <3 [13:35:47] btullis: migrating hadoop to bullseye? [13:36:39] elukey: Yup! First production datanode completed. Currently reimaging first data+journalnode [13:36:43] (SystemdUnitFailed) firing: (3) wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:36:48] btullis: congrats :) [13:36:58] lemme know if you need any help in migrating the others! [13:37:33] Thank you. [13:38:43] Will do. Next on my list is an-test-master1002 - I think that you upgraded masters and coords before the workers, didn't you? I'm going the other way around. [13:50:39] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1072.eqiad.wmnet with OS bullseye executed with errors: - analytics1072 (**FAIL... [13:53:51] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1072.eqiad.wmnet with OS bullseye [14:06:43] (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:08:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:58] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) @Jclark-ctr sorry missed one. edited my previous comment with the additional to keep all the info together. [14:15:33] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:43] (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:29:33] 10Data-Engineering, 10MediaWiki-Vagrant, 10MediaWiki-extensions-EventLogging, 10MW-1.41-notes (1.41.0-wmf.12; 2023-06-06): Interface 'Wikimedia\MetricsPlatform\EventSubmitter' not found - https://phabricator.wikimedia.org/T337383 (10phuedx) 05Open→03Resolved a:03phuedx Being **bold**. Thanks for the... [14:52:34] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1072.eqiad.wmnet with OS bullseye completed: - analytics1072 (**PASS**) - Rem... [15:07:26] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Sustainability (Incident Followup): Create Turnilo/Superset dashboards for identifying users w/ excessive WDQS queries - https://phabricator.wikimedia.org/T338159 (10bking) a:03bking [15:07:37] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Discovery-Search (Current work), 10Event-Platform, and 2 others: Add support for redirects in CirrusSearch - https://phabricator.wikimedia.org/T325315 (10Gehel) a:03pfischer [15:10:56] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Discovery-Search, 10serviceops-radar, 10Event-Platform: [NEEDS GROOMING] Store Flink HA metadata in Zookeeper - https://phabricator.wikimedia.org/T331283 (10Gehel) [15:11:13] 10Data-Platform-SRE, 10serviceops-radar, 10Discovery-Search (Current work): Test version compatibility between production Kafka and newer ZooKeeper - https://phabricator.wikimedia.org/T341137 (10Gehel) 05Open→03Declined declined as per @pfischer comment [15:14:30] 10Data-Platform-SRE, 10Discovery-Search, 10serviceops: Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10akosiaris) [15:16:24] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Epic, 10Event-Platform: Flink Operations - https://phabricator.wikimedia.org/T328561 (10Gehel) [15:19:10] 10Data-Platform-SRE, 10Discovery-Search: Test flink operations/failure scenarios relevant to Search Update Pipeline - https://phabricator.wikimedia.org/T342010 (10bking) [15:20:50] 10Data-Platform-SRE, 10Discovery-Search: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10dcausse) [15:20:52] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10serviceops-radar, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10dcausse) [15:22:50] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10serviceops-radar, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10Gehel) This is blocked by T341792 at the moment. [15:27:28] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10Gehel) [15:29:26] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10Gehel) p:05Triage→03High [15:29:32] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10Gehel) [15:30:14] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10Gehel) [15:31:07] 10Data-Platform-SRE, 10Discovery-Search: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10Gehel) p:05Triage→03High [15:31:31] 10Data-Platform-SRE, 10Discovery-Search (Current work): Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10Gehel) [15:34:18] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10BTullis) Oops. I've logged the first two reimages against T329363 instead of this task. Nevertheless, analytics1070 and 1072 are now running bullseye and showing no errors. analytics1072 was important because... [15:48:38] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10Gehel) [15:49:41] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10Gehel) [15:51:35] 10Data-Platform-SRE, 10Discovery-Search (Current work): Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10Gehel) [16:29:46] 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10TBurmeister) [16:48:36] 10Data-Engineering, 10AQS2.0, 10API Platform (Sprint 06), 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Pageviews: Implement Integration Tests - https://phabricator.wikimedia.org/T299735 (10Sfaci) [17:36:43] Starting build #26 for job wikimedia-event-utilities-maven-release-docker [17:40:39] Project wikimedia-event-utilities-maven-release-docker build #26: 09SUCCESS in 3 min 55 sec: https://integration.wikimedia.org/ci/job/wikimedia-event-utilities-maven-release-docker/26/ [17:48:06] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Configure new WDQS servers in codfw (wdqs20[13-22]) - https://phabricator.wikimedia.org/T332314 (10bking) Update: `wdqs2016.codfw.wmnet` is the last host that needs to be configured for produ... [17:49:32] (03CR) 10DCausse: "recheck" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/937954 (owner: 10DCausse) [18:11:49] 10Data-Platform-SRE, 10Discovery-Search (Current work): Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10dcausse) [18:11:51] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10serviceops-radar, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10dcausse) [18:12:07] 10Data-Platform-SRE, 10Discovery-Search (Current work): Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10dcausse) [18:12:12] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10serviceops-radar, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10dcausse) [18:21:43] (SystemdUnitFailed) firing: (2) wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:30:03] 10Data-Platform-SRE, 10Wikidata, 10Patch-For-Review: Create WDQS Lag SLO dashboard with Grizzly && documentation - https://phabricator.wikimedia.org/T324811 (10RKemper) [18:38:03] 10Data-Platform-SRE: Decommission old WDQS servers - https://phabricator.wikimedia.org/T342035 (10Gehel) [18:38:53] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Configure new WDQS servers in codfw (wdqs20[13-22]) - https://phabricator.wikimedia.org/T332314 (10Gehel) [18:39:24] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Configure new WDQS servers in codfw (wdqs20[13-22]) - https://phabricator.wikimedia.org/T332314 (10Gehel) Decommission ticket created: T342035 [18:40:02] 10Data-Platform-SRE: Decommission old WDQS servers - https://phabricator.wikimedia.org/T342035 (10RKemper) [19:42:11] PROBLEM - AQS root url on aqs1010 is CRITICAL: connect to address 10.64.0.40 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [19:42:41] ^ this is me, testing an aqs deploy on the canary [19:45:43] ack! [20:03:35] PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:04:14] ^ still me [21:01:39] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1001.eqiad.wmnet with OS bookworm [21:26:11] https://www.irccloud.com/pastebin/p993HI92/ [21:26:16] hi [21:26:26] Current state of AQS investigation: mid-scap deploy. aqs1010 has the new code. [21:26:26] milimetric: does this still match your test env? [21:27:15] milimetric: because aqs is trying to create the schema, and that code is so old, it's expecting a system table that hasn't existed in Cassandra in years [21:27:53] milimetric: that it's trying makes me skeptical that the schema matches [21:28:11] that's what it sounds like to me too, trying to validate [21:28:11] also... what version of Cassandra did you use during testing? [21:28:32] what version did you use to create that `DESCRIBE` output? [21:28:44] (this is not my dev environment, it's Nicholas the intern and I'm not 100% sure what he was using, but it's probably whatever comes with the AQS 2.0 Cassandra docker image) [21:29:32] that would be pretty recent; I would expect that to fail similarlly [21:30:03] oh, you know what....we probably need the contents of meta, too [21:30:18] it'd be like one row [21:30:23] I bet that's it [21:30:32] supposedly Cassandra 3.11 [21:30:52] Ok, that is perplexing [21:31:11] oh! If I ran the docker image and connected my local AQS to it, it would give me the row we need in meta, right? [21:31:39] yes? [21:32:07] SELECT * FROM "local_group_default_T_knowledge_gap_by_category".meta; [21:40:05] success (crafting insert statement based on what I got) [21:45:15] urandom: supposedly this works https://www.irccloud.com/pastebin/8fWVltGM/ [21:46:02] interesting... see the keyspace name there? [21:46:14] local_group_test_.... ? [21:46:37] oh, poop [21:46:44] uh... is that just 'cause of my config... [21:47:17] ok [21:47:20] just making sure [21:47:52] (03PS1) 10Aqu: WIP: Create a job to dump XML/SQL MW history files to HDFS [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) [21:48:39] urandom: ok, I configured it to write to the local_group_default one only by changing config.yaml, and I got this for value instead: [21:48:39] {"table":"knowledge.gap.by.category","version":1,"attributes":{"dt":"string","project":"string","content_gap":"string","category":"string","metric":"string","value":"int"},"index":[{"attribute":"project","type":"hash"},{"attribute":"content_gap","type":"hash"},{"attribute":"category","type":"hash"},{"attribute":"dt","type":"range","order":"asc"}],"_backend_version":1,"secondaryIndexes":{},"options":{"updates":{"pattern":"random- [21:48:39] update"}}} [21:48:42] sorry [21:48:48] https://www.irccloud.com/pastebin/2DHN0GHD/ [21:50:22] this worked https://www.irccloud.com/pastebin/wgC9NSy4/ [21:51:20] https://www.irccloud.com/pastebin/NUCDWR4N/ [21:51:40] Ok, maybe see if it (somehow) works now? [21:51:55] 🕯️ [21:51:58] hahaha [21:52:18] logging in seeing if I have rights to try and start this (I do via scap...) [21:52:22] I'll restart the service on aqs1010 with an unmodified `/etc/aqs/config.yaml` and see if it starts. I think that there is still a problem with logging. [21:52:42] 🔪 🐔 [21:53:04] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1001.eqiad.wmnet with OS bookworm executed w... [21:53:29] RECOVERY - AQS root url on aqs1010 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [21:53:31] thx btullis, I don't have perms to do `systemctl start aqs` [21:53:35] (03CR) 10CI reject: [V: 04-1] WIP: Create a job to dump XML/SQL MW history files to HDFS [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) (owner: 10Aqu) [21:53:38] um [21:53:46] recovery seems like a good thing [21:53:47] Bosh! [21:53:49] ooh, it's up [21:53:52] https://www.irccloud.com/pastebin/bvxHitzh/ [21:53:56] it's at least serving pageviews [21:54:01] let's see the new endpoint :) [21:54:11] RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:54:17] Can you test the canary? [21:55:06] btullis: all good, yeah, serves what it should and the new endpoint is up (haven't checked that it works yet... but it shouldn't matter, we should be able to continue deploy) [21:55:21] milimetric: ack, doing that now. [21:55:33] 10Data-Platform-SRE, 10Discovery-Search: Investigate WDQS categories update failures on Bullseye hosts - https://phabricator.wikimedia.org/T342060 (10bking) [21:56:47] oh cool, even the new data works [21:56:49] yay! [21:57:01] thanks very much Ben & urandom [21:57:15] 👍 [21:57:23] You're very welcome. [21:57:54] Deployment finished successfully. [21:59:11] <3 k, I'll kick the wikistats tire [22:00:22] I think we may need to add a `named_levels: true` to the logging config of aqs: https://github.com/wikimedia/service-runner/blob/master/README.md?plain=1#L150 [22:01:29] When it was failing, there was nothing useful in the logs. It was only when I added a new output to `stdout` with this parameter did I get the useful jsn logs. [22:02:30] Oh cool, +1 for actual logs [22:06:43] (SystemdUnitFailed) firing: (3) wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:17:35] 10Data-Platform-SRE, 10Discovery-Search: Investigate WDQS categories update failures on Bullseye hosts - https://phabricator.wikimedia.org/T342060 (10bking) Investigation so far: `load-categories-daily.service` calls `/usr/local/bin/loadCategoriesDaily.sh wdqs` `/usr/local/bin/loadCategoriesDaily.sh` calls... [22:19:44] 10Data-Platform-SRE, 10Discovery-Search: Investigate WDQS categories update failures on Bullseye hosts - https://phabricator.wikimedia.org/T342060 (10bking)