[02:13:02] RECOVERY - Hadoop NodeManager on analytics1069 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:52:02] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1062:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:33:25] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform (Sprint 14 B): jsonschema-tools test should fail if fields are removed in new (non major) version - https://phabricator.wikimedia.org/T340765 (10tchin) [06:52:02] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1062:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:04:43] 10Data-Platform-SRE, 10Shared-Data-Infrastructure, 10Patch-For-Review: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2de210bb-f6e5-4b71-81d4-c9d978f2bed5) set by stevemunene@cumin1001 for 7 days, 0:00:00 on 9 ho... [07:11:52] !log disable-puppet on analytics[1061-1069] Preparing to decommission the hosts - T317861 [07:11:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:11:56] T317861: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 [07:17:59] !log stop hadoop-hdfs-datanode service on analytics[1061-1069] Preparing to decommission the hosts - T317861 [07:18:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:18:02] T317861: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 [07:21:44] !log Remove analytics1064_1069 from hdfs net_topology [07:21:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:23:41] !log run puppet agent on hadoop masters [07:23:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:50:33] 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Stevemunene) Re-enabled puppet on `analytics1069` to get it back into puppetdb to allow cumin commands to run properly. Then as per the steps discussed above, 1) disabled puppet o... [07:54:02] 10Data-Platform-SRE, 10Discovery-Search, 10serviceops-radar: Test version compatibility between production Kafka, Flink, and newer ZooKeeper - https://phabricator.wikimedia.org/T341137 (10JMeybohm) > Test Flink 1.16 with Kafka 1.1.0 and ZooKeeper versions 3.4, 3.5, 3.6, and 3.7 debian bullseye contains zooke... [08:54:59] 10Data-Platform-SRE, 10Discovery-Search, 10serviceops-radar: Test version compatibility between production Kafka, Flink, and newer ZooKeeper - https://phabricator.wikimedia.org/T341137 (10JMeybohm) From the [[ https://zookeeper.apache.org/releases.html | Zookeeper 3.8.0 release notes ]]: * ZooKeeper clients... [09:15:43] 10Data-Platform-SRE, 10Discovery-Search, 10serviceops-radar: Test version compatibility between production Kafka, Flink, and newer ZooKeeper - https://phabricator.wikimedia.org/T341137 (10pfischer) I ran a local test to check compatibility of kafka broker (KB) with zookeper (ZK), to find out if they are comp... [09:21:45] 10Data-Platform-SRE, 10Discovery-Search, 10serviceops-radar: Test version compatibility between production Kafka, Flink, and newer ZooKeeper - https://phabricator.wikimedia.org/T341137 (10dcausse) [09:22:53] 10Data-Platform-SRE, 10Discovery-Search, 10serviceops-radar: Test version compatibility between production Kafka and newer ZooKeeper - https://phabricator.wikimedia.org/T341137 (10JMeybohm) [09:37:10] (03PS1) 10Btullis: Update the datahub-frontend container to fix path issues [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935989 (https://phabricator.wikimedia.org/T329514) [09:57:45] !log decommission analytics1061.eqiad.wmnet T339199 [09:57:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:57:48] T339199: decommission analytics1061.eqiad.wmnet - https://phabricator.wikimedia.org/T339199 [09:59:08] (03CR) 10Btullis: [C: 03+2] Update the datahub-frontend container to fix path issues [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935989 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [10:03:07] btullis, stevemunene o/ - I am thinking for https://phabricator.wikimedia.org/T341137 to upgrade zookeeper-test1002 to bookworm, anything against it? [10:04:46] hadoop test uses an-conf, so that zookeeper is only used by kafka-test [10:10:46] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1061.eqiad.wmnet - https://phabricator.wikimedia.org/T339199 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by stevemunene@cumin1001 for hosts: `analytics1061.eqiad.wmnet` - analytics1061.eqiad.wmnet (**WARN**) - Downtimed... [10:12:36] (03Merged) 10jenkins-bot: Update the datahub-frontend container to fix path issues [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935989 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [10:27:38] o/ elukey We might have multiple supporting package/service failures with the move [10:31:54] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10GitLab (Project Migration), 10Release-Engineering-Team (Priority Backlog 📥): Migrate analytics/datahub pipeline to GitLab - https://phabricator.wikimedia.org/T341194 (10BTullis) [10:36:26] stevemunene: the only thing that may complain is kafka test, but at steady state it shouldn't be a problem [10:39:46] I'll do it in the afternoon, please stop me in case you are against it :) [10:40:25] Nothing against it, go for it elukey :) [10:40:47] !log decommission analytics1062.eqiad.wmnet T339200 [10:40:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:40:50] T339200: decommission analytics1062.eqiad.wmnet - https://phabricator.wikimedia.org/T339200 [11:13:47] (03PS1) 10Btullis: Update the setup-elasticsearch container to fix path issue [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/936002 (https://phabricator.wikimedia.org/T329514) [11:18:36] !log decommission analytics1063.eqiad.wmnet T339201 [11:18:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:18:40] T339201: decommission analytics1063.eqiad.wmnet - https://phabricator.wikimedia.org/T339201 [11:25:32] (03CR) 10Btullis: [C: 03+2] Update the setup-elasticsearch container to fix path issue [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/936002 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [11:27:20] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1064.eqiad.wmnet - https://phabricator.wikimedia.org/T341204 (10Stevemunene) [11:28:44] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1065.eqiad.wmnet - https://phabricator.wikimedia.org/T341205 (10Stevemunene) [11:29:30] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1066.eqiad.wmnet - https://phabricator.wikimedia.org/T341206 (10Stevemunene) [11:30:11] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1067.eqiad.wmnet - https://phabricator.wikimedia.org/T341207 (10Stevemunene) [11:30:47] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1068.eqiad.wmnet - https://phabricator.wikimedia.org/T341208 (10Stevemunene) [11:31:22] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1069.eqiad.wmnet - https://phabricator.wikimedia.org/T341209 (10Stevemunene) [11:31:46] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1065.eqiad.wmnet - https://phabricator.wikimedia.org/T341205 (10Stevemunene) [11:31:48] 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Stevemunene) [11:31:55] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1066.eqiad.wmnet - https://phabricator.wikimedia.org/T341206 (10Stevemunene) [11:31:57] 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Stevemunene) [11:32:05] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1067.eqiad.wmnet - https://phabricator.wikimedia.org/T341207 (10Stevemunene) [11:32:08] 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Stevemunene) [11:32:14] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1068.eqiad.wmnet - https://phabricator.wikimedia.org/T341208 (10Stevemunene) [11:32:16] 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Stevemunene) [11:32:24] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1069.eqiad.wmnet - https://phabricator.wikimedia.org/T341209 (10Stevemunene) [11:32:28] 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Stevemunene) [11:36:31] 10Data-Platform-SRE, 10Data-Catalog: Review and improve the build process for DataHub containers - https://phabricator.wikimedia.org/T303381 (10BTullis) I'm going to say that this is done. It's much better than it was, cleaner and easier to maintain. I've effectively un-forked our build process from the datahu... [11:38:27] (03Merged) 10jenkins-bot: Update the setup-elasticsearch container to fix path issue [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/936002 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [11:41:53] 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 18 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10BTullis) [11:56:47] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1063.eqiad.wmnet - https://phabricator.wikimedia.org/T339201 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by stevemunene@cumin1001 for hosts: `analytics1063.eqiad.wmnet` - analytics1063.eqiad.wmnet (**FAIL**) - //Unable t... [11:59:25] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1063.eqiad.wmnet - https://phabricator.wikimedia.org/T339201 (10Stevemunene) Wipe of swraid, partition-table and filesystem signatures was performed during the frrst run of the playbook. [12:15:20] (03PS1) 10Btullis: Update the GMS container to address a path issue [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/936023 (https://phabricator.wikimedia.org/T329514) [12:15:31] 10Data-Platform-SRE, 10Discovery-Search, 10serviceops-radar: Test version compatibility between production Kafka and newer ZooKeeper - https://phabricator.wikimedia.org/T341137 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host zookeeper-test1002.eqiad.wmnet... [12:15:35] 10Data-Platform-SRE, 10Discovery-Search, 10serviceops-radar: Test version compatibility between production Kafka and newer ZooKeeper - https://phabricator.wikimedia.org/T341137 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host zookeeper-test1002.eqiad.wmnet with... [12:19:18] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1064.eqiad.wmnet - https://phabricator.wikimedia.org/T341204 (10Stevemunene) [12:19:21] 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Stevemunene) [12:21:07] (03CR) 10Btullis: [C: 03+2] Update the GMS container to address a path issue [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/936023 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [12:23:26] soooo I was about to start the cookbook but then I realized that I'd better dist-upgrade the zookeeper test node [12:23:31] since we have only one etc.. [12:24:04] elukey: Which cookbook, the reimage? [12:24:50] yeah [12:24:54] I'll follow https://phabricator.wikimedia.org/T332013#8724091 [12:27:52] Yeah, I suppose so. Looking at the output from `echo cons | nc localhost 2181` it looks like it's only the kafka-test cluster that are clients. of it. [12:28:08] yes yes exactly [12:28:47] OK, I'm fine with a dist-upgrade then, if you are. [12:29:54] Thanks. [12:33:20] super thanks :) [12:33:23] will report when finished [12:34:36] (03Merged) 10jenkins-bot: Update the GMS container to address a path issue [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/936023 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [12:35:08] !log decommission analytics1064.eqiad.wmnet T341204 [12:35:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:35:11] T341204: decommission analytics1064.eqiad.wmnet - https://phabricator.wikimedia.org/T341204 [12:43:04] 10Data-Platform-SRE, 10Discovery-Search, 10serviceops-radar: Test version compatibility between production Kafka and newer ZooKeeper - https://phabricator.wikimedia.org/T341137 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host zookeeper-test1002.eqiad.wmnet... [12:46:21] sooo of course the dist-upgrade went south, and I got locked out sigh. Tried to get in again, but the vm seemed unusable, so I am reimaging. I'll also need to roll restart kafka test brokers after wards [12:46:38] will let you know when things are good again [12:58:18] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1064.eqiad.wmnet - https://phabricator.wikimedia.org/T341204 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by stevemunene@cumin1001 for hosts: `analytics1064.eqiad.wmnet` - analytics1064.eqiad.wmnet (**WARN**) - Downtimed... [13:01:37] Ack, all the best elukey . available to help where possible. [13:02:14] !log decommission analytics1065.eqiad.wmnet T341205 [13:02:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:02:17] T341205: decommission analytics1065.eqiad.wmnet - https://phabricator.wikimedia.org/T341205 [13:04:32] thanks! [13:04:41] filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/936032 since puppet doesn't like openjdk-17 :) [13:11:00] PROBLEM - Check systemd state on kafka-test1010 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:11:06] PROBLEM - Kafka Broker Server on kafka-test1010 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [13:11:33] PROBLEM - Kafka broker TLS certificate validity on kafka-test1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [13:12:29] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1065.eqiad.wmnet - https://phabricator.wikimedia.org/T341205 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by stevemunene@cumin1001 for hosts: `analytics1065.eqiad.wmnet` - analytics1065.eqiad.wmnet (**WARN**) - Downtimed... [13:13:05] thankssss [13:13:48] !log decommission analytics1066.eqiad.wmnet T341206 [13:13:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:13:51] T341206: decommission analytics1066.eqiad.wmnet - https://phabricator.wikimedia.org/T341206 [13:16:08] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10Papaul) @BTullis hey looked at the server yesterday everything on the serve looks good so working with network team to see why the server is not getting anything DHCP. will let you know [13:17:46] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Replace RAID controller battery on an-worker1095 - https://phabricator.wikimedia.org/T340946 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6f84de2d-a493-4b54-92d4-cefed7da6f97) set by btullis@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their s... [13:18:58] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Replace RAID controller battery on an-worker1095 - https://phabricator.wikimedia.org/T340946 (10BTullis) @Jclark-ctr - I've shut down the machine and downtimed it. Feel free to boot it again normally after changing the battery. Many thanks. [13:20:27] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) Great. Thanks for the update @Papaul [13:21:09] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Data Pipelines (Sprint 14), 10Google-Chrome-User-Agent-Deprecation, 10Product-Analytics (Kanban): Model impact of User-Agent deprecation on top line metrics - https://phabricator.wikimedia.org/T336084 (10odimitrijevic) [13:23:01] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Epic, 10Patch-For-Review: Migrate Oozie jobs to Airflow - https://phabricator.wikimedia.org/T299074 (10odimitrijevic) 05Open→03Resolved a:03odimitrijevic [13:23:03] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Product-Analytics, 10Epic: Replace Oozie with better workflow scheduler - https://phabricator.wikimedia.org/T271429 (10odimitrijevic) [13:23:49] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Product-Analytics, 10Epic: Replace Oozie with better workflow scheduler - https://phabricator.wikimedia.org/T271429 (10odimitrijevic) 05Open→03Resolved a:03odimitrijevic So gratifying to be able to be closing this t... [13:23:52] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Epic, 10Platform Team Workboards (Image Suggestion API): Airflow collaborations - https://phabricator.wikimedia.org/T282033 (10odimitrijevic) [13:25:23] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Epic, 10Platform Team Workboards (Image Suggestion API): Airflow collaborations - https://phabricator.wikimedia.org/T282033 (10odimitrijevic) 05Open→03Resolved a:03odimitrijevic [13:29:14] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10Papaul) @BTullis it looks like we found the issue @cmooney have the fix at https://gerrit.wikimedia.org/r/c/operations/homer/public/+/936036 so i am waiting on the merge to re-test [13:30:26] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1066.eqiad.wmnet - https://phabricator.wikimedia.org/T341206 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by stevemunene@cumin1001 for hosts: `analytics1066.eqiad.wmnet` - analytics1066.eqiad.wmnet (**WARN**) - Downtimed... [13:32:56] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Replace RAID controller battery on an-worker1095 - https://phabricator.wikimedia.org/T340946 (10Jclark-ctr) 05Open→03Resolved @BTullis replaced failed battery. server is booting up now [13:33:30] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10Papaul) @BTullis all yous you good to re-image [13:33:52] 10Data-Platform-SRE, 10Discovery-Search, 10serviceops-radar: Test version compatibility between production Kafka and newer ZooKeeper - https://phabricator.wikimedia.org/T341137 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host zookeeper-test1002.eqiad.wmnet with... [13:35:51] RECOVERY - Check systemd state on kafka-test1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:35:55] RECOVERY - Kafka Broker Server on kafka-test1010 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [13:37:49] PROBLEM - Kafka broker TLS certificate validity on kafka-test1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [13:38:39] RECOVERY - Kafka broker TLS certificate validity on kafka-test1006 is OK: SSL OK - Certificate kafka-test1006.eqiad.wmnet valid until 2024-04-04 08:08:00 +0000 (expires in 272 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [13:38:51] I am restarting the kafka brokers [13:39:00] elukey: Ack, many thanks. [13:39:16] everything went horribly wrong of course :) [13:39:41] PROBLEM - Kafka broker TLS certificate validity on kafka-test1007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [13:40:51] RECOVERY - Kafka broker TLS certificate validity on kafka-test1007 is OK: SSL OK - Certificate kafka-test1007.eqiad.wmnet valid until 2024-04-04 09:53:00 +0000 (expires in 272 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [13:41:13] 10Data-Engineering, 10Data Products, Metrics & Experimentation Team , 10Data Pipelines (Sprint 14), 10Google-Chrome-User-Agent-Deprecation, 10Product-Analytics (Kanban): Model impact of User-Agent deprecation on top line metrics - https://phabricator.wikimedia.org/T336084 (10odimitrijevic) [13:41:31] PROBLEM - Kafka broker TLS certificate validity on kafka-test1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [13:42:03] RECOVERY - Kafka broker TLS certificate validity on kafka-test1008 is OK: SSL OK - Certificate kafka-test1008.eqiad.wmnet valid until 2024-04-04 14:13:00 +0000 (expires in 273 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [13:43:21] PROBLEM - Kafka broker TLS certificate validity on kafka-test1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [13:44:47] RECOVERY - Kafka broker TLS certificate validity on kafka-test1009 is OK: SSL OK - Certificate kafka-test1009.eqiad.wmnet valid until 2024-04-04 14:53:00 +0000 (expires in 273 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [13:47:11] RECOVERY - Kafka broker TLS certificate validity on kafka-test1010 is OK: SSL OK - Certificate kafka-test1010.eqiad.wmnet valid until 2024-04-04 15:44:00 +0000 (expires in 273 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [13:51:11] 10Data-Engineering: ProduceCanaryEvents job should be scheduled by Airflow - https://phabricator.wikimedia.org/T341229 (10Ottomata) [13:55:37] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Data Pipelines: Refine jobs should be scheduled by Airflow - https://phabricator.wikimedia.org/T307505 (10Ottomata) [13:56:45] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Data Pipelines: Refine jobs should be scheduled by Airflow - https://phabricator.wikimedia.org/T307505 (10Ottomata) [13:57:55] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Data Pipelines: Refine jobs should be scheduled by Airflow - https://phabricator.wikimedia.org/T307505 (10Ottomata) [13:59:47] 10Data-Engineering, 10Data Engineering and Event Platform Team: ProduceCanaryEvents job should be scheduled by Airflow - https://phabricator.wikimedia.org/T341229 (10odimitrijevic) [14:00:20] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 0): ProduceCanaryEvents job should be scheduled by Airflow - https://phabricator.wikimedia.org/T341229 (10odimitrijevic) [14:01:38] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye [14:02:31] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) Awesome. Many thanks @Papaul and @cmooney - Reimage under way now. [14:06:17] !log decommission analytics1067.eqiad.wmnet T341207 [14:06:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:06:20] T341207: decommission analytics1067.eqiad.wmnet - https://phabricator.wikimedia.org/T341207 [14:16:37] (03CR) 10Milimetric: [V: 03+2 C: 03+2] "Tests pass, Jenkins just can't use the requried node v10 to run them. Merging this finishes the AQS Knowledge Gaps endpoint work." [analytics/aqs] - 10https://gerrit.wikimedia.org/r/933603 (https://phabricator.wikimedia.org/T337059) (owner: 10Nick Ifeajika) [14:18:47] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1067.eqiad.wmnet - https://phabricator.wikimedia.org/T341207 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by stevemunene@cumin1001 for hosts: `analytics1067.eqiad.wmnet` - analytics1067.eqiad.wmnet (**WARN**) - Downtimed... [14:18:50] (03PS1) 10Mazevedo: Add new property to ios_talk_page_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/936048 (https://phabricator.wikimedia.org/T334973) [14:19:53] !log decommission analytics1068.eqiad.wmnet T341208 [14:19:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:19:56] T341208: decommission analytics1068.eqiad.wmnet - https://phabricator.wikimedia.org/T341208 [14:27:35] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye executed with errors: - an-test-worker10... [14:29:07] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye [14:29:19] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1068.eqiad.wmnet - https://phabricator.wikimedia.org/T341208 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by stevemunene@cumin1001 for hosts: `analytics1068.eqiad.wmnet` - analytics1068.eqiad.wmnet (**WARN**) - Downtimed... [14:30:53] !log decommission analytics1069.eqiad.wmnet T341209 [14:30:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:30:56] T341209: decommission analytics1069.eqiad.wmnet - https://phabricator.wikimedia.org/T341209 [14:33:43] (03CR) 10Tsevener: [C: 03+2] Add new property to ios_talk_page_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/936048 (https://phabricator.wikimedia.org/T334973) (owner: 10Mazevedo) [14:34:14] (03Merged) 10jenkins-bot: Add new property to ios_talk_page_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/936048 (https://phabricator.wikimedia.org/T334973) (owner: 10Mazevedo) [14:46:20] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1069.eqiad.wmnet - https://phabricator.wikimedia.org/T341209 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by stevemunene@cumin1001 for hosts: `analytics1069.eqiad.wmnet` - analytics1069.eqiad.wmnet (**WARN**) - Downtimed... [14:50:15] joal: o/ [14:50:31] do we still run webrequest refine on the test cluster? [14:51:21] !log upgraded zookeeper-test1002 to bookworm, but its metadata got re-initialized as well (my bad for this) [14:51:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:51:30] elukey: I think I know this. Yes we do, it's an airflow job now. [14:51:39] so kafka test is up and running, but still not working 100% right [14:51:47] btullis: ah nice, what topic does it pull from? [14:52:06] because I think that webrequest topics are not pushed anymore on kafka test [14:52:35] ahhh kafkatee-webrequest-test.service on an-test-coord [14:52:37] Err. Now you're asking. one sec. [14:52:50] Oh you found it? [14:53:00] yep yep! I recalled that bit, restarted the unit [14:53:02] let's see [14:53:02] :) [14:57:07] ahh no ok the test webrequest goes to jumbo [14:57:10] not to kafka test [14:57:42] two brokers are still not receiving traffic, will try to investigate tomorrow why [14:57:47] but overall it works [14:57:51] sorry for the trouble :( [15:01:48] How's kafka-test1006 ? I'm interested because I think that my datahub staging deployment tries to use it for bootstrapping. [15:02:37] btullis: all good, I see a datahub-related topic listed by kafka topics --describe [15:03:14] btullis: but since I wiped zookeeper the consumer group (if it uses it) maybe be inconsistent, so if you want I can roll restart staging [15:03:51] elukey: That's OK, I think I may have bigger problems :-) [15:04:48] ack! :) [15:16:12] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye completed: - an-test-worker1003 (**PASS*... [15:50:40] (03PS1) 10Btullis: Update the path to the jar file for the MCE and MAE consumer images [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/936061 (https://phabricator.wikimedia.org/T329514) [17:12:01] (03CR) 10Btullis: [C: 03+2] Update the path to the jar file for the MCE and MAE consumer images [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/936061 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [17:16:08] (03Merged) 10jenkins-bot: Update the path to the jar file for the MCE and MAE consumer images [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/936061 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [17:32:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:33:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:46:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:47:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:39:24] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: mediawiki page_content_change should generate new meta.id field - https://phabricator.wikimedia.org/T341277 (10Ottomata) [18:49:47] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform, and 3 others: Adopt conventions for server receive and client/event timestamps in non analytics event schemas - https://phabricator.wikimedia.org/T267648 (10Ottomata) Meeting today, discussed / decided the follow... [18:52:58] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform, and 3 others: Adopt conventions for server receive and client/event timestamps in non analytics event schemas - https://phabricator.wikimedia.org/T267648 (10Ottomata) [19:42:04] 10Data-Platform-SRE, 10Discovery-Search (Current work): Document SRE steps for deploying a new WDQS (and WCQS) host - https://phabricator.wikimedia.org/T330714 (10bking) a:03bking [19:44:42] 10Data-Platform-SRE, 10Discovery-Search (Current work): Document SRE steps for deploying a new WDQS (and WCQS) host - https://phabricator.wikimedia.org/T330714 (10bking) [19:44:45] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Configure new WDQS servers in codfw (wdqs20[13-22]) - https://phabricator.wikimedia.org/T332314 (10bking) [20:23:58] 10Data-Platform-SRE, 10Discovery-Search (Current work): Diagnose and fix WDQS deployment process - https://phabricator.wikimedia.org/T341290 (10bking)