[00:51:43] (SystemdUnitFailed) firing: (2) wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:21:48] (03CR) 10Milimetric: T340880 Merge visibility changes into hourly target table (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/937047 (owner: 10Jennifer Ebe) [02:38:37] (03CR) 10Milimetric: "Oh, my bad, I see this here now: https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/merge_requests/2/diffs" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/937047 (owner: 10Jennifer Ebe) [04:51:43] (SystemdUnitFailed) firing: (2) wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:09:28] 10Data-Platform-SRE, 10DBA, 10cloud-services-team: Migrate wiki replicas (clouddb*) hosts to MariaDB 10.6 - https://phabricator.wikimedia.org/T334651 (10Marostegui) Could I get an answer on this please? [07:09:45] 10Data-Platform-SRE, 10DBA: Migrate dbstore1005 to MariaDB 10.6 - https://phabricator.wikimedia.org/T334652 (10Marostegui) Could I get an answer on this please? [08:32:52] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) >>! In T342141#9026115, @Papaul wrote: > @BTullis we had the same issue with sessionstore2001 in codw see task below what we... [08:40:39] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Epic: [EPIC] Deployment of the Search Update Pipeline on Flink / k8s - https://phabricator.wikimedia.org/T340548 (10Gehel) p:05Triage→03High [08:51:43] (SystemdUnitFailed) firing: (2) wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:58:53] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10BTullis) The value for under-replicated blocks is still at around 4.5 million, although dropping. {F37143470,width=60%} https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&viewPanel=41&from=now-2d&to=now... [09:13:38] !log deploying airflow-dags for analytics_test to an-test-client1001 [09:13:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:13:55] !log correction: to an-test-client1002 [09:13:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:17:15] 10Data-Platform-SRE, 10Patch-For-Review: Migrate analytics_test airflow instance to bullseye an-test-client1002 - https://phabricator.wikimedia.org/T341700 (10BTullis) OK, it looks like the instructions are a little incomplete. For each instance in here: https://wikitech.wikimedia.org/wiki/Data_Engineering/Sy... [09:26:43] (SystemdUnitFailed) firing: (2) wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:31:29] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10BTullis) a:05BTullis→03Stevemunene @Stevemunene we're no longer going to be the early adopters of OIDC now within the foundation. There are now wo other proj... [09:34:46] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10jbond) cc @SLyngshede-WMF who worked on both the netbox and gitlab integrations, as well as the initial idm implementation. [09:46:21] 10Data-Platform-SRE, 10DBA: Migrate dbstore1005 to MariaDB 10.6 - https://phabricator.wikimedia.org/T334652 (10BTullis) Apologies for the delay @Marostegui - You can go ahead and do this any time this week or next. Thanks. [09:57:18] 10Data-Platform-SRE, 10DBA, 10cloud-services-team: Migrate wiki replicas (clouddb*) hosts to MariaDB 10.6 - https://phabricator.wikimedia.org/T334651 (10BTullis) @Marostegui - you can upgrade clouddb1021 whenever is convenient for you, this week or next. I also have no objections to the work on clouddb1019... [10:06:19] !log restarting java services on an-test-coord1001 for JVM update [10:06:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:08:08] (03PS1) 10Jennifer Ebe: Update changelog for v0.2.18 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/939642 [10:08:54] (03CR) 10Joal: [V: 03+2 C: 03+2] "LGTM - merging for deploy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/939642 (owner: 10Jennifer Ebe) [10:12:40] Starting build #123 for job analytics-refinery-maven-release-docker [10:14:38] !log restarting presto-service on an-coord1001 for T329716 [10:14:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:15:20] !log restarting oozie service on an-coord1001 for T329716 [10:15:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:25:34] Project analytics-refinery-maven-release-docker build #123: 09SUCCESS in 12 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/123/ [10:47:39] Starting build #82 for job analytics-refinery-update-jars-docker [10:48:00] Project analytics-refinery-update-jars-docker build #82: 09SUCCESS in 20 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/82/ [10:48:00] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.2.18 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/939274 [10:51:01] (03CR) 10Jennifer Ebe: [V: 03+2 C: 03+2] "Merging for deployment" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/939274 (owner: 10Maven-release-user) [10:54:45] !log migrating hive services to an-coord1002 via DNS for T329716 (to permit restart of hive services on an-coord1001). [10:54:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:57:32] !log deploying refinery using scap [10:57:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:58:10] (03CR) 10Gmodena: [C: 03+1] Skip schema test cases that will fail validation in new jsonschema-tools version [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/939367 (https://phabricator.wikimedia.org/T340765) (owner: 10TChin) [10:59:47] (03CR) 10Gmodena: [C: 03+1] "LGTM. Could you maybe add a comment re the jsonchema-tool version bump requiring tests skip?" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/939366 (https://phabricator.wikimedia.org/T340765) (owner: 10TChin) [11:10:39] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10BTullis) My plan is to: # add a silence in alertnamager for db1108 and db1208 # stop mariadb on db1108 # configure MariaDB on db1... [11:22:54] !log deploying refinery to hdfs [11:22:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:33:08] jennifer_ebe: joal Do you still have to use the workaround described here: https://phabricator.wikimedia.org/T334493 when doing the refinery-deploy-to-hdfs step? [11:33:17] btullis: we had to yes :( [11:33:57] OK, thanks. I still haven't got a proper solution for it yet, but I'll address it soon. [11:35:44] Thanks a lot btullis <3 [11:38:11] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10Marostegui) Does db1108 replicate from somewhere? If it does, you'd need to do some steps in between (I can help with). [11:39:43] Hello btullis kindly help review and merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/939667 [11:40:29] jennifer_ebe: Ack, will do. [11:41:16] 10Data-Platform-SRE, 10Patch-For-Review: anlytics/refinery deployment broken at refinery-deploy-to-hdfs - https://phabricator.wikimedia.org/T334493 (10BTullis) @MoritzMuehlenhoff made a useful suggestion on that patch, which I'll put here so I don't lose it. > If there a way to determine the hash to be used in... [11:45:40] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10BTullis) >>! In T334055#9027357, @Marostegui wrote: > Does db1108 replicate from somewhere? If it does, you'd need to do some step... [11:49:21] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10Marostegui) Ok, then you'd steps would be: # add a silence in alertnamager for db1108 and db1208 # connect to each mysql instanc... [11:58:11] jennifer_ebe: Merged and deployed. New version pulled on an-launcher1002. [12:01:43] (SystemdUnitFailed) resolved: jupyter-appledora-singleuser-conda-analytics.service Failed on stat1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:03:32] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10BTullis) Silences added: Slaves stopped: * Slave status of analytics_meta: P49601 * Slave status of matomo: P49602 MariaDB insta... [12:04:52] PROBLEM - mysqld processes on db1108 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [12:05:40] ^ this is me. Sorry, I thought a silence on alertmanager would stop this alerting, but clearly it didn't. [12:06:45] ACKNOWLEDGEMENT - mysqld processes on db1108 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Btullis T334055 - migratnig to db1208 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [12:08:12] (SystemdUnitFailed) firing: (2) prometheus-mysqld-exporter.service Failed on db1208:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:33:14] thanks a lot for the refine patch btullis :) [12:33:51] joal: It's a pleasure. [12:37:02] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10BTullis) Starting the transfer of data now. ` btullis@cumin1001:~$ sudo transfer.py --no-encrypt db1108.eqiad.wmnet:/srv db1208.eq... [12:38:35] !log deploy Airflow analytics dags - Fullrevampof cassandraloading jobs [12:38:37] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:01:50] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10Papaul) @BTullis yes that is a possibility too to use the 10G nic since those 2 nodes each has 4x1G nic and 2x10G nic. There are 2 way... [13:04:24] 10Data-Platform-SRE: Alert review: SystemdUnitFailed - https://phabricator.wikimedia.org/T342247 (10LSobanski) [13:06:47] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10Jclark-ctr) a:03Jclark-ctr [13:07:17] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10Jclark-ctr) @BTullis I replaced both sfpt and link returned [13:14:50] 10Data-Platform-SRE: analytics/refinery deployment broken at refinery-deploy-to-hdfs - https://phabricator.wikimedia.org/T334493 (10Aklapper) [13:15:36] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1073.eqiad.wmnet with OS bullseye [13:19:02] 10Data-Platform-SRE: Alert triage - https://phabricator.wikimedia.org/T342247 (10LSobanski) [13:25:45] 10Data-Platform-SRE: Alert triage - https://phabricator.wikimedia.org/T342247 (10BTullis) a:03bking Assigning to @bking because it looks like it might be related to work he's doing in {T332314} [13:36:04] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) @Jclark-ctr - many thanks for doing that. I just checked with another run of the cookbook on analytics1073 and it doesn't loo... [13:39:12] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, 10SRE-Access-Requests: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10andrea.denisse) 05Open→03Resolved Marking as resolved. :) [13:42:42] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) > There are 2 ways you will be able to switch to using the 10G nic on those servers. 1- Decommission the server and provision... [13:44:53] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10Papaul) the interface came up an went down ` papaul@asw2-b-eqiad> show interfaces descriptions ge-7/0/15 Interface Admin Link D... [13:44:55] !log restarting hive-server2 and hive-metastore services on an-coord1001 (currently standby) [13:44:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:45:06] 10Data-Platform-SRE, 10Discovery-Search (Current work): Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10bking) [13:45:09] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10bking) [13:45:51] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10Papaul) right now 1075 is showing up ` papaul@asw2-c-eqiad> show interfaces descriptions | match analytics1075 ge-7/0/5 up... [13:46:33] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10BTullis) The sync finished successfully. One warning abot a small size mismatch. ` 2023-07-19 12:39:32 WARNING: Original size is... [13:51:03] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) >>! In T342141#9028025, @Papaul wrote: > right now 1075 is showing up > ` > papaul@asw2-c-eqiad> show interfaces description... [13:53:26] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10BTullis) I had to move files manually, because it ended up in a `/srv/srv/` directory, but I've put it in the right place now and... [14:01:00] 10Data-Platform-SRE: Alert triage - https://phabricator.wikimedia.org/T342247 (10bking) Sorry for the spam....we're trying to fix [[ https://phabricator.wikimedia.org/T340793 | the issue of our cookbook removing downtimes ]] , but for now I've set a 14-day downtime for these hosts. We'll be more vigilant about... [14:03:43] milimetric: Good morning - Would you by any chance have a minute to validate gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/462 please ? [14:11:29] oh yeah, more jars, cool, merged [14:14:25] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1075.eqiad.wmnet with OS bullseye [14:14:36] 10Data-Engineering, 10Data-Platform-SRE, 10SRE, 10SRE Observability: Grant IdempotentWrite Kafka Cluster ACL to User:ANONYMOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10lmata) [14:19:52] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1073.eqiad.wmnet with OS bullseye executed with errors: - analytics1073 (**FAIL**) - Removed from Puppet... [14:20:25] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1073.eqiad.wmnet with OS bullseye [14:22:10] 10Data-Engineering, 10Data-Platform-SRE, 10SRE: Grant IdempotentWrite Kafka Cluster ACL to User:ANONYMOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10herron) Untagging observability to table this wrt the kafka-logging cluster for the time being. Will need to revisit the kafka-loggin... [14:22:54] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 0), 10Event-Platform: mw-page-content-change-enrich: alert on SLIs degradation only on active DC - https://phabricator.wikimedia.org/T342258 (10gmodena) [14:23:18] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 0), 10Event-Platform: mw-page-content-change-enrich: alert on SLIs degradation only on active DC - https://phabricator.wikimedia.org/T342258 (10gmodena) [14:35:20] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10BTullis) I went to start the services, but they weren't listed, so I enabled them by name and started them by name: ` Created symlink /etc/systemd/syste... [14:36:37] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10Marostegui) Just checked why they started and worked. Your setup is different from production. Your relay logs aren't linked to the host name, so that's... [14:45:59] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10BTullis) >>! In T334055#9028376, @Marostegui wrote: > Just checked why they started and worked. Your setup is different from production. Your relay logs... [14:48:12] (SystemdUnitFailed) resolved: prometheus-mysqld-exporter.service Failed on db1208:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:09:42] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10Marostegui) You can just disable it and reset it so it clears the alert [15:09:59] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10BTullis) I'm also not seeing it appear on [[https://grafana-rw.wikimedia.org/d/000000273/mysql?orgId=1|here]] yet, after waiting a while and refreshing... [15:14:13] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10BTullis) >>! In T334055#9028549, @Marostegui wrote: > You can just disable it and reset it so it clears the alert Done. Thanks for clarification. [15:15:55] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10Marostegui) >>! In T334055#9028552, @BTullis wrote: > I'm also not seeing it appear on [[https://grafana-rw.wikimedia.org/d/000000273/mysql?orgId=1|here... [15:19:48] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10BTullis) > I need to add it to zarcillo first Thanks. I forgot that you mentioned that. [15:31:53] 10Data-Engineering, 10Data-Engineering-Dashiki, 10Performance-Team (Radar): Investigate surprising "10% Other" portion of Analytics Browsers report - https://phabricator.wikimedia.org/T342267 (10Krinkle) [15:34:23] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1075.eqiad.wmnet with OS bullseye executed with errors: - analytics1075 (**FAIL**) - Removed from Puppet... [15:40:26] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1073.eqiad.wmnet with OS bullseye executed with errors: - analytics1073 (**FAIL**) - Removed from Puppet... [15:41:22] 10Data-Engineering, 10Data-Engineering-Dashiki, 10Performance-Team (Radar): Investigate surprising "10% Other" portion of Analytics Browsers report - https://phabricator.wikimedia.org/T342267 (10Krinkle) [15:43:14] 10Data-Engineering, 10Data-Engineering-Dashiki, 10Performance-Team (Radar): Investigate surprising "10% Other" portion of Analytics Browsers report - https://phabricator.wikimedia.org/T342267 (10Krinkle) First off. We can query the underlying pageviews Hadoop dataset, using Turnilo to get a rough sense of th... [15:51:43] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10Marostegui) Just added it to zarcillo [15:54:56] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10Marostegui) @BTullis can I just install mariadb 10.6 on this host before it goes to production so we don't have to do it at a latter time when it might... [15:57:09] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Data-Persistence: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10Marostegui) [16:00:40] 10Data-Platform-SRE, 10Discovery-Search (Current work): Investigate WDQS categories update failures on Bullseye hosts - https://phabricator.wikimedia.org/T342060 (10bking) [16:14:42] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Data-Persistence: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10BTullis) >>! In T334055#9028772, @Marostegui wrote: > @BTullis can I just install mariadb 10.6 on this host before it goes to prod... [16:19:42] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-mysqld-exporter@matomo.service Failed on db1108:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:35:50] !log Deploy airflow fixfor cassandra loading jobs [16:35:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:53:13] 10Data-Platform-SRE, 10decommission-hardware: decommission db1108.eqiad.wmnet - https://phabricator.wikimedia.org/T336254 (10BTullis) 05Stalled→03Open [17:03:28] 10Data-Engineering, 10Data-Engineering-Dashiki, 10Performance-Team (Radar): Investigate surprising "10% Other" portion of Analytics Browsers report - https://phabricator.wikimedia.org/T342267 (10Milimetric) I'm sorry I've been so slow to take this more seriously. I took it seriously, just there's always oth... [17:04:31] (03Abandoned) 10Stevemunene: Build datahub v0.10.0 containers [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/898956 (https://phabricator.wikimedia.org/T329514) (owner: 10Stevemunene) [17:04:42] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@matomo.service Failed on db1108:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:16:29] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) We've been investigating this extensively and discussing in some depth on #wikimedia-dcops on IRC. We've decided to go ahead... [17:20:18] btullis re: ^^ I did a similar maintenance a few months back, posting https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Change_server_NIC_and_switch_connection,_keeping_IPs and https://etherpad.wikimedia.org/p/T322082 in hopes it might be useful [17:22:53] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10GitLab (Project Migration), 10Release-Engineering-Team (Priority Backlog 📥): Migrate analytics/datahub pipeline to GitLab - https://phabricator.wikimedia.org/T341194 (10BTullis) a:03BTullis @thcipriani - @hashar - are you both happy for m... [17:24:38] 10Data-Engineering, 10Data-Engineering-Dashiki, 10Performance-Team (Radar): Investigate surprising "10% Other" portion of Analytics Browsers report - https://phabricator.wikimedia.org/T342267 (10Milimetric) ` presto use analytics_hive.milimetric; select browser_family, sum(view_count) as total_vie... [17:24:42] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@matomo.service Failed on db1108:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:28:28] 10Data-Engineering, 10Data-Engineering-Dashiki, 10Performance-Team (Radar): Investigate surprising "10% Other" portion of Analytics Browsers report - https://phabricator.wikimedia.org/T342267 (10Krinkle) ### Down to the source I'll work backwards from the Dashiki frontend at [analytics.wikimedia.org](https:... [17:29:34] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10GitLab (Project Migration), 10Release-Engineering-Team (Priority Backlog 📥): Migrate analytics/datahub pipeline to GitLab - https://phabricator.wikimedia.org/T341194 (10thcipriani) >>! In T341194#9029097, @BTullis wrote: > @thcipriani - @ha... [17:29:42] (SystemdUnitFailed) firing: (2) ferm.service Failed on druid1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:32:38] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Data-Persistence: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10BTullis) a:05BTullis→03Marostegui I think everything is now done from our side, then. I'll proceed with {T336254} shortly. Apo... [17:34:41] 10Data-Engineering, 10Data-Engineering-Dashiki, 10Performance-Team (Radar): Investigate surprising "10% Other" portion of Analytics Browsers report - https://phabricator.wikimedia.org/T342267 (10Milimetric) > 1. Is it feasible to compute this data such that the threshold is applied last? I can think of two a... [17:34:42] (SystemdUnitFailed) resolved: (2) ferm.service Failed on druid1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:35:22] @Krinkle: I think we found the same problem, but you found it more :) I'm all for re-computing these. Pageview_hourly is relatively small, and we can re-run jobs slowly over some period of time [17:35:31] (gtg for a while but I left comments on the task) [17:36:24] milimetric: awesome! [18:07:18] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Data-Persistence: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10Marostegui) The host is now showing up on icinga [18:08:23] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Data-Persistence: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10Marostegui) I'll get it to Mariadb 10.6 tomorrow or Friday and close this task when done [18:57:15] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10RobH) [18:58:00] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10RobH) [19:36:49] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bking@cumin1001 for hosts: `flink-zk1003.eqiad.wmnet` - flink-zk1003.eqiad.wmnet... [19:42:15] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm [20:33:43] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm executed w... [20:34:21] (03CR) 10Jdlrobson: [C: 03+2] Update web ui scroll [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/938284 (https://phabricator.wikimedia.org/T337270) (owner: 10Kimberly Sarabia) [20:34:54] (03Merged) 10jenkins-bot: Update web ui scroll [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/938284 (https://phabricator.wikimedia.org/T337270) (owner: 10Kimberly Sarabia) [20:55:07] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bking@cumin1001 for hosts: `flink-zk1003.eqiad.wmnet` - flink-zk1003.eqiad.wmnet... [21:41:08] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm [22:36:33] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm executed w...