[00:51:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:21:48] <wikibugs>	 (03CR) 10Milimetric: T340880 Merge visibility changes into hourly target table (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/937047 (owner: 10Jennifer Ebe)
[02:38:37] <wikibugs>	 (03CR) 10Milimetric: "Oh, my bad, I see this here now: https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/merge_requests/2/diffs" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/937047 (owner: 10Jennifer Ebe)
[04:51:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:09:28] <wikibugs>	 10Data-Platform-SRE, 10DBA, 10cloud-services-team: Migrate wiki replicas (clouddb*) hosts to MariaDB 10.6 - https://phabricator.wikimedia.org/T334651 (10Marostegui) Could I get an answer on this please?
[07:09:45] <wikibugs>	 10Data-Platform-SRE, 10DBA: Migrate dbstore1005 to MariaDB 10.6 - https://phabricator.wikimedia.org/T334652 (10Marostegui) Could I get an answer on this please?
[08:32:52] <wikibugs>	 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) >>! In T342141#9026115, @Papaul wrote: > @BTullis we had the same issue with sessionstore2001 in codw see task below what we...
[08:40:39] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Epic: [EPIC] Deployment of the Search Update Pipeline on Flink / k8s - https://phabricator.wikimedia.org/T340548 (10Gehel) p:05Triage→03High
[08:51:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:58:53] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10BTullis) The value for under-replicated blocks is still at around 4.5 million, although dropping. {F37143470,width=60%} https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&viewPanel=41&from=now-2d&to=now...
[09:13:38] <btullis>	 !log deploying airflow-dags for analytics_test to an-test-client1001
[09:13:40] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:13:55] <btullis>	 !log correction: to an-test-client1002
[09:13:57] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:17:15] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Migrate analytics_test airflow instance to bullseye  an-test-client1002 - https://phabricator.wikimedia.org/T341700 (10BTullis) OK, it looks like the instructions are a little incomplete. For each instance in here: https://wikitech.wikimedia.org/wiki/Data_Engineering/Sy...
[09:26:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:31:29] <wikibugs>	 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10BTullis) a:05BTullis→03Stevemunene @Stevemunene we're no longer going to be the early adopters of OIDC now within the foundation.  There are now wo other proj...
[09:34:46] <wikibugs>	 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10jbond) cc @SLyngshede-WMF who worked on both the netbox and gitlab integrations, as well as the initial idm implementation.
[09:46:21] <wikibugs>	 10Data-Platform-SRE, 10DBA: Migrate dbstore1005 to MariaDB 10.6 - https://phabricator.wikimedia.org/T334652 (10BTullis) Apologies for the delay @Marostegui - You can go ahead and do this any time this week or next. Thanks.
[09:57:18] <wikibugs>	 10Data-Platform-SRE, 10DBA, 10cloud-services-team: Migrate wiki replicas (clouddb*) hosts to MariaDB 10.6 - https://phabricator.wikimedia.org/T334651 (10BTullis) @Marostegui - you can upgrade clouddb1021 whenever is convenient for you, this week or next.  I also have no objections to the work on clouddb1019...
[10:06:19] <btullis>	 !log restarting java services on an-test-coord1001 for JVM update
[10:06:20] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:08:08] <wikibugs>	 (03PS1) 10Jennifer Ebe: Update changelog for v0.2.18 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/939642
[10:08:54] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "LGTM - merging for deploy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/939642 (owner: 10Jennifer Ebe)
[10:12:40] <wmf-insecte>	 Starting build #123 for job analytics-refinery-maven-release-docker
[10:14:38] <btullis>	 !log restarting presto-service on an-coord1001 for T329716
[10:14:40] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:15:20] <btullis>	 !log restarting oozie service on an-coord1001 for T329716
[10:15:21] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:25:34] <wmf-insecte>	 Project analytics-refinery-maven-release-docker build #123: 09SUCCESS in 12 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/123/
[10:47:39] <wmf-insecte>	 Starting build #82 for job analytics-refinery-update-jars-docker
[10:48:00] <wmf-insecte>	 Project analytics-refinery-update-jars-docker build #82: 09SUCCESS in 20 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/82/
[10:48:00] <wikibugs>	 (03PS1) 10Maven-release-user: Add refinery-source jars for v0.2.18 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/939274
[10:51:01] <wikibugs>	 (03CR) 10Jennifer Ebe: [V: 03+2 C: 03+2] "Merging for deployment" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/939274 (owner: 10Maven-release-user)
[10:54:45] <btullis>	 !log migrating hive services to an-coord1002 via DNS for T329716 (to permit restart of hive services on an-coord1001).
[10:54:47] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:57:32] <jennifer_ebe>	 !log deploying refinery using scap
[10:57:33] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:58:10] <wikibugs>	 (03CR) 10Gmodena: [C: 03+1] Skip schema test cases that will fail validation in new jsonschema-tools version [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/939367 (https://phabricator.wikimedia.org/T340765) (owner: 10TChin)
[10:59:47] <wikibugs>	 (03CR) 10Gmodena: [C: 03+1] "LGTM. Could you maybe add a comment re the jsonchema-tool version bump requiring tests skip?" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/939366 (https://phabricator.wikimedia.org/T340765) (owner: 10TChin)
[11:10:39] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10BTullis) My plan is to: # add a silence in alertnamager for db1108 and db1208  # stop mariadb on db1108 # configure MariaDB on db1...
[11:22:54] <jennifer_ebe>	 !log deploying refinery to hdfs
[11:22:55] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:33:08] <btullis>	 jennifer_ebe: joal Do you still have to use the workaround described here: https://phabricator.wikimedia.org/T334493 when doing the refinery-deploy-to-hdfs step?
[11:33:17] <joal>	 btullis: we had to yes :(
[11:33:57] <btullis>	 OK, thanks. I still haven't got a proper solution for it yet, but I'll address it soon.
[11:35:44] <joal>	 Thanks a lot btullis <3
[11:38:11] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10Marostegui) Does db1108 replicate from somewhere? If it does, you'd need to do some steps in between (I can help with).
[11:39:43] <jennifer_ebe>	 Hello btullis kindly help review and merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/939667
[11:40:29] <btullis>	 jennifer_ebe: Ack, will do.
[11:41:16] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: anlytics/refinery deployment broken at refinery-deploy-to-hdfs - https://phabricator.wikimedia.org/T334493 (10BTullis) @MoritzMuehlenhoff made a useful suggestion on that patch, which I'll put here so I don't lose it. > If there a way to determine the hash to be used in...
[11:45:40] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10BTullis) >>! In T334055#9027357, @Marostegui wrote: > Does db1108 replicate from somewhere? If it does, you'd need to do some step...
[11:49:21] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10Marostegui) Ok, then you'd steps would be:  # add a silence in alertnamager for db1108 and db1208  # connect to each mysql instanc...
[11:58:11] <btullis>	 jennifer_ebe: Merged and deployed. New version pulled on an-launcher1002.
[12:01:43] <jinxer-wm>	 (SystemdUnitFailed) resolved: jupyter-appledora-singleuser-conda-analytics.service Failed on stat1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:03:32] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10BTullis) Silences added: Slaves stopped:  * Slave status of analytics_meta: P49601 * Slave status of matomo: P49602  MariaDB insta...
[12:04:52] <icinga-wm>	 PROBLEM - mysqld processes on db1108 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[12:05:40] <btullis>	 ^ this is me. Sorry, I thought a silence on alertmanager would stop this alerting, but clearly it didn't.
[12:06:45] <icinga-wm>	 ACKNOWLEDGEMENT - mysqld processes on db1108 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Btullis T334055 - migratnig to db1208 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[12:08:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) prometheus-mysqld-exporter.service Failed on db1208:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:33:14] <joal>	 thanks a lot for the refine patch btullis :)
[12:33:51] <btullis>	 joal: It's a pleasure.
[12:37:02] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10BTullis) Starting the transfer of data now. ` btullis@cumin1001:~$ sudo transfer.py --no-encrypt db1108.eqiad.wmnet:/srv db1208.eq...
[12:38:35] <joal>	 !log deploy Airflow analytics dags - Fullrevampof cassandraloading jobs
[12:38:37] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:01:50] <wikibugs>	 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10Papaul) @BTullis yes that is a possibility too to use the 10G nic since those 2 nodes each has 4x1G nic and 2x10G nic. There are 2 way...
[13:04:24] <wikibugs>	 10Data-Platform-SRE: Alert review: SystemdUnitFailed - https://phabricator.wikimedia.org/T342247 (10LSobanski)
[13:06:47] <wikibugs>	 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10Jclark-ctr) a:03Jclark-ctr
[13:07:17] <wikibugs>	 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10Jclark-ctr) @BTullis  I replaced both sfpt and link returned
[13:14:50] <wikibugs>	 10Data-Platform-SRE: analytics/refinery deployment broken at refinery-deploy-to-hdfs - https://phabricator.wikimedia.org/T334493 (10Aklapper)
[13:15:36] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1073.eqiad.wmnet with OS bullseye
[13:19:02] <wikibugs>	 10Data-Platform-SRE: Alert triage - https://phabricator.wikimedia.org/T342247 (10LSobanski)
[13:25:45] <wikibugs>	 10Data-Platform-SRE: Alert triage - https://phabricator.wikimedia.org/T342247 (10BTullis) a:03bking Assigning to @bking because it looks like it might be related to work he's doing in {T332314}
[13:36:04] <wikibugs>	 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) @Jclark-ctr - many thanks for doing that. I just checked with another run of the cookbook on analytics1073 and it doesn't loo...
[13:39:12] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, 10SRE-Access-Requests: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10andrea.denisse) 05Open→03Resolved Marking as resolved. :)
[13:42:42] <wikibugs>	 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) > There are 2 ways you will be able to switch to using the 10G nic on those servers. 1- Decommission the server and provision...
[13:44:53] <wikibugs>	 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10Papaul) the interface came up an went down  ` papaul@asw2-b-eqiad> show interfaces descriptions ge-7/0/15 Interface       Admin Link D...
[13:44:55] <btullis>	 !log restarting hive-server2 and hive-metastore services on an-coord1001 (currently standby)
[13:44:56] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:45:06] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work): Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10bking)
[13:45:09] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10bking)
[13:45:51] <wikibugs>	 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10Papaul) right now 1075 is showing up  ` papaul@asw2-c-eqiad> show interfaces descriptions | match analytics1075 ge-7/0/5        up...
[13:46:33] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10BTullis) The sync finished successfully. One warning abot a small size mismatch. ` 2023-07-19 12:39:32  WARNING: Original size is...
[13:51:03] <wikibugs>	 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) >>! In T342141#9028025, @Papaul wrote: > right now 1075 is showing up  > ` > papaul@asw2-c-eqiad> show interfaces description...
[13:53:26] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10BTullis) I had to move files manually, because it ended up in a `/srv/srv/` directory, but I've put it in the right place now and...
[14:01:00] <wikibugs>	 10Data-Platform-SRE: Alert triage - https://phabricator.wikimedia.org/T342247 (10bking) Sorry for the spam....we're trying to fix [[ https://phabricator.wikimedia.org/T340793 | the issue of our cookbook removing downtimes ]]  , but for now I've set a 14-day downtime for these hosts. We'll be more vigilant about...
[14:03:43] <joal>	 milimetric: Good morning - Would you by any chance have a minute to validate gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/462 please ?
[14:11:29] <milimetric>	 oh yeah, more jars, cool, merged
[14:14:25] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1075.eqiad.wmnet with OS bullseye
[14:14:36] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10SRE, 10SRE Observability: Grant IdempotentWrite Kafka Cluster ACL to User:ANONYMOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10lmata)
[14:19:52] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1073.eqiad.wmnet with OS bullseye executed with errors: - analytics1073 (**FAIL**)   - Removed from Puppet...
[14:20:25] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1073.eqiad.wmnet with OS bullseye
[14:22:10] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10SRE: Grant IdempotentWrite Kafka Cluster ACL to User:ANONYMOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10herron) Untagging observability to table this wrt the kafka-logging cluster for the time being.  Will need to revisit the kafka-loggin...
[14:22:54] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 0), 10Event-Platform: mw-page-content-change-enrich:  alert on SLIs degradation only on active DC - https://phabricator.wikimedia.org/T342258 (10gmodena)
[14:23:18] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 0), 10Event-Platform: mw-page-content-change-enrich:  alert on SLIs degradation only on active DC - https://phabricator.wikimedia.org/T342258 (10gmodena)
[14:35:20] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10BTullis) I went to start the services, but they weren't listed, so I enabled them by name and started them by name: ` Created symlink /etc/systemd/syste...
[14:36:37] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10Marostegui) Just checked why they started and worked. Your setup is different from production. Your relay logs aren't linked to the host name, so that's...
[14:45:59] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10BTullis) >>! In T334055#9028376, @Marostegui wrote: > Just checked why they started and worked. Your setup is different from production. Your relay logs...
[14:48:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: prometheus-mysqld-exporter.service Failed on db1208:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:09:42] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10Marostegui) You can just disable it and reset it so it clears the alert
[15:09:59] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10BTullis) I'm also not seeing it appear on [[https://grafana-rw.wikimedia.org/d/000000273/mysql?orgId=1|here]] yet, after waiting a while and refreshing...
[15:14:13] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10BTullis) >>! In T334055#9028549, @Marostegui wrote: > You can just disable it and reset it so it clears the alert   Done. Thanks for clarification.
[15:15:55] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10Marostegui) >>! In T334055#9028552, @BTullis wrote: > I'm also not seeing it appear on [[https://grafana-rw.wikimedia.org/d/000000273/mysql?orgId=1|here...
[15:19:48] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10BTullis) > I need to add it to zarcillo first Thanks. I forgot that you mentioned that.
[15:31:53] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Dashiki, 10Performance-Team (Radar): Investigate surprising "10% Other" portion of Analytics Browsers report - https://phabricator.wikimedia.org/T342267 (10Krinkle)
[15:34:23] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1075.eqiad.wmnet with OS bullseye executed with errors: - analytics1075 (**FAIL**)   - Removed from Puppet...
[15:40:26] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1073.eqiad.wmnet with OS bullseye executed with errors: - analytics1073 (**FAIL**)   - Removed from Puppet...
[15:41:22] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Dashiki, 10Performance-Team (Radar): Investigate surprising "10% Other" portion of Analytics Browsers report - https://phabricator.wikimedia.org/T342267 (10Krinkle)
[15:43:14] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Dashiki, 10Performance-Team (Radar): Investigate surprising "10% Other" portion of Analytics Browsers report - https://phabricator.wikimedia.org/T342267 (10Krinkle) First off. We can query the underlying pageviews Hadoop dataset, using Turnilo to get a rough sense of th...
[15:51:43] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10Marostegui) Just added it to zarcillo
[15:54:56] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10Marostegui) @BTullis can I just install mariadb 10.6 on this host before it goes to production so we don't have to do it at a latter time when it might...
[15:57:09] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Data-Persistence: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10Marostegui)
[16:00:40] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work): Investigate WDQS categories update failures on Bullseye hosts - https://phabricator.wikimedia.org/T342060 (10bking)
[16:14:42] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Data-Persistence: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10BTullis) >>! In T334055#9028772, @Marostegui wrote: > @BTullis can I just install mariadb 10.6 on this host before it goes to prod...
[16:19:42] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-mysqld-exporter@matomo.service Failed on db1108:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:35:50] <joal>	 !log Deploy airflow fixfor cassandra loading jobs
[16:35:52] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:53:13] <wikibugs>	 10Data-Platform-SRE, 10decommission-hardware: decommission db1108.eqiad.wmnet - https://phabricator.wikimedia.org/T336254 (10BTullis) 05Stalled→03Open
[17:03:28] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Dashiki, 10Performance-Team (Radar): Investigate surprising "10% Other" portion of Analytics Browsers report - https://phabricator.wikimedia.org/T342267 (10Milimetric) I'm sorry I've been so slow to take this more seriously.  I took it seriously, just there's always oth...
[17:04:31] <wikibugs>	 (03Abandoned) 10Stevemunene: Build datahub v0.10.0 containers [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/898956 (https://phabricator.wikimedia.org/T329514) (owner: 10Stevemunene)
[17:04:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@matomo.service Failed on db1108:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:16:29] <wikibugs>	 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) We've been investigating this extensively and discussing in some depth on #wikimedia-dcops on IRC.  We've decided to go ahead...
[17:20:18] <inflatador>	 btullis re: ^^ I did a similar maintenance a few months back, posting https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Change_server_NIC_and_switch_connection,_keeping_IPs and https://etherpad.wikimedia.org/p/T322082 in hopes it might be useful
[17:22:53] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10GitLab (Project Migration), 10Release-Engineering-Team (Priority Backlog 📥): Migrate analytics/datahub pipeline to GitLab - https://phabricator.wikimedia.org/T341194 (10BTullis) a:03BTullis @thcipriani - @hashar - are you both happy for m...
[17:24:38] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Dashiki, 10Performance-Team (Radar): Investigate surprising "10% Other" portion of Analytics Browsers report - https://phabricator.wikimedia.org/T342267 (10Milimetric) ` presto  use analytics_hive.milimetric;   select browser_family,         sum(view_count) as total_vie...
[17:24:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@matomo.service Failed on db1108:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:28:28] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Dashiki, 10Performance-Team (Radar): Investigate surprising "10% Other" portion of Analytics Browsers report - https://phabricator.wikimedia.org/T342267 (10Krinkle) ### Down to the source  I'll work backwards from the Dashiki frontend at [analytics.wikimedia.org](https:...
[17:29:34] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10GitLab (Project Migration), 10Release-Engineering-Team (Priority Backlog 📥): Migrate analytics/datahub pipeline to GitLab - https://phabricator.wikimedia.org/T341194 (10thcipriani) >>! In T341194#9029097, @BTullis wrote: > @thcipriani - @ha...
[17:29:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) ferm.service Failed on druid1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:32:38] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Data-Persistence: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10BTullis) a:05BTullis→03Marostegui I think everything is now done from our side, then. I'll proceed with {T336254} shortly. Apo...
[17:34:41] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Dashiki, 10Performance-Team (Radar): Investigate surprising "10% Other" portion of Analytics Browsers report - https://phabricator.wikimedia.org/T342267 (10Milimetric) > 1. Is it feasible to compute this data such that the threshold is applied last? I can think of two a...
[17:34:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: (2) ferm.service Failed on druid1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:35:22] <milimetric>	 @Krinkle: I think we found the same problem, but you found it more :)  I'm all for re-computing these.  Pageview_hourly is relatively small, and we can re-run jobs slowly over some period of time
[17:35:31] <milimetric>	 (gtg for a while but I left comments on the task)
[17:36:24] <Krinkle>	 milimetric: awesome!
[18:07:18] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Data-Persistence: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10Marostegui) The host is now showing up on icinga
[18:08:23] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Data-Persistence: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10Marostegui) I'll get it to Mariadb 10.6 tomorrow or Friday and close this task when done
[18:57:15] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10RobH)
[18:58:00] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10RobH)
[19:36:49] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bking@cumin1001 for hosts: `flink-zk1003.eqiad.wmnet` - flink-zk1003.eqiad.wmnet...
[19:42:15] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm
[20:33:43] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm executed w...
[20:34:21] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+2] Update web ui scroll [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/938284 (https://phabricator.wikimedia.org/T337270) (owner: 10Kimberly Sarabia)
[20:34:54] <wikibugs>	 (03Merged) 10jenkins-bot: Update web ui scroll [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/938284 (https://phabricator.wikimedia.org/T337270) (owner: 10Kimberly Sarabia)
[20:55:07] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bking@cumin1001 for hosts: `flink-zk1003.eqiad.wmnet` - flink-zk1003.eqiad.wmnet...
[21:41:08] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm
[22:36:33] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm executed w...