[02:42:53] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [05:32:53] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [05:40:25] 10Quarry: Error 500 when clicking "stop query" - https://phabricator.wikimedia.org/T362213#9705788 (10SD0001) I don't think that's the issue. We persist the db process id in the query_run table, so even a different pod is able to execute KILL on the db to get the query to stop. The issue I suspect is that... [08:45:31] 06Data-Engineering, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Migrate matomo to Debian bullseye (or bookworm) - https://phabricator.wikimedia.org/T349397#9705995 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=562cdaa7-9ce0-44a8-9136-a3b0778aa94f) set by btullis@c... [09:16:36] 10Quarry: Error 500 when clicking "stop query" - https://phabricator.wikimedia.org/T362213#9706071 (10taavi) >>! In T362213#9705787, @SD0001 wrote: > The issue I suspect is that `*.analytics.db.svc.eqiad.wmflabs` are LB endpoints behind which there could be multiple replicas (@taavi - would you be able to confir... [10:02:15] (03PS7) 10Aqu: Add CLI to create or update Iceberg tables [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1016808 (https://phabricator.wikimedia.org/T356762) [10:03:53] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [10:04:53] (03CR) 10CI reject: [V:04-1] Add CLI to create or update Iceberg tables [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1016808 (https://phabricator.wikimedia.org/T356762) (owner: 10Aqu) [10:40:57] (03CR) 10Milimetric: "Mostly some style things" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1016796 (https://phabricator.wikimedia.org/T358681) (owner: 10Xcollazo) [11:33:50] 06Data-Engineering, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Migrate matomo to Debian bullseye (or bookworm) - https://phabricator.wikimedia.org/T349397#9706642 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host matomo1003.eqiad.wmn... [12:13:38] 06Data-Engineering, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Migrate matomo to Debian bullseye (or bookworm) - https://phabricator.wikimedia.org/T349397#9706755 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host matomo1003.eqiad.wmnet w... [12:16:56] 06Data-Engineering, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Migrate matomo to Debian bullseye (or bookworm) - https://phabricator.wikimedia.org/T349397#9706777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host matomo1003.eqiad.wmn... [12:21:04] !log deploying editor-analytics with the new aqs-http-gateway chart [12:21:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:25:49] (03PS8) 10Aqu: Add CLI to create or update Iceberg tables [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1016808 (https://phabricator.wikimedia.org/T356762) [12:34:59] !log deploying edit-analytics with the new aqs-http-gateway chart [13:00:40] 06Data-Engineering, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Migrate matomo to Debian bullseye (or bookworm) - https://phabricator.wikimedia.org/T349397#9706927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host matomo1003.eqiad.wmnet w... [13:12:53] 06Data-Engineering, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Migrate matomo to Debian bullseye (or bookworm) - https://phabricator.wikimedia.org/T349397#9706941 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host matomo1003.eqiad.wmn... [13:13:53] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [13:31:47] 06Data-Engineering, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Migrate matomo to Debian bullseye (or bookworm) - https://phabricator.wikimedia.org/T349397#9707000 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host matomo1003.eqiad.wmnet w... [13:32:17] 06Data-Engineering, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Migrate matomo to Debian bullseye (or bookworm) - https://phabricator.wikimedia.org/T349397#9707001 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host matomo1003.eqiad.wmn... [13:47:03] 06Data-Engineering, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Migrate matomo to Debian bullseye (or bookworm) - https://phabricator.wikimedia.org/T349397#9707045 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host matomo1003.eqiad.wmnet w... [13:49:33] 06Data-Engineering, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Migrate matomo to Debian bullseye (or bookworm) - https://phabricator.wikimedia.org/T349397#9707098 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host matomo1003.eqiad.wmn... [13:55:55] hey folks! [13:56:02] moving cassandras on aqs1010 to PKI [14:06:11] 06Data-Engineering, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Migrate matomo to Debian bullseye (or bookworm) - https://phabricator.wikimedia.org/T349397#9707250 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host matomo1003.eqiad.wmnet w... [14:10:47] !log move cassandra instances on aqs1010 to PKI TLS certs [14:10:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:46:53] (03PS12) 10Mforns: Clean up and parameterize SQL code for Common Impact Metrics. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1016796 (https://phabricator.wikimedia.org/T358681) (owner: 10Xcollazo) [14:53:30] (03CR) 10Xcollazo: Clean up and parameterize SQL code for Common Impact Metrics. (035 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1016796 (https://phabricator.wikimedia.org/T358681) (owner: 10Xcollazo) [14:58:30] 06Data-Engineering, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Migrate matomo to Debian bullseye (or bookworm) - https://phabricator.wikimedia.org/T349397#9707536 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host matomo1003.eqiad.wmn... [15:00:32] elukey: Good stuff. Thanks for the heads-up. Did it all go smoothly? [15:06:04] btullis: yes a few nits that I have to check (namely the fact that if the previous keystore is already there puppet doesn't refresh it), but after that all good [15:06:26] the two instances are up and running, we'll see if anything comes up and then possibly keep going [15:06:29] does it sound good? [15:25:54] !log restarting hive-server2 and hive-metastore on an-test-coord1001 for T356382 [15:25:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:26:18] elukey: Yes, sounds perfect, thanks. [15:30:22] elukey: I remember a similar sort of situation with puppet not refreshing an existing keystore. It was this: https://github.com/wikimedia/operations-puppet/commit/66995fb9006197028fe91981390f4367ffd28af6 - Don't know if it will help in this situation, but it might. [15:31:12] ack will check thanks! [16:13:35] (03PS13) 10Xcollazo: Clean up and parameterize SQL code for Common Impact Metrics. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1016796 (https://phabricator.wikimedia.org/T358681) [16:15:07] (03CR) 10Xcollazo: Clean up and parameterize SQL code for Common Impact Metrics. (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1016796 (https://phabricator.wikimedia.org/T358681) (owner: 10Xcollazo) [16:27:17] 06Data-Engineering, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Migrate matomo to Debian bullseye (or bookworm) - https://phabricator.wikimedia.org/T349397#9707906 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host matomo1003.eqiad.wmnet w... [16:36:42] (03PS14) 10Xcollazo: Clean up and parameterize SQL code for Common Impact Metrics. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1016796 (https://phabricator.wikimedia.org/T358681) [16:40:33] (03CR) 10Xcollazo: Clean up and parameterize SQL code for Common Impact Metrics. (034 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1016796 (https://phabricator.wikimedia.org/T358681) (owner: 10Xcollazo) [17:25:32] 06Data-Engineering, 06Data Products, 13Patch-For-Review, 10Web-Team-Backlog (FY2023-24 Q4 Sprint 2): Update Sample Rates for Metrics Platform Events - https://phabricator.wikimedia.org/T361962#9708165 (10ovasileva) p:05Triage→03High [17:25:53] 06Data-Engineering, 06Data Products, 10FY2023-24-WE 2.1 Typography and palette customizations, 13Patch-For-Review, 10Web-Team-Backlog (FY2023-24 Q4 Sprint 2): Update Sample Rates for Metrics Platform Events - https://phabricator.wikimedia.org/T361962#9708176 (10ovasileva) [17:26:21] 06Data-Engineering, 06Data Products, 10FY2023-24-WE 2.1 Typography and palette customizations, 13Patch-For-Review, 10Web-Team-Backlog (FY2023-24 Q4 Sprint 2): Update Sample Rates for Metrics Platform Events - https://phabricator.wikimedia.org/T361962#9708178 (10ovasileva) [17:26:33] 06Data-Engineering, 06Data Products, 10FY2023-24-WE 2.1 Typography and palette customizations, 13Patch-For-Review, 10Web-Team-Backlog (FY2023-24 Q4 Sprint 2): Update Sample Rates for Metrics Platform Events - https://phabricator.wikimedia.org/T361962#9708182 (10ovasileva) [17:30:00] 06Data-Engineering, 06Data Products, 10FY2023-24-WE 2.1 Typography and palette customizations, 13Patch-For-Review, 10Web-Team-Backlog (FY2023-24 Q4 Sprint 2): Update Sample Rates for Metrics Platform Events - https://phabricator.wikimedia.org/T361962#9708192 (10KSarabia-WMF) [18:32:37] 10Data-Engineering (Q4 2024 April 1st - June 30th), 13Patch-For-Review: Extract refine schema management into a dedicated tool - https://phabricator.wikimedia.org/T356762#9708289 (10Ahoelzl) [18:32:40] 10Data-Engineering (Q4 2024 April 1st - June 30th), 13Patch-For-Review: Extract refine schema management into a dedicated tool - https://phabricator.wikimedia.org/T356762#9708292 (10Ahoelzl) [18:32:50] 10Data-Engineering (Q4 2024 April 1st - June 30th), 13Patch-For-Review: [Refine refactoring] Extract refine schema management into a dedicated tool - https://phabricator.wikimedia.org/T356762#9708293 (10Ahoelzl) [18:33:54] 10Data-Engineering (Q4 2024 April 1st - June 30th): [Refine Refactoring] Adjust Refine schema management code for Iceberg - https://phabricator.wikimedia.org/T362289#9708302 (10Ahoelzl) closing, part of https://phabricator.wikimedia.org/T356762 [18:33:59] 10Data-Engineering (Q4 2024 April 1st - June 30th): 14[Refine Refactoring] Adjust Refine schema management code for Iceberg - 14https://phabricator.wikimedia.org/T362289#9708305 (10Ahoelzl) 05Open→03Resolved [19:33:57] 10Quarry: store quarry state in object storage - https://phabricator.wikimedia.org/T360233#9708375 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/quarry/pull/39 [19:47:01] 10Quarry: store quarry state in object storage - https://phabricator.wikimedia.org/T360233#9708412 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/quarry/pull/39 [19:48:47] 10Quarry: store quarry state in object storage - https://phabricator.wikimedia.org/T360233#9708415 (10rook) Docs in: https://wikitech.wikimedia.org/wiki/Help:Object_storage_user_guide#S3_API https://wikitech.wikimedia.org/wiki/Help:Using_OpenTofu_on_Cloud_VPS#State_management [19:50:05] 10Quarry: 14store quarry state in object storage - 14https://phabricator.wikimedia.org/T360233#9708416 (10rook) 05Open→03Resolved a:03rook [20:42:45] (03PS9) 10Aqu: Add CLI to create or update Iceberg tables [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1016808 (https://phabricator.wikimedia.org/T356762) [20:46:37] (03CR) 10Aqu: "I've refactored the code into 2 modules. 1 to manage the table schemas, 1 to manage writing into the tables." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1016808 (https://phabricator.wikimedia.org/T356762) (owner: 10Aqu)