[03:37:36] PROBLEM - Check unit status of refinery-import-page-history-dumps on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-import-page-history-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:36:39] 10Analytics, 10SRE: Import the openjdk8 packages in Bullseye - https://phabricator.wikimedia.org/T287960 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff Sure thing, I'll take care of this next week. [07:43:53] 10Analytics-Radar, 10SRE, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10elukey) I have started https://wikitech.wikimedia.org/wiki/Kafka/Administration#Rebalance_... [07:55:14] 10Analytics, 10SRE, 10Traffic, 10Patch-For-Review: Compare logs produced by atskfafka with those produced by varnishkafka - https://phabricator.wikimedia.org/T254317 (10ema) >>! In T254317#7255820, @elukey wrote: > In theory a lot of `tls = '-'` should be redirects from http to https, that hit Varnish and... [08:01:04] 10Analytics, 10SRE, 10Traffic, 10Patch-For-Review: Compare logs produced by atskfafka with those produced by varnishkafka - https://phabricator.wikimedia.org/T254317 (10elukey) A http to https redirect is probably not really a webrequest (following https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Tr... [08:24:00] 10Analytics, 10SRE, 10Traffic, 10Patch-For-Review: Compare logs produced by atskfafka with those produced by varnishkafka - https://phabricator.wikimedia.org/T254317 (10elukey) I had a chat with Ema on IRC, reporting a summary: * At the current state of the TLS termination layer, it is likely that ATS-TLS... [08:25:24] (03PS1) 10David Caro: docs: added docker compose link and minor rewording [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/709951 [08:38:29] 10Analytics-Radar, 10SRE, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10elukey) Thanos graphs for topics with more than 0 msg/s for: - [[ https://thanos.wikimedi... [08:53:50] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Incorrect log4j configuration on hive servers causing excessive logging - https://phabricator.wikimedia.org/T279304 (10BTullis) I have prepared a DNS change to effect the failover from an-coord1001 to an-corrd1002. https://gerrit.wikimedia.org/r/... [08:59:01] I'm about to proceed with the hive server restart. Plan is to restart hive services on an-coord1002, then fail over with the DNS change. Wait 5 minutes for TTL expiry. Restart hive services on an-coord1001. Prepare a DNS change for failback. [09:00:20] !log btullis@an-coord1002:~$ sudo systemctl stop hive-server2 && sudo systemctl stop hive-metastore [09:00:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:00:49] !log sudo systemctl start hive-metastore && sudo systemctl start hive-server2 [09:00:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:01:33] +1 [09:02:51] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Incorrect log4j configuration on hive servers causing excessive logging - https://phabricator.wikimedia.org/T279304 (10BTullis) ` btullis@authdns1001:~$ sudo authdns-update Updating authdns1001.wikimedia.org (self)... Pulling the current revision... [09:05:13] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Incorrect log4j configuration on hive servers causing excessive logging - https://phabricator.wikimedia.org/T279304 (10BTullis) DNS change successfully applied. ` btullis@marlin:~/wmf/dns$ for i in 0 1 2 ; do dig @ns${i}.wikimedia.org -t any +sho... [09:12:02] !log btullis@an-coord1001:~$ sudo systemctl stop hive-server2.service hive-metastore.service [09:12:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:12:23] !log btullis@an-coord1001:~$ sudo systemctl start hive-metastore.service hive-server2.service [09:12:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:13:10] 10Analytics-Clusters, 10Analytics-Kanban: Incorrect log4j configuration on hive servers causing excessive logging - https://phabricator.wikimedia.org/T279304 (10BTullis) Restart of services on an-coord1001 complete. ` tullis@an-coord1001:~$ sudo systemctl stop hive-server2.service hive-metastore.service btull... [09:40:29] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Incorrect log4j configuration on hive servers causing excessive logging - https://phabricator.wikimedia.org/T279304 (10BTullis) DNS failback patch applied. ` btullis@marlin:~/wmf/dns$ for i in 0 1 2 ; do dig @ns${i}.wikimedia.org -t any +short ana... [09:46:04] nice --^ [09:57:49] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Incorrect log4j configuration on hive servers causing excessive logging - https://phabricator.wikimedia.org/T279304 (10BTullis) Tested with `hive` and `beeline` from stat1008 and all seems well with hive, following the fail-back. I've sent an ema... [09:58:00] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Incorrect log4j configuration on hive servers causing excessive logging - https://phabricator.wikimedia.org/T279304 (10BTullis) 05Open→03Resolved [10:01:32] 10Analytics, 10Analytics-Kanban: Reducing logging levels when running a Hive query - https://phabricator.wikimedia.org/T274914 (10BTullis) 05Open→03Resolved I'm going to mark this ticket as resolved, since I believe that the logging level for hive client queries **has** been lowered from `INFO` to `WARN`.... [10:09:11] 10Analytics-Clusters, 10Analytics-Kanban: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) p:05Triage→03Medium I'm beginning work on this task now. The first thing to do, I believe, is to decide how best to deal with the zookeeper cluster that is co-located with t... [10:11:19] 10Analytics-Clusters, 10Analytics-Kanban: Incorrect log4j configuration on hive servers causing excessive logging - https://phabricator.wikimedia.org/T279304 (10elukey) Quick question - are the logs on an-coord100* nodes rotated/dropped correctly now? [11:02:42] 10Analytics, 10Analytics-Kanban: jupyter notebook causing syslog/etc.. to fill up with error messages - https://phabricator.wikimedia.org/T287339 (10BTullis) Thanks @ChristineDeKock - I see that you've got some scripts running now. I've had a quick look at the arguments in depth and I can see that the problem... [11:44:36] 10Analytics-Clusters, 10Analytics-Kanban: Incorrect log4j configuration on hive servers causing excessive logging - https://phabricator.wikimedia.org/T279304 (10BTullis) Good question. It look from an-test-coord1001 like the pre-existing logs older than 14 days aren't being purged, so I could do this manually.... [11:50:36] heya teammm! [12:48:40] hola mforns [13:12:31] :] [13:25:56] yooho [13:26:22] * btullis waves at everyone [13:27:03] 10Analytics, 10Analytics-Kanban: Reducing logging levels when running a Hive query - https://phabricator.wikimedia.org/T274914 (10ssingh) Hi @BTullis: I just ran a query and it is indeed quieter as compared to the last time I ran it, when I couldn't even see the output. Thanks very much for working on this cha... [13:42:15] 10Analytics, 10Analytics-Kanban: jupyter notebook causing syslog/etc.. to fill up with error messages - https://phabricator.wikimedia.org/T287339 (10BTullis) Currently each notebook is started as a systemd transient service, ultimately started by `systemd-run` from the file: `/usr/lib/anaconda-wmf/lib/python3.... [13:44:06] 10Analytics, 10Analytics-Kanban: Reducing logging levels when running a Hive query - https://phabricator.wikimedia.org/T274914 (10BTullis) Great! Thanks for confirming. [14:21:32] 10Analytics, 10Analytics-Kanban: jupyter notebook causing syslog/etc.. to fill up with error messages - https://phabricator.wikimedia.org/T287339 (10Ottomata) > One way in which we could deal with this is to work out how add these definitions and set the output to go to defined log files per notebook AH! This... [15:44:15] 10Analytics, 10Better Use Of Data, 10Product-Analytics: Upgrade Superset to 1.2 - https://phabricator.wikimedia.org/T288115 (10razzi) [15:57:02] (03CR) 10Bstorm: [C: 03+1] "Little nit if you want it." [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/709951 (owner: 10David Caro) [16:32:14] 10Analytics, 10Machine-Learning-Team: Update ROCm version on GPU instances. - https://phabricator.wikimedia.org/T287267 (10elukey) [16:41:25] (03CR) 10Michael DiPietro: [C: 03+1] docs: added docker compose link and minor rewording [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/709951 (owner: 10David Caro) [16:44:18] 10Analytics, 10Machine-Learning-Team: Update ROCm version on GPU instances. - https://phabricator.wikimedia.org/T287267 (10ACraze) +1 for upgrading ROCm to support ONNX runtime. It's certainly worth evaluating imo, as it seems that ONNX would help enable us to use an AMD GPU with any arbitrary ML-framework [16:45:12] (03PS1) 10Michael DiPietro: add stop query function [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/710067 (https://phabricator.wikimedia.org/T71037) [16:47:46] (03CR) 10Michael DiPietro: add stop query function (031 comment) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/710067 (https://phabricator.wikimedia.org/T71037) (owner: 10Michael DiPietro) [16:51:00] (03CR) 10jerkins-bot: [V: 04-1] add stop query function [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/710067 (https://phabricator.wikimedia.org/T71037) (owner: 10Michael DiPietro) [17:01:26] (03CR) 10Majavah: [C: 04-1] add stop query function (033 comments) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/710067 (https://phabricator.wikimedia.org/T71037) (owner: 10Michael DiPietro) [17:06:18] 10Analytics-Radar, 10SRE, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10herron) Plan looks good to me! I'll suggest also spinning off a subtask or spreadsheet to... [17:25:25] (03PS2) 10Michael DiPietro: add stop query function [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/710067 (https://phabricator.wikimedia.org/T71037) [17:28:05] (03CR) 10jerkins-bot: [V: 04-1] add stop query function [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/710067 (https://phabricator.wikimedia.org/T71037) (owner: 10Michael DiPietro) [17:30:25] (03PS3) 10Michael DiPietro: add stop query function [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/710067 (https://phabricator.wikimedia.org/T71037) [17:32:18] (03CR) 10jerkins-bot: [V: 04-1] add stop query function [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/710067 (https://phabricator.wikimedia.org/T71037) (owner: 10Michael DiPietro) [17:56:50] * razzi taking computer break then lunch, ping if you need me [18:18:53] mholloway: your normalized_host patch is live! [18:18:56] and working! :) [18:18:57] thank you! [18:20:13] 10Analytics, 10Analytics-Kanban: Fix default ownership and permissions for Hive managed databases in /user/hive/warehouse - https://phabricator.wikimedia.org/T280175 (10Ottomata) In todays PA sync we decided to move forward with this. [18:36:16] (03PS4) 10Michael DiPietro: add stop query function [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/710067 (https://phabricator.wikimedia.org/T71037) [18:37:57] (03CR) 10jerkins-bot: [V: 04-1] add stop query function [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/710067 (https://phabricator.wikimedia.org/T71037) (owner: 10Michael DiPietro) [18:39:50] 10Analytics, 10Analytics-Kanban: Fix default ownership and permissions for Hive managed databases in /user/hive/warehouse - https://phabricator.wikimedia.org/T280175 (10Ottomata) Running all this now. [18:40:58] (03PS5) 10Michael DiPietro: add stop query function [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/710067 (https://phabricator.wikimedia.org/T71037) [18:42:35] (03CR) 10jerkins-bot: [V: 04-1] add stop query function [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/710067 (https://phabricator.wikimedia.org/T71037) (owner: 10Michael DiPietro) [18:46:28] (03PS6) 10Michael DiPietro: add stop query function [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/710067 (https://phabricator.wikimedia.org/T71037) [19:19:52] (03PS7) 10Michael DiPietro: add stop query function [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/710067 (https://phabricator.wikimedia.org/T71037) [19:46:07] (03CR) 10Bstorm: add stop query function (032 comments) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/710067 (https://phabricator.wikimedia.org/T71037) (owner: 10Michael DiPietro) [19:48:36] 10Analytics-Clusters, 10Analytics-Kanban: Move the Analytics infrastructure to Debian Buster - https://phabricator.wikimedia.org/T234629 (10Ottomata) [19:49:54] 10Analytics-Clusters, 10Analytics-Kanban: Move the Analytics infrastructure to Debian Buster - https://phabricator.wikimedia.org/T234629 (10Ottomata) [19:52:51] 10Analytics, 10Analytics-EventLogging, 10dev-images, 10Patch-For-Review: EventLogging dev image should have verbose output enabled - https://phabricator.wikimedia.org/T257378 (10Ottomata) Should be covered now by https://wikitech.wikimedia.org/wiki/Event_Platform/Instrumentation_How_To#In_your_local_dev_en... [19:53:37] 10Analytics, 10Analytics-EventLogging, 10dev-images, 10Patch-For-Review: EventLogging dev image should have verbose output enabled - https://phabricator.wikimedia.org/T257378 (10Ottomata) 05Open→03Resolved a:03Ottomata [19:56:58] 10Analytics: Alert on validation errors on new stream intake service - https://phabricator.wikimedia.org/T210457 (10Ottomata) 05Open→03Resolved a:03Ottomata This was done in {T257237} [20:22:09] 10Analytics, 10Prod-Kubernetes, 10SRE, 10serviceops, and 2 others: Move eventgate services to use TLS only - https://phabricator.wikimedia.org/T255871 (10Ottomata) [20:22:27] 10Analytics, 10Analytics-Kanban, 10Prod-Kubernetes, 10SRE, and 3 others: Move eventgate services to use TLS only - https://phabricator.wikimedia.org/T255871 (10Ottomata) [20:23:38] 10Analytics, 10Analytics-Kanban, 10Prod-Kubernetes, 10SRE, and 3 others: Move eventgate services to use TLS only - https://phabricator.wikimedia.org/T255871 (10Ottomata) I think that will do it. helm template looks good locally. @JMeybohm is it ok that I moved the debug ports to their own Service? That'... [20:26:38] 10Analytics, 10Code-Health-Objective, 10Epic, 10Platform Engineering Roadmap, and 2 others: AQS 2.0 - https://phabricator.wikimedia.org/T263489 (10Eevans) [20:29:32] 10Analytics, 10Code-Health-Objective, 10Epic, 10Platform Engineering Roadmap, and 2 others: AQS 2.0 - https://phabricator.wikimedia.org/T263489 (10Eevans) [20:31:16] 10Analytics, 10Code-Health-Objective, 10Epic, 10Platform Engineering Roadmap, and 2 others: AQS 2.0 - https://phabricator.wikimedia.org/T263489 (10Eevans) [20:34:43] 10Quarry: quarry explain not working - https://phabricator.wikimedia.org/T288170 (10mdipietro) [20:35:34] (03PS8) 10Michael DiPietro: add stop query function [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/710067 (https://phabricator.wikimedia.org/T71037) [20:38:40] 10Analytics, 10Code-Health-Objective, 10Epic, 10Platform Engineering Roadmap, and 2 others: AQS 2.0 - https://phabricator.wikimedia.org/T263489 (10Eevans) [20:39:52] 10Analytics, 10Code-Health-Objective, 10Epic, 10Platform Engineering Roadmap, and 2 others: AQS 2.0 - https://phabricator.wikimedia.org/T263489 (10Eevans) [20:40:28] 10Analytics, 10Code-Health-Objective, 10Epic, 10Platform Engineering Roadmap, and 2 others: AQS 2.0 - https://phabricator.wikimedia.org/T263489 (10Eevans) [20:40:45] 10Analytics, 10Code-Health-Objective, 10Epic, 10Platform Engineering Roadmap, and 2 others: AQS 2.0 - https://phabricator.wikimedia.org/T263489 (10Eevans) [21:05:43] (03CR) 10Bstorm: add stop query function (031 comment) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/710067 (https://phabricator.wikimedia.org/T71037) (owner: 10Michael DiPietro) [22:25:00] Hmmm I see a few alerts for an-* nodes, anybody around to help me poke at them? [22:37:12] https://alerts.wikimedia.org/?q=%40state%3Dactive&q=an- [22:39:25] there's a couple warnings for an-web1001, it's a work in progress that ryankemper is working on, so no worries there [22:39:46] A few systemd units are failing on an-launcher1002: [22:39:48] refinery-import-page-history-dumps [22:39:55] refinery-import-page-current-dumps [22:40:03] refinery-import-siteinfo-dumps