[06:34:29] hello people, I am upgrading an-worker1096 [06:38:40] rebooting it, new drivers looks good from first checks [06:38:42] let's see how it goes [06:44:32] incredible, everything recognized at first try [07:08:55] wow tf 2.6.0 [07:08:59] works fine! [07:12:27] going to upgrade the rest of the hadoop workers in a bit [07:41:39] I am going to try to upgrade the nodes without rebooting [07:44:43] 10Analytics-Radar, 10Event-Platform, 10WMF-JobQueue, 10Wikibase change dispatching scripts to jobs, and 2 others: Queuing jobs is extremely slow - https://phabricator.wikimedia.org/T292048 (10Ladsgroup) Increasing number of replicas definitely helped cirrusSearchLinksUpdate https://grafana.wikimedia.org/d/... [07:48:58] nice it seems working without reboot, game changer [08:30:16] 10Analytics-Radar, 10Event-Platform, 10WMF-JobQueue, 10Wikibase change dispatching scripts to jobs, and 2 others: Queuing jobs is extremely slow - https://phabricator.wikimedia.org/T292048 (10Ladsgroup) while this definitely has helped but I'm not sure it's the root cause. I have a new hypotheses. Might be... [09:02:06] 10Quarry, 10User-dcaro, 10cloud-services-team (Kanban): [quarry] Fancy up the CI pipeline in Jenkins - https://phabricator.wikimedia.org/T289569 (10dcaro) [11:28:26] !log failover analytics-hive back to an-coord1001 after maintenance [11:28:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:24:27] so tensorflow-rocm 2.6.0 (the one that builds/links with ROCm 4.3.1 that we are testing) doesn't support anymore hdfs/s3, it needs a separate package tensorflow-io [12:24:37] that is not provided yet by AMD.. [12:24:51] so I am going to attempt ROCm 4.2.0 sigh [12:53:30] 10Analytics-Radar, 10Fundraising-Backlog, 10Product-Analytics, 10Wikipedia-iOS-App-Backlog, and 2 others: Understand impact of Apple's Relay Service - https://phabricator.wikimedia.org/T289795 (10sgrabarczuk) [13:04:31] 10Analytics-Radar, 10Event-Platform, 10WMF-JobQueue, 10Wikibase change dispatching scripts to jobs, and 2 others: Queuing jobs is extremely slow - https://phabricator.wikimedia.org/T292048 (10Ottomata) eventgate-main does not use the remote schemas. The repositories are cloned and baked into the docker im... [13:42:43] joal: there is an issue with the cookbook to roll restart aqs-node, checking it [13:43:00] ack elukey [13:48:12] joal: we can go manual, lemme prepare aqs1004 [13:48:36] joal: aqs1004 ready [13:48:42] chnecking elukey [13:49:58] elukey: not good for me :S No data for month 2021-09 [13:50:53] ah sorry [13:51:33] joal: we upgraded aqs_next, not aqs [13:51:39] (so the new cluster's role only) [13:51:46] I just realized it [13:51:47] * joal faceplam [13:52:03] I'm sorry for that elukey :S [13:52:19] nono I didn't think about it while reviewing, of course we have two roles [13:52:24] can you file a new CR? [13:52:35] elukey: doing right noew [13:57:52] 10Analytics: Request Kerberos credentials - https://phabricator.wikimedia.org/T292532 (10CMacholan) [14:02:08] elukey: PR sent - need to get the kids - I'll ping you when back to test aqs if ok for you [14:02:22] ack! [14:36:55] elukey: howdy! I saw you made a test patch for new rocm, can you load it on an instance somewhere and I can see if things will work? [14:37:06] * ebernhardson doesn't know if thats possible, or if its all-or-nothing [14:37:17] or at least, easily possible :) [14:38:06] ebernhardson: hi! I wanted to ping you later on, we are discussing it in #dse-hackathon-gpus! For the moment we have tested rocm 4.3.1 with tf-rocm 2.6.0 (failure) and now we are testing rocm 4.2 with tf-rocm 2.5 [14:38:18] the new packages are only on the hadoop worker nodes with GPUs [14:38:29] but we are planning to upgrade the stat boxes today if all goes well [14:39:18] ok, cool [14:39:51] it's a bit tedious to test on the yarn instances, but possible. will see [14:47:42] Apologies team, but I'm going to be running on about Pacific time today. Have had various car and phone issues, so I will be online from about 17:30 BST until late this evening. [15:19:57] ottomata btullis: would you be able to CR/merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/724497 before Oct 7 00:00 UTC? I'd like to be sure by end of the week that the script scheduled with `kerberos::systemd_timer` and the test notebook are executed as expected so that Maya & Irene can start migrating ETL notebooks. please and thank you! [16:00:06] Hi elukey! I'm back if now is a good time for you [16:18:23] joal: o/ I have to afk earlier, ok tomorrow morning? [16:18:42] elukey: Let me ask ottomata ;) [16:19:56] the change has been merged, it is only a matter of running the sre.aqs.roll-restart cookbook :) [16:20:03] in case I'll do it tomorrow! [16:20:03] byeee [16:20:57] thanks elukey :) [16:31:03] ping ottomata? would you have a minute to help with that deplo? [16:38:59] ok, gone for now, back in a while [16:57:54] Hi all. I'll start that cookbook now. [16:58:23] thanks Ben! [17:03:53] > >>> Please test aqs on the canary. [17:04:23] I'm not sure that I know the procedure for this. I think that joal: did it last time. [17:09:09] Followed the procedure here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS#Deploy_new_History_snapshot_for_Wikistats_Backend [17:09:09] 10Analytics, 10wmfdata-python, 10Product-Analytics (Kanban): wmfdata.mariadb relies on analytics-mysql being available - https://phabricator.wikimedia.org/T292479 (10ldelench_wmf) [17:09:27] Looks fine to me. Proceeding. [17:10:32] 10Analytics, 10wmfdata-python, 10Product-Analytics (Kanban): wmfdata.mariadb relies on analytics-mysql being available - https://phabricator.wikimedia.org/T292479 (10ldelench_wmf) a:03nshahquinn-wmf [17:13:27] 10Analytics-Radar, 10Product-Analytics: Do the messages left for unregistered or logged-out IP editors get read by those editors? - https://phabricator.wikimedia.org/T291297 (10jwang) Add @Niharika as AHT team is mentioned in description. [17:13:45] Cookbook completed. [17:15:02] thanks btullis was making lunch and interview prepping [17:21:09] 10Analytics-Radar, 10Product-Analytics: Do the messages left for unregistered or logged-out IP editors get read by those editors? - https://phabricator.wikimedia.org/T291297 (10ldelench_wmf) a:05jwang→03None [18:11:42] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) The first snapshot load has completed in 58 hours. ` Summary statistics: Connections per host : 1 Total f... [18:22:29] 10Analytics: Move the Analytics/DE testing infrastructure to Pontoon - https://phabricator.wikimedia.org/T292388 (10BTullis) a:03BTullis Requested creation of a new Cloud VPS project in {T292563} - Hopefully this can be fast-tracked in order to allow us to bootstrap a new Pontoon server. [18:29:25] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) Carefully removed the data from `aqs1010:/srv/cassandra-a/tmp` and now there is 2.6 GB of space remaining in `/srv/... [18:32:28] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) The next largest directory is now `aqs1011:/srv/cassandra-a` at 67% full and 1.1 TB free. ` btullis@cumin1001:~$ su... [18:42:56] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) The reload of `aqs1011:/srv/cassandra-a` is now under way. [19:25:27] Wow thanks a milion btullis :) [19:25:46] A pleasure. [19:28:19] btullis: I had not pinged you cause you're supposed to be oof this week, aren't ou? [19:31:47] I was on annual leave yesterday (so missed the hackathon kickoff) but today I just got caught out by being in the wrong place without a working phone. I couldn't even gete to Slack to say "I'll be late today". But anyway I'm here now and working late :-) [19:32:26] So am I btullis :) [20:42:01] 10Analytics, 10Code-Health-Objective, 10Epic, 10Platform Engineering Roadmap, and 2 others: AQS 2.0 - https://phabricator.wikimedia.org/T263489 (10Eevans) [23:31:23] 10Analytics: Request Kerberos credentials - https://phabricator.wikimedia.org/T292532 (10odimitrijevic) p:05Triage→03High [23:32:02] 10Analytics, 10Analytics-Kanban: Request Kerberos credentials - https://phabricator.wikimedia.org/T292532 (10odimitrijevic) a:03JAllemandou