[01:51:39] (03PS1) 10Sharvaniharan: Migrate MobileWikiAppDailyStats to MEP Bug: T286000 Change-Id: Ibf65637926bb3be80c64b2c373958a52b034aedb [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/723339 (https://phabricator.wikimedia.org/T286000) [01:59:12] (03PS1) 10Sharvaniharan: Migrate MobileWikiAppDailyStats to MEP Bug: T286000 Change-Id: Ibf65637926bb3be80c64b2c373958a52b034aedb [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/723340 (https://phabricator.wikimedia.org/T286000) [02:00:15] (03Abandoned) 10Sharvaniharan: Migrate MobileWikiAppDailyStats to MEP Bug: T286000 Change-Id: Ibf65637926bb3be80c64b2c373958a52b034aedb [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/723340 (https://phabricator.wikimedia.org/T286000) (owner: 10Sharvaniharan) [02:00:28] (03Abandoned) 10Sharvaniharan: Migrate MobileWikiAppDailyStats to MEP Bug: T286000 Change-Id: Ibf65637926bb3be80c64b2c373958a52b034aedb [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/723339 (https://phabricator.wikimedia.org/T286000) (owner: 10Sharvaniharan) [02:08:43] (03PS1) 10Sharvaniharan: Migrate MobileWikiAppDailyStats to MEP Bug: T286000 Change-Id: Ibf65637926bb3be80c64b2c373958a52b034aedb [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/723341 (https://phabricator.wikimedia.org/T286000) [02:09:12] (03Abandoned) 10Sharvaniharan: Migrate MobileWikiAppDailyStats to MEP Bug: T286000 Change-Id: Ibf65637926bb3be80c64b2c373958a52b034aedb [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/723341 (https://phabricator.wikimedia.org/T286000) (owner: 10Sharvaniharan) [04:16:58] RECOVERY - Check unit status of monitor_refine_event_sanitized_analytics_immediate on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event_sanitized_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:29:44] 10Analytics, 10Event-Platform, 10SRE, 10Wikimedia-Logstash, and 2 others: Integrate Event Platform and ECS logs - https://phabricator.wikimedia.org/T291645 (10Marostegui) p:05Triage→03Medium [06:29:29] btullis: np for me, I have time to do some restarts :) [07:02:20] Good morning [07:02:43] bonjour :) [07:33:07] joal: ok if I start the druid analytics roll restart? There may be some indexation in progress, in case they'll fail (not a big deal but we can also stop them if you prefer) [07:33:30] give me aminute to check elukey please [07:34:05] all good elukey - please proceed [07:34:12] <3 [07:34:17] metrics are good, nothing outstanding [07:34:47] yup, only tasks currently running are kafka ones, they should successfully be moved around workers [08:13:29] 10Analytics, 10Analytics-Kanban: Repair and reload cassandra2 mediarequest_per_file data table - https://phabricator.wikimedia.org/T291470 (10BTullis) It does get to 100% on the repair and then hang around. This is the last line of the output on the repair at the moment, which must have been showing for at lea... [08:16:04] 10Analytics, 10Platform Engineering, 10Wikibase change dispatching scripts to jobs, 10serviceops: Better observability/visualization for MediaWiki jobs - https://phabricator.wikimedia.org/T291620 (10Michael) [08:19:39] elukey: I'd rather leave the aqs restarts while the cassandra3 loading is going on, but shall I take the hadoop masters, zookeeper, and kafka restarts? [08:26:50] 10Analytics, 10Platform Engineering, 10Wikibase change dispatching scripts to jobs, 10serviceops: Better observability/visualization for MediaWiki jobs - https://phabricator.wikimedia.org/T291620 (10Michael) >>! In T291620#7375141, @Ottomata wrote: > Data Eng (analytics) is in the process of [[ https://pha... [08:33:01] btullis: yes yes definitely, I am doing kafka test atm and just finished druid [08:35:45] You're like a machine! :-) [08:37:49] ahahaha nono we should thank Riccardo, cumin and cookbooks are too good :) [08:38:05] updated the task with the last restarts :) [08:51:59] qq - you use `lsof` to check to make sure that nothing is still using the old JVM, right? [08:53:13] 10Analytics, 10Platform Engineering, 10Wikibase change dispatching scripts to jobs, 10serviceops: Better observability/visualization for MediaWiki jobs - https://phabricator.wikimedia.org/T291620 (10Joe) This task mixes quite a few things. I'll start by answering your questions to the best of my knowledge.... [08:54:09] 10Analytics, 10Platform Engineering, 10Wikibase change dispatching scripts to jobs, 10serviceops: Better observability/visualization for MediaWiki jobs - https://phabricator.wikimedia.org/T291620 (10Joe) I also want to underline that the problem statement is slightly misleading: while it's true observabili... [09:06:25] btullis: sorry just seen it, yes `lsof -Xd DEL` [09:06:57] Thanks. [09:16:55] As this is my first ops week, I'll be verbose about things. I see five alerts that I think need attention. [09:16:55] - Refine failures for job refine_eventlogging_legacy [09:16:55] - RefineMonitor problem report for job monitor_refine_event_sanitized_analytics_delayed [09:16:55] - The following units failed: hdfs-cleaner-tmp-analytics.service,hdfs-cleaner-tmp-druid.service,hdfs-cleaner-tmp.service [09:23:26] 10Analytics, 10Analytics-Kanban: Check home/HDFS leftovers of mholloway-shell - https://phabricator.wikimedia.org/T291353 (10BTullis) ` ====== stat1004 ====== total 0 ====== stat1005 ====== total 44 -rw-r--r-- 1 11963 wikidev 2478 Feb 12 2021 bundle.properties -rw-r--r-- 1 11963 wikidev 6173 Feb 12 2021 bu... [09:23:46] btullis, joal - ok if I restart druid public? [09:24:05] Fine by me. [09:25:15] started [09:37:19] `monitor_refine_event_sanitized_analytics_delayed`is the same one that ottomata wrote about yesterday. It looks like the backfilling job that he started yesterday is still running. [09:48:47] Hmm. It's not still running. https://yarn.wikimedia.org/cluster/app/application_1629727304304_148623 - It took 5 minutes and 43 seconds. The list of targets that still need refining is different in today's alert email. [09:51:26] That suggests to me that it didn't refine all of the right targets. I'd probably better seek assistance before trying to re-run anything here. [09:53:13] 10Analytics, 10Platform Engineering, 10Wikibase change dispatching scripts to jobs, 10serviceops: Better observability/visualization for MediaWiki jobs - https://phabricator.wikimedia.org/T291620 (10Michael) >>! In T291620#7376077, @Joe wrote: > This task mixes quite a few things. Yes, it is a wishlist o... [09:54:24] The three `hdfs-cleaner`jobs are all failing with a similar error: `Exception in thread "main" java.lang.NoClassDefFoundError: io/circe/Decoder` [09:56:19] That would suggest to me perhaps an issue with a recent refinery-source deploy. [09:58:16] weird, for me the other day it was a scala-related noclassdefounderror, mmm [09:59:19] but I agree it is probably due to the new refinery-job.jar [09:59:29] the cleaner does [09:59:30] /usr/bin/java -cp /srv/deployment/analytics/refinery/artifacts/refinery-job.jar:$(/usr/bin/hadoop classpath) org.wikimedia.analytics.refinery.job.HDFSCleaner $@ [09:59:54] That's just where I'd arrived :-) https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/+/refs/heads/master/bin/hdfs-cleaner#18 [10:01:40] 10Analytics, 10Platform Engineering, 10Wikibase change dispatching scripts to jobs, 10serviceops: Better observability/visualization for MediaWiki jobs - https://phabricator.wikimedia.org/T291620 (10Michael) [10:07:26] !log btullis@an-launcher1002:~$ sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_eventlogging_legacy --ignore_failure_flag=true --table_include_regex='centralnoticeimpression' --since='2021-09-23T04:00:00.000Z' --until='2021-09-24T05:00:00.000Z' [10:07:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:12:44] Uh oh, problem with the hadoop roll restart masters cookbook run on analytics. [10:12:48] `Operation failed: Unable to become active. Service became unhealthy while trying to failover. ` [10:13:44] During this step: `Run manual HDFS Namenode failover from an-master1002-eqiad-wmnet to an-master1001-eqiad-wmnet.` [10:18:53] If I run `sudo systemctl status hadoop-hdfs-namenode` on both an-master1001 and an-master1002 I get a running ststaus on both. [10:19:12] 10Analytics, 10Platform Engineering, 10Wikibase change dispatching scripts to jobs, 10serviceops: Better observability/visualization for MediaWiki jobs - https://phabricator.wikimedia.org/T291620 (10Joe) >>! In T291620#7376163, @Michael wrote: >> [...] I do agree that tracing the jobs trees and execution t... [10:19:39] I've still got running applications on yarn and I can still do a `hdfs dfs -ls /` from stat1004, so I don't think that this are too on fire. Might just be an issue with the cookbook. [10:19:41] druid restarts done! [10:20:00] I'll check which namenode the cluster believes is active. [10:20:33] btullis: yes that step doesn't restart the services, [10:21:03] it forces the failover (basically via zookeeper), but the services remain up [10:21:24] ah I see above [10:21:56] So sometimes it happened to me, but a subsequent failover succeeded, unless there are horrible error msgs in the logs :) [10:22:27] 10Analytics, 10Platform Engineering, 10Wikibase change dispatching scripts to jobs, 10serviceops: Better observability/visualization for MediaWiki jobs - https://phabricator.wikimedia.org/T291620 (10Joe) >>! In T291620#7376163, @Michael wrote: >>> *What were the (debug- or even trace-level?) logs of each j... [10:23:10] https://www.irccloud.com/pastebin/ZbN3JaKz/ [10:23:57] Yeah, so it didn't effect the failback to an-master1001 and the cookbook has bombed out. The only nasty error in the logs is `Unable to become active. Service became unhealthy while trying to failover.` [10:24:53] ah snap I think it is the issue that we had the last time [10:25:07] when the number of threads were not enough [10:25:16] I can do a manual failover,, but I would also have to do a manual restart of the services on an-master1002 afterwards. [10:25:21] there is a thread dump in /var/log/hadoop-hdfs/hadoop-hdfs-zkfc-an-master1001.log [10:26:21] since the last restarts we added more nodes [10:27:39] OK. Do we have to roll out a config change to the namenode then, to cope with more worker nodes? [10:28:25] the current thread count is 90, that in theory is more than the workers, but this is maybe why it did get through eventually.. bumping it to 100 could be good [10:28:49] but at the moment I think that we can failback to 1001 manually (when metrics are ready), and the restart 1002 [10:29:22] we can also restart 1002 without failback, but if the failover doesn't work we'll get errirs [10:29:25] *errors [10:30:36] OK, so should I run this now on an-master1001 to attempt the failback? [10:30:36] `sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet` [10:31:21] yes exactly (better with krb-run-command but if the ticket cache is ok it is not needed) [10:32:46] `sudo -u hdfs kerberos-run-command hdfs hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet` [10:33:00] `sudo -u hdfs kerberos-run-command hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet` [10:34:58] yep [10:35:23] !log btullis@an-master1001:~$ sudo -u hdfs kerberos-run-command hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet [10:35:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:35:37] `Failover to NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040 successful` [10:36:25] perfect :) [10:37:04] my theory is that the current number of threads for the port that handles the failovers etc.. may end up in a situation in which all are locked waiting [10:37:27] a +10 threads should be enough [10:38:19] Right. Got it. So now I've waited 30 seconds I can restart `hadoop-hdfs-zkfc` and `hadoop-hdfs-namenode` on an-master1002. Then finally restart `hadoop-mapreduce-historyserver` on an-master1001. Is that right? [10:39:51] +1 yes [10:41:18] (going afk, bbl!) [10:47:02] !log btullis@an-master1002:~$ sudo systemctl restart hadoop-hdfs-zkfc [10:47:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:47:31] !log btullis@an-master1002:~$ sudo systemctl restart hadoop-hdfs-namenode [10:47:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:02:22] !log btullis@an-master1001:~$ sudo systemctl restart hadoop-mapreduce-historyserver [11:02:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:30:30] Wow - Thanks for fixing btullis and elukey! [11:30:46] And reading the baclog, I have the solution for the HdfsCleaner problem [11:30:50] btullis: --^ [11:30:58] btullis: let me know when you to exchange on this [12:37:51] joal: Any time you like... [12:38:22] hi btullis - I'm writing a patch as we speak :) [12:38:44] the problem is related to https://phabricator.wikimedia.org/T217967 [12:39:51] I /suspected/ so, but hadn't got as far as looking at any code. [12:46:35] (03PS1) 10Joal: Fix hdfs-cleaner script using shaded jar [analytics/refinery] - 10https://gerrit.wikimedia.org/r/723516 (https://phabricator.wikimedia.org/T217967) [12:50:19] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review, 10User-MoritzMuehlenhoff: Improve user experience for Kerberos by creating automatic token renewal service - https://phabricator.wikimedia.org/T268985 (10BTullis) Merged and deployed. First login on stat1004.eqiad.wmnet after a puppet run. ` sta... [12:52:53] (03CR) 10Btullis: [C: 03+1] Fix hdfs-cleaner script using shaded jar [analytics/refinery] - 10https://gerrit.wikimedia.org/r/723516 (https://phabricator.wikimedia.org/T217967) (owner: 10Joal) [12:54:09] Makes sense. Should we do a refinery-source deploy when this is merged, or wait for the next train? [12:57:31] btullis: the cleaner will continue to alert until fixed - risk for data-size is low - As you wish - we can deploy now and fix (but it's friday), or wait for next week's train [13:01:25] If we wait it means we just get up to about 35 days' worth of tmp files instead of 31 days' - but Icinga alerts from these services until Tuesdsay. Right? I think I'd be in favour of waiting, personally. [13:06:08] btullis: works for me - I'm gonna add this patch in the deploy etherpad [13:06:12] for next week [13:07:32] btullis: see https://etherpad.wikimedia.org/p/analytics-weekly-train [13:08:24] also btullis, how are ou doing on the various cassandra asks? [13:08:28] +t [13:08:34] Great. Shall I: 1) reset the failed status of the three systemd units for these cleaner tasks and 2) set downtime until the and of the train deployment window on Tuesday? [13:09:08] btullis: If you reset the timer, it's gonna fail anew, no? [13:09:20] btullis: and downtiming the alarms is a good idea, yes [13:11:08] As far as the repair of the `mediarequest_per_file data table/data` table, it's been showing 100% complete on the 2nd snapshot since 03:16 this morning, but the task hasn't actually completed. [13:11:22] `[2021-09-24 03:16:57,086] Repair session 6bfaf820-1b86-11ec-8d9d-cbcce8d668d2 for range (7955956309755473306,7959630889105931638] finished (progress: 100%)` [13:11:42] right :) Also : if you wish I can follow up on your question on how to check if the refine job has successfully completed [13:11:49] btullis: the email :) [13:11:53] I can't start the repair of the next snapshot (3 of 4) until this has finished. [13:12:09] no problem btullis - it'll start when the other is finished :) [13:13:48] I'm still working on the scripting for the reloading of the freshly repaired smaller tables, but I should be able to set that going today. There is an open question for you here: https://phabricator.wikimedia.org/T291469#7374977 [13:14:31] oops sorry I missed it [13:14:35] Answering! [13:14:47] The refine job failed again with a similar error message, so I re-ran that too. But yes I'd like to learn how to check properly please. [13:16:02] 10Analytics, 10Analytics-Kanban: Repair and reload all cassandra-2 data tables but the 2 big ones - https://phabricator.wikimedia.org/T291469 (10JAllemandou) > @JAllemandou - What about the `local_group_default_T_pageviews_per_project` tables? > Since the data table for this doesn't exist on the new cluster, w... [13:17:11] btullis: answer in there -^ [13:17:31] then we can batcave when you wish btullis about the logs for the refine job [13:19:19] 10Analytics, 10Analytics-Kanban: Repair and reload all cassandra-2 data tables but the 2 big ones - https://phabricator.wikimedia.org/T291469 (10BTullis) OK, thanks. We already imported the `meta` table in that keyspace didn't we? So should we delete the whole `local_group_default_T_pageviews_per_project` fro... [13:19:36] joal: See you in the BC. [13:22:42] folks I am starting the roll restart of kafka [13:22:47] (jumbo) [13:22:47] ack elukey [14:13:06] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review, 10User-MoritzMuehlenhoff: Improve user experience for Kerberos by creating automatic token renewal service - https://phabricator.wikimedia.org/T268985 (10BTullis) I have updated the Kerberos User Guide with this information: https://wikitech.wik... [14:13:49] (03PS7) 10Michael DiPietro: add stop status [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/719567 (https://phabricator.wikimedia.org/T289349) [14:14:10] (03CR) 10Ottomata: "Ah, because this is not versioned." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/723516 (https://phabricator.wikimedia.org/T217967) (owner: 10Joal) [14:14:45] 10Analytics-Clusters, 10Analytics-Kanban, 10User-MoritzMuehlenhoff: Improve user experience for Kerberos by creating automatic token renewal service - https://phabricator.wikimedia.org/T268985 (10BTullis) [14:20:23] joal: Good news and bad news from the repair of mediarequest_per_file... [14:20:28] https://www.irccloud.com/pastebin/H4By1Wn3/ [14:25:40] 10Analytics, 10Platform Engineering, 10Wikibase change dispatching scripts to jobs, 10serviceops: Better observability/visualization for MediaWiki jobs - https://phabricator.wikimedia.org/T291620 (10Ladsgroup) Let me add some points here: - While I agree jobs have better o11y than the current dispatching... [14:29:30] 10Analytics, 10Analytics-Kanban: Repair and reload cassandra2 mediarequest_per_file data table - https://phabricator.wikimedia.org/T291470 (10BTullis) The repair of this table on this instance completed, but there was an unexpected warning message. ` [2021-09-24 14:04:52,302] Repair command #14 finished in 2 d... [14:29:55] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: SPIKE - Will Hadoop 3 container support help us for Airflow deployment pipelines? - https://phabricator.wikimedia.org/T288247 (10mforns) OK :/ [14:40:18] 10Analytics: [Session length] Apply different sample rates per wiki - https://phabricator.wikimedia.org/T291693 (10mforns) [14:46:02] 10Analytics, 10Analytics-Kanban: Repair and reload cassandra2 mediarequest_per_file data table - https://phabricator.wikimedia.org/T291470 (10BTullis) By my reading of [[this|https://stackoverflow.com/questions/28436643/lost-notification-from-nodetool-repair]] and [[this|https://issues.apache.org/jira/browse/C... [14:47:10] !log btullis@aqs1007:~$ sudo nodetool-a repair --full local_group_default_T_mediarequest_per_file data [14:47:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:59:20] 10Analytics, 10Analytics-Kanban: Repair and reload cassandra2 mediarequest_per_file data table - https://phabricator.wikimedia.org/T291470 (10BTullis) I have started the repair of instance a on aqs1007.eqiad.wmnet ` btullis@aqs1007:~$ sudo nodetool-a repair --full local_group_default_T_mediarequest_per_file da... [15:02:05] 10Analytics, 10Analytics-Kanban: Repair and reload all cassandra-2 data tables but the 2 big ones - https://phabricator.wikimedia.org/T291469 (10BTullis) Right, in that case we can do this: ` sudo cumin 'aqs100[4,7].eqiad.wmnet' --mode async 'nodetool-a snapshot -t T291469' 'nodetool-b snapshot -t T291469' sud... [15:04:27] 5/9 kafka jumbo nodes restarted, the cookbook will likely finish in an hour or so [15:05:55] 10Analytics, 10Analytics-Kanban: Repair and reload all cassandra-2 data tables but the 2 big ones - https://phabricator.wikimedia.org/T291469 (10BTullis) [x] The oozie loader job isn't running. Creating the snapshots now. ` btullis@cumin1001:~$ sudo cumin --mode async 'aqs100[4,7].eqiad.wmnet' 'nodetool-a snap... [15:06:18] !log btullis@cumin1001:~$ sudo cumin --mode async 'aqs100[4,7].eqiad.wmnet' 'nodetool-a snapshot -t T291469' 'nodetool-b snapshot -t T291469' [15:06:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:06:23] T291469: Repair and reload all cassandra-2 data tables but the 2 big ones - https://phabricator.wikimedia.org/T291469 [15:07:42] 10Analytics, 10Analytics-Kanban: Repair and reload all cassandra-2 data tables but the 2 big ones - https://phabricator.wikimedia.org/T291469 (10BTullis) Snapshots created successfully. ` 2 hosts will be targeted: aqs[1004,1007].eqiad.wmnet Ok to proceed on 2 hosts? Enter the number of affected hosts to confir... [15:08:00] 10Analytics, 10Analytics-Kanban: Repair and reload all cassandra-2 data tables but the 2 big ones - https://phabricator.wikimedia.org/T291469 (10BTullis) [15:13:08] urbanecm: o/ when you have a moment (not urgent) do you mind to stop/start your pyspark shell on sta1004? (we are rolling out the new openjdk) [15:13:57] 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering, 10SRE, 10ops-eqiad: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10Cmjohnson) [15:14:08] 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering, 10SRE, 10ops-eqiad: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10Cmjohnson) BIOS and iDrac setup [15:14:18] very interesting, on stat1005 there are a lot of users in this situation [15:22:38] 10Analytics, 10Analytics-Kanban: Repair and reload all cassandra-2 data tables but the 2 big ones - https://phabricator.wikimedia.org/T291469 (10BTullis) Here is the reworked rsync transfer script. Trying a dry-run now. ` #!/bin/bash DOMAIN='eqiad.wmnet' SOURCES=('aqs1004' 'aqs1007') DESTINATIONS=('aqs1011' '... [15:27:50] 10Analytics, 10Analytics-Kanban: Repair and reload all cassandra-2 data tables but the 2 big ones - https://phabricator.wikimedia.org/T291469 (10BTullis) I have emptied the destination directories on aqs1010 and aqs1011. This cumin command shows that there are no files present. ` btullis@cumin1001:~$ sudo cumi... [15:28:35] 10Analytics, 10Analytics-Kanban: Repair and reload all cassandra-2 data tables but the 2 big ones - https://phabricator.wikimedia.org/T291469 (10BTullis) Dry-run succeded. Running the rsync now. [15:51:15] 10Analytics, 10Analytics-Kanban: Repair and reload all cassandra-2 data tables but the 2 big ones - https://phabricator.wikimedia.org/T291469 (10BTullis) The rsync appears to have been successful. No errors shown from the script output and the directory sizes all appear to be correct. ` btullis@cumin1001:~$ su... [15:53:03] 10Analytics, 10Analytics-Kanban: Repair and reload all cassandra-2 data tables but the 2 big ones - https://phabricator.wikimedia.org/T291469 (10BTullis) [16:01:19] \o/ btullis! [16:14:22] 10Analytics, 10Analytics-Kanban: Repair and reload all cassandra-2 data tables but the 2 big ones - https://phabricator.wikimedia.org/T291469 (10BTullis) This is the reload script. I will run it on Monday. ` #!/bin/bash set -e DOMAIN='eqiad.wmnet' HOSTS=('aqs1010' 'aqs1011') INSTANCES=('a' 'b') KEYSPACES=$(c... [16:29:27] 10Analytics, 10Wikipedia-iOS-App-Backlog, 10Product-Analytics (Kanban), 10User-Johan: Understand impact of Apple's Relay Service - https://phabricator.wikimedia.org/T289795 (10ppelberg) [16:29:32] 10Analytics, 10Wikipedia-iOS-App-Backlog, 10Product-Analytics (Kanban), 10User-Johan: Understand impact of Apple's Relay Service - https://phabricator.wikimedia.org/T289795 (10ppelberg) //Note: I've added a link to the `Edit Attempts by Browser and OS` Superset dashboard @MNeisler created to the task descr... [16:53:37] (03PS4) 10Sharvaniharan: Migrate MobileWikiAppDailyStats to MEP Bug: T286000 Change-Id: Ibf65637926bb3be80c64b2c373958a52b034aedb [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/722964 (https://phabricator.wikimedia.org/T286000) [16:55:37] (03CR) 10Sharvaniharan: Migrate MobileWikiAppDailyStats to MEP Bug: T286000 Change-Id: Ibf65637926bb3be80c64b2c373958a52b034aedb (034 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/722964 (https://phabricator.wikimedia.org/T286000) (owner: 10Sharvaniharan) [17:13:44] elukey: acknowledged, will do! [17:32:09] (03CR) 10Michael DiPietro: add stop status (031 comment) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/719567 (https://phabricator.wikimedia.org/T289349) (owner: 10Michael DiPietro) [17:40:20] (03CR) 10Bstorm: "Javascript is fun, right? I now get it hanging at `Checking query status...` on a fresh checkout, and I also have lost syntax highlighting" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/719567 (https://phabricator.wikimedia.org/T289349) (owner: 10Michael DiPietro) [17:42:23] (03CR) 10Bstorm: "Here's from my JS console:" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/719567 (https://phabricator.wikimedia.org/T289349) (owner: 10Michael DiPietro) [17:44:59] (03CR) 10Bstorm: "I just got the exact same output from master, but I think I need to clear caches." [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/719567 (https://phabricator.wikimedia.org/T289349) (owner: 10Michael DiPietro) [17:45:01] (03PS1) 10MewOphaswongse: Add a link: Update action_data for back and next events to account for navigation type [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/723601 (https://phabricator.wikimedia.org/T290316) [17:54:02] (03CR) 10Bstorm: add stop status (031 comment) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/719567 (https://phabricator.wikimedia.org/T289349) (owner: 10Michael DiPietro) [17:56:53] (03PS2) 10MewOphaswongse: Add a link: Update action_data for back and next events to account for navigation type [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/723601 (https://phabricator.wikimedia.org/T290316) [17:57:37] (03PS3) 10MewOphaswongse: Add a link: Update action_data for back, next, suggestion_skip actions to account for navigation type [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/723601 (https://phabricator.wikimedia.org/T290316) [17:58:36] (03CR) 10Bstorm: "Ok, when I'm using a browser that doesn't hate our dev environment, it works nicely! FYI for anyone reading this, the browser is Safari 15" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/719567 (https://phabricator.wikimedia.org/T289349) (owner: 10Michael DiPietro) [18:00:30] (03PS4) 10MewOphaswongse: Add a link: Update action_data for back, next actions to account for navigation type [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/723601 (https://phabricator.wikimedia.org/T290316) [18:03:45] 10Analytics, 10Platform Engineering, 10Wikibase change dispatching scripts to jobs, 10serviceops: Better observability/visualization for MediaWiki jobs - https://phabricator.wikimedia.org/T291620 (10Ottomata) > having job parameters in hadoop would be extremely useful, we keep all user requests (up to 90 d... [18:04:26] (03CR) 10Bstorm: [C: 03+1] "I think this all checks out now. Playing with it a bit seems nice and solid. It also seems like a virtue that it went to a delta of less t" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/719567 (https://phabricator.wikimedia.org/T289349) (owner: 10Michael DiPietro) [18:13:59] (03CR) 10Ottomata: [C: 03+1] Migrate MobileWikiAppDailyStats to MEP Bug: T286000 Change-Id: Ibf65637926bb3be80c64b2c373958a52b034aedb [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/722964 (https://phabricator.wikimedia.org/T286000) (owner: 10Sharvaniharan) [18:56:11] hello Analytics, do you guys have your own MaxMind GeoIP license? One that is not the same as what we have on MW appservers? [19:15:51] mutante: I'm not sure if we have our own, but I should probably be the one to know this... do you know how I could find out? [19:15:58] Maybe I can hash the license file and post it here [19:16:00] then you can do the same [19:16:02] security! [19:35:40] 10Analytics, 10Platform Engineering, 10Wikibase change dispatching scripts to jobs, 10serviceops: Better observability/visualization for MediaWiki jobs - https://phabricator.wikimedia.org/T291620 (10Ladsgroup) If I can do beeline in stat1005 and look at the data, I don't care about the rest. There is a ge... [19:50:19] mutante: see our license for yourself at `razzi@puppetmaster1001:/srv/private$ cat /srv/private/modules/secret/secrets/geoip/GeoIP.conf` [20:21:42] (03PS1) 10Ladsgroup: [WIP] Add script to get some data out of wb_changes [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/723613 (https://phabricator.wikimedia.org/T291276) [20:23:16] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add script to get some data out of wb_changes [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/723613 (https://phabricator.wikimedia.org/T291276) (owner: 10Ladsgroup) [21:03:34] 10Analytics, 10Event-Platform, 10Patch-For-Review: Users should run explicit commands to materialize schema versions, rather than using magic git hooks - https://phabricator.wikimedia.org/T290074 (10mpopov) @Ottomata I would be in favor of a single `npm run build` command and removing the git hook magic to m... [21:47:27] 10Analytics-Radar, 10Product-Analytics (Kanban): [REQUEST] Investigate decrease in New Registered Users - https://phabricator.wikimedia.org/T289799 (10Sdkb) @MMiller_WMF, sorry for the delayed reply. There was discussion first [[ https://en.wikipedia.org/wiki/MediaWiki_talk:Signupstart | here ]], where I pushe... [22:33:46] !log restart an-test-coord presto coordinator service to experiment withweb-ui.authentication.type=fixed [22:33:46] and web-ui.user=user [22:33:49] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [23:32:24] razzi: thank you very much for the replies earlier, I had network trouble, reading them now. yes, that file you reference is the one I had found as well, that is installed on the puppetmaster in prod and then other clients can pull from the master instead of all reloading from maxmind. I did just add a new second license bought by the AH team and was trying to figure out how many licenses we [23:32:30] actually have. thank you [23:33:00] there might be more databases coming with the new license, btw [23:33:25] T288844 if you guys are interested , cya later [23:33:26] T288844: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 [23:33:59] so this would change the keys above .. or .. we would have to add more than one user/key