[06:36:55] 10Analytics: Check home/HDFS leftovers of mholloway-shell - https://phabricator.wikimedia.org/T291353 (10MoritzMuehlenhoff) [06:59:43] 10Analytics, 10Analytics-Kanban: Check home/HDFS leftovers of fdans - https://phabricator.wikimedia.org/T290231 (10elukey) @fdansv holaaaa! Anything worth to keep?? [07:01:41] 10Analytics: Check home/HDFS leftovers of kaywong - https://phabricator.wikimedia.org/T291060 (10elukey) ` ====== stat1004 ====== total 0 ====== stat1005 ====== total 4 drwxrwxr-x 6 28580 wikidev 4096 Jul 28 04:42 WikiReliability ====== stat1006 ====== total 0 ====== stat1007 ====== total 0 ====== stat1008 =... [07:03:44] 10Analytics: Check home/HDFS leftovers of jmads - https://phabricator.wikimedia.org/T290715 (10elukey) @MNovotny_WMF Hi! We are wondering if any file belonging to the old account of Jim Maddock are worth to keep/backup. If you have context could you please review the above? [07:06:49] 10Analytics, 10Performance-Team: Check home/HDFS leftovers of gilles - https://phabricator.wikimedia.org/T290232 (10elukey) @Krinkle ping :) To unblock this task we could either move all the old home dirs under yours (something like /home/krinkle/gilles/etc..) or only some files, and then drop the rest. What d... [07:53:24] 10Analytics-Clusters, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics: Upgrade Superset to 1.3 - https://phabricator.wikimedia.org/T288115 (10elukey) Tried to do more tests on this, and in my test presto database settings I didn't tick Security -> Impersonate users... [08:12:18] !log remove old /reportcard (password protected, old files from 2012) httpd settings for stats.wikimedia.org [08:12:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:19:32] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Set up an-web1001 and decommission thorium - https://phabricator.wikimedia.org/T285355 (10elukey) The following dirs/files are no longer be accessible from stats.wikimedia.org (not sure if anybody really used them in years): ` elukey@an-web1001:~... [08:54:03] joal: Based on current progress, I estimate that his full repair of the Cassandra 2 cluster is going to take a couple of months at the current rate. We might need to look at changing our technique to make this quicker. [08:54:34] Hi btullis - I had expected that :D [08:54:57] The 2nd biggest table took 2 days and 5 hours. [08:55:01] https://www.irccloud.com/pastebin/yIarQylu/ [08:55:51] And the biggest one is really biiger than the 2nd biggest, right? [08:55:57] But the biggest table (pageviews_per_article_flat) has been running 13 hours and is only around 3% of the way through. [08:56:11] hm [08:56:41] ...and our current draft of how to do it says, we repeat this procedure 4 times in total. [08:57:17] The thing is, I don't know how cassandra provides us with feeback - could it be that the beginning of the process takes longer (in regard to percent notified)? [08:59:40] I don't think so. I think it's linear. I can paste the whole command output as a pastebin if you like. [08:59:41] hm - the other strategy I can think of is to load 8 dumps - not sure if it'll be faster though [08:59:51] nah I trust you btullis [09:00:57] btullis: given the time it takes, we could go for full-repairs for all-tables-but-1, and take the other approach for the biggest tables? [09:01:13] Other options: [09:01:13] 1) Use the `--partitioner-range option` https://cassandra.apache.org/doc/latest/cassandra/operating/repair.html#other-options to restrict the work of each repair option [09:01:13] 2) Use more threads on the source. [09:02:18] btullis: the load has already been relatively high the past 2 days - I'm not sure if adding more threads would be a good idea for the system :S [09:02:19] This is useful: https://cassandra.apache.org/doc/latest/cassandra/operating/repair.html#usage-and-best-practices [09:02:19] > By default, repair will operate on all token ranges replicated by the node youโ€™re running repair on, which will cause duplicate work if you run it on every node. The -pr flag will only repair the "primary" ranges on a node, so you can repair your entire cluster by running nodetool repair -pr on each node in a single datacenter. [09:02:54] btullis: we wish to only run repair on 1004 and 1007 - therefore NOT using -per [09:02:58] -e sorry [09:05:30] Yes, I see what you mean. I was just wondering if what they meant by 'datacentre`in that stament was equivalent to the 'rack' that we are using. [09:06:07] btullis: datacenter is a different concept for cassandra than the rack one [09:07:01] btullis: cassandra can replicate data accross datacenters, making the DCs hosts different rings (therefore for repair you need to consider a full DC/ring) [09:08:53] OK, gotcha. Thanks. [09:15:09] Why 8 dumps? Do you think that would contain everything? Wouldn't we need 12 dumps to be sure? [09:16:09] you're absultely right btullis - 12 dumps needed [09:16:14] instead of 4 [09:24:15] We could try parallel `sstableloader` loading of dumps on the destination servers. Would stress the nextwork and the aqs101[1-5] hosts a lot more, but we can't run the repair operations in parallel at all. [09:33:26] btullis: hm - knowing that there is compation at work I wouldn't do multi-loading - it would stress the system a great deal [09:50:46] Agreed, but they're not actually *serving* anything at the moment, so that stress wouldn't necessarily cause any threat to a production service. I mean, we could theoretically take the new servers to the red-line in terms of load, just while the data is being loaded and compacted. I'm not necessarily advocating going that for, just trying to think of ways to reduce this 2 month lag without serious risk. [09:52:30] makes sense btullis [09:52:57] btullis: Could we try by launching repair on hosts for all but-1 ? do we agree on that? [09:59:44] Yes. There is no option to exclude a table from the nodetool command, so we would have to sript several `nodetool repair --full` commands per instance (4-7,a-b), each specifying keyspace and tables. [10:00:48] Given that we have already repaired one `mediarequest_per_file` and it took 2 days, we can complete the remainder of these repairs in 6 more days. [10:01:25] Maybe 7 [10:03:39] right [10:04:06] I'm in wonder about strategy [10:04:33] We should in any case go the repair way for the smaller ones (all-but-2) [10:06:06] Yes, agreed. Repair all-but-2 tables today. [10:08:36] ack btullis - thanks for that - I'm gonna do some more thinking around this [10:09:47] So are you considering what to do about `mediarequest_per_file`? Choosing between: [10:09:48] 1) 7 day repair, create 4 snapshots then transfer and load [10:09:48] 2) create 12 snapshots then transfer and load in parallel (or load sequentially if needed) [10:10:19] correct btullis - I can't think of any other solution [10:12:03] OK. Shall I interrupt the current repair operation on `pageviews_per_article_flat` and concentrate on repairing all of the other small tables? [10:13:11] btullis: I think it's wise [10:16:19] OK, will do. [13:03:23] o/ [13:03:28] \o [13:03:34] Hello. [13:04:08] "The A-tea*m's handshake" :D [13:05:23] 10Analytics: Agree on a repository structure for Airflow-related code - https://phabricator.wikimedia.org/T290664 (10Ottomata) airflow-jobs ? [13:29:59] 10Analytics: Agree on a repository structure for Airflow-related code - https://phabricator.wikimedia.org/T290664 (10mforns) > airflow-jobs ? I like [13:32:11] hello teammm [13:36:42] helllo! [13:37:20] 10Analytics, 10Analytics-Kanban: Improve Refine bad data handling - https://phabricator.wikimedia.org/T289003 (10Ottomata) Hm, I just tried adding some tests in refinery-source for this, and everywhere I try `is_wmf_domain` gets set to false. I cannot repro :/ [13:41:44] All of the following keyspaces haveb now been repaired on aqs1004 and aqs1007: [13:41:49] https://www.irccloud.com/pastebin/tgJf8dTn/ [13:42:30] \o/ [13:46:38] 10Analytics: Standardize the stats system user uid - https://phabricator.wikimedia.org/T291384 (10Ottomata) [13:47:09] 10Analytics: Standardize the stats system user uid - https://phabricator.wikimedia.org/T291384 (10Ottomata) The stats user is declared in `statistics::user`. [13:47:53] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Set up an-web1001 and decommission thorium - https://phabricator.wikimedia.org/T285355 (10Ottomata) For reference, here is Dan's response from Slack: > The old geowiki data has been disabled for years and everyone I know uses geoeditors instead,... [13:55:13] joal: do we have a task for spark 3? [13:55:15] i don't thikn so, right? [13:55:24] hm, I think we do let me check [13:56:05] actually you're right ottomata - we don't!!! [13:56:09] i will make one! [13:56:14] joal: .........can we do it?! [13:56:19] thank you :) [13:56:40] Well, I hope we do!!! [13:56:45] 10Analytics: Upgrade to Spark 3 - https://phabricator.wikimedia.org/T291386 (10Ottomata) [13:57:00] 10Analytics: Upgrade to Spark 3 - https://phabricator.wikimedia.org/T291386 (10Ottomata) [13:57:02] 10Analytics: Refine: Use Spark SQL instead of Hive JDBC - https://phabricator.wikimedia.org/T209453 (10Ottomata) [13:57:09] (03PS6) 10Ottomata: [WIP] Update to spark-3 and scala-2.12 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/656897 (https://phabricator.wikimedia.org/T291386) (owner: 10Joal) [13:57:35] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Update to spark-3 and scala-2.12 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/656897 (https://phabricator.wikimedia.org/T291386) (owner: 10Joal) [14:01:20] Gone for kids, back at standup [14:04:48] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Remove all debian python-* and other user requested packages installed for analytics clients, use conda instead - https://phabricator.wikimedia.org/T275786 (10Ottomata) I'm going to close this task. Remaining work can be done as part of {T286743} [14:10:19] 10Analytics, 10Analytics-Kanban, 10Discovery-Search, 10Patch-For-Review: Publish both shaded and unshaded artifacts from analytics refinery - https://phabricator.wikimedia.org/T217967 (10Ottomata) We still need to update various jobs that use these jars, but that can happen whenever we need to upgrade vers... [14:14:24] (03PS7) 10Ottomata: [WIP] Update to spark-3 and scala-2.12 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/656897 (https://phabricator.wikimedia.org/T291386) (owner: 10Joal) [14:48:52] (03PS3) 10MNeisler: Add the content_translation_event stream to the allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/716339 (https://phabricator.wikimedia.org/T281511) [14:53:56] (03CR) 10MNeisler: Add the content_translation_event stream to the allowlist (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/716339 (https://phabricator.wikimedia.org/T281511) (owner: 10MNeisler) [14:54:00] (03CR) 10Mforns: "LGTM! I left an indentation comment. Once fixed, will merge! Thanks" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/716339 (https://phabricator.wikimedia.org/T281511) (owner: 10MNeisler) [14:54:31] hmm oh i thought spark 3 was in bigtop [14:54:33] elukey: ^ do you know? [14:54:42] seems like just 2.4.5? [14:55:25] ottomata: it is in the next bigtop that should come out in a few weeks [14:55:32] ohhhh hm [14:55:33] cool [14:58:00] (03PS4) 10MNeisler: Add the content_translation_event stream to the allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/716339 (https://phabricator.wikimedia.org/T281511) [15:04:08] (03CR) 10Neil P. Quinn-WMF: [C: 03+1] "Looks good! Thanks, Megan ๐Ÿ˜Š" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/716339 (https://phabricator.wikimedia.org/T281511) (owner: 10MNeisler) [15:40:15] 10Analytics: Refactor analytics-meta MariaDB layout to multi instance with failover - https://phabricator.wikimedia.org/T284150 (10Ottomata) [16:20:42] 10Analytics, 10Analytics-Kanban: Check AQS with cassandra (serving + data) - https://phabricator.wikimedia.org/T290068 (10JAllemandou) [16:20:46] 10Analytics-Clusters, 10Cassandra, 10Data-Engineering, 10Data-Engineering-Kanban, and 2 others: Cassandra3 migration for Analytics AQS - https://phabricator.wikimedia.org/T249755 (10JAllemandou) [16:36:44] 10Analytics, 10Patch-For-Review: Upgrade Refinery Jobs to Spark 3 - https://phabricator.wikimedia.org/T291386 (10odimitrijevic) [16:37:41] 10Analytics, 10Data-Engineering, 10Patch-For-Review: Upgrade Refinery Jobs to Spark 3 - https://phabricator.wikimedia.org/T291386 (10odimitrijevic) p:05Triageโ†’03Medium [16:38:09] 10Analytics, 10Data-Engineering, 10Patch-For-Review: Upgrade Refinery Jobs to Spark 3 - https://phabricator.wikimedia.org/T291386 (10odimitrijevic) [16:39:04] 10Analytics: Check home/HDFS leftovers of mholloway-shell - https://phabricator.wikimedia.org/T291353 (10odimitrijevic) p:05Triageโ†’03High [16:39:21] 10Analytics, 10Analytics-Kanban: Check home/HDFS leftovers of kaywong - https://phabricator.wikimedia.org/T291060 (10odimitrijevic) [16:39:52] 10Analytics, 10Analytics-Kanban: Check home/HDFS leftovers of mholloway-shell - https://phabricator.wikimedia.org/T291353 (10odimitrijevic) [16:42:33] 10Analytics, 10Analytics-Kanban: Check home/HDFS leftovers of jmads - https://phabricator.wikimedia.org/T290715 (10odimitrijevic) [16:53:03] ottomata: how long before I can borrow a few minutes to talk about the event-s ticket? [17:23:45] oh joal how abou tnow? [17:23:51] or maybe in 5 mins? [17:36:18] 10Analytics, 10Data-Engineering, 10Patch-For-Review: Upgrade Refinery Jobs to Spark 3 - https://phabricator.wikimedia.org/T291386 (10Ottomata) Hm, I think this task is also about installing and supporting Spark 3 in favor of Spark 2, with the eventual goal of removing Spark 2. This means making sure everyth... [17:45:28] 10Analytics, 10Event-Platform, 10Patch-For-Review: WikipediaPortal Event Platform Migration - https://phabricator.wikimedia.org/T282012 (10Ottomata) @EYener ah so, the ask is different than the steps outlined in T259163. This task is about making the WikipediaPortal code itself work with Event Platform. Ri... [17:46:03] 10Analytics, 10Event-Platform, 10Patch-For-Review: WikipediaPortal Event Platform Migration - https://phabricator.wikimedia.org/T282012 (10Ottomata) See also {T262433} [17:49:47] 10Analytics, 10Event-Platform, 10Patch-For-Review: WikipediaPortal Event Platform Migration - https://phabricator.wikimedia.org/T282012 (10EYener) Hi @Ottomata! Actually that would be super helpful. Would you mind picking anything on my calendar that is open and works for you? I'll remove unnecessary events,... [18:10:32] (03CR) 10Andrew Bogott: [C: 03+2] "recheck" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/721363 (owner: 10Andrew Bogott) [18:11:18] (03CR) 10jerkins-bot: [V: 04-1] Added test_user.py [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/721363 (owner: 10Andrew Bogott) [18:11:57] (03CR) 10Andrew Bogott: [C: 03+2] "recheck" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/721363 (owner: 10Andrew Bogott) [18:14:27] (03Merged) 10jenkins-bot: Added test_user.py [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/721363 (owner: 10Andrew Bogott) [19:00:23] hey ottomata - I went for diner :) [19:01:02] ottomata: tomorrow it'll be :) [19:52:32] 10Analytics, 10Event-Platform, 10Patch-For-Review: WikipediaPortal Event Platform Migration - https://phabricator.wikimedia.org/T282012 (10Ottomata) @EYener and I met today and we are going to have to sync up with some FRtech team members about this. @EYener, so any of the tasks listed under 'Schemas produc... [19:57:06] joal: ahhhh sorry [20:43:20] 10Analytics, 10Performance-Team: Check home/HDFS leftovers of gilles - https://phabricator.wikimedia.org/T290232 (10Krinkle) That's fine yeah, just transfer them all and I'll take care of it.