[01:22:33] RECOVERY - Check unit status of monitor_refine_event on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:04:05] elaragon: Hi - I'm gonna kill our running job - with its current setup it prevents other jobs to succeed [07:06:15] elaragon: when using the yarn-large kernel, the rate at which spark tries to read data is very high, preventing the machine to fulfil other jobs with disk capacity [07:07:27] elaragon: I suggest using a non-standard spark configuration, with less executor-cores and executor-memory, leading to a smaller amount of workers executing in parallel - the job will take longer, but it should succeed and not prevent other jobs to run [07:17:12] (03CR) 10Awight: "This change is ready for review." [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/719220 (https://phabricator.wikimedia.org/T272589) (owner: 10Awight) [07:47:00] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+2] Backfill some aggregations [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/719220 (https://phabricator.wikimedia.org/T272589) (owner: 10Awight) [07:47:40] (03CR) 10Awight: [V: 03+2] Backfill some aggregations [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/719220 (https://phabricator.wikimedia.org/T272589) (owner: 10Awight) [07:48:31] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Convert `published_cx2_translations` to native HiveHQL [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/682747 (https://phabricator.wikimedia.org/T193169) (owner: 10Awight) [07:50:37] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+2] "Looks like it needs a rebase now. Otherwise let's just merge this in. It was sitting here long enough." [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/682748 (https://phabricator.wikimedia.org/T193169) (owner: 10Awight) [07:50:59] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+2] Whitespace-only [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/682752 (owner: 10Awight) [07:52:19] (03PS2) 10Awight: Convert `reference-previews` to native HiveHQL [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/682748 (https://phabricator.wikimedia.org/T193169) [07:53:22] (03CR) 10Awight: "PS 2: manual rebase, over the change that removed `funnel`." [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/682748 (https://phabricator.wikimedia.org/T193169) (owner: 10Awight) [07:59:34] elaragon: Ping again - Killing the task again so that other jobs don't fail [08:10:45] elaragon: ping ping! [08:18:45] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+2] "I checked the rebase and it looks fine." [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/682748 (https://phabricator.wikimedia.org/T193169) (owner: 10Awight) [08:19:13] (03CR) 10Awight: [V: 03+2] Convert `reference-previews` to native HiveHQL [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/682748 (https://phabricator.wikimedia.org/T193169) (owner: 10Awight) [08:26:42] (03CR) 10Awight: [V: 03+2] Whitespace-only [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/682752 (owner: 10Awight) [08:33:02] 10Analytics, 10Analytics-Kanban: Update mediawiki-history jobs spark settings - https://phabricator.wikimedia.org/T290469 (10JAllemandou) [08:33:15] 10Analytics, 10Analytics-Kanban: Update mediawiki-history jobs spark settings - https://phabricator.wikimedia.org/T290469 (10JAllemandou) a:03JAllemandou [08:33:33] (03PS2) 10Joal: Grow mediawiki-history oozie jobs resources [analytics/refinery] - 10https://gerrit.wikimedia.org/r/719111 (https://phabricator.wikimedia.org/T290469) [08:43:51] @joal: I just saw your messages, thanks! [08:44:43] ack elaragon :) [09:13:17] Hi elukey - quick question - would ou have an idea as to why new aqs hosts don't show up in cassandra dashboards? [09:14:19] checking [09:15:47] I suspect because we are not collecting metrics [09:17:29] yep [09:17:43] elaragon: you're hitting different issues with your job - Can you please grow the driver-memory to 8G, the executor-memory to 8G as well, reduce the number of executors to 80, and when trying your code do it on a single wiki instead of all? [09:18:28] joal: so we are collecting metrics for role(aqs) but not for role(aqs_next) [09:18:37] elukey: I have a last optimization for when you want to extract the text of the 14M revisions - We can ask spark to force a map-side join - it should be a lot faster [09:18:41] Ahhhh [09:19:11] woop woops sorry elukey wrong ping [09:19:19] elaragon: I have a last optimization for when you want to extract the text of the 14M revisions - We can ask spark to force a map-side join - it should be a lot faster [09:19:25] thanks a lot elukey [09:19:42] so we could add a specific target to the prometheus master config [09:19:59] and call the cluster aqs_next as well [09:20:05] (to avoid mixing the metrics) [09:20:15] but then we'll have to keep the naming after the migration [09:20:18] or [09:20:37] we could use something like profile::aqs to select nodes, but then everything will get mixed [09:22:19] elukey: we're discussing that with Ben and Hugh right now [09:22:26] btullis, hnowlan --^ [09:22:59] ah perfect [09:23:00] :) [09:25:21] (03PS1) 10GoranSMilovanovic: T283575 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/719229 [09:25:33] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T283575 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/719229 (owner: 10GoranSMilovanovic) [09:30:09] @joal: that is a spark broadcast join, right? [09:30:22] eh, no :) [09:31:36] (03PS1) 10GoranSMilovanovic: T283568 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/719232 [09:32:03] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T283568 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/719232 (owner: 10GoranSMilovanovic) [09:32:20] or, maybe elaragon :) I'm not sure, I need to see the code :) [09:38:23] Here is the patch which I think will add the scrape target: https://gerrit.wikimedia.org/r/c/operations/puppet/+/719233 [10:11:31] It didn't have quite the desired outcome. The metrics for aqs and aqs_next are merged in the existing Grafana dashboards, under the `aqs` cluster variable. I was expecting it to create an aqs_next cluster. [10:12:05] !log Kill cassandra-hourl loading job for cluster-migration first step [10:12:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:12:14] btullis, hnowlan --^ :) [10:12:16] https://usercontent.irccloud-cdn.com/file/zefEB7hs/image.png [10:12:49] joal: ack. Thanks. [10:19:20] btullis: I think I know why it happened - in the hiera config for role::aqs_next we have "cluster: aqs" [10:20:04] so the puppet code picked it up [10:20:19] I think it is fine for the moment, it will be clear if there is a problem [10:20:32] and old metrics will fade away eventually [10:21:58] elukey: Thanks. We thought so too. We're about to run the script to truncate the tabloes in the new cluster. [10:25:11] !log truncating data tables on aqs_next cluster [10:25:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:27:52] PROBLEM - aqs endpoints health on aqs1013 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views return [10:27:52] unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the unexpect [10:27:53] us 404 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:28:06] PROBLEM - aqs endpoints health on aqs1015 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views return [10:28:06] unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the unexpect [10:28:07] us 404 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:28:19] ^ expected [10:28:20] PROBLEM - aqs endpoints health on aqs1011 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views return [10:28:20] unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the unexpect [10:28:21] us 404 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:28:30] PROBLEM - aqs endpoints health on aqs1014 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views return [10:28:30] unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the unexpect [10:28:30] us 404 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:29:18] ACKNOWLEDGEMENT - aqs endpoints health on aqs1011 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page view [10:29:18] ned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the [10:29:18] ted status 404 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: Hnowlan Tables being reinitialised https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:29:20] ACKNOWLEDGEMENT - aqs endpoints health on aqs1013 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page view [10:29:20] ned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the [10:29:21] ted status 404 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: Hnowlan Tables being reinitialised https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:29:23] ACKNOWLEDGEMENT - aqs endpoints health on aqs1014 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page view [10:29:23] ned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the [10:29:24] ted status 404 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: Hnowlan Tables being reinitialised https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:29:26] ACKNOWLEDGEMENT - aqs endpoints health on aqs1015 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page view [10:29:26] ned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the [10:29:27] ted status 404 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: Hnowlan Tables being reinitialised https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:41:23] (03PS4) 10Jgiannelos: Map tile state change event schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/716219 (https://phabricator.wikimedia.org/T289771) [10:41:57] (03CR) 10jerkins-bot: [V: 04-1] Map tile state change event schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/716219 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [10:43:13] (03CR) 10Jgiannelos: Map tile state change event schema (034 comments) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/716219 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [10:45:16] (03PS5) 10Jgiannelos: Map tile state change event schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/716219 (https://phabricator.wikimedia.org/T289771) [11:11:44] elaragon: sorry I misinterpreted your question about spark broadcast join - Indeed, I was talking about map-side join, named broadcast-join in spark logical plan [11:12:38] elaragon: using this trick, the query doesn't need to shuffle the whole text to join with the ids, only to parse it - this should help signaficantly (but only works when one side of the join is small enough) [11:19:45] hey teamm :] [11:19:54] joal: is the aqs migration ongoing? [11:31:01] Hi mforns - The initial step has started yes :) [11:31:16] joal: are the alerts related then? [11:31:21] hi :] [11:31:22] correct sir :) [11:31:25] ok ok [11:32:49] (03CR) 10Mforns: [C: 03+1] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/719111 (https://phabricator.wikimedia.org/T290469) (owner: 10Joal) [11:34:09] joal: the mediawiki checker failed yesterday, but I see it's good now, you fixed it? [11:34:23] I did mforns :) [11:37:59] !log Re-Add test rows in cassandra3 cluster after tables got truncated [11:38:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:38:53] joal: can I do something re. migration? [11:39:29] RECOVERY - aqs endpoints health on aqs1013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [11:39:50] mforns: so far so good, no need I think [11:39:57] k [11:40:01] RECOVERY - aqs endpoints health on aqs1015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [11:40:53] RECOVERY - aqs endpoints health on aqs1014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [11:41:01] RECOVERY - aqs endpoints health on aqs1011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [11:41:31] !log Restarting cassandra hourly loading job after C2 snapshot taken and C3 tables truncated [11:41:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:07:55] elukey: heya - if you have a minute we have an interesting one :) [12:09:08] sure [12:11:28] * mforns leaving for 40 mins [12:13:22] joal: ? [12:13:33] Ooops excuse me elukey [12:14:48] elukey: we have an interesting case with meta tables - they are replicated! [12:15:21] joal: thanks for the MAPJOIN tip, the query worked perfectly :) [12:15:29] \o/ [12:16:20] elaragon: when dealing with really big data (here 20Tb), this makes a big difference - You didn't have to copy those 20Tb over [12:17:15] BTW, I getting this error when writing in AVRO format: 'Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;' [12:17:36] hm [12:17:38] am I missing some spark configuration setting in my session? [12:18:12] elaragon: actually you're missing a dependency :S [12:18:16] MEH [12:18:27] this is uncool [12:20:03] elaragon: I'm trying to find a quick hack to make this work [12:20:09] elaragon: give me aminute [12:20:39] no rush and many thanks! [12:24:19] joal: is there anything problematic about the replication? [12:24:40] elukey: Nope - keyspace defines rep factor of 3 as expected [12:25:17] elukey: the idea I got is that the write to this table is made using 'localOne' consistency, hence making cassandra not replicate the data [12:25:21] but I'm not sure [12:26:01] the current replication for the system table is 12 IIRC for this reason, but do you recall the mess that happened when we tried to apply it? (users disappearing etc..) [12:26:23] yeah, I kinda remember [12:26:31] maybe something similar is stake [12:26:42] interestingly, the same thing happens on the new cluster [12:26:52] I suggest to do these kind of changes before switching, so it will be more controlled [12:26:59] after we re-created the users it worked fine [12:27:19] elukey: it;s not internal tables, nothing change for those- it's AQS meta tables [12:27:31] localOne means write locally and then return, eventual consistency will do its job right ? [12:27:55] joal: sure but I don't recall exactly what AQS meta tables are :) [12:28:28] Ah elukey - they are the extremely small tables containing data schemas, in order to handle schema change [12:30:46] joal: what is your idea? [12:31:17] I don't get the part of "making cassandra not replicate the data" [12:31:33] (trying to understand how I can be helpful, is a bit difficult without all the context :D) [12:31:36] elukey: the meta table is in the same keyspace as the data ones, so the rep-factor applies normaly to all [12:32:05] elukey: I'm just keeping ou updated on our interesting findings - I probably shouldn't and let you do other things :S [12:34:51] joal: nono I am happy to get news but I didn't get what was the input that was needed from me, I tried to read a bit above but I was confused :) [12:35:04] I can imagine elukey :) [12:35:31] The weird thing is: data of those `meta` tables is only present on single instances accross the whole cluser [12:35:47] elukey: --^ [12:36:02] elukey: and while I have an idea as to possibly why, I'm completely unsure :) [12:36:17] elaragon: I have a solution for you :) [12:36:33] joal: ah ok now it is clear! [12:36:42] elaragon: in your spark session definition, add extra-settings as shown: [12:36:59] spark = wmf.spark.get_session( ..., extra_settings = {'spark.jars': '/srv/deployment/analytics/refinery/artifacts/refinery-job.jar'}) [12:37:04] elaragon: --^ [12:37:32] joal: very weird [12:37:35] normally with that ou can then do: df.write.format("avro").save("/your/path") [12:37:43] elukey: indeed!!! [12:38:08] as I was saing, the tables live in keyspaces defining rep-factor of 3, so normally the data should be replicated [12:38:12] WEIRDOH [12:38:27] joal: re-re-re-thanks :) [12:38:41] elaragon: happy to help :) [12:39:39] joal: FYI I also increased spark.sql.broadcastTimeout from 300 to 1000 for the MAPJOIN [12:40:29] nice tip elaragon - the data you're broadcasting is bigger than what is automatically done, so I'm surprised [12:48:19] back :] [13:00:56] (03CR) 10Michael DiPietro: [C: 03+2] update config to match for celery 6 [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/717443 (https://phabricator.wikimedia.org/T290328) (owner: 10Michael DiPietro) [13:01:14] (03CR) 10Michael DiPietro: [C: 03+2] close quarry db dropdown on tab [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/717592 (https://phabricator.wikimedia.org/T289872) (owner: 10Michael DiPietro) [13:01:40] (03Merged) 10jenkins-bot: update config to match for celery 6 [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/717443 (https://phabricator.wikimedia.org/T290328) (owner: 10Michael DiPietro) [13:01:50] (03Merged) 10jenkins-bot: close quarry db dropdown on tab [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/717592 (https://phabricator.wikimedia.org/T289872) (owner: 10Michael DiPietro) [13:11:42] 10Quarry: Close quarry db autocompletion on tab - https://phabricator.wikimedia.org/T289872 (10mdipietro) 05Open→03Resolved [13:27:22] 10Quarry: celery version six preparation - https://phabricator.wikimedia.org/T290328 (10mdipietro) 05Open→03Resolved [13:55:39] 10Analytics, 10Event-Platform, 10Metrics-Platform, 10Goal: BUOD-KR1-Q3: Require that all new schema/instruments are created with the MEP system - https://phabricator.wikimedia.org/T259157 (10Mholloway) [14:33:49] 10Analytics, 10Cassandra, 10Data-Engineering, 10Data-Engineering-Kanban, 10Epic: Cassandra3 migration for Analytics AQS - https://phabricator.wikimedia.org/T249755 (10BTullis) Here is the migration plan document: https://docs.google.com/document/d/1FGub_rRIrv77Miadp0Muvf6EwpbvcW2dtZ_qICSt-2o/edit The sn... [15:13:52] (03CR) 10Mholloway: [C: 03+1] Map tile state change event schema (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/716219 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [15:21:32] 10Quarry: Pressing the Stop button in Quarry results in a 500 error - https://phabricator.wikimedia.org/T290146 (10GeoffreyT2000) p:05Triage→03High Perhaps, this should be triaged as "High" or even "Unbreak Now!" priority. For now, I am going to set this as "High" priority, but if anyone thinks that this sho... [15:49:54] 10Quarry: Pressing the Stop button in Quarry results in a 500 error - https://phabricator.wikimedia.org/T290146 (10mdipietro) We could pull the stop function. Though that would still orphan jobs stuck running, they will not be killed until something like OOM killer comes and gets them, where there was previously... [15:57:56] (03PS1) 10Joal: Add num-partitions param to mw-history checkers [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/719290 (https://phabricator.wikimedia.org/T290469) [16:19:22] (03PS1) 10GoranSMilovanovic: T283571 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/719294 [16:19:40] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T283571 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/719294 (owner: 10GoranSMilovanovic) [16:32:56] (03PS1) 10Andrew Bogott: blubber def: first pass [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/719300 [16:34:41] (03PS2) 10Andrew Bogott: blubber def: first pass [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/719300 [16:39:34] hnowlan: joal: We may have a capacity issue on the new cassandra servers when restoring the snapshot. Just holding a copy of the snapshot takes the usage up to 60% and we are supposed to be restoring the snapshot to the same disk. [16:39:40] https://usercontent.irccloud-cdn.com/file/R9IEGp7G/image.png [16:39:54] https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=12&orgId=1&from=now-6h&to=now&var-server=aqs1011&var-datasource=thanos&var-cluster=aqs [16:41:38] hm [16:42:00] btullis, hnowlan - maybe we need to do the thing incrementally? [16:42:08] Maybe it won't be an issue if we only store 1/2 of the size of the snapshot on each host though. 6 hosts, 3 copies of the data. So maybe that only adds another 30% and takes us to 90%. [16:42:27] we can move some of the tables to the other disk as we restore maybe [16:42:41] if we load the smaller tables we can also just delete their backup data [16:42:42] possible hnowlan [16:43:30] 2.0TB is hopefully around the peak of what will be transferred for that host [16:43:41] it's another not ideal thing in a series of extremely not ideal things :) [16:44:41] Hello, I noticed a weird difference between https://dumps.wikimedia.org/other/pageview_complete and https://dumps.wikimedia.org/other/pageviews/ that I'd like to ask about. If I download data from both sources for 2020-09-07 and look for pageviews of `Jiří_Menzel` at Czech Wikipedia, both datasets give slightly different numbers, see my fiddling at https://phabricator.wikimedia.org/P17250. [16:44:42] aiui that instance's transfer is very close to finished [16:44:44] What causes this difference? [16:45:26] 40341 and 40066 is quite close to each other, but it's still a difference that I'm unable to explain [16:59:56] (03PS1) 10GoranSMilovanovic: minor [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/719306 [17:00:07] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] minor [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/719306 (owner: 10GoranSMilovanovic) [17:06:59] (03PS3) 10Andrew Bogott: blubber def: first pass [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/719300 [17:11:30] (03PS4) 10Andrew Bogott: blubber def: first pass [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/719300 [17:20:27] I think the first big transfer is done, just checksumming now [17:20:48] I'll be starting a little later tomorrow btw but that transfer is in tmux on cumin1001 [17:32:41] (03PS5) 10Dduvall: blubber def: first pass [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/719300 (owner: 10Andrew Bogott) [18:00:34] (03PS6) 10Andrew Bogott: Add pipeline setup [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/719300 (https://phabricator.wikimedia.org/T210359) [18:14:49] (03CR) 10Andrew Bogott: "recheck" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/719300 (https://phabricator.wikimedia.org/T210359) (owner: 10Andrew Bogott) [18:17:27] (03CR) 10Andrew Bogott: "recheck" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/719300 (https://phabricator.wikimedia.org/T210359) (owner: 10Andrew Bogott) [18:21:37] (03CR) 10Dduvall: "recheck" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/719300 (https://phabricator.wikimedia.org/T210359) (owner: 10Andrew Bogott) [18:27:24] (03CR) 10Andrew Bogott: [C: 03+2] Add pipeline setup [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/719300 (https://phabricator.wikimedia.org/T210359) (owner: 10Andrew Bogott) [18:30:41] (03Merged) 10jenkins-bot: Add pipeline setup [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/719300 (https://phabricator.wikimedia.org/T210359) (owner: 10Andrew Bogott) [18:31:28] (03PS1) 10Andrew Bogott: tox: use test-requirements.txt, same as the jenkins pipeline [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/719318 (https://phabricator.wikimedia.org/T210359) [18:31:46] (03PS7) 10Andrew Bogott: Added minimal page load test for '/' route [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/716558 [18:32:14] (03CR) 10jerkins-bot: [V: 04-1] tox: use test-requirements.txt, same as the jenkins pipeline [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/719318 (https://phabricator.wikimedia.org/T210359) (owner: 10Andrew Bogott) [18:33:40] (03CR) 10Andrew Bogott: "recheck" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/719318 (https://phabricator.wikimedia.org/T210359) (owner: 10Andrew Bogott) [18:34:42] (03CR) 10jerkins-bot: [V: 04-1] Added minimal page load test for '/' route [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/716558 (owner: 10Andrew Bogott) [18:36:25] (03CR) 10Andrew Bogott: [C: 03+2] tox: use test-requirements.txt, same as the jenkins pipeline [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/719318 (https://phabricator.wikimedia.org/T210359) (owner: 10Andrew Bogott) [18:38:59] (03PS8) 10Andrew Bogott: Added minimal page load test for '/' route [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/716558 [18:42:23] (03CR) 10jerkins-bot: [V: 04-1] Added minimal page load test for '/' route [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/716558 (owner: 10Andrew Bogott) [19:19:05] (03CR) 10Mholloway: [WIP] Metrics Platform schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan) [19:24:28] (03PS1) 10GoranSMilovanovic: T283575 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/719340 [19:24:42] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T283575 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/719340 (owner: 10GoranSMilovanovic) [20:18:05] (03PS9) 10Andrew Bogott: Added minimal page load test for '/' route [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/716558 [20:19:02] (03PS10) 10Andrew Bogott: Added minimal page load test for '/' route [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/716558 [20:22:55] (03CR) 10jerkins-bot: [V: 04-1] Added minimal page load test for '/' route [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/716558 (owner: 10Andrew Bogott) [21:10:44] (03PS1) 10GoranSMilovanovic: T283570 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/719367 [21:11:01] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T283570 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/719367 (owner: 10GoranSMilovanovic) [22:22:09] 10Analytics, 10Data-Engineering, 10FR-Tech-Analytics, 10Privacy Engineering: event.WikipediaPortal referer modification - https://phabricator.wikimedia.org/T279952 (10EYener) Hi all, just piping in to say thank you for adding this to your queue! There is no urgent need to add this to this quarter's work (a... [22:26:09] 10Analytics, 10Event-Platform, 10Patch-For-Review: WikipediaPortal Event Platform Migration - https://phabricator.wikimedia.org/T282012 (10EYener) Hi @Ottomata thanks for the ping - this is on my list of quarterly projects, and I've scheduled time out of this week to focus on it in earnest. I've read through... [22:30:44] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics: Upgrade Superset to 1.2 - https://phabricator.wikimedia.org/T288115 (10razzi) a:03razzi [23:49:15] (03Abandoned) 10BrandonXLF: Add stop button to running queries [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/682092 (https://phabricator.wikimedia.org/T71037) (owner: 10BrandonXLF)