[01:17:19] 10Analytics, 10Tool-Pageviews: Statistics for views of individual Wikimedia images - https://phabricator.wikimedia.org/T210313 (10MusikAnimal) [01:17:26] 10Analytics-Radar, 10Tool-Pageviews: Add ability to the pageview tool in labs to get mediarequests per file similar to existing functionality to get pageviews per page title - https://phabricator.wikimedia.org/T234590 (10MusikAnimal) 05Open→03Resolved This was resolved ~2 years ago. If there are any remain... [09:39:45] Morning team. Today I intend to start depooling the old AQS servers, to complete the transition to the new cluster. I'll do them gradually and set them to 'inactive' to try to avoid tripping any alarms with pybal monitoring. [09:40:36] morning! nice :) [09:52:45] Good morning! Great btullis :) [09:58:56] 10Analytics: Convert siteinfo dumps from json to parquet - https://phabricator.wikimedia.org/T244380 (10JAllemandou) a:05JAllemandou→03None [10:01:25] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Switch over the Cassandra AQS cluster to the new hosts - https://phabricator.wikimedia.org/T297803 (10BTullis) I have begun the process of decommisining the old AQS nodes. ` btullis@puppetmaster1001:~$ sudo -i c... [10:04:17] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Cassandra3 migration for Analytics AQS - https://phabricator.wikimedia.org/T249755 (10BTullis) [10:05:16] 10Analytics-Clusters, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, and 2 others: Set up a testing environment for the AQS Cassandra 3 migration - https://phabricator.wikimedia.org/T257572 (10BTullis) 05Open→03Resolved p:05Triage→03Medium a:03BTullis I think that this item ca... [10:12:19] (03Abandoned) 10Joal: SiteLinks: Loads SiteLinks from Wikidata parquet file and resolves redirects. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/512721 (owner: 10Shilad Sen) [10:12:30] (03Abandoned) 10Joal: Placeholder for job to create page ids viewed in each session. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/376797 (https://phabricator.wikimedia.org/T174796) (owner: 10Shilad Sen) [10:12:38] 10Analytics-Clusters, 10Analytics-Radar, 10Data-Engineering-Radar, 10Patch-For-Review: Productionize navigation vectors - https://phabricator.wikimedia.org/T174796 (10JAllemandou) 05Open→03Declined Reasons for which I think this should be abandoned: - code is using an old version of spark and would ne... [10:12:43] (03Abandoned) 10Joal: Spark job to create session event log appears to be working. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/377706 (https://phabricator.wikimedia.org/T174796) (owner: 10Shilad Sen) [10:15:15] (03CR) 10Joal: [C: 03+2] "Merging for dpeloy this week" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/757495 (https://phabricator.wikimedia.org/T297934) (owner: 10Joal) [10:22:15] I depooled aqs1004 at 09:59 but only logged it to the operations channel. [10:22:31] !log depooling aqs1005 [10:22:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:24:45] (03Merged) 10jenkins-bot: Integrate SparkSQLNoCLIDriver and HiveToCassandra [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/757495 (https://phabricator.wikimedia.org/T297934) (owner: 10Joal) [10:34:12] Hi btullis - I'm trying to follow your depooling from chart but I don't manage to :( [10:35:20] https://www.irccloud.com/pastebin/PZtMfaYb/ [10:35:51] btullis: I was trying to see query-requests drop on charts [10:36:01] OK, that's the current state. The main thing I'd doing is watching for any errors or unusual behaviour here: https://grafana-rw.wikimedia.org/d/000000526/aqs?orgId=1&from=now-1h&to=now&refresh=30s [10:36:25] Hmm. Charts? Not sure I know what you mean. [10:36:44] ack btullis - I'm looking at https://grafana.wikimedia.org/d/000000483/cassandra-client-request?orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=aqs&var-node=All&var-quantile=99p&from=now-6h&to=now [10:38:55] Oh right, I see. Yes, that is interesting. [10:39:00] I assume that the number of requests will drop on old hosts when they are all depooled [10:45:35] Let's work it through. I have depooled 1/3 of the old AQS cluster, or 1/6 of the combined (old and new) clusters. Traffic hasn't decreased, so that existing load has been balanced across the remaining 10 AQS instances. [10:46:48] In the same way as when we pooled a single aqs_next cluster, Cassandra reads were distributed evenly across all nodes of the cluster, I think that the same is happening here. [10:51:42] btullis: do you wish we exchange a minute in the cave? [10:51:48] So we have 4 nodejs AQS services pooled, each if which is receiving 1/10 of the total AQS traffic. [10:51:48] Each of these AQS service daemons still knows about 12 cassandra instances to use for distributed reading, so I wouldn't expect to see Cassnadra client requests stop immediately , but I would expect to see a reflection of the fact that the old AQS cluster is now only receiving 40% of the total traffic, rather than 50%. [10:52:14] right - This is about my calculation as well :) [10:52:44] Cool. Yes, give me 1 minutes and I'll see you in the cave. [11:03:10] !log depooled aqs1006 [11:03:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:41:11] !log depooled aqs1007 [11:41:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:54:45] btullis: the depooling is starting to be really visible :) [11:56:18] Great. Have now depooled just over 66.6666666666666% of the old cluster. 🙂 [11:59:32] !log depooled aqs1008 [11:59:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:08:06] btullis: also checking global metrics for aqs101X - the growth in usage is very small :) [12:15:53] joal: Nice. I'm also reminded of the suggestion from hnow.lan about running `nodetool cleanup` at some point: https://phabricator.wikimedia.org/T291472#7412334 [12:16:04] good call btullis! [12:16:45] joal: Are you ready for me to depool the last one of the old cluster? [12:16:55] yessir [12:17:17] !log depooled aqs1009 [12:17:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:19:23] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Switch over the Cassandra AQS cluster to the new hosts - https://phabricator.wikimedia.org/T297803 (10BTullis) All six instances of the old AQS cluster depooled successfully. [12:19:43] done :) https://grafana.wikimedia.org/d/000000418/cassandra?viewPanel=14&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=aqs&var-keyspace=local_group_default_T_pageviews_per_article_flat&var-table=data&var-quantile=99p [12:21:15] Nice. I still can't explain the increase in rangeslice reads here: https://grafana.wikimedia.org/d/000000483/cassandra-client-request?orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=aqs&var-node=All&var-quantile=99p&from=now-3h&to=now [12:21:15] but I guess that doesn't matter anyway. The rate is still low. [12:21:53] yeah - that's weird [12:22:34] btullis: beers on me tonight - Thanks a lot for the perseverance in having that migration done :) [12:23:27] Likewise. Thanks for all of your help and support. [12:42:53] btullis: this is a nice confirmation - https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=8&orgId=1&from=now-3h&to=now&var-server=aqs1004&var-datasource=thanos&var-cluster=aqs [12:44:53] Oh yeah, that is nice. I'm going to make a CR to remove the old aqs hosts from conftool-data. Then I need to give some thought to how we a) decommission and b) rename aqs_next to aqs [12:45:05] But there's not too much hurry. [14:38:33] !log merged Set spark maxPartitionBytes to hadoop dfs block size - T300299 [14:38:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:38:36] T300299: Set SparkmaxPartitionBytes to 256MB - https://phabricator.wikimedia.org/T300299 [14:38:39] cc aqu1 ^ :) [14:38:47] will be applied everywhere over the next 30 mins [14:39:36] ottomata: cool thanks [14:41:01] --^ Nice. Are we expecting to see any bump in any graphs anywhere? Total number of tasks created, perhaps? [14:44:14] i think it will only matter for spark jobs with lots of large files and partitions [14:44:30] i dont' think we have metrics about that kind of stuff, unless we click through to the individual job UIs [14:44:48] maybe we'd see disk read IO reduced in half for those jobs [14:45:04] but it'd be hard to separate that out in grafana for those jobs, unless it is a really huge gain [14:47:10] mayybe # of blocks read would go down? [14:47:10] https://grafana-rw.wikimedia.org/d/000000585/hadoop?viewPanel=107&from=1644223617850&to=1644245217850&orgId=1 [14:47:13] Yeah, I think you're right. [14:47:13] for those big spikes? [14:47:18] but, also, maybe not! :) [15:55:34] 10Quarry: Quarry suggests invalid database names, and doesn't suggest some valid database names - https://phabricator.wikimedia.org/T289943 (10AntiCompositeNumber) `heartbeat_p` actually shouldn't be included in the list, because with the multi-instance databases it only includes lag for the current slice. So yo... [16:02:47] 10Data-Engineering, 10Data-Services, 10cloud-services-team (Kanban): Recreate views for globalblocks table - https://phabricator.wikimedia.org/T300988 (10nskaggs) @razzi Can you help with this again? Would you like to try automating via T297026 first? [16:05:38] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic: Run Atlas on cloud services cluster - https://phabricator.wikimedia.org/T299166 (10Milimetric) a:05Milimetric→03razzi [16:12:32] 10Quarry: Quarry suggests invalid database names, and doesn't suggest some valid database names - https://phabricator.wikimedia.org/T289943 (10RhinosF1) the wiki replicas exclude private + closed wikis so you'd need open - private [16:28:55] 10Data-Engineering, 10Data-Services, 10cloud-services-team (Kanban): Recreate views for globalblocks table - https://phabricator.wikimedia.org/T300988 (10razzi) Yeah I can do this. Unless it's a high priority I'd like to start automating this. I can't remember where I saw this, but there's this idea that to... [16:38:30] 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, 10Research, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10mpopov) Hello! I would prefer to not have an allowlist for external domains, but if the final decision is to have one... [16:45:16] 10Data-Engineering, 10Airflow: Low Risk Oozie Migration: wikidata_json_entity - https://phabricator.wikimedia.org/T300026 (10Snwachukwu) a:03Snwachukwu [17:30:47] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Low Risk Oozie Migration: wikidata_item_page_link - https://phabricator.wikimedia.org/T300023 (10Antoine_Quhen) [17:31:39] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Airflow concurrency limits - https://phabricator.wikimedia.org/T300870 (10Antoine_Quhen) a:05Antoine_Quhen→03Ottomata [17:32:01] (03PS1) 10Nettrom: Update documentation to reflect "skip all" availability [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/760624 (https://phabricator.wikimedia.org/T301159) [17:35:30] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Airflow concurrency limits - https://phabricator.wikimedia.org/T300870 (10Ottomata) Tell me precisely what to change and I will change it! [17:35:48] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Airflow concurrency limits - https://phabricator.wikimedia.org/T300870 (10Antoine_Quhen) + Lets set retries to 3 by default. We may remove this line: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/24931081b7133e62849a9f54bad4... [17:41:27] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering, 10Product-Analytics: Wikistats pageview data missing counts for Mobile App pageviews on Commons, going back to 2020-11 - https://phabricator.wikimedia.org/T299439 (10Milimetric) Just a handy link to this data in Turnilo: https://w.wiki/4oZU [17:42:42] heya ottomata let me know when you can pair, if there's something you want to share, otherwise, I'm going to review your latest code [17:50:52] (03PS17) 10Phuedx: [WIP] Metrics Platform event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan) [17:51:52] ok mforns making /eating lunch then we can go [17:52:00] ok! [17:52:00] if you want, try to run your dag with the latest code [17:52:05] then we'll pick up from there [17:52:21] i'm trying to build a new airflow env, but it looks like the pip resolver has changed in recent versions and it takes FOREVER [17:52:34] so i'm going to let it keep going, or i'll use on old pip if it fails [17:53:38] (03PS18) 10Phuedx: [WIP] Metrics Platform event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan) [17:53:45] (03CR) 10Phuedx: [WIP] Metrics Platform event schema (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan) [18:07:32] 10Data-Engineering: Some varnishkafka instances dropped traffic for a long time due to the wrong version of the package installed - https://phabricator.wikimedia.org/T300164 (10JAllemandou) Summary of the data loss analysis: * Between 2021-06-04 and 2021-11-03 we have lost 2.80% of webrequest-text, statv and e... [18:22:43] mforns: ok i'm here les go! [18:22:59] heyaaa, omw to bc ottomata [18:38:32] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering: [Wikistats] The permanent link is broken - https://phabricator.wikimedia.org/T245445 (10mforns) @odimitrijevic I tested this now, and it worked for me too! I guess, since I created this task 2 years ago, this is no longer broken! [18:42:31] (03CR) 10Phuedx: [WIP] Metrics Platform event schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan) [18:42:56] (03PS30) 10AGueyte: Basic ipinfo instrument setup [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415) [18:43:13] !log manually installing airflow_2.1.4-py3.7-2_amd64.deb on an-test-client1001 [18:43:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:45:55] (03PS31) 10AGueyte: Basic ipinfo instrument setup [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415) [18:46:27] (03CR) 10jerkins-bot: [V: 04-1] Basic ipinfo instrument setup [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [19:00:38] (03PS32) 10AGueyte: Basic ipinfo instrument setup [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415) [19:07:34] (03CR) 10AGueyte: Basic ipinfo instrument setup (033 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [20:36:37] ok mforns back [20:37:09] heya [20:37:16] omw to bc [20:55:51] 10Data-Engineering, 10Data-Services, 10cloud-services-team (Kanban): Recreate views for globalblocks table - https://phabricator.wikimedia.org/T300988 (10razzi) a:03razzi (Got the idea of prompting the user to do the manual steps from https://phabricator.wikimedia.org/phame/post/view/217/runnable_runbooks/... [21:13:47] hey ottomata back [21:16:22] okay! [21:30:23] 10Data-Engineering, 10Product-Analytics: Keep canonical_data.wikis updated - https://phabricator.wikimedia.org/T241741 (10Milimetric) This feels like something we should do sooner than later. The data's not getting any more centralized by itself :) [21:33:41] (03CR) 10Ottomata: [WIP] Metrics Platform event schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan) [22:03:44] 10Data-Engineering, 10Data-Engineering-Kanban, 10Epic: Finish evaluation of Data Governance Options - https://phabricator.wikimedia.org/T296672 (10Milimetric) With the rest of the technical evaluations wrapping up, we have a candidate, DataHub, and we don't need to revisit this list. Write-up on the evaluat... [22:04:49] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic: Data Catalog Technical Evaluation - https://phabricator.wikimedia.org/T293643 (10Milimetric) [22:27:14] (03CR) 10Nettrom: [C: 03+2] Update documentation to reflect "skip all" availability [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/760624 (https://phabricator.wikimedia.org/T301159) (owner: 10Nettrom) [22:27:47] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic: Run Atlas on cloud services cluster - https://phabricator.wikimedia.org/T299166 (10odimitrijevic) Atlas evaluation is complete. In summary, the main blocker on using Atlas is that the current version of Atlas is not compatible with our... [22:27:55] (03Merged) 10jenkins-bot: Update documentation to reflect "skip all" availability [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/760624 (https://phabricator.wikimedia.org/T301159) (owner: 10Nettrom) [22:27:56] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic: Run Atlas on cloud services cluster - https://phabricator.wikimedia.org/T299166 (10odimitrijevic) 05Open→03Resolved [22:27:58] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic: Evaluate Atlas - https://phabricator.wikimedia.org/T299165 (10odimitrijevic) [22:28:05] (03CR) 10Nettrom: "Only changing documentation, self-reviewing should be fine." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/760624 (https://phabricator.wikimedia.org/T301159) (owner: 10Nettrom) [22:35:46] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering: [Wikistats] The permanent link is broken - https://phabricator.wikimedia.org/T245445 (10odimitrijevic) p:05High→03Low Thanks @mforns. There is still be a benefit in URL encoding the link if it is copied around. I'll add as a low priority on the wiki... [22:38:46] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic: Run OpenMetadata in test cluster - https://phabricator.wikimedia.org/T300540 (10Milimetric) [22:39:53] 10Data-Engineering, 10Superset: [Spike] Test spark thrift-server for Superset - https://phabricator.wikimedia.org/T300611 (10odimitrijevic) @JAllemandou Is this something that we wish to revisit as a possible longer term solution? [22:41:41] 10Data-Engineering, 10Data-Engineering-Kanban: Add alert for varnishkafka low/zero messages per second to alertmanager - https://phabricator.wikimedia.org/T300246 (10odimitrijevic) p:05Triage→03High Prioritizing for after the Catalog implementation. [22:55:57] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Run Datahub on test cluster - https://phabricator.wikimedia.org/T299703 (10Milimetric) [22:59:28] 10Analytics-Radar, 10Wikipedia-iOS-App-Backlog, 10iOS-app-v6.9-Carp-On-A-Zamboni: Metrics around existing Echo notifications volume - https://phabricator.wikimedia.org/T291663 (10JMinor) 05Open→03Resolved [23:23:26] 10Analytics, 10Analytics-Kanban, 10Pageviews-Anomaly: Article on Carles Puigdemont has inflated pageviews in many projects - https://phabricator.wikimedia.org/T263908 (10MusikAnimal) >>! In T263908#7655642, @jhsoby wrote: > What can be done about this? This would fall onto the Analytics team. I don't believ...