[10:11:02] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) All compactions from the 7th snapshot loading operation have completed successfully. S... [10:19:56] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, and 3 others: Migrate analytics cluster alerts from Icinga to AlertManager - https://phabricator.wikimedia.org/T293399 (10BTullis) [11:45:32] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, and 3 others: Migrate analytics cluster alerts from Icinga to AlertManager - https://phabricator.wikimedia.org/T293399 (10BTullis) [14:30:07] btullis: o/ [14:30:12] we had 2 mariadb masters on one host, can you ever think of a need to do a master failover for only one of the instances? [14:30:12] i had assumed this would be nice to be able to do...but now i can't think of a case where that would actually be useful [14:32:04] What about if you had changed a setting in `/etc/mysql/my.cnf` for only one of the two instances? You could fail only this instance over to the replica, restart the service, then fail it back? [14:32:35] each instance has its own conf file [14:32:38] but hm [14:33:33] i think the failover isn't so smooth though, you'd have to do a restart of mariadb to do the failover [14:33:39] beause you need to switch on/off read_only mode [14:33:44] Yes, I just spotted that, I should have just said `my.cnf` [14:34:03] and if we have to do that anyway, we might as well just restart the master [14:34:22] Yeah, understood. I'm not really familiar with mariaDB failovers here yet. [14:34:23] oh [14:34:24] SET GLOBAL read_only = 1; [14:34:25] no i guess not [14:34:34] we could do it manual without a restart [14:38:45] I've typically done live failovers of master->replica and back with Percona Resource Manager in the past: https://github.com/Percona-Lab/pacemaker-replication-agents/blob/master/doc/PRM-setup-guide.rst [14:40:17] ...but this just automates exactly those steps. SET GLOBAL read_only = 1 on master, wait for replication lag=0, CHANGE MASTER to ..., CHANGE SLAVE TO ... , move VIP address to secondary host, SET GLOBAL read_only = 0 on replica [14:42:24] I'm aware of orchestrator: https://wikitech.wikimedia.org/wiki/Orchestrator and https://orchestrator.wikimedia.org/web/clusters - Might that be useful in this setup? [14:46:46] 10Analytics-Radar, 10Product-Analytics: Do the messages left for unregistered or logged-out IP editors get read by those editors? - https://phabricator.wikimedia.org/T291297 (10Dbrant) [14:46:48] 10Analytics-Radar, 10Product-Analytics (Kanban), 10Wikipedia-Android-App-Backlog (Android Release FY2021-22): What percentage of app editors are IP editors? - https://phabricator.wikimedia.org/T291866 (10Dbrant) 05Open→03Resolved [14:50:03] Here is an description of how orchestrator does managed failover: [14:50:03] https://github.com/openark/orchestrator/blob/master/docs/topology-recovery.md#graceful-master-promotion [14:50:38] btullis: it might be [14:50:43] but i was tryign to do multi instance [14:50:45] masters [14:50:49] so 2 master instances on the same host [14:51:02] and, while possible, data-persistence is very resistant to it, because they don't do it [14:51:06] and I think that in our case we don't use virtual IPs, but we do use HAProxy to determine which out of a given set of servers is the writeable master: https://wikitech.wikimedia.org/wiki/HAProxy [14:51:20] i almost gave up last week, then got it mostly working friday, but after a discussion today am going to give up again [14:51:36] the reasons why we might failover a single instance instead of all instances on a host seem very rare [14:51:54] and, while i think the puppet would be much better and cleaner if it were written to be host-agnostic [14:52:12] fighting with it (and data-persistance) is seeming maybe not worth it [14:53:04] OK, I see. I missed that bit. I knew that you were working on multi-instance, but didn't pick up that you'd restarted working on a similar kind of method. [14:54:01] convo mostly happening in #wikimedia-data-persistence [14:54:37] What are our two master instances that we have? I thought we were just moving the single master->replica DB from an-coord100[12] [14:55:18] * btullis reading scrollback [14:55:20] https://gerrit.wikimedia.org/r/c/operations/puppet/+/735688 [14:55:34] https://phabricator.wikimedia.org/T284150 [14:55:45] oh you aren't subscribed to that one [14:55:46] adding you [14:57:06] Ah, it was this statement that I had missed previously: [14:57:06] > Each database, or at least database class (e.g. maybe all airflow databases on the same instance?), get their own MariaDB instance. [14:58:16] Then this: [14:58:16] > Instead, I'm leaning towards 2 instances, one for important data-metadata like hive and druid, and one for more user-facing stuff, like superset and airflow. [15:04:47] a-team standup [15:05:07] (reminder I'm in that business school event thing this morning, I'll be missing all meetings) [15:06:38] 10Analytics, 10Data-Engineering, 10Product-Analytics, 10Structured-Data-Backlog, and 2 others: Create a Commons equivalent of the wikidata_entity table in the Data Lake - https://phabricator.wikimedia.org/T258834 (10Gehel) p:05Triage→03High [15:07:54] (03PS2) 10MNeisler: Add the SearchSatisfaction legacy schema to the allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/715055 (https://phabricator.wikimedia.org/T274607) [15:11:12] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Wikidata, and 3 others: Events missing from event.rdf_streaming_updater_fetch_failure but present in /wmf/data/raw/event/eqiad.rdf-streaming-updater.fetch-failure - https://phabricator.wikimedia.org/T294361 (10Gehel) [15:14:16] joal: ottomata thanks for the support friday etc too, this is the tool you helped me breath some life into to be presented at wikidata con https://wmde.github.io/wikidata-map/dist/index.html [15:15:29] You can get an introduction to the data I was throwing around at https://github.com/wmde/wikidata-map/blob/master/docs/data/DATA.md [15:20:43] addshore: cool! [15:21:32] It could be cool to try and "productionize" the extraction of coordinates from the wikidata dumps each time they are loaded, and aslo backfil since 2013 :D but thats for another day [15:24:04] (03PS1) 10Clare Ming: Update web_ab_test_enrollment group property [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/735993 (https://phabricator.wikimedia.org/T292587) [15:26:19] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Internet-Archive, 10The-Wikipedia-Library: Store page-links-change data in a database table and make available through a Special page - https://phabricator.wikimedia.org/T221397 (10odimitrijevic) [15:26:20] ottomata: I'm now totally following the idea about getting multi-master, multi-instance working with neat and tidy puppet. I like the idea. [15:27:24] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Internet-Archive, 10The-Wikipedia-Library: Page-links-change stream doesn't capture duplicated links - https://phabricator.wikimedia.org/T216492 (10odimitrijevic) [15:27:37] I just don't quite understandy yet what the perceived benefits would be from splitting up the DBs into 'important' and 'user-facing' would be. [15:28:43] I think that the benefits from making the HA and failover systems smoother would be great, but I'm not sure why we would want more instances per-se. [15:33:12] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Internet-Archive, and 3 others: page-links-change stream is assigning template propagation events to the wrong edits - https://phabricator.wikimedia.org/T216504 (10Ottomata) [15:42:15] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Add a presto query logger - https://phabricator.wikimedia.org/T269832 (10odimitrijevic) [15:42:18] btullis: i think you are right [15:42:30] the benefits are few, esp given the effor [15:42:31] t [15:42:44] i think we naively wanted to do that since is seems like the cleaner thing to do [15:43:32] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Add a presto query logger - https://phabricator.wikimedia.org/T269832 (10odimitrijevic) [15:47:12] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Privacy Engineering: Implement Data Governance Tool - https://phabricator.wikimedia.org/T272060 (10odimitrijevic) [15:53:38] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Privacy Engineering: Implement Data Governance Tool - https://phabricator.wikimedia.org/T272060 (10odimitrijevic) [15:58:36] 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10Internet-Archive, 10The-Wikipedia-Library: Store page-links-change data in a database table and make available through a Special page - https://phabricator.wikimedia.org/T221397 (10odimitrijevic) [16:02:29] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, and 4 others: page-links-change stream is assigning template propagation events to the wrong edits - https://phabricator.wikimedia.org/T216504 (10mforns) [16:03:13] 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10Internet-Archive, 10The-Wikipedia-Library: Store page-links-change data in a database table and make available through a Special page - https://phabricator.wikimedia.org/T221397 (10Ottomata) If you get access to the Analytics Hadoop cluster, you... [16:03:21] 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10Internet-Archive, 10The-Wikipedia-Library: Store page-links-change data in a database table and make available through a Special page - https://phabricator.wikimedia.org/T221397 (10Ottomata) I think it's unlikely to get this feature implemented i... [16:08:44] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban: Data structuring guidance request - https://phabricator.wikimedia.org/T287402 (10odimitrijevic) a:03mforns @JAnstee_WMF can you please set up a meeting with @mforns @EChetty to discuss and tag this ticket in the agenda. Ideally any documentation/... [16:13:44] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Epic: Analytics Presto improvements - https://phabricator.wikimedia.org/T266639 (10mforns) [16:17:03] 10Analytics, 10Analytics-Jupyter, 10Data-Engineering, 10Product-Analytics: conda list does not show all packages in environment - https://phabricator.wikimedia.org/T294368 (10odimitrijevic) p:05Triage→03Lowest [16:17:29] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Epic: Analytics Presto improvements - https://phabricator.wikimedia.org/T266639 (10mforns) [16:18:09] 10Analytics, 10Patch-For-Review: Presto should warn or prevent users from querying without Hive partition predicates - https://phabricator.wikimedia.org/T273004 (10mforns) [16:18:19] 10Analytics, 10Patch-For-Review: Decide whether to migrate from Presto to Trino - https://phabricator.wikimedia.org/T266640 (10mforns) [16:18:37] 10Analytics-Clusters, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Deploy an-test-coord1002 to facilitate failover testing of analytics coordinator role - https://phabricator.wikimedia.org/T287864 (10odimitrijevic) [16:34:02] 10Analytics, 10Data-Engineering, 10Product-Analytics, 10Epic: Reconstruct Hive & Hadoop permissions for shared database - https://phabricator.wikimedia.org/T288983 (10odimitrijevic) Review a potential solution as part of https://phabricator.wikimedia.org/T288255 [16:36:50] 10Analytics, 10Analytics-Jupyter, 10Data-Engineering, 10Data-Engineering-Kanban: Autocomplete is very slow (unusable) in Newpyter - https://phabricator.wikimedia.org/T290008 (10odimitrijevic) p:05Triage→03Medium [16:37:42] 10Analytics, 10Analytics-Jupyter, 10Data-Engineering, 10Data-Engineering-Kanban: Autocomplete is very slow (unusable) in Newpyter - https://phabricator.wikimedia.org/T290008 (10odimitrijevic) [16:41:29] 10Analytics-Clusters, 10Analytics-Kanban, 10Cassandra, 10Data-Engineering, and 2 others: Set up a testing environment for the AQS Cassandra 3 migration - https://phabricator.wikimedia.org/T257572 (10odimitrijevic) a:05razzi→03None @BTullis Is this task still relevant? [16:44:19] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 (10odimitrijevic) a:05razzi→03None [16:46:13] 10Analytics, 10Data-Engineering, 10Epic: Upgrade analytics-hadoop to Spark 3 + scala 2.12 - https://phabricator.wikimedia.org/T291464 (10odimitrijevic) [16:55:12] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/715055 (https://phabricator.wikimedia.org/T274607) (owner: 10MNeisler) [16:58:07] 10Analytics, 10Data-Engineering, 10Product-Analytics, 10Epic: Reconstruct Hive & Hadoop permissions for shared database - https://phabricator.wikimedia.org/T288983 (10Mayakp.wiki) @odimitrijevic : T288255 seems to be a restricted task. Can you pls give me view access ? Thanks! [17:29:54] dcausse: FYI, eventgate-main deployed with revision-create slot change [17:39:32] (03PS1) 10MNeisler: Add discussiontools_subscription query to sqoop [analytics/refinery] - 10https://gerrit.wikimedia.org/r/736021 (https://phabricator.wikimedia.org/T290516) [21:24:35] 10Analytics, 10Platform Engineering: Replace Airflow's HDFS client (snakebite) with pyarrow - https://phabricator.wikimedia.org/T284566 (10Harej) [22:13:35] (03PS1) 10MewOphaswongse: Add an image: update schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/736070 (https://phabricator.wikimedia.org/T294659) [22:42:18] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban: Data structuring guidance request - https://phabricator.wikimedia.org/T287402 (10JAnstee_WMF) Set us up a couple weeks out!