[01:33:22] PROBLEM - Check unit status of monitor_refine_event on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:00:02] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic, 10User-razzi: Technical evaluation of Amundsen - https://phabricator.wikimedia.org/T300756 (10razzi) As can be seen in the [architecture diagram](https://www.amundsen.io/amundsen/architecture/) there are 6 components: {F34941113} - D... [07:01:48] !log kill leftover processes of decommed user on an-test-client1001 [07:01:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:04:45] Good morning team [08:05:17] Hi btullis - for when you're online: would today be the day to pool all new AQS nodes? [09:39:14] joal: one suggestion (if I may) when pooling/depooling nodes - be always aware that pybal is instructed to have a safe threshold of nodes down, if you cross it then it may refuse to depool more and page SRE [09:39:45] I think that adding/removing nodes should be done incrementally and very carefully [09:40:03] depooling is trickier, but with "inactive" (and one node at the time) I think pybal will be ok [09:40:22] (the inactive status is not counted in the safe threshold, pooled=false is counted) [09:40:30] ack elukey - thanks a lot for the info (forwarding to btullis, who will be acting on this_) [09:41:22] you could go 7 -> 12 and then depool the 6 old nodes one at the time (setting them inactive) [09:41:38] or adding/removing a new/old one each time [09:41:42] not sure what's best [09:42:27] not sure either - I'm enclined to add all new nodes, and remove old ones next monday for instance [09:47:26] Morning all. Yes I'm inclide to pool all remaining nodes today, then begin depooling the old ones (with inactive) on Monday. [09:47:49] ...inclined to pool... [09:53:33] super [09:53:55] I can check over the weekend this time and act in case something is off :) [09:55:07] Thanks. I'm also not far away this weekend, so I can keep a close'ish eye on it too. [09:58:18] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Switch over the Cassandra AQS cluster to the new hosts - https://phabricator.wikimedia.org/T297803 (10BTullis) We are now ready to carry on with this migration. The plan is to pool the remaining aqs_next servers... [10:02:32] OK, here goes. [10:03:06] btullis: if it fails it will be during the worst time for you, when you are away and busy, this is the rule :D [10:04:46] Yep :-) [10:04:59] !log pooling the remaining aqs_next nodes. [10:05:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:08:07] https://config-master.wikimedia.org/pybal/eqiad/aqs looks good [10:08:26] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Switch over the Cassandra AQS cluster to the new hosts - https://phabricator.wikimedia.org/T297803 (10BTullis) Before: ` btullis@puppetmaster1001:~$ confctl select dc=eqiad,cluster=aqs get|sort {"aqs1004.eqiad.w... [10:19:46] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Cassandra3 migration for Analytics AQS - https://phabricator.wikimedia.org/T249755 (10BTullis) [10:20:49] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 3 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10BTullis) 05Open→03Resolved Pooled the new servers. Marking this investigation as complete. [10:22:03] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 3 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10BTullis) {F34941248} [10:45:00] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering: Offer map and queries by subdivisions of countries - https://phabricator.wikimedia.org/T284294 (10A455bcd9) FYI, this ticket is a candidate on the 2022 tech wish list: https://meta.wikimedia.org/wiki/Community_Wishlist_Survey_2022/Search/Close_the_outda... [10:45:05] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering: "Page views by edition of Wikipedia" for each country - https://phabricator.wikimedia.org/T257071 (10A455bcd9) FYI, this ticket is a candidate on the 2022 tech wish list: https://meta.wikimedia.org/wiki/Community_Wishlist_Survey_2022/Search/Close_the_ou... [10:50:21] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban: Check home/HDFS leftovers of bumeh-ctr - https://phabricator.wikimedia.org/T300607 (10BTullis) There is not a lot of data to be condidered, just a few files on stat1006. ` btullis@barracuda:~/bin$ ./check_user.sh bumeh-ctr ====== stat1004 ====== to... [11:20:47] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering: Wikistats New Feature-Country pageview breakdown by language - https://phabricator.wikimedia.org/T250001 (10A455bcd9) I think this ticket is similar (if not identical) to T257071 (and to this candidate on the 2022 tech wish list: https://meta.wikimedia.... [13:24:08] 10Data-Engineering, 10Project-Admins: Archive Analytics tag - https://phabricator.wikimedia.org/T298671 (10Aklapper) I assume that #Analytics-Features is also a candidate to be archived (and removed from Herald rules)? https://phabricator.wikimedia.org/maniphest/query/osi8Qpfc0NpQ/#R [13:40:58] hola teammm :] [14:03:59] Hello mforns. :-) [14:05:38] Hi mforns :) [14:05:46] :] [14:08:22] ottomata, milimetric - I can't make the hangtime toda, conflicting meetings sorry about that [14:08:42] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Run Datahub on test cluster - https://phabricator.wikimedia.org/T299703 (10BTullis) p:05Triage→03High a:05razzi→03BTullis [14:08:43] hope they're fun jo :) [14:09:29] 'fraid I can't make the hangtime either. Doing too many things at once. [14:09:43] milimetric: graph talking - always fun :D [14:09:43] good luck with that btullis :S [14:09:54] indeed, graphs of all kinds [14:21:25] 10Analytics-Clusters, 10Analytics-Radar, 10Data-Engineering, 10Patch-For-Review: Productionize navigation vectors - https://phabricator.wikimedia.org/T174796 (10Aklapper) Boldly adding #data-engineering as I'd love to know who could or should review and decide on the remaining three open patches in Gerrit.... [14:23:50] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Airflow concurrency limits - https://phabricator.wikimedia.org/T300870 (10Antoine_Quhen) [14:28:45] 10Data-Engineering, 10Airflow: Low Risk Oozie Migration: wikidata_item_page_link - https://phabricator.wikimedia.org/T300023 (10Antoine_Quhen) a:03Antoine_Quhen [14:36:43] Hey team - I'm noticing weirdness with Druid indexation - In meeting now, then kids, I'll provide more details after [14:37:06] OK. [14:49:41] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Airflow concurrency limits - https://phabricator.wikimedia.org/T300870 (10mforns) This is great @Antoine_Quhen! I agree with your default value choices. I think in Airflow2 the concurrency parameter is deprecated in favor of max_active_tasks, see: htt... [14:53:51] (03PS29) 10Phuedx: Basic ipinfo instrument setup [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [14:55:14] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Matomo mariadb metrics are not being scraped by prometheus - https://phabricator.wikimedia.org/T299762 (10BTullis) This is now fixed. Metrics are coming in and are shown in Grafana. {F34941408} [14:57:52] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Airflow concurrency limits - https://phabricator.wikimedia.org/T300870 (10Ottomata) I wonder if we should set the global `max_active_runs_per_dag` higher than 2. I could see cases where we explicitly want to to a big backfill in parallel. Since the... [14:58:41] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Airflow concurrency limits - https://phabricator.wikimedia.org/T300870 (10JAllemandou) thanks a lot for the great summary @Antoine_Quhen! I asume that we wish to set the default values you suggested, and as much as possible not use the per-dag availab... [15:02:58] (03CR) 10Phuedx: "Two small nitpicks inline. After they've been resolved, I'm happy to merge this." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [15:03:03] (03CR) 10Phuedx: [C: 04-1] Basic ipinfo instrument setup [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [15:09:46] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Airflow concurrency limits - https://phabricator.wikimedia.org/T300870 (10Ottomata) Ya I'd think we could! [15:13:45] mforns: i'm going to come back to the pyarrow stuff and focus on finishing the other airflow stuff i'm working on for now [15:14:00] with that fix to workflow_utils, you shouldn't need to interact with fsspec at dag time [15:14:05] right? [15:14:22] yes, that's right [15:14:34] I think it will be fine until we try migrating refine :] [15:14:38] ottomata: ^ [15:14:57] mforns: why will it matter? [15:15:14] 10Data-Engineering, 10Airflow: Use arrow_hdfs:// fsspec protocol in workflow_utils artifact syncing - https://phabricator.wikimedia.org/T300876 (10Ottomata) [15:15:15] https://phabricator.wikimedia.org/T300876 [15:15:17] to track [15:15:59] ottomata: because the main way of doing refine is to use pyarrow at dag definition time, to create a dynamic dag with all existing datasets [15:16:25] oh hm right [15:16:26] hm [15:16:30] ok [15:16:45] mforns: btw, if we can make fsspec and pyarrow work right [15:16:52] i wonder, if instead of making an hdfs sensor [15:17:04] if an FsSpecSensor might be more flexible and easier to write [15:17:07] we could try other ways, like having a master dag that creates 1 child dag for each dataset (using pyarrow within a task) [15:18:08] If we need it, we can always change our HdfsSensor to an fsspec sensor, or add an fsspec sensor [15:18:27] no? [15:18:29] ya [15:19:00] i mean, actually, afaict as long as you use URLs that start with arrow_hdfs:// instead of hdfs:// [15:19:05] fsspec will use the new API [15:20:37] (brb gotta restart my compy) [15:38:55] back [15:44:54] mforns: i wonder [15:45:23] hmm, maybe not. haha i'm looking at ArfitactRegistry and wonderin g if we can do that automatically somehow [15:45:34] because i'm also looking at the default args and dag_defaults stuff [15:49:26] aha, but what do you mean do it automatically? [15:49:32] ottomata: ^ [15:50:07] hmm, i take back my ping, i'm not sure what i mean, i had an idea, e.g. getting rid of the paths to the artifact config in dag_config.py [15:50:15] but lemme mess with default_args and config stuff first [15:50:19] i thin kthat will inform other ideas [15:50:33] ok :] [15:50:40] i'm currently thinking about how to set these defaults only for our real airflow instances [15:50:43] not for dev envs [15:51:10] there is a env var available [15:51:13] # AIRFLOW_INSTANCE_NAME is an arbitrary WMF Airflow concept. [15:51:13] # We run multiple instances. User DAG code might use this to [15:51:13] # vary dynamimc config based on which instance it is running in. [15:51:13] export AIRFLOW_INSTANCE_NAME=analytics [15:51:39] that is set for all airflow stuff on our instances [15:51:56] i could vary with that, if tthat is set, assume in wmf and load these default configs [15:51:59] otherwise, don't load them [15:53:06] i'm also not loving the symlink thing I did to wmf_airflow_common inside of e.g. analytics/dags [15:53:48] hm [15:57:09] (03CR) 10Tchanders: Basic ipinfo instrument setup (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [16:07:13] mforns: are plugins useful for anything other than airflow UI stuff? [16:07:29] or maybe for templating? [16:09:15] ah right about the symlink thing. we did that so you could use wm_airflow_common without anything more than just configuring the dag directory properly [16:09:25] for our instanaces, it would be better to do it differently [16:09:34] but, what about when using with dev instance? [16:09:38] you'd need to have an affected PYTHONPATH [16:09:48] and i thinkk dev_instance doesn't do anythign except for set dags_dir [16:32:12] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Run Datahub on test cluster - https://phabricator.wikimedia.org/T299703 (10BTullis) [16:37:29] ottomata: macros are also declared within plugins, macros to use within Airflow templating [16:38:03] ottomata: the dev_instance script could add stuff to the PYTHONPATH if needed, no? [17:00:30] hm - I guess everyone is in staff meeting, right? [17:00:49] yea, joal, are we skipping standup? [17:01:22] actually there are some folks - doing standup with present people - no big deal mforns :) [17:02:08] ok [17:55:02] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Platform Engineering, 10tech-decision-forum: MediaWiki Event Carried State Transfer - Problem Statement - https://phabricator.wikimedia.org/T291120 (10Jenlenfantwright) a:05Jenlenfantwright→03Ottomata [17:57:08] mforns: [17:57:09] yt? [17:57:20] i think i'm ready to go over my MR and to merge it [17:57:28] should we do it together? [17:57:43] I've put back your original bash based operators in operators/spark_bash.py [17:57:54] and imported them fromfrom wmf_airflow_common.factories.anomaly_detection import AnomalyDetectionDAG [17:58:00] from AnomalyDetectionDAG * [18:15:43] heya ottomata I'm here [18:16:00] ottomata: yes, we can do together! [18:16:14] ok nows good? [18:16:47] yes! bc omw [18:16:50] xD [18:25:26] ok - about druid: we have 3 HiveToDruid jobs stuck from Jan 14th, and 2 oozie druid-loading subworkflows stuck from Feb 1st [18:25:33] those are yarn apps sorry --^ [18:25:49] I'm enclined to force kill them [18:25:56] any objection ottomata or btullis ? [18:41:41] joal: stuck? [18:41:54] i mean yes proceed, but i don't have any context sOooOoO :) :) [19:05:09] ottomata: sorry was away for a minute [19:05:55] ottomata: yes, stuck: those apps are launcher jobs for druid indexations - I don't know if it is that the indexation job failed or what, but the apps are stuck [19:06:05] huh [19:06:06] oh ok [19:06:14] Ok killing them (and keeping their ids to debug after) [19:06:22] so probably the stuff is fine (since we didn't get alerts, or if we did hopefully they were resolved) [19:06:25] just something old was stuck [19:06:25] okay [19:07:10] ottomata: I think we didn't get alerts BECAUSE the stuff is stuck! Killing the apps should trigger some alerts I assume [19:09:48] !log Kill druid-loading stuck yarn applications (3 HiveToDruid, 2 oozie launchers) [19:09:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:11:10] ok done - let's see if alerts show up [19:12:28] !log Kill druid indexation stuck task on Druid (from 2022-01-17T02:31) [19:12:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:13:11] OH hh okay [19:22:46] (03PS16) 10Phuedx: [WIP] Metrics Platform event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan) [19:26:12] ok great - druid has been unlocked - trying to rerun the failed indexations [19:26:59] (03CR) 10Phuedx: [WIP] Metrics Platform event schema (038 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan) [19:27:06] !log Rerun virtualpageview-druid-daily-wf-2022-1-16 [19:27:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:32:35] !log re-running the failed refine_event job as per email. [19:32:37] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:33:43] Thanks for looking into the druid issues joal. :-) [19:34:12] Cheers btullis :) [19:35:54] !log Rerun virtualpageview-druid-monthly-wf-2022-1 [19:35:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:18:25] mforns: still around? [20:18:37] ottomata: yes [20:18:41] in da cave [20:18:47] i got a config idea, just pushed want to see what you think [21:14:10] okay mforns i think i want to do different MRs for these changes [21:14:20] this config refactor will work on its own [21:14:20] https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/10 [21:14:31] i went ahead and decided to use a mergedeep library [21:14:36] i'll get that into the airflow conda env tooo [23:23:21] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic, 10User-razzi: Technical evaluation of Amundsen - https://phabricator.wikimedia.org/T300756 (10razzi) The ssl error has been fixed, here's the current diff of the amundsen repository on stat1008.eqiad.wmnet:/srv/home/razzi/amundsen sho... [23:29:49] could i get a `hdfs dfs -du -s /var/log/hadoop-yarn/apps`? I think that should give a ballpark on existing logs which i think are pruned at 30d? [23:37:09] 10Analytics-Clusters, 10Wikimedia-Logstash: Evaluate storing logs from applications in yarn with the typical logging infrastructure - https://phabricator.wikimedia.org/T300937 (10EBernhardson) [23:49:08] 10Analytics-Clusters, 10Wikimedia-Logstash: Evaluate storing logs from applications in yarn with the typical logging infrastructure - https://phabricator.wikimedia.org/T300937 (10EBernhardson)