[01:33:22] <icinga-wm>	 PROBLEM - Check unit status of monitor_refine_event on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[07:00:02] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic, 10User-razzi: Technical evaluation of Amundsen - https://phabricator.wikimedia.org/T300756 (10razzi) As can be seen in the [architecture diagram](https://www.amundsen.io/amundsen/architecture/) there are 6 components: {F34941113}  - D...
[07:01:48] <elukey>	 !log kill leftover processes of decommed user on an-test-client1001
[07:01:50] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:04:45] <joal>	 Good morning team
[08:05:17] <joal>	 Hi btullis - for when you're online: would today be the day to pool all new AQS nodes?
[09:39:14] <elukey>	 joal: one suggestion (if I may) when pooling/depooling nodes - be always aware that pybal is instructed to have a safe threshold of nodes down, if you cross it then it may refuse to depool more and page SRE
[09:39:45] <elukey>	 I think that adding/removing nodes should be done incrementally and very carefully
[09:40:03] <elukey>	 depooling is trickier, but with "inactive" (and one node at the time) I think pybal will be ok
[09:40:22] <elukey>	 (the inactive status is not counted in the safe threshold, pooled=false is counted)
[09:40:30] <joal>	 ack elukey - thanks a lot for the info (forwarding to btullis, who will be acting on this_)
[09:41:22] <elukey>	 you could go 7 -> 12 and then depool  the 6 old nodes one at the time (setting them inactive)
[09:41:38] <elukey>	 or adding/removing a new/old one each time
[09:41:42] <elukey>	 not sure what's best
[09:42:27] <joal>	 not sure either - I'm enclined to add all new nodes, and remove old ones next monday for instance
[09:47:26] <btullis>	 Morning all. Yes I'm inclide to pool all remaining nodes today, then begin depooling the old ones (with inactive) on Monday.
[09:47:49] <btullis>	 ...inclined to pool... 
[09:53:33] <elukey>	 super 
[09:53:55] <elukey>	 I can check over the weekend this time and act in case something is off :)
[09:55:07] <btullis>	 Thanks. I'm also not far away this weekend, so I can keep a close'ish eye on it too.
[09:58:18] <wikibugs>	 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Switch over the Cassandra AQS cluster to the new hosts - https://phabricator.wikimedia.org/T297803 (10BTullis) We are now ready to carry on with this migration. The plan is to pool the remaining aqs_next servers...
[10:02:32] <btullis>	 OK, here goes.
[10:03:06] <elukey>	 btullis: if it fails it will be during the worst time for you, when you are away and busy, this is the rule :D
[10:04:46] <btullis>	 Yep :-)
[10:04:59] <btullis>	 !log pooling the remaining aqs_next nodes.
[10:05:01] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:08:07] <elukey>	 https://config-master.wikimedia.org/pybal/eqiad/aqs looks good
[10:08:26] <wikibugs>	 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Switch over the Cassandra AQS cluster to the new hosts - https://phabricator.wikimedia.org/T297803 (10BTullis) Before: ` btullis@puppetmaster1001:~$ confctl select dc=eqiad,cluster=aqs get|sort {"aqs1004.eqiad.w...
[10:19:46] <wikibugs>	 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Cassandra3 migration for Analytics AQS - https://phabricator.wikimedia.org/T249755 (10BTullis)
[10:20:49] <wikibugs>	 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 3 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10BTullis) 05Open→03Resolved Pooled the new servers. Marking this investigation as complete.
[10:22:03] <wikibugs>	 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 3 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10BTullis) {F34941248}
[10:45:00] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Data-Engineering: Offer map and queries by subdivisions of countries - https://phabricator.wikimedia.org/T284294 (10A455bcd9) FYI, this ticket is a candidate on the 2022 tech wish list: https://meta.wikimedia.org/wiki/Community_Wishlist_Survey_2022/Search/Close_the_outda...
[10:45:05] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Data-Engineering: "Page views by edition of Wikipedia" for each country - https://phabricator.wikimedia.org/T257071 (10A455bcd9) FYI, this ticket is a candidate on the 2022 tech wish list: https://meta.wikimedia.org/wiki/Community_Wishlist_Survey_2022/Search/Close_the_ou...
[10:50:21] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban: Check home/HDFS leftovers of bumeh-ctr - https://phabricator.wikimedia.org/T300607 (10BTullis) There is not a lot of data to be condidered, just a few files on stat1006. ` btullis@barracuda:~/bin$ ./check_user.sh bumeh-ctr  ====== stat1004 ====== to...
[11:20:47] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Data-Engineering: Wikistats New Feature-Country pageview breakdown by language - https://phabricator.wikimedia.org/T250001 (10A455bcd9) I think this ticket is similar (if not identical) to T257071 (and to this candidate on the 2022 tech wish list: https://meta.wikimedia....
[13:24:08] <wikibugs>	 10Data-Engineering, 10Project-Admins: Archive Analytics tag - https://phabricator.wikimedia.org/T298671 (10Aklapper) I assume that #Analytics-Features is also a candidate to be archived (and removed from Herald rules)? https://phabricator.wikimedia.org/maniphest/query/osi8Qpfc0NpQ/#R
[13:40:58] <mforns>	 hola teammm :]
[14:03:59] <btullis>	 Hello mforns. :-)
[14:05:38] <joal>	 Hi mforns :)
[14:05:46] <mforns>	 :]
[14:08:22] <joal>	 ottomata, milimetric - I can't make the hangtime toda, conflicting meetings sorry about that
[14:08:42] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Run Datahub on test cluster - https://phabricator.wikimedia.org/T299703 (10BTullis) p:05Triage→03High a:05razzi→03BTullis
[14:08:43] <milimetric>	 hope they're fun jo :)
[14:09:29] <btullis>	 'fraid I can't make the hangtime either. Doing too many things at once.
[14:09:43] <joal>	 milimetric: graph talking - always fun :D
[14:09:43] <joal>	 good luck with that btullis :S
[14:09:54] <milimetric>	 indeed, graphs of all kinds
[14:21:25] <wikibugs>	 10Analytics-Clusters, 10Analytics-Radar, 10Data-Engineering, 10Patch-For-Review: Productionize navigation vectors - https://phabricator.wikimedia.org/T174796 (10Aklapper) Boldly adding #data-engineering as I'd love to know who could or should review and decide on the remaining three open patches in Gerrit....
[14:23:50] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Airflow concurrency limits - https://phabricator.wikimedia.org/T300870 (10Antoine_Quhen)
[14:28:45] <wikibugs>	 10Data-Engineering, 10Airflow: Low Risk Oozie Migration: wikidata_item_page_link - https://phabricator.wikimedia.org/T300023 (10Antoine_Quhen) a:03Antoine_Quhen
[14:36:43] <joal>	 Hey team - I'm noticing weirdness with Druid indexation - In meeting now, then kids, I'll provide more details after
[14:37:06] <btullis>	 OK.
[14:49:41] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Airflow concurrency limits - https://phabricator.wikimedia.org/T300870 (10mforns) This is great @Antoine_Quhen! I agree with your default value choices. I think in Airflow2 the concurrency parameter is deprecated in favor of max_active_tasks, see: htt...
[14:53:51] <wikibugs>	 (03PS29) 10Phuedx: Basic ipinfo instrument setup [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte)
[14:55:14] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Matomo mariadb metrics are not being scraped by prometheus - https://phabricator.wikimedia.org/T299762 (10BTullis) This is now fixed. Metrics are coming in and are shown in Grafana. {F34941408}
[14:57:52] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Airflow concurrency limits - https://phabricator.wikimedia.org/T300870 (10Ottomata) I wonder if we should set the global `max_active_runs_per_dag` higher than 2.  I could see cases where we explicitly want to to a big backfill in parallel.  Since the...
[14:58:41] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Airflow concurrency limits - https://phabricator.wikimedia.org/T300870 (10JAllemandou) thanks a lot for the great summary @Antoine_Quhen! I asume that we wish to set the default values you suggested, and as much as possible not use the per-dag availab...
[15:02:58] <wikibugs>	 (03CR) 10Phuedx: "Two small nitpicks inline. After they've been resolved, I'm happy to merge this." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte)
[15:03:03] <wikibugs>	 (03CR) 10Phuedx: [C: 04-1] Basic ipinfo instrument setup [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte)
[15:09:46] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Airflow concurrency limits - https://phabricator.wikimedia.org/T300870 (10Ottomata) Ya I'd think we could!
[15:13:45] <ottomata>	 mforns:  i'm going to come back to the pyarrow stuff and focus on finishing the other airflow stuff i'm working on for now
[15:14:00] <ottomata>	 with that fix to workflow_utils, you shouldn't need to interact with fsspec at dag time
[15:14:05] <ottomata>	 right?
[15:14:22] <mforns>	 yes, that's right
[15:14:34] <mforns>	 I think it will be fine until we try migrating refine :]
[15:14:38] <mforns>	 ottomata: ^
[15:14:57] <ottomata>	 mforns:  why will it matter?
[15:15:14] <wikibugs>	 10Data-Engineering, 10Airflow: Use arrow_hdfs:// fsspec protocol in workflow_utils artifact syncing - https://phabricator.wikimedia.org/T300876 (10Ottomata)
[15:15:15] <ottomata>	 https://phabricator.wikimedia.org/T300876
[15:15:17] <ottomata>	 to track
[15:15:59] <mforns>	 ottomata: because the main way of doing refine is to use pyarrow at dag definition time, to create a dynamic dag with all existing datasets
[15:16:25] <ottomata>	 oh hm right
[15:16:26] <ottomata>	 hm
[15:16:30] <ottomata>	 ok
[15:16:45] <ottomata>	 mforns:  btw, if we can make fsspec and pyarrow work right
[15:16:52] <ottomata>	 i wonder, if instead of making an hdfs sensor
[15:17:04] <ottomata>	 if an FsSpecSensor might be more flexible and easier to write
[15:17:07] <mforns>	 we could try other ways, like having a master dag that creates 1 child dag for each dataset (using pyarrow within a task)
[15:18:08] <mforns>	 If we need it, we can always change our HdfsSensor to an fsspec sensor, or add an fsspec sensor
[15:18:27] <mforns>	 no?
[15:18:29] <ottomata>	 ya
[15:19:00] <ottomata>	 i mean, actually, afaict as long as you use URLs that start with arrow_hdfs:// instead of hdfs://
[15:19:05] <ottomata>	 fsspec will use the new API
[15:20:37] <ottomata>	 (brb gotta restart my compy)
[15:38:55] <ottomata>	 back
[15:44:54] <ottomata>	 mforns: i wonder
[15:45:23] <ottomata>	 hmm, maybe not.  haha i'm looking at ArfitactRegistry and wonderin g if we can do that automatically somehow
[15:45:34] <ottomata>	 because i'm also looking at the default args and dag_defaults stuff
[15:49:26] <mforns>	 aha, but what do you mean do it automatically?
[15:49:32] <mforns>	 ottomata: ^
[15:50:07] <ottomata>	 hmm, i take back my ping, i'm not sure what i mean, i had an idea, e.g. getting rid of the paths to the artifact config in dag_config.py
[15:50:15] <ottomata>	 but lemme mess with default_args and config stuff first
[15:50:19] <ottomata>	 i thin kthat will inform other ideas
[15:50:33] <mforns>	 ok :]
[15:50:40] <ottomata>	 i'm currently thinking about how to set these defaults only for our real airflow instances
[15:50:43] <ottomata>	 not for dev envs
[15:51:10] <ottomata>	 there is a env var available 
[15:51:13] <ottomata>	 # AIRFLOW_INSTANCE_NAME is an arbitrary WMF Airflow concept.
[15:51:13] <ottomata>	 # We run multiple instances.  User DAG code might use this to
[15:51:13] <ottomata>	 # vary dynamimc config based on which instance it is running in.
[15:51:13] <ottomata>	 export AIRFLOW_INSTANCE_NAME=analytics
[15:51:39] <ottomata>	 that is set for all airflow stuff on our instances
[15:51:56] <ottomata>	 i could vary with that, if tthat is set, assume in wmf and load these default configs
[15:51:59] <ottomata>	 otherwise, don't load them
[15:53:06] <ottomata>	 i'm also not loving the symlink thing I did to wmf_airflow_common inside of e.g. analytics/dags
[15:53:48] <mforns>	 hm
[15:57:09] <wikibugs>	 (03CR) 10Tchanders: Basic ipinfo instrument setup (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte)
[16:07:13] <ottomata>	 mforns:  are plugins useful for anything other than airflow UI stuff?
[16:07:29] <ottomata>	 or maybe for templating?
[16:09:15] <ottomata>	 ah right about the symlink thing.  we did that so you could use wm_airflow_common without anything more than just configuring the dag directory properly
[16:09:25] <ottomata>	 for our instanaces, it would be better to do it differently
[16:09:34] <ottomata>	 but, what about when using with dev instance?
[16:09:38] <ottomata>	 you'd need to have an affected PYTHONPATH
[16:09:48] <ottomata>	 and i thinkk dev_instance doesn't do anythign except for set dags_dir
[16:32:12] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Run Datahub on test cluster - https://phabricator.wikimedia.org/T299703 (10BTullis)
[16:37:29] <mforns>	 ottomata: macros are also declared within plugins, macros to use within Airflow templating
[16:38:03] <mforns>	 ottomata: the dev_instance script could add stuff to the PYTHONPATH if needed, no?
[17:00:30] <joal>	 hm - I guess everyone is in staff meeting, right?
[17:00:49] <mforns>	 yea, joal, are we skipping standup?
[17:01:22] <joal>	 actually there are some folks - doing standup with present people - no big deal mforns :)
[17:02:08] <mforns>	 ok
[17:55:02] <wikibugs>	 10Analytics, 10Data-Engineering, 10Event-Platform, 10Platform Engineering, 10tech-decision-forum: MediaWiki Event Carried State Transfer - Problem Statement - https://phabricator.wikimedia.org/T291120 (10Jenlenfantwright) a:05Jenlenfantwright→03Ottomata
[17:57:08] <ottomata>	 mforns: 
[17:57:09] <ottomata>	 yt?
[17:57:20] <ottomata>	 i think i'm ready to go over my MR and to merge it
[17:57:28] <ottomata>	 should we do it together?
[17:57:43] <ottomata>	 I've put back your original  bash based operators in operators/spark_bash.py
[17:57:54] <ottomata>	 and imported them fromfrom wmf_airflow_common.factories.anomaly_detection import AnomalyDetectionDAG
[17:58:00] <ottomata>	 from AnomalyDetectionDAG *
[18:15:43] <mforns>	 heya ottomata I'm here
[18:16:00] <mforns>	 ottomata: yes, we can do together!
[18:16:14] <ottomata>	 ok nows good?
[18:16:47] <mforns>	 yes! bc omw
[18:16:50] <mforns>	 xD
[18:25:26] <joal>	 ok - about druid: we have 3 HiveToDruid jobs stuck from Jan 14th, and 2 oozie druid-loading subworkflows stuck from Feb 1st
[18:25:33] <joal>	 those are yarn apps sorry --^
[18:25:49] <joal>	 I'm enclined to force kill them
[18:25:56] <joal>	 any objection ottomata or btullis ?
[18:41:41] <ottomata>	 joal:  stuck?
[18:41:54] <ottomata>	 i mean yes proceed, but i don't have any context sOooOoO :) :)
[19:05:09] <joal>	 ottomata: sorry was away for a minute
[19:05:55] <joal>	 ottomata: yes, stuck: those apps are launcher jobs for druid indexations - I don't know if it is that the indexation job failed or what, but the apps are stuck
[19:06:05] <ottomata>	 huh
[19:06:06] <ottomata>	 oh ok
[19:06:14] <joal>	 Ok killing them (and keeping their ids to debug after)
[19:06:22] <ottomata>	 so probably the stuff is fine (since we didn't get alerts, or if we did hopefully they were resolved)
[19:06:25] <ottomata>	 just something old was stuck
[19:06:25] <ottomata>	 okay
[19:07:10] <joal>	 ottomata: I think we didn't get alerts BECAUSE the stuff is stuck! Killing the apps should trigger some alerts I assume
[19:09:48] <joal>	 !log Kill druid-loading stuck yarn applications (3 HiveToDruid, 2 oozie launchers)
[19:09:50] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:11:10] <joal>	 ok done - let's see if alerts show up
[19:12:28] <joal>	 !log Kill druid indexation stuck task on Druid (from 2022-01-17T02:31)
[19:12:33] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:13:11] <ottomata>	 OH hh okay
[19:22:46] <wikibugs>	 (03PS16) 10Phuedx: [WIP] Metrics Platform event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan)
[19:26:12] <joal>	 ok great - druid has been unlocked - trying to rerun the failed indexations
[19:26:59] <wikibugs>	 (03CR) 10Phuedx: [WIP] Metrics Platform event schema (038 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan)
[19:27:06] <joal>	 !log Rerun virtualpageview-druid-daily-wf-2022-1-16
[19:27:08] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:32:35] <btullis>	 !log re-running the failed refine_event job as per email.
[19:32:37] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:33:43] <btullis>	 Thanks for looking into the druid issues joal. :-)
[19:34:12] <joal>	 Cheers btullis :)
[19:35:54] <joal>	 !log Rerun  virtualpageview-druid-monthly-wf-2022-1
[19:35:56] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[20:18:25] <ottomata>	 mforns:  still around?
[20:18:37] <mforns>	 ottomata: yes
[20:18:41] <mforns>	 in da cave
[20:18:47] <ottomata>	 i got a config idea, just pushed want to see what you think
[21:14:10] <ottomata>	 okay mforns  i think i want to do different MRs for these changes
[21:14:20] <ottomata>	 this config refactor will work on its own
[21:14:20] <ottomata>	 https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/10
[21:14:31] <ottomata>	 i went ahead and decided to use a mergedeep library
[21:14:36] <ottomata>	 i'll get that into the airflow conda env tooo
[23:23:21] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic, 10User-razzi: Technical evaluation of Amundsen - https://phabricator.wikimedia.org/T300756 (10razzi) The ssl error has been fixed, here's the current diff of the amundsen repository on stat1008.eqiad.wmnet:/srv/home/razzi/amundsen sho...
[23:29:49] <ebernhardson>	 could i get a `hdfs dfs -du -s /var/log/hadoop-yarn/apps`? I think that should give a ballpark on existing logs which i think are pruned at 30d?
[23:37:09] <wikibugs>	 10Analytics-Clusters, 10Wikimedia-Logstash: Evaluate storing logs from applications in yarn with the typical logging infrastructure - https://phabricator.wikimedia.org/T300937 (10EBernhardson)
[23:49:08] <wikibugs>	 10Analytics-Clusters, 10Wikimedia-Logstash: Evaluate storing logs from applications in yarn with the typical logging infrastructure - https://phabricator.wikimedia.org/T300937 (10EBernhardson)