[01:19:57] <icinga-wm>	 RECOVERY - Check unit status of monitor_refine_event on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[02:54:22] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Data-Engineering: [Wikistats] The permanent link is broken - https://phabricator.wikimedia.org/T245445 (10Krinkle) I suspect this is due to the use of pipes and parenthesis, both of which are often considered terminal characters when various applications parse links in t...
[04:56:11] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review, 10Product-Analytics (Kanban): Test log file and error notification - https://phabricator.wikimedia.org/T295733 (10Mayakp.wiki) Hi @BTullis, can you please go ahead and merge the patch to change the job run date to the 7th ?  Thank you for y...
[07:03:23] <elukey>	 !log clean up wmf_auto_restart_prometheus-mysqld-exporter@matomo on matomo1002 (not used anymore, listed as failed)
[07:03:25] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:09:29] <elukey>	 !log cleanup wmf_auto_restart_prometheus-mysqld-exporter@analytics-meta on an-test-coord1001 and unmasked wmf_auto_restart_prometheus-mysqld-exporter (now used)
[07:09:31] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:09:56] <elukey>	 hi folks
[07:10:02] <elukey>	 on an-test-coord1001 there is one last issue
[07:10:03] <elukey>	 Error 1045: Access denied for user 'prometheus'@'localhost'
[07:10:20] <elukey>	 I assume that when the database was restored those got lost
[07:12:23] <elukey>	 !log `GRANT PROCESS, REPLICATION CLIENT ON *.* TO `prometheus`@`localhost` IDENTIFIED VIA unix_socket WITH MAX_USER_CONNECTIONS 5` on an-test-coord1001 to allow the prometheus exporter to gather metrics
[07:12:25] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:12:36] <elukey>	 ok now it looks better
[07:12:55] <elukey>	 and I see metrics flowing in grafana :)
[07:20:18] <jinxer-wm>	 (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org
[07:20:23] <jinxer-wm>	 (DruidSegmentsUnavailable) firing: More than 5 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org
[07:30:18] <jinxer-wm>	 (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org
[07:30:18] <jinxer-wm>	 (DruidSegmentsUnavailable) resolved: More than 5 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org
[07:53:18] <jinxer-wm>	 (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for mediawiki_history_reduced_2022_01 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org
[07:53:18] <jinxer-wm>	 (DruidSegmentsUnavailable) firing: More than 5 segments have been unavailable for mediawiki_history_reduced_2022_01 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org
[08:13:18] <jinxer-wm>	 (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for mediawiki_history_reduced_2022_01 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org
[08:13:18] <jinxer-wm>	 (DruidSegmentsUnavailable) resolved: More than 5 segments have been unavailable for mediawiki_history_reduced_2022_01 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org
[08:17:13] <addshore>	 all of the spam
[11:28:52] <wikibugs>	 10Analytics, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Users should run explicit commands to materialize schema versions, rather than using magic git hooks - https://phabricator.wikimedia.org/T290074 (10phuedx) A friendly bump.
[12:22:20] <joal>	 ebernhardson: we store logs for 40 days - the total log size is as of now 3.1Tb
[12:23:41] <joal>	 addshore: indeed - we need to tweak those alerts better - they fire each and every month when we load a new datasource
[12:55:14] <joal>	 !log Rerun cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2022-2-3
[12:55:16] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:02:20] <wikibugs>	 10Analytics-Clusters, 10Wikimedia-Logstash: Evaluate storing logs from applications in yarn with the typical logging infrastructure - https://phabricator.wikimedia.org/T300937 (10JAllemandou) Adding some information about logs stored in HDFS by YARN:  - We keep them for 40 days  - Today, 40 days of logs weight...
[13:21:45] <joal>	 Hi btullis - for when you're around - https://gerrit.wikimedia.org/r/c/operations/puppet/+/759702/
[13:25:59] <mforns>	 hola teamm!
[13:27:55] <joal>	 Hi mforns 
[13:31:50] <joal>	 Gone for a run
[14:26:01] <btullis>	 joal: Thanks for that. The new mediawiki_history snapshot is now live in AQS.
[14:36:32] <ottomata>	 yoohoo teamers!
[14:37:01] <btullis>	 Hello there!
[14:44:00] <wikibugs>	 (03CR) 10Ottomata: [WIP] Metrics Platform event schema (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan)
[14:46:14] <wikibugs>	 10Analytics, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Users should run explicit commands to materialize schema versions, rather than using magic git hooks - https://phabricator.wikimedia.org/T290074 (10Ottomata) Nice bump! :)  Been thinking about how I haven't moved here. I want to do this...
[14:50:29] <wikibugs>	 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, 10Research, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10Ottomata)
[15:01:25] <joal>	 thanks a lot btullis :)
[15:28:50] <ottomata>	 mforns:  do you understand what template_ext is for?
[15:29:00] <mforns>	 heya ottomata 
[15:29:21] <mforns>	 no, don't know what that is
[15:29:45] <ottomata>	 i think it is how airflow knows what kind of files to apply templates for
[15:29:57] <ottomata>	 buti think i'm doing something wrong
[15:30:00] <ottomata>	 but i'll read up
[15:30:01] <ottomata>	 :)
[15:30:05] <mforns>	 ottomata: BTW, I'm reviewing your changes, and had an idea, I'm trying to write it as a comment, but it's getting long... do you have a moment for bc?
[15:30:16] <ottomata>	 sure
[15:30:27] <mforns>	 ok, omw
[15:44:22] <elukey>	 btullis: o/ this morning I have fixed the prometheus-mysqld-exporter status on an-test-coord1001 but I think that we'd need to do the same for both an-coords too
[15:44:41] <elukey>	 namely unmask prometheus-mysqld-exporter, and clear prometheus-mysqld-exporter@analytics-meta (in theory)
[15:47:43] <btullis>	 elukey: Thanks ever so much. I think you're right. I might have to leave it until Monday though.
[15:48:08] <btullis>	 I can ack the alerts until then.
[15:48:41] <elukey>	 btullis: I can clean up if you are ok, I have a few mins
[15:49:02] <elukey>	 (I can be repaid in beers when we'll meet don't worry)
[15:49:07] <elukey>	 :D
[15:49:11] <btullis>	 Oh if you wouldn't mind, that would be great.
[15:49:46] <btullis>	 I owe you a good quantity of beer already :-)
[15:50:39] <elukey>	 let's hope for 2022/23 to be a good moment to meet (at least DE + ML would be great)
[16:05:42] <elukey>	 !log unmask prometheus-mysqld-exporter.service and clean up the old @analytics + wmf_auto_restart units (service+timer) not used anymore on an-coord100[12]
[16:05:44] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:05:52] <elukey>	 metrics are flowing again :)
[16:10:17] <ottomata>	 mforns: !  wait one thought
[16:10:20] <ottomata>	 bc again?
[16:10:43] <ottomata>	 is there even a point of using these namespace/variants thing?
[16:10:46] <mforns>	 ok
[16:10:50] <ottomata>	 okay ya bc
[16:10:51] <ottomata>	 :)
[16:10:52] <mforns>	 omw
[16:52:56] <wikibugs>	 10Data-Engineering, 10Data-Services, 10cloud-services-team (Kanban): Recreate views for globalblocks table - https://phabricator.wikimedia.org/T300988 (10Zabe)
[17:01:04] <ottomata>	 ok mforns  default_args MR updated
[17:01:04] <ottomata>	 https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/10/diffs
[17:10:11] <wikibugs>	 10Data-Engineering, 10MediaWiki-Page-editing, 10Editing-team (Tracking), 10Performance-Team (Radar), 10Product-Analytics (Kanban): Update edits_hourly to ingest new legacy wikitext editor change tag - https://phabricator.wikimedia.org/T293406 (10MNeisler) a:05MNeisler→03ppelberg @ppelberg  I've confi...
[17:15:52] <wikibugs>	 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, 10Research, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10jbond) Thanks for creating this task Andrew, Just wanted to copy paste the following from the parent task in-case the...
[17:31:05] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic, 10User-razzi: Technical evaluation of Amundsen - https://phabricator.wikimedia.org/T300756 (10BTullis) I've now got Amundsen up and running, with an imported collection of Hive tables. If you'd like to check it out you can do: ` ssh -...
[17:32:58] <wikibugs>	 10Analytics-Clusters, 10Data-Engineering: Move the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10nshahquinn-wmf) Please consider prioritizing the upgrade of the [analytics clients](https://wikitech.wikimedia.org/wiki/Analytics/Systems/Clients) to Bullseye. It...
[17:35:02] <ottomata>	 mforns: i wonder...shoudl we put keytab and principal in default args for spark if they are set in airflow config?
[17:35:10] <ottomata>	 then folks wouldn't have to configure it in their instances
[17:35:15] <ottomata>	 instance dag default config
[17:40:42] <wikibugs>	 10Analytics-Clusters, 10Data-Engineering: Move the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10Ottomata) @nshahquinn-wmf  ` conda install python=3.9? `  ?  :)  I think we should upgrade anaconda-wmf to python 3.9 sometime soon.
[17:43:25] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic, 10User-razzi: Technical evaluation of Amundsen - https://phabricator.wikimedia.org/T300756 (10BTullis) p:05Triage→03High I will try a druid metadata import, to see how well that works.
[17:43:29] <mforns>	 ottomata: hm, maybe not!
[17:43:40] <ottomata>	 mforns: ?
[17:44:12] <mforns>	 ottomata: I mean maybe the keytab and principal should not be in default_args
[17:44:27] <mforns>	 hm
[17:44:28] <ottomata>	 even if it is set for the airflow instance in airflow.cfg?
[17:44:47] <mforns>	 but how do they get to the spark_submit command?
[17:44:58] <ottomata>	 via the keytab and prinipal params to the SparkSubmitOperator
[17:45:14] <mforns>	 but how does the operator get them, if they are not passed?
[17:45:38] <ottomata>	 eh?  they need to be passed, i''m suggesting we automate putting them into wmf_base_default_args
[17:45:44] <mforns>	 aaaaah
[17:46:21] <mforns>	 ok, makes sense!
[17:46:27] <ottomata>	 okay
[17:46:47] <mforns>	 the airflow cfg is available from code via an api, let me get it
[17:49:25] <ottomata>	 yes i have it
[17:49:53] <ottomata>	 airflow.configuration.get('kerberos', 'keytab')
[17:52:44] <ottomata>	 i really dislike airlfow connections
[17:52:45] <wikibugs>	 10Analytics-Clusters, 10Data-Engineering: Move the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10nshahquinn-wmf) >>! In T288804#7683776, @Ottomata wrote: > @nshahquinn-wmf  > ` > conda install python=3.9? > ` >  > ?  :) >  > I think we should upgrade anaconda...
[17:52:47] <ottomata>	 i want to talk about how they are bad
[17:53:33] <wikibugs>	 10Analytics-Clusters, 10Data-Engineering: Move the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10Ottomata) Well, if we had it installed on all the workers, then you wouldn't have to ship your conda env every time you run a spark job and want python3.9.
[17:54:34] <mforns>	 ottomata: hehe
[17:54:48] <ottomata>	 i think they are bad and we should not use them.
[17:54:54] <mforns>	 ok
[17:54:59] <ottomata>	 however, prob have to if we want to use built in airflow hooks
[17:55:21] <mforns>	 do they depend on connections?
[17:55:31] <ottomata>	 yes, hooks usually are meant to work with connections
[17:55:48] <ottomata>	 like, right now i'm trying to make sure thet useragent_distribution dag works in my dev insgtance
[17:55:53] <ottomata>	 but, i don't have a hive connection defined
[17:56:12] <mforns>	 yea, needs to be configured
[17:56:14] <ottomata>	 it will be really annoying to have to redefine all connections for every new deff instance
[17:56:19] <ottomata>	 dev*
[17:56:34] <mforns>	 well, it's only if you use a new conda env..
[17:56:49] <ottomata>	 eh? no if you use a new airflow.db
[17:56:51] <mforns>	 if you reuse the same env, it remembers your connection config
[17:57:06] <mforns>	 yes, assuming you don't reset the db
[17:57:09] <ottomata>	 yes, but i expect to break and scrap my dev env often
[17:57:14] <mforns>	 yea..
[17:57:19] <ottomata>	 or move to a different node
[17:57:40] <mforns>	 ottomata: I was trying to set the connections from the dev_instance script
[17:57:47] <mforns>	 I think there's a way
[17:57:56] <ottomata>	 there is, we have connections.yaml file in the instances
[17:58:01] <ottomata>	 but it is rendered by puppet
[17:58:13] <ottomata>	 we could instead choose to keep that stuff in airflow-dags
[17:58:20] <ottomata>	 i think
[17:58:24] <ottomata>	 hm
[17:58:35] <ottomata>	 would need varied somehow in some cases
[17:58:37] <ottomata>	 like in test cluster
[17:58:37] <ottomata>	 hm
[18:13:47] <ottomata>	 mforns:  i don't know how the currently running bash SparkSqlOperator works without a keytab
[18:14:01] <ottomata>	 OHH yes i do
[18:14:03] <ottomata>	 ok yes i do
[18:14:20] <ottomata>	 sorry, beacus spark-submit runs locally and inherits the process / airflow kerberos stuff
[18:14:20] <ottomata>	 okay
[18:16:38] <ottomata>	 mforns:  pushed keytab/principal to dag_config MR
[19:18:29] <ottomata>	 mforns:  if you are still around, i'd love to test some anomoly dags with you on analytics-test
[19:26:42] <wikibugs>	 10Analytics, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review, 10Readers-Web-Backlog (Kanbanana-FY-2021-22): WikipediaPortal Event Platform Migration - https://phabricator.wikimedia.org/T282012 (10Ottomata) Checking in, @Jdrewniak how's this going?  Can I help in any way?
[19:28:30] <mforns>	 heya ottomata sure let's
[19:28:34] <mforns>	 let's test
[19:28:36] <ottomata>	 okay!   
[19:28:44] <mforns>	 omw
[19:28:50] <ottomata>	 i think i guess we'll need the airflow env on analytics-test to have updates
[19:28:57] <ottomata>	 mostly workflow_utils and uhh deepmerge
[19:29:01] <ottomata>	 i shodul have done that already
[20:44:14] <mforns>	 ottomata: I think you can not hear me
[20:45:04] <mforns>	 and I cannot hear you...