[01:19:57] RECOVERY - Check unit status of monitor_refine_event on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:54:22] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering: [Wikistats] The permanent link is broken - https://phabricator.wikimedia.org/T245445 (10Krinkle) I suspect this is due to the use of pipes and parenthesis, both of which are often considered terminal characters when various applications parse links in t... [04:56:11] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review, 10Product-Analytics (Kanban): Test log file and error notification - https://phabricator.wikimedia.org/T295733 (10Mayakp.wiki) Hi @BTullis, can you please go ahead and merge the patch to change the job run date to the 7th ? Thank you for y... [07:03:23] !log clean up wmf_auto_restart_prometheus-mysqld-exporter@matomo on matomo1002 (not used anymore, listed as failed) [07:03:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:09:29] !log cleanup wmf_auto_restart_prometheus-mysqld-exporter@analytics-meta on an-test-coord1001 and unmasked wmf_auto_restart_prometheus-mysqld-exporter (now used) [07:09:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:09:56] hi folks [07:10:02] on an-test-coord1001 there is one last issue [07:10:03] Error 1045: Access denied for user 'prometheus'@'localhost' [07:10:20] I assume that when the database was restored those got lost [07:12:23] !log `GRANT PROCESS, REPLICATION CLIENT ON *.* TO `prometheus`@`localhost` IDENTIFIED VIA unix_socket WITH MAX_USER_CONNECTIONS 5` on an-test-coord1001 to allow the prometheus exporter to gather metrics [07:12:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:12:36] ok now it looks better [07:12:55] and I see metrics flowing in grafana :) [07:20:18] (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org [07:20:23] (DruidSegmentsUnavailable) firing: More than 5 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org [07:30:18] (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org [07:30:18] (DruidSegmentsUnavailable) resolved: More than 5 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org [07:53:18] (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for mediawiki_history_reduced_2022_01 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org [07:53:18] (DruidSegmentsUnavailable) firing: More than 5 segments have been unavailable for mediawiki_history_reduced_2022_01 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org [08:13:18] (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for mediawiki_history_reduced_2022_01 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org [08:13:18] (DruidSegmentsUnavailable) resolved: More than 5 segments have been unavailable for mediawiki_history_reduced_2022_01 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org [08:17:13] all of the spam [11:28:52] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Users should run explicit commands to materialize schema versions, rather than using magic git hooks - https://phabricator.wikimedia.org/T290074 (10phuedx) A friendly bump. [12:22:20] ebernhardson: we store logs for 40 days - the total log size is as of now 3.1Tb [12:23:41] addshore: indeed - we need to tweak those alerts better - they fire each and every month when we load a new datasource [12:55:14] !log Rerun cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2022-2-3 [12:55:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:02:20] 10Analytics-Clusters, 10Wikimedia-Logstash: Evaluate storing logs from applications in yarn with the typical logging infrastructure - https://phabricator.wikimedia.org/T300937 (10JAllemandou) Adding some information about logs stored in HDFS by YARN: - We keep them for 40 days - Today, 40 days of logs weight... [13:21:45] Hi btullis - for when you're around - https://gerrit.wikimedia.org/r/c/operations/puppet/+/759702/ [13:25:59] hola teamm! [13:27:55] Hi mforns [13:31:50] Gone for a run [14:26:01] joal: Thanks for that. The new mediawiki_history snapshot is now live in AQS. [14:36:32] yoohoo teamers! [14:37:01] Hello there! [14:44:00] (03CR) 10Ottomata: [WIP] Metrics Platform event schema (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan) [14:46:14] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Users should run explicit commands to materialize schema versions, rather than using magic git hooks - https://phabricator.wikimedia.org/T290074 (10Ottomata) Nice bump! :) Been thinking about how I haven't moved here. I want to do this... [14:50:29] 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, 10Research, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10Ottomata) [15:01:25] thanks a lot btullis :) [15:28:50] mforns: do you understand what template_ext is for? [15:29:00] heya ottomata [15:29:21] no, don't know what that is [15:29:45] i think it is how airflow knows what kind of files to apply templates for [15:29:57] buti think i'm doing something wrong [15:30:00] but i'll read up [15:30:01] :) [15:30:05] ottomata: BTW, I'm reviewing your changes, and had an idea, I'm trying to write it as a comment, but it's getting long... do you have a moment for bc? [15:30:16] sure [15:30:27] ok, omw [15:44:22] btullis: o/ this morning I have fixed the prometheus-mysqld-exporter status on an-test-coord1001 but I think that we'd need to do the same for both an-coords too [15:44:41] namely unmask prometheus-mysqld-exporter, and clear prometheus-mysqld-exporter@analytics-meta (in theory) [15:47:43] elukey: Thanks ever so much. I think you're right. I might have to leave it until Monday though. [15:48:08] I can ack the alerts until then. [15:48:41] btullis: I can clean up if you are ok, I have a few mins [15:49:02] (I can be repaid in beers when we'll meet don't worry) [15:49:07] :D [15:49:11] Oh if you wouldn't mind, that would be great. [15:49:46] I owe you a good quantity of beer already :-) [15:50:39] let's hope for 2022/23 to be a good moment to meet (at least DE + ML would be great) [16:05:42] !log unmask prometheus-mysqld-exporter.service and clean up the old @analytics + wmf_auto_restart units (service+timer) not used anymore on an-coord100[12] [16:05:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:05:52] metrics are flowing again :) [16:10:17] mforns: ! wait one thought [16:10:20] bc again? [16:10:43] is there even a point of using these namespace/variants thing? [16:10:46] ok [16:10:50] okay ya bc [16:10:51] :) [16:10:52] omw [16:52:56] 10Data-Engineering, 10Data-Services, 10cloud-services-team (Kanban): Recreate views for globalblocks table - https://phabricator.wikimedia.org/T300988 (10Zabe) [17:01:04] ok mforns default_args MR updated [17:01:04] https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/10/diffs [17:10:11] 10Data-Engineering, 10MediaWiki-Page-editing, 10Editing-team (Tracking), 10Performance-Team (Radar), 10Product-Analytics (Kanban): Update edits_hourly to ingest new legacy wikitext editor change tag - https://phabricator.wikimedia.org/T293406 (10MNeisler) a:05MNeisler→03ppelberg @ppelberg I've confi... [17:15:52] 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, 10Research, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10jbond) Thanks for creating this task Andrew, Just wanted to copy paste the following from the parent task in-case the... [17:31:05] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic, 10User-razzi: Technical evaluation of Amundsen - https://phabricator.wikimedia.org/T300756 (10BTullis) I've now got Amundsen up and running, with an imported collection of Hive tables. If you'd like to check it out you can do: ` ssh -... [17:32:58] 10Analytics-Clusters, 10Data-Engineering: Move the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10nshahquinn-wmf) Please consider prioritizing the upgrade of the [analytics clients](https://wikitech.wikimedia.org/wiki/Analytics/Systems/Clients) to Bullseye. It... [17:35:02] mforns: i wonder...shoudl we put keytab and principal in default args for spark if they are set in airflow config? [17:35:10] then folks wouldn't have to configure it in their instances [17:35:15] instance dag default config [17:40:42] 10Analytics-Clusters, 10Data-Engineering: Move the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10Ottomata) @nshahquinn-wmf ` conda install python=3.9? ` ? :) I think we should upgrade anaconda-wmf to python 3.9 sometime soon. [17:43:25] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic, 10User-razzi: Technical evaluation of Amundsen - https://phabricator.wikimedia.org/T300756 (10BTullis) p:05Triage→03High I will try a druid metadata import, to see how well that works. [17:43:29] ottomata: hm, maybe not! [17:43:40] mforns: ? [17:44:12] ottomata: I mean maybe the keytab and principal should not be in default_args [17:44:27] hm [17:44:28] even if it is set for the airflow instance in airflow.cfg? [17:44:47] but how do they get to the spark_submit command? [17:44:58] via the keytab and prinipal params to the SparkSubmitOperator [17:45:14] but how does the operator get them, if they are not passed? [17:45:38] eh? they need to be passed, i''m suggesting we automate putting them into wmf_base_default_args [17:45:44] aaaaah [17:46:21] ok, makes sense! [17:46:27] okay [17:46:47] the airflow cfg is available from code via an api, let me get it [17:49:25] yes i have it [17:49:53] airflow.configuration.get('kerberos', 'keytab') [17:52:44] i really dislike airlfow connections [17:52:45] 10Analytics-Clusters, 10Data-Engineering: Move the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10nshahquinn-wmf) >>! In T288804#7683776, @Ottomata wrote: > @nshahquinn-wmf > ` > conda install python=3.9? > ` > > ? :) > > I think we should upgrade anaconda... [17:52:47] i want to talk about how they are bad [17:53:33] 10Analytics-Clusters, 10Data-Engineering: Move the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10Ottomata) Well, if we had it installed on all the workers, then you wouldn't have to ship your conda env every time you run a spark job and want python3.9. [17:54:34] ottomata: hehe [17:54:48] i think they are bad and we should not use them. [17:54:54] ok [17:54:59] however, prob have to if we want to use built in airflow hooks [17:55:21] do they depend on connections? [17:55:31] yes, hooks usually are meant to work with connections [17:55:48] like, right now i'm trying to make sure thet useragent_distribution dag works in my dev insgtance [17:55:53] but, i don't have a hive connection defined [17:56:12] yea, needs to be configured [17:56:14] it will be really annoying to have to redefine all connections for every new deff instance [17:56:19] dev* [17:56:34] well, it's only if you use a new conda env.. [17:56:49] eh? no if you use a new airflow.db [17:56:51] if you reuse the same env, it remembers your connection config [17:57:06] yes, assuming you don't reset the db [17:57:09] yes, but i expect to break and scrap my dev env often [17:57:14] yea.. [17:57:19] or move to a different node [17:57:40] ottomata: I was trying to set the connections from the dev_instance script [17:57:47] I think there's a way [17:57:56] there is, we have connections.yaml file in the instances [17:58:01] but it is rendered by puppet [17:58:13] we could instead choose to keep that stuff in airflow-dags [17:58:20] i think [17:58:24] hm [17:58:35] would need varied somehow in some cases [17:58:37] like in test cluster [17:58:37] hm [18:13:47] mforns: i don't know how the currently running bash SparkSqlOperator works without a keytab [18:14:01] OHH yes i do [18:14:03] ok yes i do [18:14:20] sorry, beacus spark-submit runs locally and inherits the process / airflow kerberos stuff [18:14:20] okay [18:16:38] mforns: pushed keytab/principal to dag_config MR [19:18:29] mforns: if you are still around, i'd love to test some anomoly dags with you on analytics-test [19:26:42] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review, 10Readers-Web-Backlog (Kanbanana-FY-2021-22): WikipediaPortal Event Platform Migration - https://phabricator.wikimedia.org/T282012 (10Ottomata) Checking in, @Jdrewniak how's this going? Can I help in any way? [19:28:30] heya ottomata sure let's [19:28:34] let's test [19:28:36] okay! [19:28:44] omw [19:28:50] i think i guess we'll need the airflow env on analytics-test to have updates [19:28:57] mostly workflow_utils and uhh deepmerge [19:29:01] i shodul have done that already [20:44:14] ottomata: I think you can not hear me [20:45:04] and I cannot hear you...