[00:11:24] <wikibugs>	 10Analytics, 10Event-Platform: Problem with delay caused by input-analytics.wikimedia.org - https://phabricator.wikimedia.org/T295427 (10Reedy)
[00:17:11] <wikibugs>	 10Analytics, 10Event-Platform, 10Browser-Support-Microsoft-Edge: Problem with delay caused by intake-analytics.wikimedia.org - https://phabricator.wikimedia.org/T295427 (10Reedy)
[01:04:23] <wikibugs>	 10Analytics, 10Product-Analytics, 10Readers-Web-Backlog (Kanbanana-FY-2021-22): Lower sampling rate for MobileWebUIClickTracking on English Wikipedia - https://phabricator.wikimedia.org/T295432 (10Jdlrobson)
[01:04:29] <wikibugs>	 10Analytics, 10Product-Analytics, 10Readers-Web-Backlog (Kanbanana-FY-2021-22): Lower sampling rate for MobileWebUIClickTracking on English Wikipedia - https://phabricator.wikimedia.org/T295432 (10Jdlrobson) p:05Triage→03High
[01:05:20] <wikibugs>	 10Analytics, 10Product-Analytics, 10Readers-Web-Backlog (Kanbanana-FY-2021-22): Lower sampling rate for MobileWebUIClickTracking on English Wikipedia - https://phabricator.wikimedia.org/T295432 (10Jdlrobson)
[01:09:23] <wikibugs>	 10Analytics, 10Product-Analytics, 10Patch-For-Review, 10Readers-Web-Backlog (Kanbanana-FY-2021-22): Lower sampling rate for MobileWebUIClickTracking on English Wikipedia - https://phabricator.wikimedia.org/T295432 (10Jdlrobson) p:05Unbreak!→03High
[01:09:31] <wikibugs>	 10Analytics, 10Product-Analytics, 10Patch-For-Review, 10Readers-Web-Backlog (Kanbanana-FY-2021-22): Lower sampling rate for MobileWebUIClickTracking on English Wikipedia - https://phabricator.wikimedia.org/T295432 (10Jdlrobson) p:05High→03Unbreak!
[01:09:45] <wikibugs>	 10Analytics, 10Product-Analytics, 10Patch-For-Review, 10Readers-Web-Backlog (Kanbanana-FY-2021-22): Lower sampling rate for MobileWebUIClickTracking on English Wikipedia - https://phabricator.wikimedia.org/T295432 (10Jdlrobson) (This will become UBN on Thursday unless https://gerrit.wikimedia.org/r/737814...
[01:11:13] <wikibugs>	 10Analytics, 10Product-Analytics, 10Patch-For-Review, 10Readers-Web-Backlog (Kanbanana-FY-2021-22): Lower sampling rate for MobileWebUIClickTracking on English Wikipedia before wmf8 is on English Wikipedia - https://phabricator.wikimedia.org/T295432 (10Jdlrobson)
[01:34:35] <wikibugs>	 10Analytics, 10Event-Platform, 10Browser-Support-Microsoft-Edge: Problem with delay caused by intake-analytics.wikimedia.org - https://phabricator.wikimedia.org/T295427 (10Downsize43) yes, that is correct - sorry.  Sent from my iPad
[04:24:40] <icinga-wm>	 PROBLEM - Check unit status of monitor_refine_event_sanitized_analytics_delayed on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event_sanitized_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[08:37:57] <wikibugs>	 (03PS1) 10Gergő Tisza: image_suggestion_interaction: update documentation [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737850 (https://phabricator.wikimedia.org/T294669)
[09:01:02] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] image_suggestion_interaction: update documentation [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737850 (https://phabricator.wikimedia.org/T294669) (owner: 10Gergő Tisza)
[09:01:45] <wikibugs>	 (03Merged) 10jenkins-bot: image_suggestion_interaction: update documentation [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737850 (https://phabricator.wikimedia.org/T294669) (owner: 10Gergő Tisza)
[09:18:16] <wikibugs>	 10Analytics, 10Event-Platform, 10Browser-Support-Microsoft-Edge: Problem with delay caused by intake-analytics.wikimedia.org - https://phabricator.wikimedia.org/T295427 (10Aklapper) @Downsize43: Does this also happen with another internet provider? Are there any add-ons or extensions installed in your browse...
[10:19:55] <btullis>	 !log btullis@an-launcher1002:~$ sudo systemctl reset-failed monitor_refine_event_sanitized_analytics_delayed
[10:19:57] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:27:35] <icinga-wm>	 RECOVERY - Check unit status of monitor_refine_event_sanitized_analytics_delayed on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event_sanitized_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[11:01:18] <wikibugs>	 10Analytics-Radar, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Sprint-2021-11-10: Review existing dashboards and metrics for maps - https://phabricator.wikimedia.org/T295315 (10awight)
[11:14:19] <wikibugs>	 10Analytics-Radar, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Sprint-2021-11-10: Review existing dashboards and metrics for maps - https://phabricator.wikimedia.org/T295315 (10awight)
[11:40:18] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Recreate analytics-meta replica on db1108 from master on an-coord1001 - https://phabricator.wikimedia.org/T295312 (10BTullis) >> Maybe I should just create the local snapshot on an-coord1001 > I don't thin...
[13:34:23] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "Simpler logic - I like it! Merge as you wish :)" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/735444 (https://phabricator.wikimedia.org/T294361) (owner: 10Ottomata)
[13:37:00] <btullis>	 Have we got to be deployed at the moment? I can't see anything in that column Kanban, nor anything on the etherpad. Does anyone have anything? Apologies for not fitting it in yesterday.
[13:37:40] <joal>	 Hi btullis - maybe ottomata will wish his change I just reviewed to be deployed? Nothing for me anyhow :)
[13:38:46] <btullis>	 joal: Thanks. Yes, that message above was what jogged my memory :-)
[13:43:24] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Recreate analytics-meta replica on db1108 from master on an-coord1001 - https://phabricator.wikimedia.org/T295312 (10BTullis) The backup completed successfully.  Now preparing the backup with: ` root@db110...
[13:55:53] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Recreate analytics-meta replica on db1108 from master on an-coord1001 - https://phabricator.wikimedia.org/T295312 (10BTullis) Set downtime for both MariaDB instances on db1108. Stopped slave on both instan...
[13:55:56] <ottomata>	 btullis: backup!  you are a hero!
[13:56:01] <ottomata>	 we should probably sync the settings though, right?
[13:56:18] <ottomata>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/736780
[13:57:05] <btullis>	 Yes, I was just about to merge it. Do you want to do it?
[13:57:28] <btullis>	 Still referring to your guide here: https://phabricator.wikimedia.org/T279440#7481773
[13:57:50] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Refine - don't remove records during deduplication if ids are null [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/735444 (https://phabricator.wikimedia.org/T294361) (owner: 10Ottomata)
[13:58:16] <btullis>	 ottomata: I merged it.
[13:58:16] <joal>	 good morning ottomata - Would now be a good time for a quick sync on gobblin?
[13:58:18] <ottomata>	 btullis:  if you are about to do all that then merge at will!
[13:58:19] <ottomata>	 thank you.
[13:58:38] <ottomata>	 btullis: /joal i just merged my refinery-source patch, no urgency on that but it i woulnn't mind if it went out on a train
[13:58:43] <ottomata>	 but it can wait too
[13:58:55] <ottomata>	 joal: 10 mins?
[13:59:04] <joal>	 sure ottomata 
[14:01:23] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Recreate analytics-meta replica on db1108 from master on an-coord1001 - https://phabricator.wikimedia.org/T295312 (10BTullis) * Merged and deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/7367...
[14:15:30] <ottomata>	 joal ok bc?
[14:15:35] <joal>	 yes ottomata!
[14:19:34] <elukey>	 first kafka broker running fine with the kafka PKI! \o/
[14:19:38] <elukey>	 (kafka-test1006)
[14:19:49] <btullis>	 elukey: Awesome!
[14:20:32] <elukey>	 credits to John for the amazing puppet work
[14:20:52] <elukey>	 it seems that we can run easily mixed certs (puppet based and pki based)
[14:21:10] <elukey>	 clients need to be instructed to trust both CAs
[14:28:37] <elukey>	 query jbond 
[14:28:39] <elukey>	 uff
[14:34:40] <elukey>	 btullis: I was thinking about https://gerrit.wikimedia.org/r/c/operations/puppet/+/708739, maybe in the future we could move the hostname CN TLS certs (like an-test-presto1001) to a dedicated intermediate CA, either a data-engineering one or something dedicated to Presto/Hadoop/etc..
[14:34:45] <elukey>	 what do you think?
[14:35:01] <elukey>	 afaics the cert issued for an-test-presto1001 is issued by the discovery intermediate
[14:35:36] <elukey>	 (not even sure the long term plans for presto so we can postpone this later on)
[14:37:56] <wikibugs>	 (03CR) 10MNeisler: Add discussiontools_subscription query to sqoop (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/736021 (https://phabricator.wikimedia.org/T290516) (owner: 10MNeisler)
[14:38:15] <wikibugs>	 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10SRE, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10elukey) Finally kafka-test1006 is running with a PKI kafka intermediate cert, and the rest of the cluster works...
[14:58:59] <wikibugs>	 (03CR) 10Milimetric: Add discussiontools_subscription query to sqoop (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/736021 (https://phabricator.wikimedia.org/T290516) (owner: 10MNeisler)
[14:59:31] <btullis>	 elukey: ack. Sorry, was in a meeting. Yes I think that having our own intermediate CA would be a good idea. The discovery one isn't ideal, but not urgent either. I think that all the experience you're getting with the Kafka stuff is going to make this much easier.
[15:00:15] <elukey>	 ack
[15:00:22] <elukey>	 also for hadoop and druid
[15:02:02] <ottomata>	 elukey:  nice! (PKI kafka!)
[15:02:51] <ottomata>	 elukey:  maybe you can tell us how it all works in sre sync today!
[15:05:53] <joal>	 ottomata: about deploy, let's wait for next week (totmorrow is bank holiday, let's not deploy today) - also ottomata - could you add a line in the dploy etherpad so that I don't forget it?
[15:07:55] <ottomata>	 joal:  yeah sounds good
[15:07:57] <ottomata>	 yes will do
[15:10:30] <ottomata>	 btullis: 
[15:10:35] <ottomata>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/737930/1/modules/profile/templates/mariadb/mysqld_config/analytics_multiinstance.my.cnf.erb
[15:10:39] <ottomata>	 we should make that value work, no?
[15:11:21] <btullis>	 It's already in this fragment, so I thought it was unnecessary:
[15:11:26] <btullis>	 https://www.irccloud.com/pastebin/259b1W6Z/
[15:11:46] <ottomata>	 hmmmm Oh
[15:11:49] <ottomata>	 oh ok
[15:11:55] <ottomata>	 phew, yeah that sounds fine then.
[15:12:01] <ottomata>	 no need for default in global cnf
[15:12:18] <btullis>	 Oh, but there is another parsing error too:
[15:12:23] <btullis>	 `Nov 10 15:10:28 db1108 mysqld[13073]: 2021-11-10 15:10:28 0 [ERROR] /opt/wmf-mariadb104/bin/mysqld: option '--innodb-large-prefix' requires an argument
[15:12:23] <btullis>	 `
[15:13:01] <ottomata>	 should = 1
[15:13:14] <ottomata>	 can change in global file i think
[15:13:33] <btullis>	 OK, thanks.
[15:24:23] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Recreate analytics-meta replica on db1108 from master on an-coord1001 - https://phabricator.wikimedia.org/T295312 (10BTullis) There were a couple of syntax errors in the `/etc/my.cnf` file that had been ca...
[15:39:30] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Recreate analytics-meta replica on db1108 from master on an-coord1001 - https://phabricator.wikimedia.org/T295312 (10BTullis) ` MariaDB [(none)]> CHANGE MASTER TO MASTER_HOST='an-coord1001.eqiad.wmnet',MAS...
[15:40:30] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Recreate analytics-meta replica on db1108 from master on an-coord1001 - https://phabricator.wikimedia.org/T295312 (10BTullis) Oh, that didn't quite work. ` Last_IO_Error: Got fatal error 1236 from master w...
[15:50:28] <ottomata>	 btullis:  lemme know if i can help
[15:50:35] <ottomata>	 is it possibly becauase the binlog filenames have changed?
[15:52:26] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Recreate analytics-meta replica on db1108 from master on an-coord1001 - https://phabricator.wikimedia.org/T295312 (10BTullis) I reset the GTID position and restarted the slave with: ` MariaDB [(none)]> sto...
[15:52:59] <btullis>	 No, it's fine now, thanks. I hadn't set the GTID position properly. I think it's all good now.
[15:53:34] <ottomata>	 ok phew
[15:55:28] <btullis>	 I see that we're excluding the superset_staging database from replication. `Replicate_Wild_Ignore_Table: superset\_staging.%`
[15:56:50] <ottomata>	 yes
[15:57:08] <ottomata>	 actually, it probably only needs excluded on db1108 for backup
[15:57:17] <ottomata>	 it probalby shoudl be replicated to failover on an-coord1002 (an-db1002)
[15:58:06] <btullis>	 Yeah, I just wonder why we exclude it at all, really. But it's not a biggie.
[15:58:54] <ottomata>	 yeah i guess, we don't have to?  can't hurt to backup, even if we never restore it
[15:59:52] <elukey>	 btullis: I had to do it in the past since testing superset releases often led to weird replication issues
[16:00:12] <elukey>	 and I was more free to use it as playground without worrying to much
[16:00:55] <elukey>	 (dropping dbs, restoring, etc.. way easier without risking to break replication)
[16:01:26] <ottomata>	 btullis:  if you have space, would appreciate a puppet brain bounce..mostly about some  weird stuff i'm doing to DRY airflow deployments...not sure if it is worth the complexity
[16:01:32] <ottomata>	 no urgency though if you are still doing db stuff
[16:01:35] <btullis>	 Yeah, fair enough. I've just always stayed away from replication filters in the past, but anyway.
[16:01:57] <btullis>	 No, I think I'm done with the DB for now. Happy to look at puppet with you. Batcave?
[16:02:00] <ottomata>	 ya
[16:02:11] <ottomata>	 1 min
[16:02:41] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Recreate analytics-meta replica on db1108 from master on an-coord1001 - https://phabricator.wikimedia.org/T295312 (10BTullis) All looks OK. Removed downtimes on services.
[16:03:49] <ottomata>	 k btullis 
[16:06:00] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Recreate analytics-meta replica on db1108 from master on an-coord1001 - https://phabricator.wikimedia.org/T295312 (10BTullis) Removed the backups. ` btullis@db1108:/srv$ sudo rm -rf sqldata.analytics_meta....
[16:23:00] <tanny411>	 I've been getting a lot of pyspark errors recently (Lost Task, Container exited with a non-zero exit code 1. Error file: prelaunch.err.) that fills the entire notebook. (using jupyterhub). I am not sure how to start to get this addressed, please let me know. 
[16:23:00] <tanny411>	 Sometimes the code does execute successfully and I know it because the file get saves (I had run a query and saved the results in a file) but the long error messages still showed up. 
[16:23:00] <tanny411>	 At the moment, I am getting "timeout waiting for task" and the cell is still running. I am not sure if there's an error, something is stuck, or something else....
[16:25:20] <ottomata>	 tanny411:  i betcah that is because the cluster looks pretty busy right now
[16:25:50] <ottomata>	 hmmm actually i take it back, no not that busy, 
[16:25:55] <ottomata>	 > half full
[16:25:58] <ottomata>	 greater than
[16:26:25] <ottomata>	 but, if in the past few days / week, it will have been periodicaly very busy, as there are a lot of beginning of  month computations that aree done
[16:26:44] <ottomata>	 mforns: o/
[16:26:51] <ottomata>	 where inside of airflow-dags will team specific dags be?
[16:27:00] <ottomata>	 dags/<team-name>
[16:27:00] <mforns>	 I believe on the root?
[16:27:00] <ottomata>	 ?
[16:27:03] <ottomata>	 or
[16:27:07] <ottomata>	 <team-name>-dags
[16:27:08] <ottomata>	 ?
[16:27:19] <ottomata>	 dags-team-name
[16:27:19] <ottomata>	 ?
[16:27:25] <wikibugs>	 10Analytics: Data drifts between superset_production on an-coord1001 and db1108 - https://phabricator.wikimedia.org/T279440 (10BTullis) I have rebuilt the replica on db1108 from a backup of an-coord1001, so that should all be good now and we will be in a better place in terms of having known good backups.
[16:27:25] <mforns>	  /team-name/dags
[16:27:36] <ottomata>	 ok
[16:27:41] <mforns>	 also potentially  /team-name/tests
[16:27:46] <ottomata>	 got it
[16:27:49] <mforns>	 and  /team-name/others
[16:27:55] <mforns>	 other stuff
[16:28:07] <mforns>	 does it make sense to you?
[16:28:09] <ottomata>	 so, since we will have a scap target deployment for each instance
[16:28:12] <ottomata>	 the full path will end up being
[16:28:15] <ottomata>	 e.g. for research
[16:28:42] <ottomata>	  /srv/deployment/airflow-dags/research/research/dags
[16:28:53] <ottomata>	   /srv/deployment/airflow-dags/research is path to repo checkout
[16:29:01] <ottomata>	 since the checkout is per instance too
[16:29:20] <ottomata>	 that look ok?
[16:29:22] <btullis>	 tanny411: Is this new code, or have these queries been working successfully before? I'm wondering if it's something to do with the queries themselves, or the cluster. 
[16:31:26] <tanny411>	 humm... the queries are new. but success/failure of the queries doesn't seem to be consistent. Also its  weird the query results come ingrained in a swarm of error messages when successful. Hard to keep that output in a notebook.
[16:38:56] <tanny411>	 Does this have to do something with changing verbosity level of spark tasks? 
[16:45:01] <wikibugs>	 10Analytics, 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Compare logs produced by atskfafka with those produced by varnishkafka - https://phabricator.wikimedia.org/T254317 (10elukey) While checking atskafka logs I found something interesting:  ` elukey@cp3050:~$ sudo journalctl -u atskafka-webrequest.serv...
[16:58:16] <joal>	 Hi tanny411 - errors in the spark-UI are about shufflw-fetch errors
[16:59:52] <joal>	 tanny411: this can come from multiple reasons, from too few workers (too much data shuffled across too few nodes) to too few partition to skewed data and too big stuff ending up on a single worker
[17:04:17] <tanny411>	 I nee to change up the configs then? 
[17:04:40] <tanny411>	 need*
[17:05:54] <joal>	 tanny411: hm, more precisions on what the job and its current config could help in tring to get a better idea of why the shuffle failed :)
[17:06:21] <tanny411>	 let me pull it up...
[17:08:47] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban: Results have expired error in Hue - https://phabricator.wikimedia.org/T294144 (10BTullis) a:03BTullis
[17:10:23] <tanny411>	 This is my current config: ```"spark.driver.memory": "16g",
[17:10:23] <tanny411>	                                             "spark.executor.memory": "16g",
[17:10:23] <tanny411>	                                             "spark.sql.shuffle.partitions": 1024```
[17:10:23] <tanny411>	 Besides setting yarn-large in wmfdata. 
[17:11:26] <joal>	 hm - IIRC yarn-large in wmfdata makes 4 cores executors, right?
[17:12:00] <tanny411>	 Yes
[17:12:23] <joal>	 hm
[17:12:35] <tanny411>	 wmfdata yarn-large configs: 
[17:12:35] <tanny411>	 ```"spark.driver.memory": "4g",
[17:12:35] <tanny411>	             "spark.dynamicAllocation.maxExecutors": 128,
[17:12:35] <tanny411>	             "spark.executor.memory": "8g",
[17:12:35] <tanny411>	             "spark.executor.cores": 4,
[17:12:36] <tanny411>	             "spark.sql.shuffle.partitions": 512
[17:12:36] <tanny411>	 ```
[17:18:04] <joal>	 tanny411: Could it be that your data is skewed?
[17:19:01] <tanny411>	 joal: how can I verify that? I did a bunch of filtering and merging on wikidata and query datasets
[17:20:40] <joal>	 tanny411: it requires knowledge of your data, and also an understanding of how data is shuffled (keys)
[17:21:31] <tanny411>	 Ah, that I don't think I can know for sure, but it could be.
[17:22:09] <tanny411>	 thats what i want to find out with the query actually, haha
[17:22:37] <joal>	 tanny411: :D
[17:26:29] <tanny411>	 joal: I am running the query again now, it doesn't seem to be producing errors yet. I'll let you know if I run into errors again. That is the problem, things are inconsistent
[17:26:54] <joal>	 tanny411: can very weel be linked to cluster usage as well :S
[17:27:22] <tanny411>	 indeed. 
[17:27:28] <tanny411>	 where can I see cluster usage though?
[17:27:38] <joal>	 tanny411: I'm monitoring 
[17:27:41] <joal>	 tanny411: https://yarn.wikimedia.org/cluster/scheduler
[17:28:55] <tanny411>	 Thanks!
[17:29:30] <joal>	 and then tanny411 when you have found your job in the page, click on the "Application Master" link at the end of the job line
[17:29:40] <joal>	 tanny411: this leads to - https://yarn.wikimedia.org/proxy/application_1633985963344_161408/stages/
[17:29:51] <joal>	 where you can see many details about yout job
[17:30:36] <joal>	 tanny411: from what I can see in the job, the query is very complex (many steps)
[17:31:01] <joal>	 tanny411: would it be something to create intermediate materialized results (written files)? 
[17:31:31] <joal>	 If there is no reusable bit, then probably no, but would there be?
[17:32:52] <tanny411>	 Humm... so I did compute another query that ran successfully that contained most of the same tasks. The new query simply uses different filtering and grouping. So I don't think it is because of the complexity. Data is too large to be stored actually, but I will look into ways I can reframe the query.
[17:34:14] <joal>	 tanny411: idea behind materializing intermediate views is to allow to not have to recompute everything in case of failure , and also use HDFS instead of shuffle-storage for those intermediate results
[17:35:32] <tanny411>	 humm...why didn't i think of hdfs?! Is there any limit to what I can/cannot store in hdfs? 
[17:36:30] <joal>	 tanny411: there are limits yes, but they are high enough so that you shouldn't be too much restrained :)
[17:36:52] <tanny411>	 okay, I will try and so that.
[17:37:30] <joal>	 tanny411: another point about storing on hdfs- don't forget to delete it when you don't need it anymore ;)
[17:37:40] <joal>	 tanny411: it's particularly true when very big :)
[17:38:11] <tanny411>	 joal: true! thanks!
[17:38:50] <joal>	 tanny411: I'll be off this end of week (I think you know it :) - I'll gladly help when back on monday!
[17:39:07] <tanny411>	 joal: yes, sure.
[18:01:08] <mforns>	 milimetric: you said there was a PS about airflow today, can I ask what was it about?
[18:03:36] <icinga-wm>	 PROBLEM - Check unit status of monitor_refine_event_sanitized_analytics_delayed on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event_sanitized_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:16:37] <tanny411>	 joal: the query succeeded. but again, the output is printed in between a lot of error messages. 
[18:20:09] <btullis>	 !log btullis@an-launcher1002:~$ sudo systemctl reset-failed monitor_refine_event_sanitized_analytics_delayed.service
[18:20:14] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:25:46] <icinga-wm>	 RECOVERY - Check unit status of monitor_refine_event_sanitized_analytics_delayed on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event_sanitized_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:31:42] <milimetric>	 mforns: we didn't get to it yet, I was going to ping after this round of meetings
[18:31:50] <milimetric>	 it might be too late, we can talk tomorrow
[18:32:14] <milimetric>	 nothing urgent, just some casual discussion around Airflow as a Service vs. ETL as a Service from the point of view of Product Analytics
[18:36:12] <ottomata>	 mforns: mostly just this convo from PA sync https://docs.google.com/document/d/1EnyRx1mr9pitxGlfnSbBE1pYA9q8xENjzIO9GOyg4Yo/edit
[18:36:49] <mforns>	 OK, I read that already, cool we can discuss, will be here for a bit still
[19:30:38] <ottomata>	 mforns:  
[19:30:59] <ottomata>	 so, just for ease of maintenance, i'm going to use a single ssh key for scap deployments of airflow-dags to different instances.
[19:31:31] <ottomata>	 that will mean that any airflow user group, e.g. analytics-research-admins, analytics-platform-eng-admins, etc. can deploy any scap deploy any instance airflow-edags
[19:31:50] <ottomata>	 if we need more restriced deployment later, we can add more ssh keys, but I think this will be fine for now (and probably for good)
[19:31:54] <ottomata>	 that ok with you?
[19:46:48] <wikibugs>	 (03CR) 10Joal: Add discussiontools_subscription query to sqoop (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/736021 (https://phabricator.wikimedia.org/T290516) (owner: 10MNeisler)
[19:47:26] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "Forgot to add the +1 :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/736021 (https://phabricator.wikimedia.org/T290516) (owner: 10MNeisler)
[19:58:07] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Send some existing Gobblin metrics to prometheus - https://phabricator.wikimedia.org/T294420 (10JAllemandou) Updating the task description with the latest decisions after talking with @Ottomata (thanks again :).
[20:09:25] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Send some existing Gobblin metrics to prometheus - https://phabricator.wikimedia.org/T294420 (10Ottomata)
[20:09:29] <ottomata>	 joal: updated to note that we will be using task_number, not task_id
[20:20:28] <joal>	 ottomata: I'm fully revamping the content :S
[20:20:40] <ottomata>	 oh k! :)
[20:21:11] <joal>	 I'll overwrite your patch sorry for that :)
[20:24:42] <ottomata>	 no worries!
[21:43:54] <joal>	 ottomata: heya - still here?
[21:44:13] <ottomata>	 ya
[21:44:16] <ottomata>	 joal hellooo 
[21:44:26] <joal>	 ottomata: batcave for a minute?
[21:44:47] <ottomata>	 yup
[21:55:48] <mforns>	 ottomata: agree re. scap groups!
[21:55:52] <mforns>	 thanks :]
[22:04:21] <ottomata>	 ok gr8!
[22:12:09] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Send some existing Gobblin metrics to prometheus - https://phabricator.wikimedia.org/T294420 (10JAllemandou)
[22:12:19] <joal>	 \o/ doc done :) --^
[22:12:32] <joal>	 ottomata: if you wish to review when you have free time :)
[22:12:47] <joal>	 With that, I wish ou all a good end of week!
[22:12:47] <ottomata>	 joal:  will do!  gotta run now tho!  thank you!
[22:20:22] <wikibugs>	 (03PS1) 10Bearloga: movement_metrics: Fail job if a notebook fails [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/738018 (https://phabricator.wikimedia.org/T295513)
[22:21:57] <wikibugs>	 (03CR) 10Bearloga: [V: 03+2 C: 03+2] "Tested that it works as expected after introducing bugs in repo clone" [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/738018 (https://phabricator.wikimedia.org/T295513) (owner: 10Bearloga)
[23:17:06] <wikibugs>	 10Analytics-Data-Quality, 10Analytics-EventLogging, 10Analytics-Radar, 10Editing-team, and 3 others: WikiEditor records all edits as platform = desktop in EventLogging - https://phabricator.wikimedia.org/T249944 (10ppelberg)
[23:35:57] <wikibugs>	 10Analytics-Data-Quality, 10Analytics-EventLogging, 10Analytics-Radar, 10Editing-team, and 3 others: WikiEditor records all edits as platform = desktop in EventLogging - https://phabricator.wikimedia.org/T249944 (10DLynch) I commented [elsewhere](https://phabricator.wikimedia.org/T294503#7493119) but I'm a...
[23:50:54] <wikibugs>	 10Analytics-Data-Quality, 10Analytics-EventLogging, 10Analytics-Radar, 10Product-Analytics, and 3 others: WikiEditor records all edits as platform = desktop in EventLogging - https://phabricator.wikimedia.org/T249944 (10ppelberg)