[00:11:19] <icinga-wm>	 PROBLEM - Check unit status of refinery-sqoop-whole-mediawiki on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-sqoop-whole-mediawiki https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[01:42:40] <wikibugs>	 10Analytics, 10SRE, 10SRE-Access-Requests, 10User-Ladsgroup: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10Ladsgroup) a:03Ladsgroup According to https://github.com/wikimedia/puppet/blob/production/modules/admin/data/data.yaml your shell username is dannyh, or you...
[05:40:24] <wikibugs>	 10Analytics, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Ladsgroup: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10Ladsgroup) I made the patch for it, please confirm that the correct LDAP username is dannyh and I will merge it. Keep it in mind this is...
[08:20:18] <wikibugs>	 10Analytics: Check home/HDFS leftovers of bumeh-ctr - https://phabricator.wikimedia.org/T300607 (10MoritzMuehlenhoff)
[08:50:40] <wikibugs>	 10Analytics, 10MediaWiki-extensions-EventLogging, 10MediaWiki-extensions-WikimediaEvents: Remove userAgent from Schema:PageContentSaveComplete - https://phabricator.wikimedia.org/T104863 (10kostajh) Untagging #contributors-team per T300558.
[08:54:20] <wikibugs>	 10Data-Engineering, 10Airflow, 10Platform Engineering: Catalog, Categorize, and Templetize existing scheduled workflows - https://phabricator.wikimedia.org/T282035 (10JAllemandou)
[08:55:54] <wikibugs>	 10Data-Engineering: [Spike] Test spark thrift-server for Superset - https://phabricator.wikimedia.org/T300611 (10JAllemandou)
[09:26:24] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban: Check home/HDFS leftovers of bumeh-ctr - https://phabricator.wikimedia.org/T300607 (10BTullis) a:03BTullis
[09:26:57] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban: Check home/HDFS leftovers of bumeh-ctr - https://phabricator.wikimedia.org/T300607 (10BTullis) p:05Triage→03Medium
[09:32:01] <btullis>	 I am investigating the `refinery-sqoop-whole-mediawiki` job failure.
[09:32:37] <joal>	 thanks btullis - I'm here if needed
[09:36:42] <btullis>	 Thanks joal. It's not immediately apparent what caused this. I'll look at the logs of the Mariadb server to see if it had a wobble.
[09:36:47] <btullis>	 https://www.irccloud.com/pastebin/vTKhOI80/
[09:38:38] <icinga-wm>	 ACKNOWLEDGEMENT - Check unit status of refinery-sqoop-whole-mediawiki on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-sqoop-whole-mediawiki Btullis Investigating this sqoop failure. https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[09:47:45] <wikibugs>	 (03Abandoned) 10DCausse: Add wikibase/rdf/update_stream/1.0.0 [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/594098 (owner: 10DCausse)
[10:00:46] <btullis>	 I'm going to re-run the service,
[10:00:55] <joal>	 btullis: please don't
[10:01:02] <btullis>	 OK.
[10:02:17] <joal>	 btullis: I'm investigating as well :)
[10:02:22] <btullis>	 I'm unsure what to do at the moment. I've looked in `an-launcher1002:/var/log/refinery/sqoop-mediawiki.log.1` 
[10:02:38] <btullis>	 Ah, great. Let me know if you'd like to chat about it.
[10:06:48] <joal>	 btullis: https://phabricator.wikimedia.org/T297191
[10:07:11] <joal>	 btullis: batcave?
[10:07:28] <btullis>	 On my way.
[10:20:57] <wikibugs>	 (03PS1) 10Joal: Fix sqoop page_restriction schema [analytics/refinery] - 10https://gerrit.wikimedia.org/r/758798 (https://phabricator.wikimedia.org/T297191)
[10:30:52] <joal>	 btullis: --^
[10:31:58] <joal>	 btullis: I'm gone for a swim, back in ~2h
[10:39:19] <elukey>	 btullis: o/ there are some DE nodes in icinga showing up issues (little ones, nothing burning, just wanted to ping)
[11:06:58] <btullis>	 Thanks elukey. Will check them out. I know about the matomo one, which I'm planning to do anyway. The change I made on Friday had to be reverted because the prometheus change wasn't quite right.
[11:09:24] <btullis>	 !log btullis@an-test-coord1001:~$ sudo apt-get -f install
[11:09:26] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:28:49] <elukey>	 !log kill processes related to offboarded user on stat1006 to unblock puppet
[12:28:51] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:05:42] <mforns>	 hi teammm!
[13:06:56] <wikibugs>	 10Analytics, 10Data-Engineering: Kerberos identity for bmansurov - https://phabricator.wikimedia.org/T300450 (10fkaelin)
[13:25:05] <btullis>	 Hi mforns.
[13:34:40] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Set SparkmaxPartitionBytes to 256MB - https://phabricator.wikimedia.org/T300299 (10Antoine_Quhen) a:03Ottomata
[13:35:17] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me. I still don't have +2 rights on this repo." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/758798 (https://phabricator.wikimedia.org/T297191) (owner: 10Joal)
[13:40:41] <wikibugs>	 10Data-Engineering-Radar, 10Release-Engineering-Team: Requesting membership of the analytics group in gerrit for 'btullis' - https://phabricator.wikimedia.org/T300631 (10BTullis)
[13:41:16] <btullis>	 I have made a request for membership of the analytics group in gerrit --^
[14:08:13] <wikibugs>	 10Data-Engineering-Radar, 10Gerrit-Privilege-Requests, 10Release-Engineering-Team: Requesting membership of the analytics group in gerrit for 'btullis' - https://phabricator.wikimedia.org/T300631 (10Zabe)
[14:16:36] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Fix sqoop page_restriction schema [analytics/refinery] - 10https://gerrit.wikimedia.org/r/758798 (https://phabricator.wikimedia.org/T297191) (owner: 10Joal)
[14:16:38] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Fix sqoop page_restriction schema [analytics/refinery] - 10https://gerrit.wikimedia.org/r/758798 (https://phabricator.wikimedia.org/T297191) (owner: 10Joal)
[14:17:01] <wikibugs>	 (03CR) 10Ottomata: "Huh, Ben I just tried to give you merge rights, but gerrit wouldn't let me!  I'm not sure how to do that then." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/758798 (https://phabricator.wikimedia.org/T297191) (owner: 10Joal)
[14:18:19] <wikibugs>	 (03CR) 10Btullis: Fix sqoop page_restriction schema (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/758798 (https://phabricator.wikimedia.org/T297191) (owner: 10Joal)
[14:29:20] <wikibugs>	 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 3 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10BTullis) Looks good to me too. {F34939480,width=60%} I'm happy to wait a couple more days to be sure,...
[15:00:28] <wikibugs>	 (03CR) 10Ottomata: "Left some inline Qs and comments." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan)
[15:45:14] <mforns>	 heya ottomata :] I've been testing the dependency thing, and in the end it seems all tasks that use artifacts, are working! However, I've encountered a weird error in the should_alert task (a python one that uses pyarrow). This task is silently freezing. Only when using the new setup. I suspect that is has something to do with the way the workflow_utils lib uses pyarrow?
[15:45:45] <mforns>	 Do you have 10 minutes later today to help me troubleshoot?
[15:52:46] <ottomata>	 hmmmm
[15:52:51] <ottomata>	 yeah for sure lets figure it out mforns 
[15:53:17] <ottomata>	 it is possible workflow_utils pyarrow ultimately uses the old HDFS API, since i'm just using it via fsspec
[15:53:25] <ottomata>	 and i think fsspec might not have upgraded to use the new pyarrow HDFS API
[15:53:30] <mforns>	 yes, it does use the old api
[15:53:59] <mforns>	 maybe that conflicts with the pyarrow task
[15:57:10] <ottomata>	 yeah maybe
[17:01:22] <btullis>	 I have one thing to deploy after standup: Fix sqoop page_restriction schema | https://gerrit.wikimedia.org/r/c/analytics/refinery/+/758798
[17:01:28] <btullis>	 Anything for anyone else?
[17:01:40] <joal>	 I can't think of anything else for me btullis :)
[17:02:00] <joal>	 btullis: I'll be in meetings all evening - I'll be able to help through IRC async
[17:04:19] <wikibugs>	 (03CR) 10Milimetric: Explore kafka data with visidata (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/673648 (https://phabricator.wikimedia.org/T265765) (owner: 10Milimetric)
[17:14:28] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Epic: Low Risk Oozie Migration: Mediawiki Geoeditors Monthly - https://phabricator.wikimedia.org/T300282 (10ntsako) 05Open→03In progress
[17:14:30] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Create Custom Hdfssensor - https://phabricator.wikimedia.org/T300276 (10BTullis)
[17:14:32] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Epic, 10Patch-For-Review: Migrate Oozie jobs to Airflow - https://phabricator.wikimedia.org/T299074 (10ntsako)
[17:15:31] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Create Custom Hdfssensor - https://phabricator.wikimedia.org/T300276 (10BTullis) p:05Triage→03Medium
[17:16:56] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Epic, 10Patch-For-Review: Migrate Oozie jobs to Airflow - https://phabricator.wikimedia.org/T299074 (10Antoine_Quhen) a:05Antoine_Quhen→03None
[17:23:03] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Epic: Low Risk Oozie Migration: Mediawiki Geoeditors Monthly - https://phabricator.wikimedia.org/T300282 (10BTullis) p:05Triage→03Medium
[17:24:35] <wikibugs>	 10Data-Engineering, 10Airflow, 10Platform Engineering: Replace Airflow's HDFS client (snakebite) with pyarrow - https://phabricator.wikimedia.org/T284566 (10Ottomata) A PR already exists in airflow to do this.  We could reopen it and follow up:  https://github.com/apache/airflow/pull/3560/files
[17:29:22] <btullis>	 !log about to deploy analytics/refinery
[17:29:24] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:43:15] <btullis>	 !log btullis@an-launcher1002:~$ sudo systemctl start refinery-sqoop-whole-mediawiki.service
[17:43:17] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:07:16] <btullis>	 I am not sure how to re-run the two oozie jobs that have failed. Should I be using the hue interface like this? If so, should I be selecting 'All or skip successful' or 'only failed'?
[18:07:21] <btullis>	 https://usercontent.irccloud-cdn.com/file/ovFZLM89/image.png
[18:09:11] <btullis>	 Also I cannot see the `virtualpageview-druid-daily` job in Hue to rerun. Sorry.
[18:11:34] <btullis>	 I deployed analytics/refinery and then ran `refinery-deploy-to-hdfs` as described here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Deploy/Refinery#How_to_deploy
[18:11:34] <btullis>	 But it still seems to be trying to get the `pr_long` table.
[18:11:38] <btullis>	 https://www.irccloud.com/pastebin/bbSej7tl/
[18:11:56] <btullis>	 I haven't done very well with these alerts today.
[18:25:33] <joal>	 Arf sorry btullis - I wasn't looking
[18:25:41] <joal>	 Using hue for reruns is usually what I do
[18:26:25] <joal>	 I rerun the failed instance, in the parent coordinator UI (not within the workflow)
[18:28:51] <joal>	 Ah btullis - I've been too fast with my path (CR didn't catch it either :( - new patch on the
[18:28:54] <joal>	 wya
[18:30:15] <wikibugs>	 (03PS1) 10Joal: Fix sqoop page_restriction job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/758940 (https://phabricator.wikimedia.org/T297191)
[18:30:24] <joal>	 btullis: --^
[18:34:27] <joal>	 !log rerun webrequest-druid-hourly-wf-2022-2-1-12
[18:34:29] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:35:52] <wikibugs>	 (03CR) 10Tullis: [C: 03+1] "Lgtm." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/758940 (https://phabricator.wikimedia.org/T297191) (owner: 10Joal)
[18:36:30] <joal>	 !log Rerun virtualpageview-druid-daily-wf-2022-1-16
[18:36:31] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:36:57] <btullis>	 I +1'd the change with my volunteer account in Gerrit by mistake. :-)
[18:45:17] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging hotfix" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/758940 (https://phabricator.wikimedia.org/T297191) (owner: 10Joal)
[18:45:42] <joal>	 it's merged btullis - let me know if you wish me to redeploy
[18:53:43] <btullis>	 Please do if you have time. I should be able to rerun the service on just over two hours.
[19:01:29] <joal>	 !log Deploying refinery with scap
[19:01:30] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:19:42] <ottomata>	 joal:  yt?
[19:21:01] <ottomata>	 mforns: got a few mins if you want to talk airflow stuff
[19:21:12] <joal>	 ottomata: here!
[19:23:11] <ottomata>	 joal https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/192e882a9f51897f25d8244abf5b2ea1311c8bf6/wmf_airflow_common/operators/spark.py#L173
[19:23:42] <ottomata>	 so far these Operators don't depend on custom other stuff, like refinery
[19:23:52] <ottomata>	 i guess.... do we want to always use NoCLI?
[19:23:54] <ottomata>	 i guess we should?
[19:23:58] <ottomata>	 or, shoudl we default to not using NoCLI
[19:24:00] <mforns>	 ottomata: sure :]
[19:24:04] <ottomata>	 since we can run in client mode
[19:24:09] <ottomata>	 with CLIDriver
[19:24:15] <joal>	 ottomata: if we use Skein, NoCLI is not mandatory anymore
[19:24:21] <ottomata>	 okay
[19:24:26] <ottomata>	 perhaps i will default to CLIDriver
[19:24:32] <ottomata>	 and we can override in our default args / dags
[19:24:43] <ottomata>	 joal also
[19:24:47] <ottomata>	 params on line 187 there
[19:24:51] <ottomata>	 does val need quoated?
[19:24:54] <ottomata>	 or should it not be?
[19:24:57] <ottomata>	 mforns:  bc?
[19:25:02] <mforns>	 in it!
[19:25:03] <joal>	 ottomata: I'd do it the other way - CLIDriver will break our instance if not used with Skein
[19:25:24] <ottomata>	 hm, no?  it will break our instance if not used in skein but used in cluster mode
[19:25:32] <ottomata>	 i'd like the code here to not reference refinery if possible
[19:25:34] <joal>	 ottomata: assuming the vals are "simple string", no quote needed
[19:25:44] <ottomata>	 right, but they could have spaces maybe?
[19:26:32] <joal>	 ottomata: I don't know how our opt-parsing (and hive-parsing) would accept that - I'm pretty sure it'll fail as of now because of my trnaslation
[20:05:55] <btullis>	 !log btullis@an-launcher1002:~$ sudo systemctl restart refinery-sqoop-whole-mediawiki.service
[20:05:57] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[20:14:49] <icinga-wm>	 RECOVERY - Check unit status of refinery-sqoop-whole-mediawiki on an-launcher1002 is OK: OK: Status of the systemd unit refinery-sqoop-whole-mediawiki https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[20:17:30] <joal>	 ok! looks like sqoop is back on track :) thanks a lot for the restart btullis 
[20:17:36] <joal>	 Gone for tonight folks
[20:27:59] <ottomata>	 mforns:  back for a bit how goes?
[20:28:37] <razzi>	 ottomata: I have a question for you if you have a minute: I'm seeing an alert `CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@matomo.service`
[20:28:46] <ottomata>	 oh
[20:29:15] <ottomata>	 hm razzi  i can't quite recall what wmf_auto_restart is, but i think is an sre thing...maybe for debmonitor?
[20:29:20] <ottomata>	 but, it should be in puppet, lets see
[20:29:30] <razzi>	 want to do a quick voice chat ottomata ?
[20:29:33] <ottomata>	 sure
[21:08:49] <wikibugs>	 (03CR) 10Milimetric: [C: 03+1] "Looks good to me, just two questions but you can merge if you're in a rush." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/757495 (https://phabricator.wikimedia.org/T297934) (owner: 10Joal)
[21:11:48] <ottomata>	 joal:  i don't suppose we can somenow get NoCLIDriver into these specialClasses eh?
[21:11:49] <ottomata>	 https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L77-L89
[21:11:52] <ottomata>	 private static final :(
[21:16:26] <btullis>	 razzi: I've been working on that alert. It's related to the matomo one in my in progress column. You can ask the alert if you want, or just ignore.
[21:17:02] <razzi>	 Sounds good btullis, I'll ack it
[22:11:24] <wikibugs>	 10Analytics, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Ladsgroup: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10DannyH) I'm not sure how to check this. On Superset, my profile is https://superset.wikimedia.org/superset/profile/dannyh/  To log in, I...
[22:27:27] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Run Atlas on test cluster - https://phabricator.wikimedia.org/T296670 (10razzi)
[22:27:33] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Apache atlas build fails due to expired certificate (https://maven.restlet.com) - https://phabricator.wikimedia.org/T297841 (10razzi) 05Open→03Resolved I also got the build to work at one point, and we're pausing on Atlas due to the hive incomp...
[22:28:19] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic: Evaluate Atlas - https://phabricator.wikimedia.org/T299165 (10razzi)
[22:28:30] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Run Atlas on test cluster - https://phabricator.wikimedia.org/T296670 (10razzi) 05Open→03Resolved We're calling this done, since the latest Atlas not supporting the hive version we're running is enough of a blocker that we're pausing with Atlas...
[23:08:33] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Kerberos identity for bmansurov - https://phabricator.wikimedia.org/T300450 (10odimitrijevic) p:05Triage→03High
[23:35:18] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Set SparkmaxPartitionBytes to 256MB - https://phabricator.wikimedia.org/T300299 (10odimitrijevic) p:05Triage→03High
[23:36:35] <wikibugs>	 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Conda's CPPFLAGS may not be correct when pip installing a package that needs c/cpp compilation - https://phabricator.wikimedia.org/T292699 (10Ottomata) This didn't quite work!  After activating a stacked env now, CPPFLAGS are: ` -DNDEBUG -D_F...
[23:38:01] <wikibugs>	 10Data-Engineering, 10Superset: [Spike] Test spark thrift-server for Superset - https://phabricator.wikimedia.org/T300611 (10odimitrijevic)
[23:43:14] <wikibugs>	 10Data-Engineering, 10Product-Analytics, 10Superset: Investigate Superset query templating as a mean to optimize partition pruning - https://phabricator.wikimedia.org/T299961 (10odimitrijevic)
[23:50:41] <wikibugs>	 10Data-Engineering, 10Project-Admins: Archive Analytics tag - https://phabricator.wikimedia.org/T298671 (10odimitrijevic) Hi @Aklapper apologies for the very late response on this and thanks for the list above. I propose the following changes:  * Can H126 be changed to add Data-Engineering instead of Analytics...