[00:11:19] PROBLEM - Check unit status of refinery-sqoop-whole-mediawiki on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-sqoop-whole-mediawiki https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:42:40] 10Analytics, 10SRE, 10SRE-Access-Requests, 10User-Ladsgroup: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10Ladsgroup) a:03Ladsgroup According to https://github.com/wikimedia/puppet/blob/production/modules/admin/data/data.yaml your shell username is dannyh, or you... [05:40:24] 10Analytics, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Ladsgroup: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10Ladsgroup) I made the patch for it, please confirm that the correct LDAP username is dannyh and I will merge it. Keep it in mind this is... [08:20:18] 10Analytics: Check home/HDFS leftovers of bumeh-ctr - https://phabricator.wikimedia.org/T300607 (10MoritzMuehlenhoff) [08:50:40] 10Analytics, 10MediaWiki-extensions-EventLogging, 10MediaWiki-extensions-WikimediaEvents: Remove userAgent from Schema:PageContentSaveComplete - https://phabricator.wikimedia.org/T104863 (10kostajh) Untagging #contributors-team per T300558. [08:54:20] 10Data-Engineering, 10Airflow, 10Platform Engineering: Catalog, Categorize, and Templetize existing scheduled workflows - https://phabricator.wikimedia.org/T282035 (10JAllemandou) [08:55:54] 10Data-Engineering: [Spike] Test spark thrift-server for Superset - https://phabricator.wikimedia.org/T300611 (10JAllemandou) [09:26:24] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban: Check home/HDFS leftovers of bumeh-ctr - https://phabricator.wikimedia.org/T300607 (10BTullis) a:03BTullis [09:26:57] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban: Check home/HDFS leftovers of bumeh-ctr - https://phabricator.wikimedia.org/T300607 (10BTullis) p:05Triage→03Medium [09:32:01] I am investigating the `refinery-sqoop-whole-mediawiki` job failure. [09:32:37] thanks btullis - I'm here if needed [09:36:42] Thanks joal. It's not immediately apparent what caused this. I'll look at the logs of the Mariadb server to see if it had a wobble. [09:36:47] https://www.irccloud.com/pastebin/vTKhOI80/ [09:38:38] ACKNOWLEDGEMENT - Check unit status of refinery-sqoop-whole-mediawiki on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-sqoop-whole-mediawiki Btullis Investigating this sqoop failure. https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:47:45] (03Abandoned) 10DCausse: Add wikibase/rdf/update_stream/1.0.0 [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/594098 (owner: 10DCausse) [10:00:46] I'm going to re-run the service, [10:00:55] btullis: please don't [10:01:02] OK. [10:02:17] btullis: I'm investigating as well :) [10:02:22] I'm unsure what to do at the moment. I've looked in `an-launcher1002:/var/log/refinery/sqoop-mediawiki.log.1` [10:02:38] Ah, great. Let me know if you'd like to chat about it. [10:06:48] btullis: https://phabricator.wikimedia.org/T297191 [10:07:11] btullis: batcave? [10:07:28] On my way. [10:20:57] (03PS1) 10Joal: Fix sqoop page_restriction schema [analytics/refinery] - 10https://gerrit.wikimedia.org/r/758798 (https://phabricator.wikimedia.org/T297191) [10:30:52] btullis: --^ [10:31:58] btullis: I'm gone for a swim, back in ~2h [10:39:19] btullis: o/ there are some DE nodes in icinga showing up issues (little ones, nothing burning, just wanted to ping) [11:06:58] Thanks elukey. Will check them out. I know about the matomo one, which I'm planning to do anyway. The change I made on Friday had to be reverted because the prometheus change wasn't quite right. [11:09:24] !log btullis@an-test-coord1001:~$ sudo apt-get -f install [11:09:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:28:49] !log kill processes related to offboarded user on stat1006 to unblock puppet [12:28:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:05:42] hi teammm! [13:06:56] 10Analytics, 10Data-Engineering: Kerberos identity for bmansurov - https://phabricator.wikimedia.org/T300450 (10fkaelin) [13:25:05] Hi mforns. [13:34:40] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Set SparkmaxPartitionBytes to 256MB - https://phabricator.wikimedia.org/T300299 (10Antoine_Quhen) a:03Ottomata [13:35:17] (03CR) 10Btullis: [C: 03+1] "Looks good to me. I still don't have +2 rights on this repo." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/758798 (https://phabricator.wikimedia.org/T297191) (owner: 10Joal) [13:40:41] 10Data-Engineering-Radar, 10Release-Engineering-Team: Requesting membership of the analytics group in gerrit for 'btullis' - https://phabricator.wikimedia.org/T300631 (10BTullis) [13:41:16] I have made a request for membership of the analytics group in gerrit --^ [14:08:13] 10Data-Engineering-Radar, 10Gerrit-Privilege-Requests, 10Release-Engineering-Team: Requesting membership of the analytics group in gerrit for 'btullis' - https://phabricator.wikimedia.org/T300631 (10Zabe) [14:16:36] (03CR) 10Ottomata: [C: 03+2] Fix sqoop page_restriction schema [analytics/refinery] - 10https://gerrit.wikimedia.org/r/758798 (https://phabricator.wikimedia.org/T297191) (owner: 10Joal) [14:16:38] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Fix sqoop page_restriction schema [analytics/refinery] - 10https://gerrit.wikimedia.org/r/758798 (https://phabricator.wikimedia.org/T297191) (owner: 10Joal) [14:17:01] (03CR) 10Ottomata: "Huh, Ben I just tried to give you merge rights, but gerrit wouldn't let me! I'm not sure how to do that then." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/758798 (https://phabricator.wikimedia.org/T297191) (owner: 10Joal) [14:18:19] (03CR) 10Btullis: Fix sqoop page_restriction schema (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/758798 (https://phabricator.wikimedia.org/T297191) (owner: 10Joal) [14:29:20] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 3 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10BTullis) Looks good to me too. {F34939480,width=60%} I'm happy to wait a couple more days to be sure,... [15:00:28] (03CR) 10Ottomata: "Left some inline Qs and comments." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan) [15:45:14] heya ottomata :] I've been testing the dependency thing, and in the end it seems all tasks that use artifacts, are working! However, I've encountered a weird error in the should_alert task (a python one that uses pyarrow). This task is silently freezing. Only when using the new setup. I suspect that is has something to do with the way the workflow_utils lib uses pyarrow? [15:45:45] Do you have 10 minutes later today to help me troubleshoot? [15:52:46] hmmmm [15:52:51] yeah for sure lets figure it out mforns [15:53:17] it is possible workflow_utils pyarrow ultimately uses the old HDFS API, since i'm just using it via fsspec [15:53:25] and i think fsspec might not have upgraded to use the new pyarrow HDFS API [15:53:30] yes, it does use the old api [15:53:59] maybe that conflicts with the pyarrow task [15:57:10] yeah maybe [17:01:22] I have one thing to deploy after standup: Fix sqoop page_restriction schema | https://gerrit.wikimedia.org/r/c/analytics/refinery/+/758798 [17:01:28] Anything for anyone else? [17:01:40] I can't think of anything else for me btullis :) [17:02:00] btullis: I'll be in meetings all evening - I'll be able to help through IRC async [17:04:19] (03CR) 10Milimetric: Explore kafka data with visidata (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/673648 (https://phabricator.wikimedia.org/T265765) (owner: 10Milimetric) [17:14:28] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Epic: Low Risk Oozie Migration: Mediawiki Geoeditors Monthly - https://phabricator.wikimedia.org/T300282 (10ntsako) 05Open→03In progress [17:14:30] 10Data-Engineering, 10Data-Engineering-Kanban: Create Custom Hdfssensor - https://phabricator.wikimedia.org/T300276 (10BTullis) [17:14:32] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Epic, 10Patch-For-Review: Migrate Oozie jobs to Airflow - https://phabricator.wikimedia.org/T299074 (10ntsako) [17:15:31] 10Data-Engineering, 10Data-Engineering-Kanban: Create Custom Hdfssensor - https://phabricator.wikimedia.org/T300276 (10BTullis) p:05Triage→03Medium [17:16:56] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Epic, 10Patch-For-Review: Migrate Oozie jobs to Airflow - https://phabricator.wikimedia.org/T299074 (10Antoine_Quhen) a:05Antoine_Quhen→03None [17:23:03] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Epic: Low Risk Oozie Migration: Mediawiki Geoeditors Monthly - https://phabricator.wikimedia.org/T300282 (10BTullis) p:05Triage→03Medium [17:24:35] 10Data-Engineering, 10Airflow, 10Platform Engineering: Replace Airflow's HDFS client (snakebite) with pyarrow - https://phabricator.wikimedia.org/T284566 (10Ottomata) A PR already exists in airflow to do this. We could reopen it and follow up: https://github.com/apache/airflow/pull/3560/files [17:29:22] !log about to deploy analytics/refinery [17:29:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:43:15] !log btullis@an-launcher1002:~$ sudo systemctl start refinery-sqoop-whole-mediawiki.service [17:43:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:07:16] I am not sure how to re-run the two oozie jobs that have failed. Should I be using the hue interface like this? If so, should I be selecting 'All or skip successful' or 'only failed'? [18:07:21] https://usercontent.irccloud-cdn.com/file/ovFZLM89/image.png [18:09:11] Also I cannot see the `virtualpageview-druid-daily` job in Hue to rerun. Sorry. [18:11:34] I deployed analytics/refinery and then ran `refinery-deploy-to-hdfs` as described here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Deploy/Refinery#How_to_deploy [18:11:34] But it still seems to be trying to get the `pr_long` table. [18:11:38] https://www.irccloud.com/pastebin/bbSej7tl/ [18:11:56] I haven't done very well with these alerts today. [18:25:33] Arf sorry btullis - I wasn't looking [18:25:41] Using hue for reruns is usually what I do [18:26:25] I rerun the failed instance, in the parent coordinator UI (not within the workflow) [18:28:51] Ah btullis - I've been too fast with my path (CR didn't catch it either :( - new patch on the [18:28:54] wya [18:30:15] (03PS1) 10Joal: Fix sqoop page_restriction job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/758940 (https://phabricator.wikimedia.org/T297191) [18:30:24] btullis: --^ [18:34:27] !log rerun webrequest-druid-hourly-wf-2022-2-1-12 [18:34:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:35:52] (03CR) 10Tullis: [C: 03+1] "Lgtm." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/758940 (https://phabricator.wikimedia.org/T297191) (owner: 10Joal) [18:36:30] !log Rerun virtualpageview-druid-daily-wf-2022-1-16 [18:36:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:36:57] I +1'd the change with my volunteer account in Gerrit by mistake. :-) [18:45:17] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging hotfix" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/758940 (https://phabricator.wikimedia.org/T297191) (owner: 10Joal) [18:45:42] it's merged btullis - let me know if you wish me to redeploy [18:53:43] Please do if you have time. I should be able to rerun the service on just over two hours. [19:01:29] !log Deploying refinery with scap [19:01:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:19:42] joal: yt? [19:21:01] mforns: got a few mins if you want to talk airflow stuff [19:21:12] ottomata: here! [19:23:11] joal https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/192e882a9f51897f25d8244abf5b2ea1311c8bf6/wmf_airflow_common/operators/spark.py#L173 [19:23:42] so far these Operators don't depend on custom other stuff, like refinery [19:23:52] i guess.... do we want to always use NoCLI? [19:23:54] i guess we should? [19:23:58] or, shoudl we default to not using NoCLI [19:24:00] ottomata: sure :] [19:24:04] since we can run in client mode [19:24:09] with CLIDriver [19:24:15] ottomata: if we use Skein, NoCLI is not mandatory anymore [19:24:21] okay [19:24:26] perhaps i will default to CLIDriver [19:24:32] and we can override in our default args / dags [19:24:43] joal also [19:24:47] params on line 187 there [19:24:51] does val need quoated? [19:24:54] or should it not be? [19:24:57] mforns: bc? [19:25:02] in it! [19:25:03] ottomata: I'd do it the other way - CLIDriver will break our instance if not used with Skein [19:25:24] hm, no? it will break our instance if not used in skein but used in cluster mode [19:25:32] i'd like the code here to not reference refinery if possible [19:25:34] ottomata: assuming the vals are "simple string", no quote needed [19:25:44] right, but they could have spaces maybe? [19:26:32] ottomata: I don't know how our opt-parsing (and hive-parsing) would accept that - I'm pretty sure it'll fail as of now because of my trnaslation [20:05:55] !log btullis@an-launcher1002:~$ sudo systemctl restart refinery-sqoop-whole-mediawiki.service [20:05:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:14:49] RECOVERY - Check unit status of refinery-sqoop-whole-mediawiki on an-launcher1002 is OK: OK: Status of the systemd unit refinery-sqoop-whole-mediawiki https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:17:30] ok! looks like sqoop is back on track :) thanks a lot for the restart btullis [20:17:36] Gone for tonight folks [20:27:59] mforns: back for a bit how goes? [20:28:37] ottomata: I have a question for you if you have a minute: I'm seeing an alert `CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@matomo.service` [20:28:46] oh [20:29:15] hm razzi i can't quite recall what wmf_auto_restart is, but i think is an sre thing...maybe for debmonitor? [20:29:20] but, it should be in puppet, lets see [20:29:30] want to do a quick voice chat ottomata ? [20:29:33] sure [21:08:49] (03CR) 10Milimetric: [C: 03+1] "Looks good to me, just two questions but you can merge if you're in a rush." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/757495 (https://phabricator.wikimedia.org/T297934) (owner: 10Joal) [21:11:48] joal: i don't suppose we can somenow get NoCLIDriver into these specialClasses eh? [21:11:49] https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L77-L89 [21:11:52] private static final :( [21:16:26] razzi: I've been working on that alert. It's related to the matomo one in my in progress column. You can ask the alert if you want, or just ignore. [21:17:02] Sounds good btullis, I'll ack it [22:11:24] 10Analytics, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Ladsgroup: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10DannyH) I'm not sure how to check this. On Superset, my profile is https://superset.wikimedia.org/superset/profile/dannyh/ To log in, I... [22:27:27] 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Run Atlas on test cluster - https://phabricator.wikimedia.org/T296670 (10razzi) [22:27:33] 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Apache atlas build fails due to expired certificate (https://maven.restlet.com) - https://phabricator.wikimedia.org/T297841 (10razzi) 05Open→03Resolved I also got the build to work at one point, and we're pausing on Atlas due to the hive incomp... [22:28:19] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic: Evaluate Atlas - https://phabricator.wikimedia.org/T299165 (10razzi) [22:28:30] 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Run Atlas on test cluster - https://phabricator.wikimedia.org/T296670 (10razzi) 05Open→03Resolved We're calling this done, since the latest Atlas not supporting the hive version we're running is enough of a blocker that we're pausing with Atlas... [23:08:33] 10Data-Engineering, 10Data-Engineering-Kanban: Kerberos identity for bmansurov - https://phabricator.wikimedia.org/T300450 (10odimitrijevic) p:05Triage→03High [23:35:18] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Set SparkmaxPartitionBytes to 256MB - https://phabricator.wikimedia.org/T300299 (10odimitrijevic) p:05Triage→03High [23:36:35] 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Conda's CPPFLAGS may not be correct when pip installing a package that needs c/cpp compilation - https://phabricator.wikimedia.org/T292699 (10Ottomata) This didn't quite work! After activating a stacked env now, CPPFLAGS are: ` -DNDEBUG -D_F... [23:38:01] 10Data-Engineering, 10Superset: [Spike] Test spark thrift-server for Superset - https://phabricator.wikimedia.org/T300611 (10odimitrijevic) [23:43:14] 10Data-Engineering, 10Product-Analytics, 10Superset: Investigate Superset query templating as a mean to optimize partition pruning - https://phabricator.wikimedia.org/T299961 (10odimitrijevic) [23:50:41] 10Data-Engineering, 10Project-Admins: Archive Analytics tag - https://phabricator.wikimedia.org/T298671 (10odimitrijevic) Hi @Aklapper apologies for the very late response on this and thanks for the list above. I propose the following changes: * Can H126 be changed to add Data-Engineering instead of Analytics...