[08:59:26] (03PS9) 10Martaannaj: Create wd_propertysuggester/client_ab_testing and wd_propertysuggester/server_ab_testing [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/689152 [09:00:00] (03CR) 10jerkins-bot: [V: 04-1] Create wd_propertysuggester/client_ab_testing and wd_propertysuggester/server_ab_testing [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/689152 (owner: 10Martaannaj) [09:00:24] (03PS10) 10Martaannaj: Create wd_propertysuggester/client_side_property_request and wd_propertysuggester/server_side_property_request [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/689152 [09:00:55] (03CR) 10jerkins-bot: [V: 04-1] Create wd_propertysuggester/client_side_property_request and wd_propertysuggester/server_side_property_request [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/689152 (owner: 10Martaannaj) [09:10:17] (03PS11) 10Martaannaj: Create wd_propertysuggester/client_side_property_request and wd_propertysuggester/server_side_property_request [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/689152 [09:10:51] (03CR) 10jerkins-bot: [V: 04-1] Create wd_propertysuggester/client_side_property_request and wd_propertysuggester/server_side_property_request [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/689152 (owner: 10Martaannaj) [09:15:08] (03PS12) 10Martaannaj: Create wd_propertysuggester/client_side_property_request and wd_propertysuggester/server_side_property_request [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/689152 [09:15:50] (03CR) 10jerkins-bot: [V: 04-1] Create wd_propertysuggester/client_side_property_request and wd_propertysuggester/server_side_property_request [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/689152 (owner: 10Martaannaj) [09:33:53] (03CR) 10Martaannaj: "Looking at the test output it seems like the schemas which are causing the tests to fail are not actually the ones created here (client_si" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/689152 (owner: 10Martaannaj) [10:41:19] (03CR) 10Svantje Lilienthal: [C: 03+1] Add aggregations for template data usage in VE's template dialog [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/703753 (https://phabricator.wikimedia.org/T272589) (owner: 10Andrew-WMDE) [10:41:28] (03CR) 10Svantje Lilienthal: [C: 03+1] Add aggregations for template data usage in TemplateWizard [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/703838 (https://phabricator.wikimedia.org/T272589) (owner: 10Andrew-WMDE) [11:20:28] Kill stuck refine application [11:20:31] !log Kill stuck refine application [11:20:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:27:21] I still find it a bit nerve-wracking to run these cookbooks in dry-run mode. I guess that doesn't go away for a while, eh? :) [12:00:34] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Update Spicerack cookbooks to follow the new class API conventions - https://phabricator.wikimedia.org/T269925 (10BTullis) [12:08:08] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Update Spicerack cookbooks to follow the new class API conventions - https://phabricator.wikimedia.org/T269925 (10BTullis) Dry run succeeded for hadoop masters and workers cookbooks. [12:19:01] btullis: dry-run mode is super fine, there is zero chance to get any change live, don't worry :) [12:19:20] (it was built in the spicerack API by Riccardo very well) [12:20:48] also the cookbooks have a lot of checks etc.. [12:42:50] joal o/ [12:43:06] qq about alluxio/presto [12:43:07] https://phabricator.wikimedia.org/T286591 [12:43:20] in hardware planning, we wrote [12:43:21] 512GB RAM, 32core x 2 procs, minimal storage [12:43:38] is there any need for something like ssd storage for alluxio disk caching? [12:43:47] hm - minimal storage is wrong [12:43:49] or is it always in memory [12:43:58] https://docs.google.com/spreadsheets/d/1123OTmek4eRriBkZrAjbp06aH0RMmR0e69TMUlVF84s/edit#gid=978022032 [12:44:17] ottomata: alluxio does multi-tier caching, ordering tiers by access-speed [12:44:38] joal: in the beginning we thought we'd have cached all in memory, it is probably why we didn't focus on disks [12:44:39] so we don't need hdfs size storage, but somethign would be good? [12:45:12] ottomata: Having a lot of RAM is good, we should also have storage - If we want best, we would have RAM + SSDs + HDDS, but it would be ok to have RAM + HDDs [12:45:39] HDFS size storage seems a lot, but having quite some storage would be useful [12:45:50] and actually a lot - I don't know! [12:46:07] 500Gb total not enough? [12:46:10] ottomata: most of our data is immutable - If we can cache it on alluxio on disk, well that is! [12:46:31] ottomata: more space for caching = more local data for alluxio [12:46:36] huh [12:46:45] there is also the ? about the alluxio master nodes, that are basically hdfs-namenode-like nodes [12:46:46] so basically, hdfs nodes but with tons of ram [12:47:15] joal: wasn't the initial plan to just fetch from HDFS if RAM was filled up? [12:47:15] oh ya? we don't have anythin allocated for alluxio master type nodes [12:47:19] ottomata: indeed - we have no clue how alluxio handles leader-node metadata [12:48:01] elukey: hm, the nodes we have for alluxio have disks, so the plan was to use disks as well - no? [12:48:13] hey all! [12:48:19] hi mforns :) [12:48:44] Hello mforns. [12:48:48] hello! [12:48:49] :] [12:48:55] joal: I am asking, at the beginning we just wanted ram-caching IIRC, we didn't discuss about multi-tier storage.. I'd say that some is fine, but I wouldn't buy hadoop-like nodes [12:49:06] joal: not in the hw plan we made a few months ago [12:49:13] we were going to reallocate the existing presto nodes as hadoop workers [12:49:21] and get 10 new presto nodes, more ram, less storage [12:49:58] ottomata: it is my bad for the master nodes, I didn't have a clear view in mind at the time when we discussed it. We could ask to Rob if it is possible to order some standard nodes with a lot of ram, and maybe reduce a little the presto/alluxio workers (like 2 master + 8 beefy workers? depending on the cost) [12:50:21] elukey: can we collcate at first? [12:50:26] collocate* [12:50:34] maybe on an-coords or just on a couple of presto nodes? [12:50:50] also...we are doing both presto and alluxio collocated....right? [12:50:55] elukey: I had a different understanding about tiered caching - but that's no big deal - If we can have some storage, I think it's better [12:50:55] ottomata: we could yes, but I'd prefer the latter, the coords are already a little busy [12:51:12] they'll be less busy in a quarter or two [12:51:12] right ottomata - alluxio and presto workers collocated [12:51:15] getting mysql off of them [12:51:17] ok [12:51:26] maybe we can think of a new node name for them then [12:51:44] ok, let's pick a target total storage size [12:51:48] so I can tell rob something [12:51:56] the main point is how much heap those jvms will eat (the masters I mean), they are close to namenodes so we may need a lot of space [12:52:11] aye [12:52:12] ok [12:52:25] elukey: but they only have to maintain references to accessed files, right/ [12:52:28] ? [12:52:31] not everything in hdfs? [12:52:33] elukey: we don't know how the metadata is structured for those - maybe it works in a different way and will require less memory? [12:52:48] yes we don't know :) [12:52:55] but better safe than sorry :D [12:53:00] elukey: I get your point nonetheless :) [12:53:40] how about I ask for ~2-4T of working SSDs storage (after RAID)? [12:53:46] ottomata: no idea how they manage metadata (as Joseph pointed out), maybe we could do a little research or contact upstream [12:54:21] if they are lightweight coords or co-location may be enough (so no need for extra special nodes) [12:54:24] or even VMs [12:54:30] (but the last one is not great) [12:55:13] just asking about worker storage for now [12:55:50] ottomata: I have no clue about storage price etc - Is it better to ask for SSDs or more HDDs? [12:57:01] also no idea, i think it doesn't matter at this point, but we should just give them something [12:57:27] joal and elukey can you make a recommendation on https://phabricator.wikimedia.org/T286591 ? [12:57:47] not visible for me ottomata [12:59:32] asked rob to make it so [13:00:20] ottomata: I think that we should figure out how to add Joseph to https://phabricator.wikimedia.org/project/profile/1586/ [13:00:29] so he'll be able to see procurement tasks from now on [13:01:25] perfect Rob needs to approve, we'll see [13:03:05] ottomata: is it ok if we do it say on Monday? [13:03:18] (so Joseph gets access, time for reviews/discussion/etc..) [13:04:56] i think there's tiem [13:05:09] perfect [13:05:12] i think there was a mistake anyway; we weren't planning on working on these nodes wil Q3 [13:05:14] told rob as much [13:06:17] how do we organize on this? Do we setup a meeting? [13:06:53] (03CR) 10Ottomata: [C: 03+2] Rematerialize fragment schemas with generated examples. [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/702700 (https://phabricator.wikimedia.org/T270134) (owner: 10Ottomata) [13:07:13] (03PS2) 10Ottomata: Rematerialize fragment schemas with generated examples. [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/702700 (https://phabricator.wikimedia.org/T270134) [13:07:42] (03CR) 10jerkins-bot: [V: 04-1] Rematerialize fragment schemas with generated examples. [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/702700 (https://phabricator.wikimedia.org/T270134) (owner: 10Ottomata) [13:07:58] ottomata, elukey - how do we organize? meeting? [13:08:17] joal? [13:08:42] joal: we can discuss the specs on monday after you get access to the task [13:08:46] does it sound good? [13:08:56] that works for me :) [13:09:03] thanks elukey [13:14:01] (03PS3) 10Ottomata: Rematerialize fragment schemas with generated examples. [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/702700 (https://phabricator.wikimedia.org/T270134) [13:14:47] (03CR) 10Ottomata: "Hm, yeah there are some in flight changes I think that need to be merged first. On it..." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/689152 (owner: 10Martaannaj) [13:15:23] (03CR) 10Ottomata: [C: 03+2] Rematerialize fragment schemas with generated examples. [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/702700 (https://phabricator.wikimedia.org/T270134) (owner: 10Ottomata) [13:15:59] (03Merged) 10jenkins-bot: Rematerialize fragment schemas with generated examples. [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/702700 (https://phabricator.wikimedia.org/T270134) (owner: 10Ottomata) [13:19:46] ottomata: refine job is stuck, failing on NPE errors with geocoding [13:21:41] (03PS3) 10Ottomata: Use latest version of jsonschema-tools and run tests on analytics/legacy schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/702736 (https://phabricator.wikimedia.org/T285975) [13:21:47] oof k joal will look in a sec [13:22:33] (03PS4) 10Ottomata: Use latest version of jsonschema-tools and run tests on analytics/legacy schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/702736 (https://phabricator.wikimedia.org/T285975) [13:22:38] ottomata: To my understanding this is due to the UAParser using a static object for underneath parsing, and reinstantiating it at creation [13:22:56] joal geocoding or uaparsing? [13:23:08] oops - us-parsing my bad [13:23:11] aye k [13:23:16] (03CR) 10Ottomata: [C: 03+2] Use latest version of jsonschema-tools and run tests on analytics/legacy schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/702736 (https://phabricator.wikimedia.org/T285975) (owner: 10Ottomata) [13:23:54] (03Merged) 10jenkins-bot: Use latest version of jsonschema-tools and run tests on analytics/legacy schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/702736 (https://phabricator.wikimedia.org/T285975) (owner: 10Ottomata) [13:25:24] (03PS13) 10Ottomata: Create wd_propertysuggester/client_side_property_request and wd_propertysuggester/server_side_property_request [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/689152 (owner: 10Martaannaj) [13:25:56] joal i don't have any notices about refine failling [13:26:22] (03PS1) 10Joal: Fix ua-parser initialization [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/704785 [13:26:40] ottomata: interestingly it doesn't fail - I manually killed a job earlier that had been running for 10h [13:26:48] (03CR) 10Ottomata: [C: 03+1] Fix ua-parser initialization [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/704785 (owner: 10Joal) [13:26:55] huh [13:26:56] weird [13:27:03] And the one I restarted is also stuck, running on a task for more than 1h [13:27:14] and logs are full of that NPE error [13:27:32] ottomata: shall we test a run of refine with my patch? [13:27:42] ottomata: I'm in meeting in 3 minutes :S [13:28:51] joal sure lets merge that and mine :) [13:28:56] i can then build and test [13:29:17] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Fix ua-parser initialization [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/704785 (owner: 10Joal) [13:29:34] (03CR) 10Ottomata: [C: 03+2] Refine - explicitly uncache DataFrame when done [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/704576 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [13:32:16] (03PS1) 10Ottomata: Update changelog [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/704787 [13:32:20] ottomata: killing the failing current run [13:34:40] 10Analytics, 10Analytics-Kanban, 10Platform Engineering, 10Research, 10User-razzi: Create airflow instances for Platform Engineering and Research - https://phabricator.wikimedia.org/T284225 (10Ottomata) Hey yall, quick update: We haven't been working on this as the Gobblin migration has been taking a lo... [13:34:52] joal which refine job? [13:35:21] !log Kill currently running refine job (application_1623774792907_154014) [13:35:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:35:28] refine_event? [13:35:31] el legacy? [13:35:34] el analytics? [13:35:40] refine_event [13:35:43] ottomata: --^ [13:35:43] k [13:37:06] ok joal running [13:37:06] application_1623774792907_154475 [13:37:17] wow - deployed etc already? [13:37:25] manual build ottomata ? [13:37:27] yup [13:37:29] ack [13:37:32] thanks for doing that [13:37:36] /home/otto/refinery-source/refinery-job/target/refinery-job-0.1.15-SNAPSHOT.jar \ [13:37:45] on an-launcher1002 [13:38:23] ottomata: puppet relaunches refine_event, we have 2 isntances running [13:38:31] can you disable timer please? [13:38:34] ottomata: --^ [13:38:44] 2 now? [13:38:48] i'm running with the same job name [13:38:49] Yes - application_1623774792907_154469 [13:38:52] would have expdcted it not to [13:39:10] It started just before [13:39:15] Killing it now [13:39:21] oh wow so fast [13:39:22] thx [13:39:23] bad timing [13:39:34] !lof Kill refine_event application_1623774792907_154469 to let manual run finish [13:39:40] !log Kill refine_event application_1623774792907_154469 to let manual run finish [13:39:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:41:38] PROBLEM - Check unit status of refine_event on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:58:42] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for next deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703454 (https://phabricator.wikimedia.org/T286241) (owner: 10Gerrit maintenance bot) [14:00:45] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for next deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/702877 (owner: 10Joal) [14:05:50] ottomata: not fixed :( [14:07:06] hmm joal yeah its still going [14:07:15] ottomata: it contains the same error [14:07:21] joal how can you tell? [14:07:23] the job is sitll running? [14:07:24] ottomata: I'm looking at logs on the fly [14:07:26] you looking on workers? [14:07:26] aye [14:07:32] is it for some special UA maybe? [14:08:38] ottomata: for multiple UAs [14:09:56] joal why is this happening now? [14:10:06] ottomata: I have no idea [14:10:26] ottomata: I imagine it is related to our change (multi-cores executors), but I'm not sure [14:10:44] right [14:10:48] hm [14:12:39] 10Analytics, 10Analytics-EventLogging, 10MW-1.37-notes (1.37.0-wmf.15; 2021-07-19), 10User-Zabe, 10Wikimedia-production-error: TypeError: Return value of JsonSchemaHooks::onEditFilterMergedContent() must be of the type boolean, none returned - https://phabricator.wikimedia.org/T286611 (10Zabe) 05Open→... [14:18:08] It's weird ottomata - I don't get it :( [14:19:18] ottomata: ok I get it [14:19:22] hahah [14:19:23] wow so fast! [14:19:27] Note that LRUMap is not synchronized and is not thread-safe. If you wish to use this map from multiple threads concurrently, you must use appropriate synchronization. The simplest approach is to wrap this map using Collections.synchronizedMap(Map). This class may throw NullPointerException's when accessed by concurrent threads. [14:19:37] this explains that [14:19:44] I'm gonna add a patch [14:19:57] hm - can I? [14:20:36] ottomata: also, this will mean waiting if I synchronize [14:20:39] We can try [14:21:08] joal try it for sure! [14:21:15] what do you mean waiting? [14:23:16] RECOVERY - Check unit status of refine_event on an-launcher1002 is OK: OK: Status of the systemd unit refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:23:45] (03PS1) 10Joal: Fix ua-parser race-condition by synchronizing its usage [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/704794 [14:23:50] ottomata: --^ [14:25:24] (03PS2) 10Joal: Fix ua-parser race-condition by synchronizing its usage [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/704794 [14:25:31] ottomata: with some comments--^ [14:26:57] ok joal, building, will relaunch job [14:27:13] ack ottomata - waiting for our new launch to kill the current one [14:27:26] joal i'll kill the current one right before i launch [14:27:37] as you wish, I can also do it :) [14:32:38] ok killin gand staring [14:33:45] joal i just noticed in yarn ui that someone else is using Apache Toree! [14:33:53] yesir :) [14:34:09] ottomata: I try to push people in using scala :) [14:34:15] maybe someone other than just us will appreciate spark 3 + scala 2.12 + almond [14:34:20] (and fab) [14:34:54] application_1623774792907_154630 [14:35:45] (03CR) 10Ottomata: "https://phabricator.wikimedia.org/T286655" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/689152 (owner: 10Martaannaj) [14:38:44] 10Analytics, 10Analytics-Wikistats, 10Voice & Tone, 10good first task: Wikistats Bug - easy to understand language for pageviews - https://phabricator.wikimedia.org/T263973 (10Samtar) a:03Samtar [14:41:56] joal: I added some queue-related metrics to https://grafana-rw.wikimedia.org/d/000000585/hadoop?orgId=1, really interesting [14:42:07] (all averages for namenode) [14:42:29] the avg queue time (mentioned in RPChttps://engineering.linkedin.com/blog/2021/the-exabyte-club--linkedin-s-journey-of-scaling-the-hadoop-distr) is really tiny [14:42:37] I thought it was a scale error but those are ms [14:49:42] 10Analytics, 10Analytics-Wikistats, 10User-Samtar, 10Voice & Tone, 10good first task: Wikistats Bug - easy to understand language for pageviews - https://phabricator.wikimedia.org/T263973 (10Samtar) 05Open→03Stalled It would be great if we could agree on the wording to use - I've played around with a... [14:52:47] joal job still running... [14:52:56] maybe its just got a lot to do? [14:55:07] checking ottomata [14:55:42] ottomata: it's making progress [14:56:09] ottomata: there was indeed a lot to do - the first job I killed was stuck from today hour 2am [14:56:13] aye [14:59:02] ottomata: I have tried gobblin with graphite metrics - Let's talk about that poststandup [15:02:06] hey a-team I just spoke with Olja and we are going to skip stand-up today because of staff meeting, see you in grooming! [15:02:23] oh yeah, this was the important meeting right? [15:02:29] yes [15:02:32] and long [15:02:33] :] [15:14:31] ottomata: job finished successfully! [15:14:41] ottomata: there might folders with failure-flag? [16:15:26] joal from previous runs? [16:15:36] i actually ruan that job with --ignoire-failure-flag [16:15:52] Perfect :) thanks a lot for that [16:16:00] checking output [16:16:27] joal it looks lke only mediawiki_api_request was lagging [16:16:27] 21/07/15 15:14:00 INFO Refine: Successfully refined 21 of 21 dataset partitions into table `event`.`mediawiki_api_request` (total # refined records: 155015338) [16:16:35] the rest only did one or two datasets [16:16:39] that one is large and lots of ua parsing [16:16:42] so makes sense [16:16:48] but, no failures! [16:16:49] ottomata: other ones have been worked successfully from my rerun [16:16:49] great! [16:16:52] lets merge and release [16:16:53] right [16:16:56] OH [16:16:57] \o/ [16:16:59] ? [16:16:59] you mean they were stuck before? [16:17:05] yes! [16:17:24] the job that started at 2am lasted 11h --> no refine in between! [16:17:31] (03CR) 10Ottomata: [C: 03+2] Fix ua-parser race-condition by synchronizing its usage [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/704794 (owner: 10Joal) [16:17:55] then I ran my manual try, al small ones were done - Then again your manual run [16:18:23] (03PS2) 10Ottomata: Update changelog [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/704787 [16:18:34] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Update changelog [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/704787 (owner: 10Ottomata) [16:18:41] So stuff have been backfilled, except for the failing one [16:19:08] Starting build #92 for job analytics-refinery-maven-release-docker [16:21:32] 10Analytics, 10Analytics-EventLogging, 10MW-1.37-notes (1.37.0-wmf.15; 2021-07-19), 10User-Zabe, 10Wikimedia-production-error: TypeError: Return value of JsonSchemaHooks::onEditFilterMergedContent() must be of the type boolean, none returned - https://phabricator.wikimedia.org/T286611 (10matmarex) I'll b... [16:24:38] Thank you ottomata for the deploy <3 [16:24:49] thanks for the patches! [16:31:53] Project analytics-refinery-maven-release-docker build #92: 09SUCCESS in 12 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/92/ [16:36:22] Starting build #50 for job analytics-refinery-update-jars-docker [16:36:49] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.1.15 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/704839 [16:36:49] Project analytics-refinery-update-jars-docker build #50: 09SUCCESS in 27 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/50/ [16:39:29] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add refinery-source jars for v0.1.15 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/704839 (owner: 10Maven-release-user) [16:44:50] !log deploying refinery and refinery-source 0.1.15 for refine job fixes - T271232 [16:44:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:44:53] T271232: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 [16:46:23] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 (10Ottomata) [17:00:10] 10Analytics, 10Analytics-EventLogging, 10MW-1.37-notes (1.37.0-wmf.15; 2021-07-19), 10Patch-For-Review, and 2 others: TypeError: Return value of JsonSchemaHooks::onEditFilterMergedContent() must be of the type boolean, none returned - https://phabricator.wikimedia.org/T286611 (10matmarex) THanks for prepar... [17:27:48] ottomata: do we spend a minute on metrics, or tomorrow? [17:28:02] 3 mins joal ? :) [17:28:21] in bc [17:28:26] sure ottomata joining [17:29:29] we could build Apache Centipede: a curated collection of Apache tools glued together into a turn-key open data platform [17:45:57] I'm signing off for now folks. Catch you all later or tomorrow. [17:48:45] joal: anything to hand off for ops week? [18:06:39] heya mforns - ongoing gobblin affecting refine, sqoop patch merged and to be deployed (etherpad updated) - that's all I think of ) [18:07:22] joal: do we want to deploy sqoop changes today, or can wait until tuesday? [18:07:35] regular deploy is super fine mforns :) [18:07:51] and joal is anything to be done with gobblin? [18:08:25] meaning anything that you started that I can continue or pair? [18:08:58] mforns: the next thing to be done for gobblin is adding metrics to Prometheus [18:09:12] ok [18:09:29] and regarding the ongoing refine issues? [18:10:09] mforns: I have tested the metrics system with Graphite, and the bulk of the task will be to devise a prometheus data sender, as well as defining how we wish to "define" metrics (define as in name, organize etc) [18:10:27] About refine I think we have every known issue under control [18:10:31] Here is what happened: [18:11:11] The move to gobblin lead to some issues in refine, we assume due to files being gzipped (more work on single wokers) [18:11:58] We changed refine spark execution settings, adding multi-cores executors and more ram, and this lead to a race-condition in ua-parser [18:12:40] this has been solved today, Andrew has deployed patches, and so far so good (ou know everything I think mforns) [18:13:19] ok, thanks a lot joal! [18:13:31] np mforns - thank you for taking over :) [20:12:28] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Services, and 2 others: EventGate should use recent service-runner (^2.8.1) with Prometheus support - https://phabricator.wikimedia.org/T272714 (10Ottomata) ^ Done and deployed to eventgate-analytics staging. Looks good. Will deploy to production the w... [20:19:42] 10Analytics: Refinery python code should use anaconda-wmf - https://phabricator.wikimedia.org/T286743 (10Ottomata) [20:20:12] 10Analytics-Clusters, 10Analytics-Kanban: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10razzi)