[07:19:28] 10Analytics-Radar, 10WMDE-Templates-FocusArea, 10WMDE-TechWish-Sprint-2021-07-07: Backfill metrics for TemplateWizard and VisualEditor - https://phabricator.wikimedia.org/T274988 (10WMDE-Fisch) This might be done or not relevant anymore. Jobs seem to run for current data and I'm not sure if back fill is need... [10:30:46] PROBLEM - Check unit status of check_webrequest_partitions on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit check_webrequest_partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:58:56] (03PS1) 10Joal: Fix refinery-dump_status-webrequest-partition after gobblin [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703730 (https://phabricator.wikimedia.org/T271232) [12:34:51] hello joal! :) [12:37:34] (03CR) 10Ottomata: Fix refinery-dump_status-webrequest-partition after gobblin (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703730 (https://phabricator.wikimedia.org/T271232) (owner: 10Joal) [12:53:45] Hi ottomata :) [12:54:27] yoohooo [12:54:41] so, netflow today? [12:56:10] yes! [12:56:17] rdy whenever you are! [12:56:34] let's do it - I'll update the patch for the checker later [12:56:52] I assume it's only about moving data and merging your puppet patch, right? [12:56:59] yup i think so [12:57:04] ok :) [12:57:18] so should I stop jobs? [12:57:21] Let's stop camus ang gobblin for netflow then [12:57:29] k [12:57:47] done [12:58:14] moving data [12:59:05] making patch for gobblin job new location [12:59:47] (03PS1) 10Ottomata: Make gobblin-netflow use production directory [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703738 (https://phabricator.wikimedia.org/T271232) [13:00:00] joal gonna merge and deploy ^ so its ready [13:00:13] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Make gobblin-netflow use production directory [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703738 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [13:00:14] Please go :) [13:00:33] ottomata: data ready! [13:00:47] when deploy is done we can restart gobblin and absent camus :) [13:01:46] deploying to launvher [13:05:01] joal added other bits to same refine puppet patch [13:05:02] https://gerrit.wikimedia.org/r/c/operations/puppet/+/703623 [13:05:08] also, deploy to launcher finished [13:05:19] shall I merge ^^ ? [13:06:07] please ottomata :) [13:07:46] oh i need to change data purge format [13:08:01] joal, puppet applied, camus should be absented, gobblni running, refine config changed too [13:08:18] i guess...i should be able to launch a netflow refine, right? [13:08:19] ottomata: ok let;s monitor [13:08:25] and it should find the new data you just moved in? [13:08:42] ottomata: actually it should not do anything: there is no 'new' data yet :) [13:08:51] right, but the directories changed, right? [13:08:54] so it shoudl pick up new mtimes? [13:08:59] true [13:09:02] we can try [13:09:09] gonna run it manually now just to see what it woudl do without waiting :) [13:09:15] ok [13:09:28] ok running [13:10:57] hm 21/07/08 13:09:42 INFO Refine: No targets needing refinement were found in /wmf/data/raw/netflow. [13:10:59] ok... [13:11:06] doesn't sound right [13:11:21] ottomata: timestamps before the camus ones? [13:11:31] ? [13:11:33] hm [13:11:34] oh [13:11:34] hm [13:12:13] ottomata: there also ia a weird thing - there is no _IMPORTED flag in the folders [13:12:16] this is unexpected [13:12:32] oh in the gobblin imported folders? [13:12:37] yeah [13:12:58] that's done by the timestamp checker? [13:13:03] in gobblin when the job is finishing? [13:13:27] ottomata: it's done by gobblin itself (publisher step) [13:13:35] Will cdheck a gobblin run [13:14:53] indeed joal _REFINED timestamps greater than gobblin import time [13:15:01] i guess that makes sense [13:15:05] ok [13:16:16] ottomata: I'm gonna do some triple check on this data -something is weird [13:16:26] opk [13:25:54] ottomata: data looks ok - The fact that there is no _IMPORTED flag in not ok though :( [13:26:20] Oh! I think I get it [13:26:22] ya why would that be? some bug in the timestamop calc? [13:26:29] Interesting! [13:26:34] okkk. :) [13:27:07] https://gerrit.wikimedia.org/r/c/operations/puppet/+/703742/ [13:28:16] hm [13:29:22] wassup? [13:29:40] trying to make sure my understanding is correct [13:32:02] ottomata: batcave for a minute? [13:32:07] sure [13:45:35] o/ [13:54:01] o/ [13:54:28] 10Analytics: Fix gobblin not writing _IMPORTED flags when runs don't overlap hours - https://phabricator.wikimedia.org/T286343 (10JAllemandou) [13:54:54] ottomata: --^ [13:54:59] hi milimetric [13:55:33] aye [13:56:59] makes sense. I was trying to figure out if the data loss warnings and the webrequest partition check problems were related joal [13:56:59] (03PS2) 10Joal: Fix refinery-dump_status-webrequest-partition after gobblin [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703730 (https://phabricator.wikimedia.org/T271232) [13:57:07] milimetric: --^ [13:57:12] :) [13:57:54] milimetric: the probability of the problem happening for webrequest (or any relatively high volume stream not timed at the minute)is really slow [14:00:40] right, I conflated two things. I should've said: [14:01:01] it seemed like the partition check would be affected by gobblin, and it looks like you have a fix for that [14:01:24] and second, it looks like there's some minor data loss, is that related to you all migrating from camus to gobblin? [14:01:46] Ah - I had not seen that one milimetric - It shouldn't :( [14:01:49] (as in, did you do that for webrequest at about that time, 06 UTC today? No... wait, nobody was awake then? [14:02:03] ok, I'll do the false positive thingy on it, brb with more info [14:02:10] Thanks a lot milimetric [14:05:48] joal https://gerrit.wikimedia.org/r/c/operations/puppet/+/703750/2/modules/profile/manifests/analytics/refinery/job/gobblin.pp [14:07:00] Perfect ottomata - Thank you! [14:21:05] (no loss when running the false positive checker) [14:22:11] 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Product-Data-Infrastructure: Remove StreamConfig::INTERNAL_SETTINGS logic from EventStreamConfig and do it in EventLogging client instead - https://phabricator.wikimedia.org/T286344 (10Ottomata) [14:22:50] joal lets discuss https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/668124 [14:23:40] hmm actually joal we have a few other camus jobs we could migrate [14:23:44] eventlogging-client-side [14:23:53] atskafka_test_webrequest_text [14:24:03] oh maybe we can remvoe atskafka [14:24:08] # TODO(klausman): Remove this once we are confident that ATSKafka and [14:24:08] # VarnishKafka report the same event streams (cf. T254317) [14:24:09] T254317: Compare logs produced by atskfafka with those produced by varnishkafka - https://phabricator.wikimedia.org/T254317 [14:24:14] I thought ats_kafka was not used [14:24:19] yeah [14:24:28] We can also keep it if needed [14:24:47] lets not migrate now then [14:24:54] joal....did you just manually launche gobblin-netflow [14:24:55] ? [14:25:06] OH [14:25:07] no [14:25:07] huh [14:25:09] i think puppet did [14:25:10] weird [14:25:10] eventlogging-client-side is independent on event-config, right? It's legacy events that will be migrted [14:25:18] I didn't do anything ottomata [14:25:21] yeah joal eventlogging-client-side is not even refined [14:25:27] its just imported raw for backfilling purposes just in case [14:25:33] ok [14:25:34] and will be gone after event platform migration is done [14:25:38] i guess we could do that last too [14:25:49] doable now if you wish - not complicated [14:26:13] hmm [14:26:16] lets focus on events [14:26:19] ok lets look at camus test [14:26:36] there we explicitly oly includelist of a few streams [14:27:18] joalyou wanted to discuss job names in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/668124 [14:27:19] ? [14:28:10] ottomata: I wondered if job-names should be more specific - event_general for instance seems very generic [14:28:36] aye, its the default one [14:28:38] other jobs would be like [14:28:44] event_high_volume? [14:29:15] the other being eventlogging, would this one be eventstream ? [14:30:00] ? [14:30:17] i think gonna call the eventlogging one [14:30:22] eventlogging_legacy_test [14:30:25] sorry [14:30:27] eventlogging_legacy [14:30:34] right [14:31:11] ok - eventlogging_legacy and event then? [14:31:23] but event is not specific at all :( [14:31:58] well [14:32:04] i want event_something [14:32:09] because there may be more than one event job [14:32:20] because we may need to vary based on volume [14:32:24] approximate volume [14:32:30] right [14:32:40] event_general is what I came up with but I don't love it either [14:32:49] event_default? [14:32:52] hm [14:32:53] yeah maybe [14:34:23] joal does the consumer name 'analytics_hadoop_ingestion' work for you? [14:34:40] would calling it something 'gobblin' be better? [14:34:46] it does - I like that it's functionally defining without technology name [14:34:51] aye me too ok [14:35:04] No need to change when moving to flink :) [14:40:04] (03PS1) 10Ottomata: Add gobblin/jobs/eventlogging_legacy_test.pull [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703751 (https://phabricator.wikimedia.org/T271232) [14:40:19] joal ^ will that work? [14:42:19] reading ottomata [14:43:56] ottomata: it seems correct :) [14:44:04] ottomata: do you wish me to test quickly? [14:45:32] joa yeah not sure if we tested the stream_names thing? [14:45:34] does that work? [14:45:45] joal i gotta merge and deploy the stream config change too [14:45:47] ottomata: we have not tested no [14:45:53] Ah right ottomata [14:46:00] updaged [14:46:01] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/668124/7/wmf-config/InitialiseSettings.php [14:46:04] ok, let's try with the patches then :) [14:46:06] 'job_name' => 'event_default', [14:46:06] anbd [14:46:10] 'job_name' => 'eventlogging_legacy', [14:46:16] perfect - thank you for that [14:46:46] k gimme +1 and i will merge and deploy [14:47:21] (03CR) 10Joal: [C: 03+1] "I think it should work ottomata!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703751 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [14:52:04] (03PS1) 10Andrew-WMDE: Add aggregations for template data usage in VisualEditor's template dialog [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/703753 (https://phabricator.wikimedia.org/T272589) [14:52:42] ok joal done [14:52:46] stream config has those settingis now [14:52:48] can you test? [14:58:10] joal: ? :) [14:58:18] sure ottomata - sorry was away for aminute [15:01:53] ottomata: test successfull :) [15:03:52] wow great [15:04:10] ok joal lets mrege deploy, you can do first run [15:04:15] i will make puppet gobblin jobin test [15:04:24] (03CR) 10Ottomata: [C: 03+2] Add gobblin/jobs/eventlogging_legacy_test.pull [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703751 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [15:04:26] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add gobblin/jobs/eventlogging_legacy_test.pull [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703751 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [15:04:42] so FYI joal that patch is assuming we are taking this moment to rename the import dir eventlogging_legacy [15:04:46] instead of just eventlogging [15:04:50] that shouldn't matter, right? [15:04:56] we have to change refine job anyway [15:05:19] ottomata: as long as we change the refine job and the deletion job, no problem for me :) [15:05:23] yup [15:05:23] k [15:06:18] ottomata: shall we merge the patch for webrequest-status to deploy it at the same time? [15:06:47] sure oh [15:06:49] i already started deploy [15:06:52] well, that is to test anyway [15:06:56] separate deploys [15:06:56] but ya [15:06:57] Ah true [15:07:25] joal i still don't undesrtand teh padded date thing [15:07:38] the former is used to extract percent loss? [15:08:18] new data format is used for success-file check, and old-date format to extract date values to run hive queries [15:09:19] oh wow [15:09:22] hacky bash stuff [15:09:23] i see [15:09:37] and we can't do it easily with the new date format as it is unpadded [15:10:02] beacuse it is extracting the stuff with hardcoded lengths [15:10:03] huh [15:10:07] i guess we could make that part betterr [15:10:09] but then yuck [15:10:20] at that point we'd just rewrite this entire thing [15:10:22] and yuck [15:10:50] ottomata: given how often the thing is useful lately, I wonder if we shouldn't actually bin it [15:10:57] oh? [15:11:09] oh right this is the CLI tool to do it? [15:11:19] to help debug seq stats? [15:11:21] (03CR) 10Ottomata: [C: 03+2] "This script is really hard to read and work with in general; i see many ways it could be refactored and re-written, but now is not the tim" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703730 (https://phabricator.wikimedia.org/T271232) (owner: 10Joal) [15:11:23] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Fix refinery-dump_status-webrequest-partition after gobblin [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703730 (https://phabricator.wikimedia.org/T271232) (owner: 10Joal) [15:11:44] joal, other thing [15:11:44] This is the original checker we used to monitor webrequest [15:11:58] now we use oozie SLAs and errors [15:11:58] just remembered that we can't migrate all eventlogigng legacy stuuff until ALL eventlogging streams are migrated to eventplatform [15:12:00] it'll work fine in test [15:12:06] because it only imports navtiming [15:12:13] so we can proceed with this one [15:12:24] aye [15:12:33] joal: i merged this, let's keep it for now but consider dropping later? [15:12:39] sure ottomata [15:13:10] So about events: we can deploy for test, but we'll need a job using regexes-topic for eventlogging - is that right? [15:13:30] yes, lets put that off til the end though and decide [15:13:49] https://gerrit.wikimedia.org/r/c/operations/puppet/+/703742 [15:13:52] oops [15:13:54] wrong paste [15:14:02] https://gerrit.wikimedia.org/r/c/operations/puppet/+/703755 [15:14:05] add job ^ [15:16:41] joal: ok job declared, i guess it doesn't really matter if you run it manually for first run, right? [15:17:11] ottomata: I need to create the folder IIRC - test cluster has weird perms [15:17:25] oh ok [15:17:49] joal i'm going to make other test jobs too, no reason to not start importing via gobblin there now [15:18:23] ottomata: low volume only eh :) [15:18:35] ottomata: and very regular deletion (cluster is small) [15:19:08] ottomata: all set in test (folder perms changed [15:19:20] ottomata: then it'll be about changing the refine test job [15:19:28] well, and moving data [15:19:41] ok joal ok if i start the gobblin job for eventlogging legacy test? [15:19:42] ottomata: will we? [15:19:43] via systemd? [15:19:47] please ottomata [15:19:58] joal: ? OH i still named this thing _gobblin. [15:19:59] i see [15:20:03] beacuse we use a new name [15:20:04] we don't have to do that [15:20:07] right. [15:20:10] right? [15:20:13] ottomata: indeed!P [15:20:15] ohhh [15:20:21] ottomata: I didn't notice the gobblin sorry :S [15:20:22] ok re-deloying then without _gobblin [15:20:39] /o\ [15:20:55] joal remiind me how writer.partition.timestamp.columns=meta.dt [15:20:55] works? [15:21:09] i guess with [15:21:09] org.wikimedia.gobblin.kafka.Kafka1TimestampedRecordSource [15:21:17] kafka timestamp takes precedence anyway? [15:21:20] yessir [15:21:34] you can also define multiple columns [15:22:38] ok, since we'll only be importing migrated data, i will define just meta.dt [15:22:48] (03PS1) 10Ottomata: gobblin eventlogging_legacy_test - use final import path [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703757 (https://phabricator.wikimedia.org/T271232) [15:23:14] (03CR) 10Joal: [C: 03+1] gobblin eventlogging_legacy_test - use final import path [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703757 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [15:23:16] (03CR) 10Ottomata: [V: 03+2 C: 03+2] gobblin eventlogging_legacy_test - use final import path [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703757 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [15:24:33] joal hm [15:24:35] the camus job has [15:24:38] 'mapreduce.job.queuename' => 'essential', [15:24:55] do we need that for gobblin? [15:25:37] Arf I had forgotten about that one :( [15:25:56] can we just set that in all the relvant .pull files or is it a different setting? [15:26:04] it currently runs in production [15:26:19] I have not tested ottomata - testing now [15:26:22] it does joal? [15:26:30] i don't see that configure anyway [15:26:42] ottomata: run by analytics, so run in prod by default :) [15:26:47] oh really? [15:26:51] huh cool [15:27:00] must be a new capacity sched thing? [15:27:34] correct ottomata [15:29:33] ottomata: the setting is correct for gobblin to run in the queue [15:29:45] We can add it to the pull files [15:29:52] doing it now [15:30:48] (03PS1) 10Ottomata: Add gobblin event_default_test job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703759 (https://phabricator.wikimedia.org/T271232) [15:30:50] ok gr8 [15:32:26] (03PS2) 10Ottomata: Add gobblin event_default_test job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703759 (https://phabricator.wikimedia.org/T271232) [15:32:28] (03PS1) 10Joal: Make gobblin jobs run in essential queue [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703760 (https://phabricator.wikimedia.org/T271232) [15:32:33] ottomata: --^ [15:33:03] (03CR) 10Joal: [C: 03+1] "LGTM :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703759 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [15:33:16] (03CR) 10Ottomata: [C: 03+2] Make gobblin jobs run in essential queue [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703760 (https://phabricator.wikimedia.org/T271232) (owner: 10Joal) [15:33:18] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Make gobblin jobs run in essential queue [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703760 (https://phabricator.wikimedia.org/T271232) (owner: 10Joal) [15:33:39] (03PS3) 10Ottomata: Add gobblin event_default_test job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703759 (https://phabricator.wikimedia.org/T271232) [15:34:05] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add gobblin event_default_test job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703759 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [15:35:17] https://gerrit.wikimedia.org/r/c/operations/puppet/+/703761 [15:37:10] oh joal i guess we could put that setting in one of hte common files? [15:37:19] or, maybe per job explicit is better? [15:37:35] ottomata: you're so rigth! I should have thought about it [15:37:49] might as well, i guess analytics-common.properties [15:37:59] go ahead make patch and we'll just get it out with next whatever deploy [15:38:08] joal ok, that stuff is on test cluster [15:38:08] ottomata: gobblin should run in essential by default - ok doing [15:38:12] ok if i run first gobblin eventlogging_legacy ? [15:38:16] via systemd? [15:38:31] eventlogging_legacy_test job [15:38:31] ? [15:38:56] sure! [15:40:01] hm joial didn't you say you had to create some dirs? [15:40:06] i don't see the raw/eventlogging_legacy dir [15:40:12] i guess you meant gobblin state dirs? [15:40:43] no no - I updated /wmf/data/raw perms so that analytics could write into in - it was incorrectly set [15:40:47] so all good [15:40:51] ah k [15:40:59] hmm first run did not create dir [15:41:03] i guess thats expecgted? [15:41:06] normal - no data :) [15:41:10] ok [15:41:12] running again [15:41:16] next one will have data [15:41:31] first run just create a non-empty state-store for the job [15:42:07] aye makes sense [15:42:18] pfct [15:42:19] it worked [15:42:35] joal https://gerrit.wikimedia.org/r/c/operations/puppet/+/703761 should be good to go too [15:42:38] (03PS1) 10Joal: Put gobblin essential-queue property in common file [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703762 (https://phabricator.wikimedia.org/T271232) [15:42:44] that will start us with the event_default_test job [15:42:52] and we can have some data coming in in prep for data move [15:43:09] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Put gobblin essential-queue property in common file [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703762 (https://phabricator.wikimedia.org/T271232) (owner: 10Joal) [15:43:20] super ottomata :) [15:43:39] ok will get that running tthen will make patches for el legacy refine and data purge [15:43:41] ottomata: I assume next thing is to migrate the refine job for NavTiming? [15:43:48] awesome [15:44:02] I had forgotten data-purge [15:44:42] yeah, it'd be more obvious if the purge jobs were declared (maybe automatically!) with the import jobs [15:44:45] but...airflow. [15:44:45] :) [15:45:02] airflow first, then Atlas maybe? [15:45:11] hheh [15:45:13] ya [15:48:53] hm! there iis no data_purge job for raw eventlogging in test cluster :o [15:49:00] making one :p [15:49:12] WUT? [15:51:52] we have NavTiming there back utnil 2020/11/08 [15:51:55] in raw [15:52:02] meh:( [16:02:06] joal https://gerrit.wikimedia.org/r/c/operations/puppet/+/703766 [16:05:02] MEH - there must have been a problem with a gobblin run - will investigate - a webrequest hour was missing _IMPORTED flag [16:05:37] !log Manually add /wmf/data/raw/webrequest/webrequest_text/year=2021/month=7/day=8/hour=9/_IMPORTED [16:05:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:07:17] ^ this probably explained the SLA alerts downstream [16:07:51] correct milimetric - I had missed those (google account issue), I checked just now, and there was a job stuck (timedout) [16:08:18] no worries joal, I am checking everything and would've pinged you [16:15:01] Wow - I know what happens [16:15:12] Unexpected mess due to unpadded dates [16:18:31] Ok, I reran the culprit (hour 10) [16:20:02] ottomata: I should have tripple thought the change of date- format :( [16:20:07] joal whats the issue? [16:20:39] the publisher for flags rely on alphabetical ordered = time order [16:20:51] oh [16:20:56] huh [16:20:57] and this is not the case anymore for hour=9->10 [16:20:57] weird [16:21:03] i mean, i like padded date better than unpadded for sure [16:21:10] its just that everything else uses unpadded [16:21:14] yeah [16:21:16] but that is a good reason to use padded everywhere [16:21:34] joal, we can switch back [16:21:45] and document somewhere intentions [16:21:58] ottomata: There'll be issues I think - Hive won't create partitions with padded names [16:22:33] When set them ourselves it works, but when we do ADD PARTITION and hive creates the folder, it'll be back to unpaded I think [16:22:52] grmbl [16:23:36] hmmm [16:23:38] Ok - I'm gonna rework the publisher [16:24:01] hmmm [16:24:02] joal [16:24:04] Or, we get back to old camus date format without partitions [16:24:04] hmm [16:24:07] nonono [16:24:10] ok [16:24:10] :) [16:24:23] is it possible to make the thing just sort numerically? [16:24:26] i guess not since it is files [16:24:34] PROBLEM - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:24:41] and now we have hour=9 vs hour=10 [16:24:46] I have folders [16:24:53] right [16:24:57] ok [16:25:15] The way to make it wokr is to provide a pattern for sorting based on regex extraction I assume [16:25:21] oof [16:25:22] yeah [16:25:26] i dunno, i think its ok to change back [16:25:31] most things will still work...i think [16:25:53] ooo eventlogging_to_druid_netflow_hourly interesting [16:25:57] milimetric: am investigating [16:25:59] ^^ [16:26:03] ottomata: refine was working with padded for camus, so I assume it could work here as well [16:26:08] we just migrated netflow today [16:27:12] ok ottomata - sending a patch [16:27:20] What a mess :( [16:27:34] not too bad! [16:27:36] could be worse. [16:28:03] actually ottomata - Could we delay the patch - kids are back home and I'm of food duty [16:28:07] yup i think its fine [16:28:15] problem with netflow refine, will figure it out [16:28:39] Thank you ottomata [16:28:47] will prepare patches [16:34:57] (03PS1) 10Joal: Revert gobblin time-pattern to use padded hours [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703776 (https://phabricator.wikimedia.org/T271232) [16:35:00] ottomata: ---^ [16:35:19] Actually, we can mkae it happen now ottomata if ou wish, I'll go for kids just after [16:35:50] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Revert gobblin time-pattern to use padded hours [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703776 (https://phabricator.wikimedia.org/T271232) (owner: 10Joal) [16:36:01] joal might be good so we don't break downstream jobs later [16:36:13] merged, can you deploy and apply? [16:42:18] ottomata: Will deploy and apply, yes! [16:42:28] ottomata: sorry for the delay, doing multiple things at once [16:43:13] np! [16:44:36] !log Deploying refinery to an-launcher and hadoop-test [16:44:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:51:16] joal refine_netflow also broken because of padded! [16:51:17] input_path_datetime_format = 'year='yyyy/'month='MM/'day='dd/'hour='HH [16:51:30] that it isn't used to match, it is used to generate possible input paths [16:51:43] for hours between since and until [16:52:26] i'll fix Refine config to use non padded [16:52:32] and it willl work after your patch is applied [16:52:38] i guess we need to manually move existing dirs? [16:52:45] to padded? [16:52:51] (sorry fix refine config to use padded*) [16:53:09] oh no wait, ack. your change makes it padded, and refine is configured to use padded. ok. [16:53:13] so jsut with your fix and moving data [16:54:03] ok joal i see your change with padded is deployed on an-launcher [16:54:08] so new netflow data should be padded [16:54:14] i will move existent dirs to padded too [16:56:26] ottomata: thank you for that [16:56:34] ottomata: deploying on test-cluster now [16:57:09] ottomata: Killing webrequest oozie job [16:57:25] !log Kill-restart webrequest oozie job after gobblin time-format change [16:57:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:57:46] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 (10Ottomata) Pasting here for posterity, had to revert non-padded import style and move already gobblin imported netflow data back to padded style: ` sudo -u analytics hdfs dfs -mv /wmf/... [16:59:11] ottomata: I'll also fix hive partitions (the changed folders mean broken hive partitions [16:59:23] for webrequset you mean? [16:59:27] yes [16:59:28] +1 [17:02:59] ottomata: you have forgotten to move hours for day 7 and 8 :) [17:03:43] or is it still moving stuff ottomata ? [17:04:45] probably moving stuff [17:04:47] ? [17:04:52] no hours for day 7 needed moved [17:04:54] they are all > 10 [17:04:57] day 8 I did [17:04:59] 0-9 [17:05:09] ottomata: in wrong parent folder though :) [17:05:17] ottomata: let me fix that :) [17:05:20] ?> [17:05:22] i just did netflow [17:05:26] joal [17:05:26] https://phabricator.wikimedia.org/T271232#7201888 [17:05:28] is what I did [17:05:48] looks right to me what am I missing? [17:05:51] checking netflwo (I was looking at webrequest) [17:06:07] i haven't moved webrequest [17:06:10] only doing netflow [17:06:50] Indeed all good otto [17:06:53] k [17:06:54] sorry :) [17:07:04] ottomata: shall I do webrequest? [17:07:09] yes plz! [17:07:12] ack doing [17:07:13] i was just focusing on netflow while you were doing web [17:07:38] am running a refine_netflow now [17:07:41] i think it is working after move [17:07:52] RECOVERY - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:09:37] wow and it immediately starts working ^ ? huh. [17:11:26] \o/ [17:12:26] wild ride. Feels more like surfing than ops week [17:13:17] joal ok we need to move some stuff in test right? [17:13:26] looks lik eonly a few hours in webrequest form today [17:13:30] ottomata: I think it;s not important [17:13:45] ottomata: test is not about data being correct [17:13:56] ok maybe for webrequset [17:14:06] ottomata: I'll fix the oozie job (it currently fails), but no need to move data IMO [17:14:07] but for the event stuff refine needs it padded to work [17:14:12] i'll move the event stuff [17:14:15] ack [17:14:17] we aren't using any reifne yet [17:22:45] !log Deploy refinery to HDFS [17:22:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:26:29] ok joal test cluster looking better [17:26:40] i'm going to merge the refine and purge job change for eventlogging_legacy, ok? [17:27:10] work for me ottomata :) [17:27:24] i realized that the event jobs are going to need an input_path_regex change! [17:27:32] since the topic dirs are now exaclyt the topic names [17:27:37] instead of 'normalized' ones like camus writes [17:28:25] Ah wow - I had not forseen that :S [17:31:10] PROBLEM - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:31:16] hmmm [17:31:19] Arf :( [17:31:27] well, refine_netflow still running [17:31:34] i was surprised it succeeded before [17:32:09] ottomata: webrequest data moved, job restarted (both prod and test-clsuter) [17:32:15] great [17:41:22] ok, partitions fixed on prod cluster [17:41:37] ottomata: anything else we should figure out? [17:42:09] am making sure eventlogging_legacy_test works [17:42:14] and purge job [17:42:24] I think we can do event_default on test cluster today [17:42:28] how much longer you workin? [17:42:42] I was planning on stopping soon [17:42:47] i'll follow up with the druid job after refine_netflow finishes [17:42:59] joal, i think we're good then; i might keep going with event_default on test cluster [17:43:00] but stop therre [17:43:01] ottomata: ok - how come it takes so long? [17:43:22] not sure, it is a decent amount of data? and it hadn't been running since we migrated this morning? [17:43:35] it hadn't been running? [17:43:44] oh, I thought it was [17:43:49] nevermind - mybad [17:43:49] no, because of the unpadded dates [17:43:55] right right [17:43:56] refine didn't see any new data [17:44:01] ok [17:44:17] or, when did we migrate netflow? was tht today or yesterday :p [17:44:20] i guess today [17:44:22] i can't remember anymore :p [17:44:29] huhuhu [17:44:34] its been running for 40 mins [17:44:53] it was today we migrated I think [17:45:08] maybe it re-refines old folders? [17:47:18] hm maybe [17:47:35] it shouldn't though, because that data has been moved away [17:49:40] ok I confirm pageview has been unlocked as well (problem due to the hour=9 webrequest not refined) [17:50:11] We seem back in regular state - until nex alert [17:50:30] ottomata: I'm gonna stop for today - Thanks again for all the help :) [17:51:37] ok [17:51:39] thanks joal! [17:51:43] ttyl! [18:03:35] RECOVERY - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:27:18] PROBLEM - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:49:06] RECOVERY - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:10:30] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:32:18] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:12:14] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:22:58] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers