[07:31:52] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10fgiunchedi) [07:33:42] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10fgiunchedi) [07:39:32] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10fgiunchedi) [08:45:23] joal: im trying to re run this old query but it appears to not like it any more D: spot anything obvious? [08:45:27] https://www.irccloud.com/pastebin/QN9d8UXH/ [08:45:32] `Error while compiling statement: FAILED: ParseException line 4:33 cannot recognize input near 'claim' '.' 'references' in select expression` [08:45:44] did something change and i should be quoting more things now? [08:54:22] addshore: o/ I am reading https://cwiki.apache.org/confluence/display/hive/languagemanual+udf, the usage examples seems to indicate that "array()" or "map()" may be needed [08:54:35] for the explode() I mean [08:55:03] ah no maybe for an inline declaration, nevermind [08:55:32] how old is the query? Pre-bigtop migration? (so pre hive 2 upgrade?) [08:55:43] It ran as is 1 year ago [08:56:21] It might have something to do with the . actually in the second explode, as it doesn't get tripped up by the first explode [08:56:42] addshore: can you try with `claim/references` ? [08:56:48] err `claim.references` [08:57:11] without the explode? [08:58:04] nono I mean something like explode(`claim.references`) [08:58:20] thats what it currently is! [08:58:50] what do you mean? In the paste I don't see backticks [08:59:01] oh, haha, my irc client got rid of the backticks :P [08:59:21] Error while compiling statement: FAILED: SemanticException [Error 10004]: line 4:27 Invalid table alias or column reference 'claim.references': (possible column names are: wikidata_entity.id, wikidata_entity.typ, wikidata_entity.labels, wikidata_entity.descriptions, wikidata_entity.aliases, wikidata_entity.claims, wikidata_entity.sitelinks, wikidata_entity.snapshot, t.claim) [08:59:36] aaah, it works with more backticks [08:59:48] BTclaimBT.BTrefeerencesBT [08:59:57] thanks! :))))) [09:00:10] addshore: as curiosity, what do you mean with more backticks?? [09:00:32] so we know next time, this is why I am asking, it may be a problem that others will surface [09:00:40] I ttied BTclaim.referencesBT first [09:00:42] *tried [09:03:15] are we talking about the same query as in https://www.irccloud.com/pastebin/QN9d8UXH/ ? [09:03:27] I am a little lost, anyway, don't want to waste your time :) [09:03:30] if it works good [09:25:49] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [09:34:25] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) @Dwisehaupt I have to apologize for a newbie error I made here. The FR-Tech server's interfaces don't show what they're connected to in Netb... [09:40:30] elukey: I have submitted a patch for a cookbook. Is there a way that I can safely test it? [09:44:13] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Update Spicerack cookbooks to follow the new class API conventions - https://phabricator.wikimedia.org/T269925 (10BTullis) @elukey Can I ask, what would you consider to be the best way to test this change locally, before pushing to gerrit? [09:45:17] btullis: there is no good way to test locally, what I usually do is the following [09:45:33] 1) make sure that all the linters/etc.. pass correctly locally via tox [09:45:53] 2) submit the code review and wait for Riccardo's feedback and others interested [09:46:06] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Volans) [09:46:16] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [09:46:23] 3) merge and then run the cookbook in dry-run mode, that basically emits only the steps to take [09:46:38] 4) test the cookbook on a test cluster if we have one [09:47:38] so far I have been following this procedure and it went really fine.. the main problem is that mocking all the various services to get data from (like puppetdb etc..) is difficult locally [09:48:29] for the AQS cookbook we should be fine with the dry-run, it is relatively simple [09:49:16] (not sure if I have answered to the question, I can follow up more if you want) [09:50:45] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Volans) [09:51:57] elukey: That's great, thanks. tox is new to me so I'll read some more and set up local linting before pushing again. [09:53:47] btullis: also left some comments to the change, but it looks good! [10:10:30] RECOVERY - Check unit status of check_webrequest_partitions on an-launcher1002 is OK: OK: Status of the systemd unit check_webrequest_partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:10:57] btullis: are you subscribed to analytics-alerts@ ? [10:11:28] yes you are, just checked, super :) [10:11:41] 👍 [10:11:49] (I asked since it is a special config in puppet private so we could have done a change on it together) [10:12:58] It was done here: T285936 [10:12:59] T285936: Please add btullis@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T285936 [10:13:18] ahhh okok [10:16:30] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [10:17:00] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [10:44:47] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [11:17:02] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10MoritzMuehlenhoff) [11:19:06] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10MoritzMuehlenhoff) [11:27:06] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10MoritzMuehlenhoff) [11:38:37] Hi team [11:38:51] Hi addshore - sorry was away this morning - Looking at your query now [11:39:29] Hi joal [11:40:15] hey joal ! elu_key got me to the sultion!!! [11:40:28] Ah great! [11:40:34] Hi btullis :) [11:40:47] addshore: backticks, was it? [11:40:53] yup! [11:40:57] great [11:52:08] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Kormat) [12:04:21] (03CR) 10Mforns: Set shouldGenerateExample: true and rematerialize schemas to get examples everywhere (032 comments) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/703250 (https://phabricator.wikimedia.org/T270134) (owner: 10Ottomata) [12:16:38] (03CR) 10Mforns: [C: 03+1] Rematerialize fragment schemas with generated examples. (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/702700 (https://phabricator.wikimedia.org/T270134) (owner: 10Ottomata) [12:21:00] (03CR) 10Mforns: [C: 03+1] "LGTM!" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/703838 (https://phabricator.wikimedia.org/T272589) (owner: 10Andrew-WMDE) [12:24:25] (03CR) 10Mforns: [C: 03+1] "LGTM!" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/703753 (https://phabricator.wikimedia.org/T272589) (owner: 10Andrew-WMDE) [12:33:36] (03CR) 10Mforns: [C: 03+1] "LGTM!" (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/702736 (https://phabricator.wikimedia.org/T285975) (owner: 10Ottomata) [12:38:45] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10ema) [12:40:53] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10ema) [12:41:18] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10ema) [12:42:24] heya ottomata - let me know when you're up :) [13:00:31] hello! [13:00:33] just getting on now! [13:00:44] Heya ottomata - issue with gobblin job :) [13:00:51] ok [13:01:03] ottomata: I'm investigating now: the gobblin job got stuck in locked mode [13:01:11] which one? [13:01:13] event? [13:01:46] ottomata: correct - a job instance failed yesterday at hour 18:15, leaving the lock file up - and all other run are not running [13:01:51] Will remove the file now [13:02:08] huh [13:02:17] that's strange [13:02:18] ottomata: this leads us to know that gobbln job failure doesn't lead to alert-email [13:02:24] interrrresting [13:02:30] ottomata: sudo journalctl -u gobblin-event_default --since "2021-07-12 17:00" | less [13:02:42] ottomata: Unlocking the thing now [13:03:31] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10ssingh) [13:03:43] !log remove /wmf/gobblin/locks/event_default.lock to unlock gobblin event job [13:03:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:04:01] ottomata: will monitor next gobblin run - a lot of data to gather [13:04:53] joal ok. i don't see the job failure [13:06:12] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10ssingh) [13:09:12] ottomata: you need to get down quite some [13:11:08] joal got a string I can seach for? i think i scrolled through that whole job that started at that time [13:11:28] ottomata: Jul 12 18:15:50 an-launcher1002 systemd[1]: gobblin-event_default.service: Main process exited, code=killed, status=15/TERM [13:12:08] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Vgutierrez) [13:12:22] ottomata: from hadoop job logs: Operation category READ is not supported in state standby [13:12:27] joal thats normal [13:12:32] because an-launcher1001 is standby right now [13:12:37] and it is contacting that one first i thin [13:13:04] the code=killed happened around the time we were doing the migration yesterday, right? [13:13:34] hm, I think you're right ottomata [13:13:44] i am not sure but there may have been a moment when I did a systemctl start gobblin-event_default and then a very quick systemctl stop gobblin-event_default [13:13:52] because i hadn't deployed? but i don't remember exactly if that is true [13:13:57] i might be remembering a refine [13:14:00] ottomata: weird [13:14:06] but, that shouldn't cause a total breakage even if I did that, right? [13:14:45] this one is the problem? [13:14:45] Jul 12 18:15:52 an-launcher1002 gobblin-event_default[29997]: 2021-07-12 18:15:52 UTC ERROR [main] org.apache.gobblin.runtime.AbstractJobLauncher - Failed to unlock for job job_event_default_1626113714109: org.apache.gobblin.runtime.locks.JobLockException: java.io.IOException: Filesystem closed [13:15:55] yeah we gonna need those gobblin metrics :) [13:16:36] ottomata: indeed! [13:16:51] so [13:16:52] Failed to unlock for job job_event_default_1626113714109: org.apache.gobblin.runtime.locks.JobLockException: java.io.IOException: Filesystem closed [13:17:24] so it looks like because the job was stopped/killed, maybe the hdfs fs connection was closed on its way down before the unlock was supposed to happen [13:17:43] which, should be fixed? a manually stop of the job should not prevent futuer jobs from running [13:17:48] joal should I make a task? [13:17:57] possible - I wonder if at some point yesterday there have not been glitches on HDFS [13:17:58] also [13:18:05] we need a non zero exit code! [13:18:14] for the next job runs [13:18:21] Exception in thread "main" org.apache.gobblin.runtime.JobException: Previous instance of job event_default is still running, skipping this scheduled run should exit 1? [13:18:23] or hmmmm [13:18:30] joal maybe this is a reason to have the is_yarn_application_running! [13:18:46] ottomata: the mapreduce gobblin job failed with errors Operation category READ is not supported in state standby for HDFS [13:19:04] joal you sure that caused it to fail? or was that just in the logs? [13:19:23] ottomata: sudo -u analytics kerberos-run-command analytics yarn logs --applicationId application_1623774792907_141590 | less [13:19:28] i've seen that error in lots of logs [13:19:47] joal did you know that there is yarn-logs wrapper in refinery? :) [13:20:07] /srv/deployment/analytics/refinery/bin/yarn-logs -u analytics application_1623774792907_141590 | less [13:20:32] oh ha, but maybe it now needs to work with kerberos-run-command nm [13:21:06] joal those messages about standby are all WARN [13:23:42] ottomata: true - the job failed nonetheless :S [13:23:43] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10jbond) [13:24:12] meh [13:24:24] its status failed in yarn? [13:24:29] could be related to the launch\kill, but weird [13:24:34] checking [13:25:39] ottomata: no more info in yarn UI :( [13:26:22] ottomata: the job didn't manage to send its info to yanr [13:26:44] joal maybe thats the one that was killed? [13:26:47] yarn application -status application_1623774792907_141590 [13:26:52] State : KILLED [13:26:52] Final-State : KILLED [13:27:19] i betcha i killed it during the migration for some reason [13:28:02] (03CR) 10Ottomata: Set shouldGenerateExample: true and rematerialize schemas to get examples everywhere (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/703250 (https://phabricator.wikimedia.org/T270134) (owner: 10Ottomata) [13:28:53] ottomata: if killing a job makes gobblin fail that way, we need to be super careful :) [13:29:17] yeah [13:29:21] well, we should fix that [13:29:25] it should remove its lock if killed [13:29:31] also, we should know about it [13:29:34] it looks like what happened was [13:29:41] i killed, which did not cause a nonzero exit code [13:29:44] so no alertt. [13:29:44] ottomata: we definitely need metrics, and possibly also about job success/failure [13:29:50] right [13:29:55] then, future run attemps just didn't run because lock file [13:29:59] and those also were non-zero [13:30:08] (03CR) 10Mforns: "AAaaaaaaahhh :]" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/703250 (https://phabricator.wikimedia.org/T270134) (owner: 10Ottomata) [13:30:54] ottomata: current gobblin run seems ok [13:30:59] (03CR) 10Ottomata: "> Patch Set 2:" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/703250 (https://phabricator.wikimedia.org/T270134) (owner: 10Ottomata) [13:31:05] ok cool [13:31:17] I'm following the execution [13:31:32] the re-run with just api_request includelisted succeeded tho [13:31:38] so we'll watch that [13:32:14] ottomata: refine succeeded every time, as nothing more was to be refined [13:32:35] right, i mean the first refines i ran yesterday [13:32:41] so, afte rthis big import [13:32:44] we should watch for that too [13:32:53] Ah I get it ok, makes sense [13:34:38] joal I forget do we have gobblin max run times? [13:34:47] no we don't ottomata [13:34:49] ok [13:34:51] we let it run [13:36:43] 10Analytics: When gobblin fails, we should know about it - https://phabricator.wikimedia.org/T286559 (10Ottomata) [13:36:47] joal ^ [13:42:35] yessir - This is a must [13:42:54] ottomata: the job moves foward (93% done now) [13:43:02] that's great! [13:43:07] is that % of volume or streams? :) [13:43:25] no, in number of tasks (related, but not exact) [13:43:31] 10 mappers to go [13:45:49] ,aye, so probably those are the big ones [13:46:21] joal do you think we should hold off on further migrations (was going to do eventlogging_legacy today)? [13:47:26] ottomata: hm, I don't know! [13:48:07] ottomata: I feel we possibly can have the migration finished, and in the meantime focus on getting metrics/info to make sure we are alerted [13:48:14] yeah i think so too [13:48:24] ok lets go then, i'm going to deploy and start the gobblin job so we start getting data [13:48:33] ack ottomata - thanks for that [13:48:39] ottomata: how many jobs left to do? [13:49:12] let's see [13:49:20] just one [13:49:27] after el legacy [13:49:31] eventlogging-client-side [13:49:34] which we only iimport just in case [13:49:36] actually... [13:49:41] maybe we can stop importing that? [13:49:47] and that last one is 'less-important', as in only useful for backfilling - right? [13:49:48] its been a long time since we used it, and we won't have it anymore after EL migraiotn [13:49:50] ya [13:49:59] ottomata: as you prefer [13:50:02] cause we said we're not going to do mw job, right? [13:50:10] ok, lets do el legacy now then think about it [13:50:24] joal [13:50:24] https://gerrit.wikimedia.org/r/c/analytics/refinery/+/704157 [13:51:09] (03CR) 10Joal: [C: 03+1] "LGTM" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/704157 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [13:51:12] ottomata: to be sure [13:51:17] the topic thing works like that with a regex? [13:51:44] ottomata: we use the EL-legacy to get all EL stuff, even the ones moved to event-gate [13:51:49] yes [13:52:06] otherwise we'd have to maintain an exclude list [13:52:07] ottomata: aye makes sense - When migration will be done we'll move to using stream-config [13:52:11] exactly [13:52:16] Perfect :) [13:52:35] Yes, the topic works as a regex - let me doublew check it anew [13:52:35] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add gobblin job eventlogging_legacy [analytics/refinery] - 10https://gerrit.wikimedia.org/r/704157 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [13:52:38] ok [13:53:13] going to deploy lemme know if you find otherwise [13:53:40] all good ottot [13:53:45] ottomata:--^ sorry :) [13:53:49] k [13:53:51] np [13:54:38] and then same pattern as yesterday: check data correctness, check folders correctness(number), switch [13:54:43] ya [13:55:02] and in this case (just like in test), we don't need to mov anything (we should probably move eventlogging -> eventlogging_camus) [13:55:10] but the finalize will just be switching refine and data purge [13:55:22] ack ottomata - sounds good [13:55:30] only 1 mapper to go ottomata [13:55:34] cool [13:55:49] that's pretty fast i guess. i wonder how long it would have taken for webrequest if it had happened for htat job [14:00:02] ottomata: 1372 time-partition flags published \o/ [14:00:06] nice! [14:00:41] refine should launch in 20 mins [14:00:49] perfect, will monitor [14:01:23] going to start the first couple of el legacy gobblin runs [14:05:30] ottomata: idea- should we change gobblin-event_default time? it runs at HH:15, and refine_event runs at HH:20 - the probability of gobblin finishing just after refine has started is pretty high - would be better to have gobblin start at HH:05, or have refine run at HH:30 maybe [14:05:54] ya makes sense [14:07:26] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Update Spicerack cookbooks to follow the new class API conventions - https://phabricator.wikimedia.org/T269925 (10BTullis) My dry-run fails. The error seems to be here: ` File "/srv/deployment/spicerack/cookbooks/sre/aqs/roll-restart.py", line... [14:11:11] joal incorporated into this patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/704161 [14:11:18] ok, so el legacy gobblin running [14:11:30] we will wait for hour 15 to be fully imported before proceeding [14:11:31] ottomata: first run? [14:11:44] first and second (and third, i think gobblin was scheduled at :05) [14:11:47] so we have dirs! [14:11:50] \o/ [14:18:55] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Update Spicerack cookbooks to follow the new class API conventions - https://phabricator.wikimedia.org/T269925 (10BTullis) Here is the full traceback. ` Traceback (most recent call last): File "/usr/lib/python3/dist-packages/spicerack/_menu.py",... [14:19:43] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Kormat) [14:27:22] hm virtualpageview refine failed, but that is a totally separate job [14:27:23] rerunning [14:34:45] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Update Spicerack cookbooks to follow the new class API conventions - https://phabricator.wikimedia.org/T269925 (10elukey) @BTullis good finding! Can you send a follow up patch to update the argparse settings? [14:35:16] 10Analytics, 10Product-Analytics: Investigate running Stan models on GPU - https://phabricator.wikimedia.org/T286493 (10nettrom_WMF) Case in point, I'm fitting one of these models for NEWTEA revisited (T270786) and as of yesterday it's running time is 338 hours. Here's a screenshot of `htop`, running time is t... [14:56:28] this refine run is neverending [14:58:09] aye [14:58:11] lots to do [14:58:14] lots of geocoding :/ [15:34:32] btullis: razzi o/ I'm working on updating some prometheus metrics for eventgate and fixing up a grafana dashboard [15:34:50] wonder if that might be interesting to show and talk about in the few mins before standup right now? [15:35:15] Yes please. [15:35:19] ok! batcave [15:35:34] https://meet.google.com/rxb-bjxn-nip [15:53:54] Awesome, thanks. [16:04:50] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Dwisehaupt) [16:06:41] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Dwisehaupt) @cmooney Not a problem. I have updated the task to remove the pre/post tasks we were looking at. Always good for us to think about failure... [17:10:37] 10Analytics, 10Product-Analytics: Investigate running Stan models on GPU - https://phabricator.wikimedia.org/T286493 (10mpopov) a:03mpopov [17:15:55] 10Analytics, 10Product-Analytics: Investigate running Stan models on GPU - https://phabricator.wikimedia.org/T286493 (10ldelench_wmf) p:05Triage→03Medium [17:51:27] ottomata: Heya - May I ask you to rerun the failed instances of refine please (not being able to copy paste from my phone doesn't help :S_ [17:51:57] also ottomata - I think we are ready to finalize eventlogging_legacy when you wish [17:57:08] am doing! [18:22:48] Going to start the java process restarts on hadoop workers for security updates soon! [18:27:35] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Bstorm) [18:35:22] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Bstorm) cloudstore servers are both in the same rack. They are a cluster, and it will simply be offline. We will make a task to verify that it come... [18:42:04] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Bstorm) [18:46:51] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Bstorm) [18:48:03] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Bstorm) Switching cloudmetrics to just eat the brief outage. I don't think it will be a big deal. We can just check it after. [18:48:31] (03PS1) 10MewOphaswongse: Add a link: Update schema to support edit mode toggle [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/704402 (https://phabricator.wikimedia.org/T278115) [18:49:03] (03CR) 10jerkins-bot: [V: 04-1] Add a link: Update schema to support edit mode toggle [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/704402 (https://phabricator.wikimedia.org/T278115) (owner: 10MewOphaswongse) [18:49:32] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Bstorm) The WMCS-owned dbproxy1018 and 1019 make up the entire cluster, so that's just an outage no matter what for wikireplicas. It will just need do... [19:02:05] Here I go with the sre.hadoop.roll-restart-workers analytics cookbook [19:02:57] !log razzi@cumin1001:~$ sudo cookbook sre.hadoop.roll-restart-workers analytics [19:03:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:03:31] 78 workers in progress!! [19:03:59] 10Analytics, 10Product-Analytics: Investigate running Stan models on GPU - https://phabricator.wikimedia.org/T286493 (10mpopov) [19:05:36] I'm also going to finish the java restarts for the hadoop test cluster: an-test-druid1001 and an-test-client1001 [19:10:31] the production druid cluster I will restart tomorrow before standup when elukey is sure to be around, anybody else who wants to join for that feel free to do so! Homework is to read the cookbook: https://gerrit.wikimedia.org/g/operations/cookbooks/+/master/cookbooks/sre/druid/roll-restart-workers.py [19:11:17] btullis: I assume (and hope) you're signed off for the day, but if you have some spare time before standup tomorrow that's when I'll be kicking of the druid restarts [19:16:35] 10Analytics, 10Product-Analytics: Investigate running Stan models on GPU - https://phabricator.wikimedia.org/T286493 (10mpopov) [19:19:23] joal lets see what we can do now [19:19:26] should be easy enough [19:19:42] Ack ottomata - works for me :) [19:19:43] ok [19:19:52] stopping timers on an-launcher [19:20:03] When done, I'll move folders [19:20:30] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Bstorm) [19:20:57] actually I will: move /wmf/data/raw/eventlogging --> /wmf/data/raw/eventlogging_camus, and drop /wmf/data/raw/eventlogging_legacy/*/year=2021/month=07/day=13/hour=14 [19:21:48] 10Analytics, 10Product-Analytics: Investigate running Stan models on GPU - https://phabricator.wikimedia.org/T286493 (10mpopov) **Additional details**: I've built CmdStan from source a bunch of times (including on the stat nodes), just never with OpenCL support and don't know if the installed ROCm driver suppo... [19:21:59] joal, lets drop everything except for hours that are not yet refined? [19:22:08] or, wait, i forget nowhhaa [19:22:09] hm, why? [19:22:14] yesterday did refine end up doing extra work? [19:22:18] or did it not? [19:22:19] I don't think so [19:22:38] we drop the first hour cause it's incomplete, but that's all [19:22:43] ok [19:23:09] ottomata: is this a +1 for actions? --^ ? [19:23:55] +1 ya hang on let me real quick do the dir comparisons [19:24:14] Sure ottomata - waiting for your signal [19:28:22] looks good joal, hour 18 same streams in both camus and gobblin [19:28:31] awesome [19:28:35] ok startin [19:29:16] !log move /wmf/data/raw/eventlogging --> /wmf/data/raw/eventlogging_camus and drop /wmf/data/raw/eventlogging_legacy/*/year=2021/month=07/day=13/hour=14 [19:29:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:29:50] ok ottomata - all good [19:30:02] k verifyin gpuppet patch [19:30:11] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Bstorm) [19:31:43] ok merged and running puppet, this will make refine and data purge use eventlogging_legacy dir [19:32:34] ok ottomata - Given the hour and the schedule of refine, I assume we could launch a manual run with the new settings, right? [19:32:49] ya sounds good. oo i forgot to absent the camus job, doing now [19:38:39] PROBLEM - Check unit status of refine_eventlogging_analytics on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:38:54] Arf - :( [19:40:15] ottomata: failure is due to base folder having moved [19:40:24] looking [19:40:40] java.io.FileNotFoundException: File /wmf/data/raw/eventlogging does not exist. [19:40:46] which is expecte [19:41:07] i think maybe the job was running when you moved [19:41:12] lets relaunch with new configs [19:41:22] Ah crap I didn't check - my bad [19:44:38] joal...lots more failures than recently, and only on larger datsets (i think) [19:44:54] is it possible that refine with gzip input files needs more capacity? [19:45:10] ottomata: I've been thinking about this - could it be gzip? [19:45:14] yeah [19:45:45] there isn't that much different about the input [19:45:50] about the job i mean [19:45:54] gzip is not splittable and compresses, so indeed quite some more data ends up on single workers [19:46:04] oh [19:46:10] joal that is why we used sequence files, right? [19:46:25] nah ottomata - text without gzip is splitta [19:46:37] there was some reason [19:46:43] was it that snappy could compress them? [19:46:48] i thin they were snappy compressed sequence files [19:46:54] we didn't compress them [19:46:54] and still split [19:46:56] no? [19:47:08] compression/split is not related to underlying format [19:47:16] a compression format is either splittable or not [19:47:25] hm [19:47:30] k [19:49:28] ottomata: this still doesn't make a lot of sense - workers need to decompress, but then sends that data through shuffle-sort (dedup) - so even if data is bigger, it feels bizzare that it fails :( [19:49:29] RECOVERY - Check unit status of refine_eventlogging_analytics on an-launcher1002 is OK: OK: Status of the systemd unit refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:49:34] \o/ [19:49:38] yay [19:49:49] ya after just a rerun with --ignore_failure_flag=true [19:50:00] awesome [19:51:30] I was also looking at config for the spark jobs, I think we could benefit from some re-tuning of the jobs themselves, and possibly of the choice of number of output partitions [19:52:01] aye [19:52:25] As an example: hdfs dfs -ls /wmf/data/event/mediawiki_api_request/datacenter=codfw/year=2021/month=7/day=13/hour=16 tells me that every generated file is 2x our block size [19:53:04] It should be ok (it's not 10x), but this also means more pressure on the workers [19:53:26] ottomata: Let's maybe review config for refine later this week? [19:53:28] k [19:54:07] I'll try to investigate more on failures, to see if I can get a better sense of why it now fails [19:54:21] ottomata: are we done with eventlogging_legacy? [19:55:22] i think so...been dealing with these refeine reruns [19:55:27] i need to test the data purge [19:57:35] ack ottomata - is it ok for you if I sign of? [19:57:37] yes joal [19:57:39] thank you! [19:57:54] Thank you ottomata :) We're almost done! [19:58:09] Have a good end of day folks [19:59:04] l8rs joal! :) [20:45:36] 10Analytics-Kanban, 10Analytics-Radar, 10Product-Analytics, 10Pageviews-Anomaly: Analyse possible bot traffic for ptwiki article Ambev - https://phabricator.wikimedia.org/T282502 (10Milimetric) I took a quick look and I agree the requests seem diverse enough to be organic, even if they're really not. I'm... [20:45:51] 10Analytics-Kanban, 10Analytics-Radar, 10Product-Analytics, 10Pageviews-Anomaly: Analyse possible bot traffic for ptwiki article Ambev - https://phabricator.wikimedia.org/T282502 (10Milimetric) [20:45:53] 10Analytics, 10Research-Backlog: [Open question] Improve bot identification at scale - https://phabricator.wikimedia.org/T138207 (10Milimetric) [21:13:56] 10Analytics, 10Analytics-EventLogging, 10Wikimedia-production-error: TypeError: Return value of JsonSchemaHooks::onEditFilterMergedContent() must be of the type boolean, none returned - https://phabricator.wikimedia.org/T286611 (10Aklapper) Assuming this is about #analytics-eventlogging