[06:14:54] 10Analytics-Clusters: ROCm can't find clang on stat1005 - https://phabricator.wikimedia.org/T285495 (10elukey) p:05Triage→03Medium a:03elukey [06:30:45] folks there is an alert about jvm usage for hive-server that started some days ago [06:30:52] https://grafana.wikimedia.org/d/000000379/hive?orgId=1&from=now-60d&to=now&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=analytics&var-instance=an-coord1001 [06:31:15] we've set in the past a warning if 80% of the heap was constantly crossed, we are at around 90% [06:31:28] the main downside is that a peak of traffic will surely cause OOMs [06:31:47] the Metaspace that grows and grows looks really strange [06:32:23] we have to restart hive due to jvm upgrades, so this will surely alleviate the issue, but we should keep an eye on it [06:34:20] razzi: --^ [06:54:35] Good morning [06:54:46] Thanks elukey for pointing it out - I'll try to not forget :S [07:48:47] (03PS1) 10Joal: Update webrequest raw create table to text files [analytics/refinery] - 10https://gerrit.wikimedia.org/r/702073 (https://phabricator.wikimedia.org/T271232) [08:23:09] (03PS1) 10Joal: Add webrequest and netflow gobblin jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/702075 (https://phabricator.wikimedia.org/T271232) [09:01:34] (03CR) 10Joal: "Another small bunch of comments after testing." (034 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/701463 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [09:16:04] Hi btullis :) [09:16:09] welcome! [09:16:36] Hiya. Thanks, great to be here. [09:33:59] btullis: o/ (Luca) [09:34:31] btullis: chans suggested for you - #wikimedia-operations and #wikimedia-sre [09:34:39] (if you are not already on them) [09:35:17] Oh, great. Thanks elukey. Will join now. [09:37:10] also something interesting - on most of the IRC chans we have a bot that listens for !log events, that end up in https://sal.toolforge.org [09:37:32] for analytics: https://sal.toolforge.org/analytics [09:37:51] it is very useful to have a list of things done to check/verify/etc.. [09:38:21] the #operations chan is full of bots, most of the alarms goes there, etc.. meanwhile the #sre one is more to chat [09:39:54] OK, thanks. So if I literally type '! log something' (without the space) it will end up in the sal ? Is this what you do when you're actively working on something? [09:40:57] btullis: yes correct, for example say that you restart a service, etc.. [09:42:04] Nice. Got it. Any other commands beginning with a ! that I should remember? Forgive me, it's been a while since I spent any real time on IRC. [09:42:05] usually I use !log when I want to save an action that I did that may interest others, or that could be useful when it is my night time if any regression pops up and people start to look for "what changed" [09:42:34] btullis: I think that !log is the most useful one, nothing really important that I can think of [11:16:22] morning team :) [11:16:30] hola fran [11:42:03] Hi fdans :) [13:06:25] Hi fdans. [13:06:32] btullis: yoohooo! [13:06:35] o/ [13:06:55] Sorry for the delay, I'm still getting the hang of IRC again. [13:08:22] I use https://meetfranz.com/ so I don't have to have so many messaging apps open [13:09:27] Nice, thanks. I will give that a try. [13:12:35] Although I just found https://getferdi.com/ and maybe is better? [13:12:36] not sure [13:13:36] ottomata: hello! I have stuff to discuss with you when ou'll have a minute :) [13:13:59] yes! i am trying to do emails but the stuff witih you is blocking lets just talk now! [13:14:09] bc? [13:14:20] sure joining ottomata [13:25:32] hey a-team, i'll be doing the refinery deploy train a little early today (basically now), looks like only one simple thing needs deployed [13:25:33] https://etherpad.wikimedia.org/p/analytics-weekly-train [13:25:39] lemme know if there is anything else! [13:26:50] Ah ottomata - forgot to mention - Could ou please confirm (or not!) my say about camus-checker alerts please? [13:27:27] ottomata: Would it be worthwhile for me to watch over your shoulder while you're doing it? [13:28:11] joal: i am responding right now! [13:28:15] \o/ [13:28:17] thanks ottomata [13:28:57] oh joal this sqoop is just rerfinery, not refinery source? [13:29:03] correct ottomata [13:29:07] ok very easy then [13:29:11] indeed [13:29:20] btullis: i was going to say there were a lot of async changes that take a while, but not in this case [13:29:26] so, sure! [13:29:30] https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Deploy/Refinery [13:29:54] actually one of the steps does take a while, let me start and finsih that one [13:30:04] then we can jump in a hangout and I'll show you what I did and we can do the rest together [13:30:21] Cool. Will read the links until then. [13:31:03] !log deploying refinery for weekly train [13:31:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:39:48] hey teammm :] [13:40:24] o/ Hello. [13:49:30] 10Analytics, 10Analytics-Kanban: Retroactively edit Airflow decision document to include discarded schedulers - https://phabricator.wikimedia.org/T285691 (10mforns) [13:49:47] ok btullis! [13:49:55] 10Analytics, 10Analytics-Kanban: Write a job entirely in Airflow with spark and/or sparkSQL - https://phabricator.wikimedia.org/T285692 (10mforns) [13:50:03] https://meet.google.com/rxb-bjxn-nip [13:50:09] this is the 'batcave' [13:50:10] hangout [14:10:58] Thanks for that ottomata. All makes sense. [14:11:24] :) [14:36:02] btullis: FYI i just switched to ferdi, it seems better. if you are going to use one of those start with that :) [15:09:50] joal: qq did you ever get to try gobblin include with sysconfig file? [15:10:27] or joal hmm, maybe we can just make the test jobs override the fs.uri with java opts -Dfs.uri=... ? [15:10:43] I think that would be the only thing that would need to be overridden for sysconfig [15:17:59] (03PS2) 10Ottomata: Add bin/gobblin wrapper and initial gobblin/ common properties files [analytics/refinery] - 10https://gerrit.wikimedia.org/r/701463 (https://phabricator.wikimedia.org/T271232) [15:18:23] (03CR) 10Ottomata: "Hm, I wonder if env vars are not propagating through to os.system like docs say they would. Will have to investigate." (039 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/701463 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [15:22:01] (03PS3) 10Ottomata: Add bin/gobblin wrapper and initial gobblin/ common properties files [analytics/refinery] - 10https://gerrit.wikimedia.org/r/701463 (https://phabricator.wikimedia.org/T271232) [15:29:06] (03CR) 10Ottomata: [C: 03+1] "+1 for idea, code only peripherally reviewed but looks good." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/694547 (https://phabricator.wikimedia.org/T270433) (owner: 10Mforns) [15:29:28] (03CR) 10Ottomata: [C: 03+1] "Let's merge this after we do it :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/702073 (https://phabricator.wikimedia.org/T271232) (owner: 10Joal) [15:29:41] thanks ottomata :] [15:30:02] I still have to apply a couple suggestions from joal [15:36:50] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 (10Ottomata) @JAllemandou I know we've talked about a migration plan, but let's write it down somewhere. Let's do test cluster first, and just worry about that migration before we write... [15:41:11] joal: https://grafana.wikimedia.org/d/000000538/druid?orgId=1&refresh=1m [15:42:30] it happened also a week ago, I thought it was the switchover but the timing don't match [15:43:46] anyway not a bit deal [15:48:26] joal: good news! (sort of?) img_metadata is getting fixed (T275268) fortunately it's currently ignored in https://github.com/wikimedia/analytics-refinery/blob/master/hive/mediawiki/history/create_mediawiki_image_table.hql anyway, but I wanted to make sure you saw it [15:48:27] T275268: Address "image" table capacity problems by storing pdf/djvu text outside file metadata - https://phabricator.wikimedia.org/T275268 [15:48:44] joal: it's also being converted to an external storage pointer in the replicas; see https://lists.wikimedia.org/hyperkitty/list/cloud@lists.wikimedia.org/thread/CZOLC5IQWEHDH45DJILNQJBMML4VP65A/ [15:50:08] er, or rather parts of the metadata will be converted to ES references; so it may be possible to start sqooping image.img_metadata again eventually and by then it should all be JSON serialized instead of PHP serialized [16:01:54] elukey: I need more precision about druid please :) [16:02:55] joal: there is a jump in traffic from the metrics, not sure what it is about [16:03:15] see broker metrics [16:03:50] ah wait https://grafana.wikimedia.org/d/000000526/aqs?viewPanel=21&orgId=1 [16:04:13] so it is external traffic-related [16:07:51] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Top edited pages list on enwiktionary contains nonexistent pages with titles made up of question marks - https://phabricator.wikimedia.org/T284623 (10Milimetric) p:05Triage→03High a:03Milimetric [16:23:09] a-team: we have to patch and restart this job too, right? [16:23:10] https://hue.wikimedia.org/hue/jobbrowser/#!id=0021724-201202074829419-oozie-oozi-C [16:23:22] the wikidata weekly thing that looks for eqiad hourly data [16:23:52] (I remember mforns had fixed it last time, but I don't remember if it was manual or he did something fancy) [16:24:12] ?? [16:24:21] wanna discuss in da cave? [16:25:05] milimetric: good call! [16:28:55] elukey: will take a quick look - services didn't alert or have they? [16:30:59] joal: nono all good, I just checked after the switchover to see if anything was different, and noticed the bumps [16:31:17] ack elukey - thanks for pointing :) [16:54:20] (03PS1) 10GoranSMilovanovic: T285752 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/702160 [16:54:28] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T285752 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/702160 (owner: 10GoranSMilovanovic) [17:10:39] heya bearloga - thanks for letting me know! creating a task now [17:14:50] 10Analytics: Sqoop image metadata - https://phabricator.wikimedia.org/T285783 (10JAllemandou) [17:15:18] bearloga: --^ :) [17:15:37] joal: woo! :D [17:15:58] ottomata: would you please give me a minute after your lunch? [17:17:00] joal: i got 15 minutes before interview lets go [17:17:07] to the cave! [17:46:01] 10Analytics-Radar, 10ChangeProp, 10Event-Platform, 10Platform Engineering, and 3 others: Run EventBus tests in MediaWiki core CI - https://phabricator.wikimedia.org/T257583 (10thcipriani) Let us know if there's additional guidance needed here [18:10:21] (03CR) 10Joal: "Last round before good for me (except for env-vars issue)" (037 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/701463 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [18:32:55] (03CR) 10Joal: Add bin/gobblin wrapper and initial gobblin/ common properties files (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/701463 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [19:17:40] (03PS2) 10Joal: Update webrequest for gobblin [analytics/refinery] - 10https://gerrit.wikimedia.org/r/702073 (https://phabricator.wikimedia.org/T271232) [19:43:55] (03CR) 10Ottomata: Add bin/gobblin wrapper and initial gobblin/ common properties files (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/701463 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [20:02:27] (03CR) 10Joal: Add bin/gobblin wrapper and initial gobblin/ common properties files (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/701463 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [20:03:06] Ok - done for today - talk to you tomorrow folks [21:14:00] laters joal! [21:16:12] (03CR) 10Ottomata: Add bin/gobblin wrapper and initial gobblin/ common properties files (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/701463 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [21:41:39] 10Analytics-Radar, 10Product-Analytics, 10Product-Data-Infrastructure, 10Language-Team (Language-2021-April-June), 10MW-1.37-notes (1.37.0-wmf.11; 2021-06-21): All events in the contenttranslationabusefilter data stream failing validation - https://phabricator.wikimedia.org/T283872 (10nshahquinn-wmf) 05... [22:50:45] Doing some service restarts on an-test-coord1001 for the java versions [22:51:11] !log sudo systemctl restart oozie on an-test-coord1001.eqiad.wmnet for T283067 [22:51:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:52:16] !log sudo systemctl restart presto-server on an-test-coord1001.eqiad.wmnet for T283067 [22:52:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:53:44] !log sudo systemctl restart hive-metastore on an-test-coord1001.eqiad.wmnet for T283067 [22:53:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:55:24] !log sudo systemctl restart hive-server2 on an-test-coord1001.eqiad.wmnet for T283067 [22:55:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log