[00:00:49] Going to do the service restarts on an-coord1002 [00:02:56] !log razzi@an-coord1002:~$ sudo systemctl restart hive-server2.service [00:02:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [00:03:13] !log razzi@an-coord1002:~$ sudo systemctl restart hive-metastore.service [00:03:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [00:03:55] Looks like oozie and presto have been restarted since the jars have been updated, skipping those [00:04:01] fuse has old jars [00:04:12] !log razzi@an-coord1002:~$ sudo umount /mnt/hdfs [00:04:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [00:04:21] !log razzi@an-coord1002:~$ sudo mount -a [00:04:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [00:04:35] I'll check back in for alarms but unless something goes wrong that's it for me for today! [00:08:14] PROBLEM - Hive Server on an-coord1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [00:27:26] RECOVERY - Hive Server on an-coord1002 is OK: PROCS OK: 1 process with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [02:40:10] PROBLEM - Check unit status of refinery-sqoop-mediawiki-private on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-sqoop-mediawiki-private https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:17:31] 10Analytics, 10Article-Recommendation: Make endpoint for top wikis by number of articles - https://phabricator.wikimedia.org/T220673 (10Aklapper) a:05bmansurov→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26... [05:21:58] 10Analytics-Radar, 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Platform Team Initiatives (Modern Event Platform (TEC2)), and 2 others: ORES hook integration with EventBus - https://phabricator.wikimedia.org/T201869 (10Aklapper) a:05Pchelolo→03None Removing task assignee due to inactivity, a... [05:24:00] 10Quarry, 10Regression: Bad resultset number case is not handled - https://phabricator.wikimedia.org/T218470 (10Aklapper) a:05Framawiki→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and `T270544`... [06:50:57] Good morning [06:51:09] elukey: would you be around by any chance? [06:52:32] joal: bonjour [06:52:46] elukey: Hi! sorry to bother that early [06:52:47] I am adding graphs for the Namenode :) [06:52:50] nono please [06:52:57] Yay! Moar graphs :) [06:53:48] elukey: sqoop-private failed and I'd like to restart it manually, too see if the failure could be related to DB usage [06:54:07] elukey: my plan is: drop already sqooped data (this month snapshot), and restart the service [06:54:28] joal: yep I saw the alarm, makes sense to me [06:54:45] joal: what do you mean with db usage though? [06:55:02] If the dbstore machines were overwhlemed for instance [06:55:38] ah ok we can check from logs in case [06:55:41] or metrics [06:55:57] does sqoop reports timeouts or similar? [06:56:13] I am asking since we should figure out what to watch when restarting [06:56:20] otherwise we may end up in the same spot [06:56:25] nah, we only know when an instance fails - logs were too verbose [06:56:44] elukey: I can also run an instance of previously failed job manually, and check [06:59:30] joal: yeah +1 for the manual run, at least we know if it is an issue with the table sqooped or if it is resource consumption [06:59:41] ack elukey - doing that [06:59:43] from dbstores' metrics they don't seem under pressure [07:02:06] joal: https://grafana-rw.wikimedia.org/d/000000585/hadoop?viewPanel=126&orgId=1 :) [07:02:17] (not related but you'll like it) [07:03:19] \o/! This chart will be very helpful to troubleshoot unexpected behavior [07:04:08] there is also a new set of metrics that we could add to measure latencies [07:05:23] ah and also the bigtop devs told me that there is a setting in xmls to flip the locking type for the namenode [07:05:27] from fair to else [07:05:47] woow - that would be neat! [07:06:19] ok elukey - failure sweems to come from a schema change, that obviously doesn't apply to all projects :( [07:07:01] ah snap :( [07:07:38] will investigate [07:09:37] lemme know if I can help [07:10:10] sure elukey - actually schemas are the same - it then seems a problem of schema-interpretation by sqoop :( [07:23:01] (03PS1) 10Joal: Fix sqoop cu_changes table adding explicit schema [analytics/refinery] - 10https://gerrit.wikimedia.org/r/702877 [07:31:54] elukey: running test sqoop with that patch --^ [07:33:58] ack! [08:28:45] Morning all. [08:29:11] Hi btullis [08:34:50] (03PS2) 10Joal: Fix sqoop cu_changes table adding explicit schema [analytics/refinery] - 10https://gerrit.wikimedia.org/r/702877 [10:42:37] 10Analytics-Clusters, 10Analytics-Kanban, 10observability, 10Patch-For-Review, and 2 others: Setup Analytics team in VO/splunk oncall - https://phabricator.wikimedia.org/T273064 (10BTullis) I have requested a VO/Splunk account in T286028. [12:20:13] (03CR) 10Joal: [V: 03+1] "Tested on cluster for this month data after failure." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/702877 (owner: 10Joal) [12:21:03] Replacing failed data with successful data generated when testing --^ [12:21:47] !log Replacing failed data with successful data generated when testing https://gerrit.wikimedia.org/r/702877 - wmf_raw.mediawiki_private_cu_changes [12:21:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:28:46] elukey: the test-run ran successfully - ok for me to reset the timer? [12:48:29] joal: sure! [13:28:18] hi teamm!! [13:29:47] hola Marcel [13:29:54] que pasa? [13:29:57] hola elukey [13:29:58] :] [13:30:12] tutto bene [13:30:55] Hiya. [13:32:21] hey btullis :] [13:39:52] Hi mforns [13:39:59] a-team: Hello! I'm interested in using Druid web console (https://druid.apache.org/docs/latest/operations/druid-console.html) to query Druid, rather than using Superset's SQL Lab. Is it enabled? If so, how can I ssh tunnel it similar to the internal EventStreams web UI (https://wikitech.wikimedia.org/wiki/Event_Platform/Instrumentation_How_To#In_production)? [13:40:45] bearloga: there is the druid admin console - I don't know about druid-console [13:40:49] bearloga: checking [13:41:44] bearloga: The query aspect of the console is disabled [13:42:00] bearloga: I think we did that with elukey as we weren't using it [13:42:42] bearloga: similarly, the 'load data' aspect is disabled [13:43:43] joal: ah, okay. would it be possible to enable the query aspect? sql lab has a forced limit of 10k on results. if not, I can try querying with native druid & curl [13:44:31] bearloga: for getting large result sets, I'd go for curl querying [13:44:38] bearloga: you can do it sql! [13:44:44] oh WORD? [13:44:51] for sure [13:45:25] joal: ah! okay I see https://druid.apache.org/docs/latest/tutorials/tutorial-query.html#query-sql-over-http now. will try that, thank you! [13:45:46] bearloga: the concern with SQL is that depending on the query, it is transposed to an internal druid query that is not super efficient - but for relatively simple queries it should work [13:46:05] bearloga: more precisely: beware of group by :) [13:47:39] !log Reset failed timer refinery-sqoop-mediawiki-private.service [13:47:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:47:44] joal: thanks for the heads-up! [13:48:14] np bearloga - Don't hesitate to ping if you wish to discuss queries :) [13:49:06] by the way bearloga - IIRC druid has the 'EXPLAIN PLAN FOR' function, allowing you to have a vision of how your SQL query is translated to inner-druid query [13:50:10] And in druid the efficient queries are: timeseries, topn - The group-by ones are less efficient (they require the equivalent of a shuffle) [13:51:12] hm… I'm planning on querying druid.pageview_hourly and keeping some dimensions but collapsing across others [13:51:18] it sounds like that might be a problem? [13:52:12] bearloga: you wish "select dim1, dim2, SUM(count) from pv where ... group by dim1, dim2" right? [13:52:27] joal: yep! [13:53:17] You should try - I just wish you to know hat at some point it'll fail, and that will mean you'll have reached the max amount of memory a query can use :) [13:53:50] ways to make it work: reduce time periods, reduce number of dimensions [13:54:59] bearloga: --^ and, actually, you should try you queries in superset-SQLLab - when they work and you're happy, ou can go to curl :) [13:55:28] RECOVERY - Check unit status of refinery-sqoop-mediawiki-private on an-launcher1002 is OK: OK: Status of the systemd unit refinery-sqoop-mediawiki-private https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:55:31] and finally, depending on the dimension you wish to query, good ol' spark and pageview or project tables are here for you :) [13:55:36] bearloga: --^ [13:55:59] joal: that sounds reasonable! kind of like focusing on 1 hour of webrequest/pageview_hourly data in hive and then writing a script that breaks up a date range into 1 hour chunks [13:56:57] bearloga: if you're after 1h chunks, definitely use spark and make a loop! It'll be better :) [13:58:05] bearloga: If you use 1 month of data, and group by hour or day, the whole month is shuffled to then be re-grouped - if you're not in a hurry, a script running hours or days at a time takes longer but uses a lot less resource overall :) [13:58:10] joal: oh I was just using that as an analogy. focus on smaller subset in query and then stitch results together, rather than 1 query with all the results [13:58:26] as an example bearloga :) [13:58:57] joal: thanks so much for your help and advice! [13:59:16] You're very welcome bearloga :) I hope it'll work [15:36:36] (03CR) 10Mforns: [C: 03+2] "LGTM! Feel free to merge if tested!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/702877 (owner: 10Joal) [16:24:01] 10Analytics, 10Analytics-General-or-Unknown: udp2log and/or demux.py filename corruption - https://phabricator.wikimedia.org/T64082 (10bd808) 05Open→03Declined I'm BOLD'ly closing as declined with no activity in 7 years. [19:50:03] ohia [19:50:27] Am I correct in thinking that to test event logging on beta https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/689152 would need to be merged? [20:35:54] Excellent, checked in on an-coord1001 memory pressure since restarting java processes yesterday, looks good (but still climbing slowly) https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=an-coord1001&var-datasource=thanos&var-cluster=analytics&from=now-2d&to=now