[00:00:49] <razzi>	 Going to do the service restarts on an-coord1002
[00:02:56] <razzi>	 !log razzi@an-coord1002:~$ sudo systemctl restart hive-server2.service
[00:02:59] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[00:03:13] <razzi>	 !log razzi@an-coord1002:~$ sudo systemctl restart hive-metastore.service
[00:03:15] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[00:03:55] <razzi>	 Looks like oozie and presto have been restarted since the jars have been updated, skipping those
[00:04:01] <razzi>	 fuse has old jars
[00:04:12] <razzi>	 !log razzi@an-coord1002:~$ sudo umount /mnt/hdfs
[00:04:14] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[00:04:21] <razzi>	 !log razzi@an-coord1002:~$ sudo mount -a
[00:04:23] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[00:04:35] <razzi>	 I'll check back in for alarms but unless something goes wrong that's it for me for today!
[00:08:14] <icinga-wm>	 PROBLEM - Hive Server on an-coord1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive
[00:27:26] <icinga-wm>	 RECOVERY - Hive Server on an-coord1002 is OK: PROCS OK: 1 process with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive
[02:40:10] <icinga-wm>	 PROBLEM - Check unit status of refinery-sqoop-mediawiki-private on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-sqoop-mediawiki-private https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:17:31] <wikibugs>	 10Analytics, 10Article-Recommendation: Make endpoint for top wikis by number of articles - https://phabricator.wikimedia.org/T220673 (10Aklapper) a:05bmansurov→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26...
[05:21:58] <wikibugs>	 10Analytics-Radar, 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Platform Team Initiatives (Modern Event Platform (TEC2)), and 2 others: ORES hook integration with EventBus - https://phabricator.wikimedia.org/T201869 (10Aklapper) a:05Pchelolo→03None Removing task assignee due to inactivity, a...
[05:24:00] <wikibugs>	 10Quarry, 10Regression: Bad resultset number case is not handled - https://phabricator.wikimedia.org/T218470 (10Aklapper) a:05Framawiki→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and `T270544`...
[06:50:57] <joal>	 Good morning
[06:51:09] <joal>	 elukey: would you be around by any chance?
[06:52:32] <elukey>	 joal: bonjour
[06:52:46] <joal>	 elukey: Hi! sorry to bother that early
[06:52:47] <elukey>	 I am adding graphs for the Namenode :)
[06:52:50] <elukey>	 nono please
[06:52:57] <joal>	 Yay! Moar graphs :)
[06:53:48] <joal>	 elukey: sqoop-private failed and I'd like to restart it manually, too see if the failure could be related to DB usage
[06:54:07] <joal>	 elukey: my plan is: drop already sqooped data (this month snapshot), and restart the service
[06:54:28] <elukey>	 joal: yep I saw the alarm, makes sense to me
[06:54:45] <elukey>	 joal: what do you mean with db usage though?
[06:55:02] <joal>	 If the dbstore machines were overwhlemed for instance
[06:55:38] <elukey>	 ah ok we can check from logs in case
[06:55:41] <elukey>	 or metrics
[06:55:57] <elukey>	 does sqoop reports timeouts or similar?
[06:56:13] <elukey>	 I am asking since we should figure out what to watch when restarting
[06:56:20] <elukey>	 otherwise we may end up in the same spot
[06:56:25] <joal>	 nah, we only know when an instance fails - logs were too verbose
[06:56:44] <joal>	 elukey: I can also run an instance of previously failed job manually, and check
[06:59:30] <elukey>	 joal: yeah +1 for the manual run, at least we know if it is an issue with the table sqooped or if it is resource consumption
[06:59:41] <joal>	 ack elukey - doing that
[06:59:43] <elukey>	 from dbstores' metrics they don't seem under pressure
[07:02:06] <elukey>	 joal: https://grafana-rw.wikimedia.org/d/000000585/hadoop?viewPanel=126&orgId=1 :)
[07:02:17] <elukey>	 (not related but you'll like it)
[07:03:19] <joal>	 \o/! This chart will be very helpful to troubleshoot unexpected behavior
[07:04:08] <elukey>	 there is also a new set of metrics that we could add to measure latencies
[07:05:23] <elukey>	 ah and also the bigtop devs told me that there is a setting in xmls to flip the locking type for the namenode
[07:05:27] <elukey>	 from fair to else
[07:05:47] <joal>	 woow - that would be neat!
[07:06:19] <joal>	 ok elukey - failure sweems to come from a schema change, that obviously doesn't apply to all projects :(
[07:07:01] <elukey>	 ah snap :(
[07:07:38] <joal>	 will investigate
[07:09:37] <elukey>	 lemme know if I can help
[07:10:10] <joal>	 sure elukey - actually schemas are the same - it then seems a problem of schema-interpretation by sqoop :(
[07:23:01] <wikibugs>	 (03PS1) 10Joal: Fix sqoop cu_changes table adding explicit schema [analytics/refinery] - 10https://gerrit.wikimedia.org/r/702877
[07:31:54] <joal>	 elukey: running test sqoop with that patch --^
[07:33:58] <elukey>	 ack!
[08:28:45] <btullis>	 Morning all.
[08:29:11] <joal>	 Hi btullis 
[08:34:50] <wikibugs>	 (03PS2) 10Joal: Fix sqoop cu_changes table adding explicit schema [analytics/refinery] - 10https://gerrit.wikimedia.org/r/702877
[10:42:37] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10observability, 10Patch-For-Review, and 2 others: Setup Analytics team in VO/splunk oncall - https://phabricator.wikimedia.org/T273064 (10BTullis) I have requested a VO/Splunk account in T286028.
[12:20:13] <wikibugs>	 (03CR) 10Joal: [V: 03+1] "Tested on cluster for this month data after failure." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/702877 (owner: 10Joal)
[12:21:03] <joal>	 Replacing failed data with successful data generated when testing --^
[12:21:47] <joal>	 !log Replacing failed data with successful data generated when testing https://gerrit.wikimedia.org/r/702877 - wmf_raw.mediawiki_private_cu_changes
[12:21:50] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:28:46] <joal>	 elukey: the test-run ran successfully - ok for me to reset the timer?
[12:48:29] <elukey>	 joal: sure! 
[13:28:18] <mforns>	 hi teamm!!
[13:29:47] <elukey>	 hola Marcel
[13:29:54] <elukey>	 que pasa?
[13:29:57] <mforns>	 hola elukey 
[13:29:58] <mforns>	 :]
[13:30:12] <mforns>	 tutto bene
[13:30:55] <btullis>	 Hiya.
[13:32:21] <mforns>	 hey btullis :]
[13:39:52] <joal>	 Hi mforns 
[13:39:59] <bearloga>	 a-team: Hello! I'm interested in using Druid web console (https://druid.apache.org/docs/latest/operations/druid-console.html) to query Druid, rather than using Superset's SQL Lab. Is it enabled? If so, how can I ssh tunnel it similar to the internal EventStreams web UI (https://wikitech.wikimedia.org/wiki/Event_Platform/Instrumentation_How_To#In_production)?
[13:40:45] <joal>	 bearloga: there is the druid admin console - I don't know about druid-console
[13:40:49] <joal>	 bearloga: checking
[13:41:44] <joal>	 bearloga: The query aspect of the console is disabled
[13:42:00] <joal>	 bearloga: I think we did that with elukey as we weren't using it
[13:42:42] <joal>	 bearloga: similarly, the 'load data' aspect is disabled
[13:43:43] <bearloga>	 joal: ah, okay. would it be possible to enable the query aspect? sql lab has a forced limit of 10k on results. if not, I can try querying with native druid & curl
[13:44:31] <joal>	 bearloga: for getting large result sets, I'd go for curl querying
[13:44:38] <joal>	 bearloga: you can do it sql!
[13:44:44] <bearloga>	 oh WORD?
[13:44:51] <joal>	 for sure
[13:45:25] <bearloga>	 joal: ah! okay I see https://druid.apache.org/docs/latest/tutorials/tutorial-query.html#query-sql-over-http now. will try that, thank you!
[13:45:46] <joal>	 bearloga: the concern with SQL is that depending on the query, it is transposed to an internal druid query that is not super efficient - but for relatively simple queries it should work
[13:46:05] <joal>	 bearloga: more precisely: beware of group by :)
[13:47:39] <joal>	 !log Reset failed timer refinery-sqoop-mediawiki-private.service 
[13:47:42] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:47:44] <bearloga>	 joal: thanks for the heads-up!
[13:48:14] <joal>	 np bearloga - Don't hesitate to ping if you wish to discuss queries :)
[13:49:06] <joal>	 by the way bearloga - IIRC druid has the 'EXPLAIN PLAN FOR' function, allowing you to have a vision of how your SQL query is translated to inner-druid query
[13:50:10] <joal>	 And in druid the efficient queries are: timeseries, topn - The group-by ones are less efficient (they require the equivalent of a shuffle)
[13:51:12] <bearloga>	 hm… I'm planning on querying druid.pageview_hourly and keeping some dimensions but collapsing across others
[13:51:18] <bearloga>	 it sounds like that might be a problem?
[13:52:12] <joal>	 bearloga: you wish  "select dim1, dim2, SUM(count) from pv where ... group by dim1, dim2" right?
[13:52:27] <bearloga>	 joal: yep!
[13:53:17] <joal>	 You should try - I just wish you to know hat at some point it'll fail, and that will mean you'll have reached the max amount of memory a query can use :)
[13:53:50] <joal>	 ways to make it work: reduce time periods, reduce number of dimensions
[13:54:59] <joal>	 bearloga: --^ and, actually, you should try you queries in superset-SQLLab - when they work and you're happy, ou can go to curl :)
[13:55:28] <icinga-wm>	 RECOVERY - Check unit status of refinery-sqoop-mediawiki-private on an-launcher1002 is OK: OK: Status of the systemd unit refinery-sqoop-mediawiki-private https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[13:55:31] <joal>	 and finally, depending on the dimension you wish to query, good ol' spark and pageview or project tables are here for you :)
[13:55:36] <joal>	 bearloga: --^
[13:55:59] <bearloga>	 joal: that sounds reasonable! kind of like focusing on 1 hour of webrequest/pageview_hourly data in hive and then writing a script that breaks up a date range into 1 hour chunks
[13:56:57] <joal>	 bearloga: if you're after 1h chunks, definitely use spark and make a loop! It'll be better :)
[13:58:05] <joal>	 bearloga: If you use 1 month of data, and group by hour or day, the whole month is shuffled to then be re-grouped - if you're not in a hurry, a script running hours or days at a time takes longer but uses a lot less resource overall :)
[13:58:10] <bearloga>	 joal: oh I was just using that as an analogy. focus on smaller subset in query and then stitch results together, rather than 1 query with all the results
[13:58:26] <joal>	 as an example bearloga :)
[13:58:57] <bearloga>	 joal: thanks so much for your help and advice!
[13:59:16] <joal>	 You're very welcome bearloga :) I hope it'll work
[15:36:36] <wikibugs>	 (03CR) 10Mforns: [C: 03+2] "LGTM! Feel free to merge if tested!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/702877 (owner: 10Joal)
[16:24:01] <wikibugs>	 10Analytics, 10Analytics-General-or-Unknown: udp2log and/or demux.py filename corruption - https://phabricator.wikimedia.org/T64082 (10bd808) 05Open→03Declined I'm BOLD'ly closing as declined with no activity in 7 years.
[19:50:03] <addshore>	 ohia
[19:50:27] <addshore>	 Am I correct in thinking that to test event logging on beta https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/689152 would need to be merged?
[20:35:54] <razzi>	 Excellent, checked in on an-coord1001 memory pressure since restarting java processes yesterday, looks good (but still climbing slowly) https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=an-coord1001&var-datasource=thanos&var-cluster=analytics&from=now-2d&to=now