[01:32:17] <wikibugs>	 10Quarry: Pressing the Stop button in Quarry results in a 500 error - https://phabricator.wikimedia.org/T290146 (10Certes) I reproduced the problem on enwiki_p, including https://quarry.wmcloud.org/query/61115 which should complete in milliseconds but claims to have run for more than a day.
[05:52:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[06:07:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[07:15:15] <mforns>	 thanks joal :] yes, I thought too that the pageviews jobs had succeeded, the mediarequest one you reran succeeded too! Now I tried to rerun the failed remaining one once again, maybe there's just a problem with that specific day...
[07:37:09] <joal>	 Good morning :) Thank you mforns for having launched the last one
[07:37:55] <mforns>	 :]
[09:03:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[09:08:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[09:08:57] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[09:13:57] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[10:08:31] <elukey>	 good morning folks :)
[10:08:36] <elukey>	 I am watching https://grafana.wikimedia.org/d/000000418/cassandra?viewPanel=18&orgId=1&from=now-7d&to=now&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=aqs&var-keyspace=local_group_default_T_top_percountry&var-table=data&var-quantile=99p
[10:08:44] <elukey>	 and there are some interesting things to notice
[10:09:11] <elukey>	 - aqs-next seems generally spending more gc time in young generations than aqs-"old"
[10:09:28] <elukey>	 - during the next few days things got worse, even in the baseline
[10:09:54] <joal>	 hm - super interesting elukey - I also noticed there are differences in the way we load data - smaller batched
[10:10:21] <joal>	 I have ideas about data-size difference as well - In meeting now, will write after
[10:16:06] <elukey>	 joal: ack! I am checking [1], and I see two instances in aqs-next with 15G+ of heap size used
[10:16:09] <elukey>	 https://thanos.wikimedia.org/graph?g0.expr=jvm_memory_bytes_used%7Bcluster%3D%22aqs%22%2C%20area%3D%22heap%22%7D&g0.tab=0&g0.stacked=0&g0.range_input=2d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
[10:16:16] <elukey>	 [1] --^
[10:16:23] <elukey>	 can I restart them?
[10:16:38] <joal>	 elukey: checking
[10:16:46] <elukey>	 1010-a and 1015-a
[10:17:05] <elukey>	 those are definitely trashing
[10:18:30] <joal>	 ok - go ahead elukey please restart them
[10:20:13] <elukey>	 !log restart cassandra-a on aqs1015 (heap size used in full, high GC)
[10:20:16] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:29:48] <aqu1>	 Hi everybody !
[10:30:32] <joal>	 Hey! Here is Antoine! :) Hi aqu1 :)
[10:33:26] <btullis>	 Hello all.
[10:39:06] <elukey>	 !log restart cassandra-a on aqs1010 (heap size used in full, high GC)
[10:39:08] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:39:09] <elukey>	 o/
[10:39:27] <joal>	 elukey: we still have a loading job running
[10:39:53] <joal>	 elukey: this could make the hosts be busy at some spots
[10:41:16] <elukey>	 joal: yep all done, will not take more actions :)
[10:41:32] <joal>	 np elukey - just mentioning :)
[10:43:59] <elukey>	 joal: all GC metrics are not healthy afaics, no more outliers
[10:45:52] <joal>	 elukey: I assume you meant metrics are healthy? or if they are not, we need more investigation
[10:46:00] <joal>	 elukey: thanks a lot for the checks :)
[10:47:31] <elukey>	 yesyes sorry :)
[10:47:34] <elukey>	 s/not//g
[10:47:45] <joal>	 :)
[10:53:53] <joal>	 loading job succeeded - we're back in normal state from the cassandra-data perspective
[10:57:41] <btullis>	 How do we feel about carrying on with the migration from 2->3 ?
[10:57:41] <btullis>	 We have 2 instances (aqs1010-[a-b]) receiving AQS traffic. I'm about to make a CR to make the remaining aqs_next nodes available to pool, but do you think that we need to investigate this GC issue more before proceeding?
[10:58:03] <joal>	 btullis: haven't we depooled aqs1010 (both) ?
[10:58:14] <joal>	 yesterday
[10:58:30] <btullis>	 Oh, sorry. I hadn't caught up with the scrollback.
[10:59:21] <btullis>	 Yes, I see now.
[11:00:43] <joal>	 btullis: let's wait and see metrics for a while now that some hosts have been restarted
[11:02:05] <btullis>	 joal: OK. I think I can still make the CR to make the new hosts available for pooling, right? It won't make them receive any traffic yet.
[11:02:24] <joal>	 for sure btullis, go ahead - thank you :)
[11:08:45] <aqu1>	 Team: Are you often using 1Password ?
[11:14:56] <btullis>	 aqu1: Personally, I use Bitwarden. I think that personal choice is fine for which password manager you use, as long as you do use one. There is some guidance here: https://office.wikimedia.org/wiki/Password_Management
[11:17:55] <btullis>	 I'm having a look at https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=CHANGES.txt;hb=refs/tags/cassandra-3.11.11 to see if I can correlate any major bugs fixed between Cassandra 3.11.4 (our current version) and 3.11.11 (the newest 3.11 version)
[11:18:26] <joal>	 ack btullis - we probably also should let Eric know
[11:18:56] <elukey>	 btullis: if you pool the new aqs nodes, remember to set them as "inactive" in conftool, otherwise I fear that some lvs/pybal alerts might trip (too many hosts down etc..)
[11:19:31] <aqu1>	 Thanks joal & btullis
[11:22:13] <btullis>	 elukey: Thanks. I wasn't aware of that 'inactive' value for pooled. but I am now. Here's the CR: https://gerrit.wikimedia.org/r/c/operations/puppet/+/751408
[11:23:18] <btullis>	 So I'd need to do: `sudo -i confctl select name=aqs1011.eqiad.wmnet set/pooled=inactive` for each host until we're ready to pool them again, right?
[11:24:02] <elukey>	 I think that you can use glob or a regex in confctl, but basically yes.. I don't recall if by default they are put into inactive state, maybe you can merge and verify first with a "get"
[11:24:29] <btullis>	 elukey: Thanks.
[12:27:19] <wikibugs>	 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10BTullis)
[12:29:23] <btullis>	 I have created this ticket to investigate the Cassandra GC and high heap usage: https://phabricator.wikimedia.org/T298516 - eevans is subscribed.
[12:29:32] <wikibugs>	 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10BTullis) p:05Triage→03High
[12:32:08] <btullis>	 elukey: When merged, the pooled value is automatically set to inactive. 👍
[12:32:12] <btullis>	 https://www.irccloud.com/pastebin/qD5WXmh7/
[12:44:10] <kostajh>	 hi. I'm trying the example in https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter#PySpark_and_wmfdata but just see "PySpark executors will use /usr/lib/anaconda-wmf/bin/python3." as the output. What am I doing wrong?
[12:53:01] <urbanecm>	 kostajh: did the cell finish running?
[12:53:17] <btullis>	 kostajh: I think that's a normal, informational message. Are you doing something like `print(pandas_df)` to show your output as well? That caught me out.
[12:53:19] <kostajh>	 how can I tell?
[12:53:44] <kostajh>	 ha, no, `print()` would probably be useful :)
[12:54:30] <btullis>	 Great! Did you get the output you were expecting now?
[12:54:58] <kostajh>	 btullis: I moved on to doing something else, but let me try again
[12:56:33] <kostajh>	 hmm, now I get "Empty DataFrame" 
[12:57:14] <urbanecm>	 that means the query didn't return any results
[12:57:44] <kostajh>	 I also see "You are using wmfdata v1.2, but v1.3 is available.", is that a problem?
[12:57:50] <urbanecm>	 shouldn't be
[12:57:55] <urbanecm>	 if you're running the snippet from wikitech w/o changes, that'd be because `year=2021 AND month=1 AND day=1 and hour=0` is purged from the event table
[12:58:27] <kostajh>	 aha
[12:58:32] <kostajh>	 that was it
[12:59:01] <kostajh>	 I updated the example on wikitech
[12:59:29] <urbanecm>	 well, it will get purged in 90 days :))
[13:00:13] <urbanecm>	 event_sanitized.mediawiki_page_create would have the unpurged data (potentially redacted if any PII is logged in that schema )
[13:01:04] <btullis>	 kostajh: Thanks for the update. You could add a `print()` statement as well, while you're at it. I got caught out by the same issue the first time I tried to run it. :-)
[13:01:15] <kostajh>	 I added that :) 
[13:02:19] <btullis>	 👍
[13:02:20] <kostajh>	 are there some helper tools for writing queries? to explore schemas, for example? Or is it trial and error while referring to the fields found in the schemas/event/secondary repo?
[13:02:36] <urbanecm>	 i always use describe <table>;
[13:03:18] <urbanecm>	 kostajh: like this https://www.irccloud.com/pastebin/JMOTTXAj/
[13:03:34] <btullis>	 Same here, but there is also this which might help a bit. https://schema.wikimedia.org/#!/ 
[13:03:46] <kostajh>	 thanks both
[13:04:43] <urbanecm>	 kostajh: for more information, https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Queries is/was very useful for me when I started finding my own way around this part of the infra
[13:05:17] <urbanecm>	 it's hive-centric, but a lot of the information can be extended to pyspark
[13:07:09] <urbanecm>	 some datasets also have their description page at wikitech, sometimes with example queries (like https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest#Usage_notes)
[13:12:30] <kostajh>	 thanks
[13:12:53] <kostajh>	 so, should `DESCRIBE event` work?
[13:13:25] <kostajh>	 this is the table I want to query https://schema.wikimedia.org/repositories//secondary/jsonschema/analytics/mediawiki/welcomesurvey/interaction/current.yaml
[13:14:31] <urbanecm>	 Event is a database name. You need to first find the DB table name
[13:14:42] <urbanecm>	 One way to do that is use event; show tables;
[13:15:17] <urbanecm>	 Then, either first use event and then describe <table> or describe event.<table> if you want it to work from anywhere
[13:15:39] <kostajh>	 k
[13:19:16] <kostajh>	 and you mean to do this in a python notebook, right? cause if I do `USE event; SHOW tables;` I get an error
[13:19:30] <urbanecm>	 No, in a SQL context, sorry
[13:19:45] <urbanecm>	 It's easiest to ssh into a stat host and run hive there
[13:20:00] <urbanecm>	 But spark.run('...') should work too, in theory
[14:52:10] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "If this still agrees with the state of the code, let's merge it before things diverge again :)" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711133 (https://phabricator.wikimedia.org/T288523) (owner: 10David Caro)
[15:04:31] <wikibugs>	 (03CR) 10David Caro: [C: 04-1] db: Added a script to generate a DB schema from the models (031 comment) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711133 (https://phabricator.wikimedia.org/T288523) (owner: 10David Caro)
[15:05:38] <wikibugs>	 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10BTullis) I have made a modification to the JVM row of the [[https://grafana-rw.wikimedia.org/d/000000...
[15:08:43] <btullis>	 I've updated the Cassandra dashboard in Grafana here: https://grafana-rw.wikimedia.org/d/000000418/cassandra?orgId=1 - It now shows the frequency of garbage collections, averag duration, and percentage of time spent collecting. I think it will be useful in working out what's happening when they're thrashing in future.
[15:29:21] <wikibugs>	 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10BTullis) For instance: {F34907225,width=80%} We can clearly see here a correlation between pending co...
[15:40:47] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, and 2 others: Apparent latency warning in 90th centile of eventgate-logging-external - https://phabricator.wikimedia.org/T294911 (10BTullis) I have merged a change that will increase the threshold for alerting, so that the warnin...
[16:43:32] <kostajh>	 I see "PySpark executors will use /usr/lib/anaconda-wmf/bin/python3" every time I run a cell in a Jupyter notebook; is that expected? Is there a way to hide that noise?
[16:44:25] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Hive query failure in Jupyter notebook on stat1005 - https://phabricator.wikimedia.org/T297734 (10BTullis) I'd like to see if this is resolved or affected at all by the new versions of log4j that we will be deploying as part of the new hive build. See T297468 for...
[16:55:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[17:17:26] <wikibugs>	 10Data-Engineering, 10Data-Services, 10cloud-services-team (Kanban): Recreate views for globaluser table - https://phabricator.wikimedia.org/T298505 (10nskaggs) a:03razzi Razzi, I understand you got try help try one of these last month. If you're up for it, here's another one! Thanks for your help!
[17:17:31] <wikibugs>	 10Data-Engineering, 10Data-Services, 10cloud-services-team (Kanban): Recreate views for globaluser table - https://phabricator.wikimedia.org/T298505 (10nskaggs)
[17:20:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[17:30:16] <wikibugs>	 10Data-Engineering, 10Data-Services, 10cloud-services-team (Kanban): Recreate views for globaluser table - https://phabricator.wikimedia.org/T298505 (10razzi) Yeah I'll give it a go!
[18:01:26] <wikibugs>	 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Switch over the Cassandra AQS cluster to the new hosts - https://phabricator.wikimedia.org/T297803 (10BTullis) We have paused the migration whilst we investigate occurrences of high JVM heap usage and prolonged...
[18:41:01] <musikanimal>	 hello Analytics! It's that time of year again where I need to run the Hive query to get the topviews for the previous year. This time it seemed to error out:
[18:41:16] <musikanimal>	 https://www.irccloud.com/pastebin/03ETQ97W/
[18:41:53] <musikanimal>	 this happened when running the command at https://phabricator.wikimedia.org/P18390 . I did this after I authenticated with `kinit`
[18:42:00] <musikanimal>	 any ideas?
[18:44:27] <btullis>	 musikanimal: I think it's the same as we've seen here: https://phabricator.wikimedia.org/T297734
[18:44:51] <btullis>	 I'm working on a solution, but it's not there yet. Sorry.
[18:46:24] <musikanimal>	 okay, I will follow that task. Thanks!
[18:46:44] <joal>	 musikanimal: trying to run the query myself to see if it changes anything
[18:47:00] <btullis>	 Then again, I'm not 100% sure. It seems to say GC overhead limit reached, which I've not seen before. 
[18:47:28] <joal>	 musikanimal: The ranking is very costly, so I wouldn't be surprise that this makes the query fail and require some tuning
[18:47:32] <btullis>	 The parquet log error might be a red herring.
[18:47:33] <musikanimal>	 it's a very expensive query, that much I know, so there could be a performance-related thing too. But the output seemed to mostly complain about not being able to write to the file
[18:48:14] <musikanimal>	 it ran fine in the past few years, but was always very slow
[18:49:02] <musikanimal>	 I hope we can make this work 🤞 People really want their yearly topviews, hehe!
[18:49:48] <musikanimal>	 I imagine it would be cheaper to run the query on a per-project basis, so if we have to we can make a script to loop through them all or something
[18:57:52] <musikanimal>	 I got the same error after adding a `WHERE project = 'ee.wikipedia'` (very small project)
[18:58:41] <joal>	 musikanimal: then it's related to the hive issue btullis mentionned - I'm now tryi8ng to run the query with Spark
[19:01:06] <musikanimal>	 ok thanks! if you're able to get me that same data in TSV format then that'd be amazing
[19:30:09] <joal>	 musikanimal: heya - I got your data :)
[19:30:20] <musikanimal>	 \o/ you're amazing!
[19:30:31] <joal>	 musikanimal: stat1008:/home/joal/topviews2021-500.tsv
[19:30:59] <joal>	 musikanimal: I quickly checked the fr.wikipedia results, and I actually don't want to give them to you now :)
[19:31:22] <musikanimal>	 haha not good?
[19:31:26] <joal>	 meh :)
[19:32:38] <joal>	 musikanimal: also, I'd love if you could do a quick manual check of the data, see if you see any obvious wrong (like order of magnitudes etc) - I'm not super familiar with it and wonder if I could have done something wrong
[19:33:12] <musikanimal>	 will do!
[19:34:36] <joal>	 ok gone for tonight folks - see you tomorrow :)
[19:45:47] <musikanimal>	 g'nite! don't let me keep you up, but there is a lot of questionable results, such as [[Bible]] being the #1 for enwiki (which I think is definitely a false positive)
[19:47:06] <musikanimal>	 but anyways thanks again for running the query for me! I'm going to remove the obvious false positives from the results then deploy this today
[20:05:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[20:10:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[21:59:37] <wikibugs>	 (03CR) 10Sharvaniharan: "@Jason @Ottomata Please let me know if there are any more changes needed on this?" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/747967 (https://phabricator.wikimedia.org/T290920) (owner: 10Sharvaniharan)
[22:00:10] <wikibugs>	 (03CR) 10Sharvaniharan: "@Jason @Ottomata Please let me know if there are any more changes needed on this?" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/747226 (https://phabricator.wikimedia.org/T297818) (owner: 10Sharvaniharan)
[22:13:17] <olja_>	 musikanimal: that's very interesting about the false positives in the list of top articles. Why would they be false negatives and  is there a way to determine that through some property of the pageview further upstream and/or the query?