[00:20:48] <icinga-wm>	 RECOVERY - Check unit status of monitor_refine_eventlogging_analytics on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:27:16] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1012 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[00:31:46] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:00:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[01:15:00] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1014 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{refer
[01:15:00] <icinga-wm>	 ent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:15:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[01:17:08] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:56:02] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1015 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[02:00:30] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[05:16:30] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1012 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[05:17:20] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1015 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[05:17:28] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1010 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[05:17:30] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1011 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[05:17:48] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1013 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[05:18:58] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1014 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[05:33:38] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[05:34:36] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[05:34:48] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[05:35:22] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[05:35:42] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[05:35:42] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[07:35:24] <mforns>	 heya team, looking into cassandra and aqs alerts
[07:37:58] <joal>	 Good morning mforns - my guess is that it comes from druid new snapshot
[07:38:15] <joal>	 And also: Happy new year team :)
[07:38:44] <mforns>	 joal: ah, I was trying the hypothesis that yesterday's cassandra mediarequest job got stuck and it blocked resources for other jobs
[07:38:54] <mforns>	 hehe, happy new year!!
[07:39:29] <mforns>	 joal: the logs say: Not enough replicas available for query at consistency LOCAL_QUORUM (2 required but only 1 alive)
[07:39:38] <joal>	 mforns: could very well be! the fast recovery is what lead me to the snapshot option but I could very well be wrong :)
[07:40:03] <mforns>	 I killed the stuck job like 20 mins ago
[07:40:07] <joal>	 Ah!!!
[07:40:10] <joal>	 makes sense
[07:40:38] <mforns>	 But tried to re-run the pageviews job and it continues to fail
[07:45:05] <joal>	 mwarf :(
[07:45:23] <joal>	 I'm gonna focus on that with you mforns 
[07:47:44] <mforns>	 joal: yesterday I was looking at cassandra dashboard, and could not see anything weird..
[07:55:32] <mforns>	 joal: hm, now the cassandra systems dashboard does not show basic metrics like load, cpu, network...
[08:02:52] <mforns>	 by executing nodetool status and nodetool info in aqs1004 I see the -a instance has 6TB load while the -b instance has only 2.8TB
[08:03:59] <joal>	 mforns: this is probably expected due to snapshots - the new cassandra cluster hosts are: aqs101[0-6] IIRC
[08:08:16] <mforns>	 oh I see
[08:09:21] <mforns>	 joal: I have a physio appointment, will need to leave in 5 mins. and be back in +-35 mins.
[08:09:35] <joal>	 ack mforns
[08:57:42] <mforns>	 joal: back
[09:04:42] <mforns>	 joal: in aqs1010 the -b instance is not responding
[09:06:24] <mforns>	 ah maybe there is no -b instance outside of aqs100[456]...
[09:06:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[09:08:54] <RhinosF1>	 mforns: if you haven't seen, one of the an-workers went down last night
[09:09:31] <mforns>	 RhinosF1: ah! didn't see, thanks!
[09:10:41] <RhinosF1>	 Np
[09:11:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[10:13:24] <elukey>	 hello folks, I am around if needed
[10:13:30] <elukey>	 I can check the two workers down
[10:20:18] <mforns>	 heya elukey :]
[10:20:39] <elukey>	 !log powercycle an-worker1120 (CPU soft lockup errors in mgmt console)
[10:20:42] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:22:36] <elukey>	 !log powercycle an-worker1114 (CPU soft lockup errors in mgmt console)
[10:22:38] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:22:46] <elukey>	 hola mforns :)
[10:22:53] <mforns>	 heloooo
[10:23:03] <elukey>	 both workers down are booting, kernel issues (sporadic but sometimes it happens)
[10:26:22] <mforns>	 ok
[10:27:16] <elukey>	 mforns: for aqs1010, it is the only instance of the new cassandra cluster that we pooled
[10:27:25] <elukey>	 we an depool it and debug it if you want
[10:30:03] <mforns>	 elukey: how do you know it's aqs1010 the one with problems?
[10:30:21] <mforns>	 and thanks for restarting the an nodes :]
[10:31:58] <elukey>	 mforns: on aqs1010, after reading aqs1010 from you, I tried `nodetool-b status` and it blocked for a while, `nodetool-a status` worked indicating -b instance as DN (down)
[10:32:36] <mforns>	 ah ok, so they do have 2 instances in theory
[10:33:41] <elukey>	 yep yep
[10:33:46] <elukey>	 even in the new cluster
[10:34:30] <elukey>	 !log depool aqs1010 (`sudo -i depool` on the node) to allow investigation of the cassandra -b instance
[10:34:32] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:34:44] <elukey>	 mforns: aqs on aqs1010 depooled, feel free to work on it :)
[10:34:51] <mforns>	 ok, then elukey I believe other nodes have similar problems, I tried 1012 and the -a instance is not responding there
[10:35:41] <elukey>	 it could be something related to compactions etc.. in the new cluster (with cassandra 3 I mean)
[10:36:39] <mforns>	 elukey: nodetool info says percent repaired is at 2.228174031670533E-7%
[10:37:24] <mforns>	 and in some nodes heap memory is very close to the limit
[10:42:18] <elukey>	 yeah I imagined something similar, maybe with cassandra 3 some heap settings needs to be bumped to avoid trashing while compacting/repairing/etc..
[10:43:01] <joal>	 hm - have we started a repair on the new nodes lately?
[10:46:18] <jinxer-wm>	 (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org
[10:46:18] <jinxer-wm>	 (DruidSegmentsUnavailable) firing: More than 5 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org
[10:46:38] <joal>	 ok here we are about the new druid snapshot
[10:54:12] <mforns>	 joal: didn't know about any repair
[10:54:23] <joal>	 meh :(
[10:54:46] <mforns>	 joal: should we ping a team's sre?
[10:55:16] <joal>	 mforns: maybe not before we have something to ask them for?
[10:55:24] <mforns>	 k
[10:56:18] <jinxer-wm>	 (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org
[10:56:18] <jinxer-wm>	 (DruidSegmentsUnavailable) resolved: More than 5 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org
[10:58:27] <elukey>	 currently the only node marked DN is aqs1010-b, that seems trashing, maybe restarting cassandra-b in there could improve things
[10:59:58] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1010 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[11:02:10] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[11:05:36] <joal>	 elukey: Have you taken action on restarting cassandra for aqs1010-b?
[11:06:18] <jinxer-wm>	 (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for mediawiki_history_reduced_2021_12 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org
[11:06:18] <jinxer-wm>	 (DruidSegmentsUnavailable) firing: More than 5 segments have been unavailable for mediawiki_history_reduced_2021_12 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org
[11:06:26] <elukey>	 joal: nono but I can if you want
[11:09:39] <mforns>	 ok, trying to restart
[11:10:22] <mforns>	 I think I don have permits
[11:10:58] <mforns>	 can you do that elukey?
[11:11:02] <elukey>	 mforns: what did you try?
[11:11:18] <mforns>	 sudo systemctl restart cassandra-b
[11:12:10] <mforns>	 elukey: ^
[11:12:29] <elukey>	 mforns: ah interesting, the sudoers config mentions only `service` commands, I think you folks are not yet enabled to use systemctl (we'll fix it)
[11:12:43] <elukey>	 mforns: try `service cassandra-b restart`
[11:12:47] <elukey>	 (with suod)
[11:13:52] <mforns>	 ok
[11:14:19] <mforns>	 elukey: same
[11:14:33] <joal>	 looking at logs: there are some errors on aqs1010-b, none on aqs1010-a - errors are all of type:  Unexpected exception during request
[11:15:13] <mforns>	 aha
[11:15:47] <joal>	 something else: all those error logs are from today - nothing before
[11:16:09] <elukey>	 !log restart cassandra-b on aqs1010 (stuck trashing)
[11:16:11] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:16:22] <elukey>	 mforns: weird, trying to restart myself
[11:16:44] <elukey>	 the command is taking a bit, I guess the old jvm doesn't go away easily
[11:17:15] <mforns>	 thanks elukey 
[11:19:42] <elukey>	 done, it is currently bootstrapping
[11:19:50] <mforns>	 ok
[11:20:12] <elukey>	 all nodes up now (from nodetool's perspectiver
[11:21:53] <mforns>	 joal: cassandra-hourly-coord-local_group_default_T_pageviews_per_project_v2 is managing to succeed now, will re-run the failed hours
[11:22:04] <joal>	 please wait mforns :)
[11:22:10] <mforns>	 oh, ok
[11:22:32] <joal>	 elukey: talining logs of the restarted nodes I see a lot of: Unable to connect to aqs1015-b.eqiad.wmnet/10.64.48.69
[11:22:55] <joal>	 and only that host
[11:23:11] <joal>	 do we need to restart it?
[11:23:26] <elukey>	 yeah I see now 1015-b DN
[11:23:33] <joal>	 WEORD :(
[11:23:54] <elukey>	 ok to restart? It is probably trashing as well
[11:23:54] <mforns>	 https://www.irccloud.com/pastebin/DH1ccOlH/
[11:24:37] <joal>	 please restart elukey 
[11:26:15] <elukey>	 !log restart cassandra-b on aqs1015 (instance not responding, probably trashing)
[11:26:18] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:26:18] <jinxer-wm>	 (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for mediawiki_history_reduced_2021_12 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org
[11:26:18] <jinxer-wm>	 (DruidSegmentsUnavailable) resolved: More than 5 segments have been unavailable for mediawiki_history_reduced_2021_12 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org
[11:29:03] <elukey>	 joal: all done
[11:29:08] <elukey>	 (and instance bootstrapped)
[11:30:31] <joal>	 ack elukey - errors have stopped on my end (aqs1010-b)
[11:30:35] <elukey>	 the cassandra instances have 16G of heap available, maybe they need a bit more on cassandra 3
[11:30:39] <joal>	 Will continue to monitor
[11:31:29] <elukey>	 ack :)
[11:40:09] <elukey>	 ah mforns, you are not in the 'aqs-admins' group, this is why you can't restart
[11:40:41] <mforns>	 aha, thanks for looking, can you add me?
[11:53:47] <elukey>	 mforns: I created https://gerrit.wikimedia.org/r/c/operations/puppet/+/751104
[11:53:48] <joal>	 mforns: logs look ok, pending compactions as well - I'm happy if you wish to restart failed hours :)
[11:53:52] <elukey>	 but it will require some time probablyu
[11:54:00] <mforns>	 joal: ok
[11:54:55] <mforns>	 although, joal, I was monitoring page faults in grafana, and they have not gone down since the restarts...
[11:55:04] <joal>	 Arf :S
[11:56:15] <elukey>	 mforns: page faults?
[11:56:25] <joal>	 mforns: I'm not precise enough to understand page-faults correctly, but the graph over time (looking at 7 days) seems regular - maybe it's ok?
[11:56:27] <RhinosF1>	 elukey: the service uses [a-z] but for systemctl it's just [ab], should it not be consistent?
[11:57:00] <mforns>	 elukey: yes, they started 1-2 days ago
[11:57:03] <elukey>	 RhinosF1: we can probably get rid of the service bit entirely
[11:57:15] <RhinosF1>	 elukey: one way of doing it
[11:57:20] <joal>	 actually I looked at the wrong graph mforns (restbase instead of AQS) - my bad :S
[11:57:22] <elukey>	 mforns: yeah what I meant is what graphs are you looking at
[11:57:49] <mforns>	 elukey: major page faults peak around end of year
[11:58:27] <joal>	 now that I have the correct graph mforns: high values are for old cluster - I think they are read-related
[11:58:31] <mforns>	 sorry, around jan 2nd 00:00
[11:59:17] <mforns>	 joal: I selected the aqs10[10,11,12,13,14,15] nodes and still see page faults, am I looking at the wrong place?
[11:59:18] <elukey>	 mforns: sure but page fauls can happen anytime, it seems a very low level metric to look at, this is why I was asking
[11:59:30] <elukey>	 why are we looking to those metrics?
[11:59:50] <elukey>	 or is it another page-fault (cassandra related?)
[11:59:53] <joal>	 mforns: add other servers, ou'll the numbers are a lot higher
[11:59:57] <mforns>	 elukey: I was looking at that, because it was the metric that spiked right at the same time as we started having problems
[12:00:51] <elukey>	 mforns: this is why I asked the graph's link :) but I think it is only a consequence of something else going on
[12:01:18] <elukey>	 the two jvms that we restarted were clearly trashing, did we check their GC and heap metrics? I suspect that the heap was full
[12:01:23] <mforns>	 I see, it seems there are other time intervals that the page faults were high as well
[12:01:39] <mforns>	 ok, will slowly restart jobs
[12:02:04] <mforns>	 *rerun
[13:56:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[14:21:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[16:07:29] <joal>	 mforns: pageview and mediarequest jobs are still failing
[16:07:37] <joal>	 mforns: I think we should run them one by one
[16:08:08] <joal>	 !log Kill cassandra3-local_group_default_T_mediarequest_per_file-daily-2022-1-1
[16:08:10] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:29:00] <mforns>	 joal: I ran them one by one, not sure what happened, but I can not look now :/
[18:25:30] <joal>	 heya mforns - I just triple checked: the pageview-per-article job succeeded - will restart another one
[18:26:19] <joal>	 !log rerun cassandra-daily-wf-local_group_default_T_mediarequest_per_file-2022-1-1
[18:26:21] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[21:56:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[22:26:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org