[00:20:48] RECOVERY - Check unit status of monitor_refine_eventlogging_analytics on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:27:16] PROBLEM - aqs endpoints health on aqs1012 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [00:31:46] RECOVERY - aqs endpoints health on aqs1012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:00:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [01:15:00] PROBLEM - aqs endpoints health on aqs1014 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{refer [01:15:00] ent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:15:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [01:17:08] RECOVERY - aqs endpoints health on aqs1014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:56:02] PROBLEM - aqs endpoints health on aqs1015 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [02:00:30] RECOVERY - aqs endpoints health on aqs1015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [05:16:30] PROBLEM - aqs endpoints health on aqs1012 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [05:17:20] PROBLEM - aqs endpoints health on aqs1015 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [05:17:28] PROBLEM - aqs endpoints health on aqs1010 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [05:17:30] PROBLEM - aqs endpoints health on aqs1011 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [05:17:48] PROBLEM - aqs endpoints health on aqs1013 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [05:18:58] PROBLEM - aqs endpoints health on aqs1014 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [05:33:38] RECOVERY - aqs endpoints health on aqs1013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [05:34:36] RECOVERY - aqs endpoints health on aqs1012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [05:34:48] RECOVERY - aqs endpoints health on aqs1014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [05:35:22] RECOVERY - aqs endpoints health on aqs1015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [05:35:42] RECOVERY - aqs endpoints health on aqs1011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [05:35:42] RECOVERY - aqs endpoints health on aqs1010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [07:35:24] heya team, looking into cassandra and aqs alerts [07:37:58] Good morning mforns - my guess is that it comes from druid new snapshot [07:38:15] And also: Happy new year team :) [07:38:44] joal: ah, I was trying the hypothesis that yesterday's cassandra mediarequest job got stuck and it blocked resources for other jobs [07:38:54] hehe, happy new year!! [07:39:29] joal: the logs say: Not enough replicas available for query at consistency LOCAL_QUORUM (2 required but only 1 alive) [07:39:38] mforns: could very well be! the fast recovery is what lead me to the snapshot option but I could very well be wrong :) [07:40:03] I killed the stuck job like 20 mins ago [07:40:07] Ah!!! [07:40:10] makes sense [07:40:38] But tried to re-run the pageviews job and it continues to fail [07:45:05] mwarf :( [07:45:23] I'm gonna focus on that with you mforns [07:47:44] joal: yesterday I was looking at cassandra dashboard, and could not see anything weird.. [07:55:32] joal: hm, now the cassandra systems dashboard does not show basic metrics like load, cpu, network... [08:02:52] by executing nodetool status and nodetool info in aqs1004 I see the -a instance has 6TB load while the -b instance has only 2.8TB [08:03:59] mforns: this is probably expected due to snapshots - the new cassandra cluster hosts are: aqs101[0-6] IIRC [08:08:16] oh I see [08:09:21] joal: I have a physio appointment, will need to leave in 5 mins. and be back in +-35 mins. [08:09:35] ack mforns [08:57:42] joal: back [09:04:42] joal: in aqs1010 the -b instance is not responding [09:06:24] ah maybe there is no -b instance outside of aqs100[456]... [09:06:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [09:08:54] mforns: if you haven't seen, one of the an-workers went down last night [09:09:31] RhinosF1: ah! didn't see, thanks! [09:10:41] Np [09:11:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [10:13:24] hello folks, I am around if needed [10:13:30] I can check the two workers down [10:20:18] heya elukey :] [10:20:39] !log powercycle an-worker1120 (CPU soft lockup errors in mgmt console) [10:20:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:22:36] !log powercycle an-worker1114 (CPU soft lockup errors in mgmt console) [10:22:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:22:46] hola mforns :) [10:22:53] heloooo [10:23:03] both workers down are booting, kernel issues (sporadic but sometimes it happens) [10:26:22] ok [10:27:16] mforns: for aqs1010, it is the only instance of the new cassandra cluster that we pooled [10:27:25] we an depool it and debug it if you want [10:30:03] elukey: how do you know it's aqs1010 the one with problems? [10:30:21] and thanks for restarting the an nodes :] [10:31:58] mforns: on aqs1010, after reading aqs1010 from you, I tried `nodetool-b status` and it blocked for a while, `nodetool-a status` worked indicating -b instance as DN (down) [10:32:36] ah ok, so they do have 2 instances in theory [10:33:41] yep yep [10:33:46] even in the new cluster [10:34:30] !log depool aqs1010 (`sudo -i depool` on the node) to allow investigation of the cassandra -b instance [10:34:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:34:44] mforns: aqs on aqs1010 depooled, feel free to work on it :) [10:34:51] ok, then elukey I believe other nodes have similar problems, I tried 1012 and the -a instance is not responding there [10:35:41] it could be something related to compactions etc.. in the new cluster (with cassandra 3 I mean) [10:36:39] elukey: nodetool info says percent repaired is at 2.228174031670533E-7% [10:37:24] and in some nodes heap memory is very close to the limit [10:42:18] yeah I imagined something similar, maybe with cassandra 3 some heap settings needs to be bumped to avoid trashing while compacting/repairing/etc.. [10:43:01] hm - have we started a repair on the new nodes lately? [10:46:18] (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org [10:46:18] (DruidSegmentsUnavailable) firing: More than 5 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org [10:46:38] ok here we are about the new druid snapshot [10:54:12] joal: didn't know about any repair [10:54:23] meh :( [10:54:46] joal: should we ping a team's sre? [10:55:16] mforns: maybe not before we have something to ask them for? [10:55:24] k [10:56:18] (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org [10:56:18] (DruidSegmentsUnavailable) resolved: More than 5 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org [10:58:27] currently the only node marked DN is aqs1010-b, that seems trashing, maybe restarting cassandra-b in there could improve things [10:59:58] PROBLEM - aqs endpoints health on aqs1010 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [11:02:10] RECOVERY - aqs endpoints health on aqs1010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [11:05:36] elukey: Have you taken action on restarting cassandra for aqs1010-b? [11:06:18] (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for mediawiki_history_reduced_2021_12 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org [11:06:18] (DruidSegmentsUnavailable) firing: More than 5 segments have been unavailable for mediawiki_history_reduced_2021_12 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org [11:06:26] joal: nono but I can if you want [11:09:39] ok, trying to restart [11:10:22] I think I don have permits [11:10:58] can you do that elukey? [11:11:02] mforns: what did you try? [11:11:18] sudo systemctl restart cassandra-b [11:12:10] elukey: ^ [11:12:29] mforns: ah interesting, the sudoers config mentions only `service` commands, I think you folks are not yet enabled to use systemctl (we'll fix it) [11:12:43] mforns: try `service cassandra-b restart` [11:12:47] (with suod) [11:13:52] ok [11:14:19] elukey: same [11:14:33] looking at logs: there are some errors on aqs1010-b, none on aqs1010-a - errors are all of type: Unexpected exception during request [11:15:13] aha [11:15:47] something else: all those error logs are from today - nothing before [11:16:09] !log restart cassandra-b on aqs1010 (stuck trashing) [11:16:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:16:22] mforns: weird, trying to restart myself [11:16:44] the command is taking a bit, I guess the old jvm doesn't go away easily [11:17:15] thanks elukey [11:19:42] done, it is currently bootstrapping [11:19:50] ok [11:20:12] all nodes up now (from nodetool's perspectiver [11:21:53] joal: cassandra-hourly-coord-local_group_default_T_pageviews_per_project_v2 is managing to succeed now, will re-run the failed hours [11:22:04] please wait mforns :) [11:22:10] oh, ok [11:22:32] elukey: talining logs of the restarted nodes I see a lot of: Unable to connect to aqs1015-b.eqiad.wmnet/10.64.48.69 [11:22:55] and only that host [11:23:11] do we need to restart it? [11:23:26] yeah I see now 1015-b DN [11:23:33] WEORD :( [11:23:54] ok to restart? It is probably trashing as well [11:23:54] https://www.irccloud.com/pastebin/DH1ccOlH/ [11:24:37] please restart elukey [11:26:15] !log restart cassandra-b on aqs1015 (instance not responding, probably trashing) [11:26:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:26:18] (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for mediawiki_history_reduced_2021_12 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org [11:26:18] (DruidSegmentsUnavailable) resolved: More than 5 segments have been unavailable for mediawiki_history_reduced_2021_12 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org [11:29:03] joal: all done [11:29:08] (and instance bootstrapped) [11:30:31] ack elukey - errors have stopped on my end (aqs1010-b) [11:30:35] the cassandra instances have 16G of heap available, maybe they need a bit more on cassandra 3 [11:30:39] Will continue to monitor [11:31:29] ack :) [11:40:09] ah mforns, you are not in the 'aqs-admins' group, this is why you can't restart [11:40:41] aha, thanks for looking, can you add me? [11:53:47] mforns: I created https://gerrit.wikimedia.org/r/c/operations/puppet/+/751104 [11:53:48] mforns: logs look ok, pending compactions as well - I'm happy if you wish to restart failed hours :) [11:53:52] but it will require some time probablyu [11:54:00] joal: ok [11:54:55] although, joal, I was monitoring page faults in grafana, and they have not gone down since the restarts... [11:55:04] Arf :S [11:56:15] mforns: page faults? [11:56:25] mforns: I'm not precise enough to understand page-faults correctly, but the graph over time (looking at 7 days) seems regular - maybe it's ok? [11:56:27] elukey: the service uses [a-z] but for systemctl it's just [ab], should it not be consistent? [11:57:00] elukey: yes, they started 1-2 days ago [11:57:03] RhinosF1: we can probably get rid of the service bit entirely [11:57:15] elukey: one way of doing it [11:57:20] actually I looked at the wrong graph mforns (restbase instead of AQS) - my bad :S [11:57:22] mforns: yeah what I meant is what graphs are you looking at [11:57:49] elukey: major page faults peak around end of year [11:58:27] now that I have the correct graph mforns: high values are for old cluster - I think they are read-related [11:58:31] sorry, around jan 2nd 00:00 [11:59:17] joal: I selected the aqs10[10,11,12,13,14,15] nodes and still see page faults, am I looking at the wrong place? [11:59:18] mforns: sure but page fauls can happen anytime, it seems a very low level metric to look at, this is why I was asking [11:59:30] why are we looking to those metrics? [11:59:50] or is it another page-fault (cassandra related?) [11:59:53] mforns: add other servers, ou'll the numbers are a lot higher [11:59:57] elukey: I was looking at that, because it was the metric that spiked right at the same time as we started having problems [12:00:51] mforns: this is why I asked the graph's link :) but I think it is only a consequence of something else going on [12:01:18] the two jvms that we restarted were clearly trashing, did we check their GC and heap metrics? I suspect that the heap was full [12:01:23] I see, it seems there are other time intervals that the page faults were high as well [12:01:39] ok, will slowly restart jobs [12:02:04] *rerun [13:56:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [14:21:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [16:07:29] mforns: pageview and mediarequest jobs are still failing [16:07:37] mforns: I think we should run them one by one [16:08:08] !log Kill cassandra3-local_group_default_T_mediarequest_per_file-daily-2022-1-1 [16:08:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:29:00] joal: I ran them one by one, not sure what happened, but I can not look now :/ [18:25:30] heya mforns - I just triple checked: the pageview-per-article job succeeded - will restart another one [18:26:19] !log rerun cassandra-daily-wf-local_group_default_T_mediarequest_per_file-2022-1-1 [18:26:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:56:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [22:26:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org