[00:03:50] (HdfsTotalFilesHeap) resolved: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [00:03:50] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [00:59:50] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [01:10:07] (03CR) 10Ottomata: [C: 03+2] "Thank you!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/789183 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [01:10:31] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Fix typo [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/789183 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [01:12:27] (03CR) 10Ottomata: [V: 03+2 C: 03+2] "TY!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/789182 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [01:57:15] (EventgateLoggingExternalLatency) firing: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [02:00:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [02:02:15] (EventgateLoggingExternalLatency) resolved: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [03:20:09] (03PS1) 10DLynch: DesktopUIActions/MobileUIActions: add pageToken field [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789324 (https://phabricator.wikimedia.org/T307640) [03:20:47] (03CR) 10jerkins-bot: [V: 04-1] DesktopUIActions/MobileUIActions: add pageToken field [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789324 (https://phabricator.wikimedia.org/T307640) (owner: 10DLynch) [03:30:13] (03PS2) 10DLynch: DesktopUIActions/MobileUIActions: add pageToken field [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789324 (https://phabricator.wikimedia.org/T307640) [03:33:37] (03CR) 10DLynch: "@Jdlrobson: adding you as a reviewer just so you can double-check this doesn't interfere with your usage of these schemas. (I don't think " [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789324 (https://phabricator.wikimedia.org/T307640) (owner: 10DLynch) [04:05:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [04:11:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [04:16:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [04:18:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [04:28:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [04:33:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [04:38:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [05:00:05] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [07:04:13] (03CR) 10Joal: "I added a bunch of comments as well. Let's synchronize with Dan on whether we wish to add a constant-timestamp parameter to the spark job." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/780749 (https://phabricator.wikimedia.org/T306177) (owner: 10Snwachukwu) [08:31:35] joal: Was this what you had in mind for gobblin? https://gerrit.wikimedia.org/r/c/operations/puppet/+/789560 [08:34:30] I'm also planning to stop oozie on an-coord1001 shortly. Any objections? [08:35:41] Hi btullis - sorry I'm late to answer [08:36:11] No worries :-) [08:36:27] btullis: I assume that your patch will absent all gobblin jobs [08:36:45] which is exactly what we want [08:36:52] Yes. By modifying the defined type temporarily: https://puppet-compiler.wmflabs.org/pcc-worker1001/35096/an-launcher1002.eqiad.wmnet/index.html [08:37:28] great [08:38:26] OK. I'll +2 it now and merge. Happy with my stopping oozie too? [08:39:26] btullis: I'd suggest waiting for gobblin to have stopped for some before stopping oozie [08:39:39] ack [08:43:24] OK, that's deployed. Checking for any remaining gobblin processes now... [08:51:00] The change inadvertently deleted old gobblin logs. I can get them back with `puppet filebucket` [08:51:29] hm, I don't think that's very important given nothing had failed [08:52:42] No, I agree. They're available for a while if we need them though. Right, I can't find any gobblin processes running on an-launcher1002. Shall I stop oozie? [08:53:01] btullis: I see no prod job running now from oozie on the cluster - I think we're safe to go [08:53:10] Thank you btullis for having waited ) [08:53:58] !log stopping oozie on an-coord1001 [08:54:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:00:05] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [09:00:06] About to restart an-coord1001 then. [09:00:31] !log restarting an-coord1001 [09:00:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:05:04] PROBLEM - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:06:29] OK, host is back up and running. All services green in icinga. [09:10:16] PROBLEM - Check unit status of eventlogging_to_druid_network_flows_internal_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_network_flows_internal_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:10:38] PROBLEM - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:12:50] https://usercontent.irccloud-cdn.com/file/mxuIAOhe/image.png [09:12:56] PROBLEM - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:13:21] I should just able to restart these, right joal? No additional cleanup required beforehand? [09:13:42] yup btullis - should be all good [09:15:14] !log restarting failed eventlogging_to_druid_ services on an-launcher1002 [09:15:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:16:02] RECOVERY - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:16:22] !log re-enabling gobblin jobs now [09:16:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:23:00] RECOVERY - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:23:44] RECOVERY - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:25:58] RECOVERY - Check unit status of eventlogging_to_druid_network_flows_internal_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_network_flows_internal_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:26:16] OK, those are all coming back. Anything else we need to check? [09:27:51] btullis: I'm monitoring as well - the webrequest gobblin job has started, I'll check it finishes ok, then it's about potential emails [09:28:41] Actually btullis - have you restarted the hive-server? [09:32:43] (03PS1) 10Steven Sun: IPInfo: Add event source special_watchlist and update description [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789568 (https://phabricator.wikimedia.org/T307594) [09:33:10] (03CR) 10jerkins-bot: [V: 04-1] IPInfo: Add event source special_watchlist and update description [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789568 (https://phabricator.wikimedia.org/T307594) (owner: 10Steven Sun) [09:39:55] joal: I haven't restarted it, but it restarted normall as part of the reboot. Is there an issue? [09:40:22] nono, I wanted to check about the memory-leak issue [09:40:47] And indeed it shows a reboot :) [09:40:56] Oh right, yes I think that will be OK for about the next 6 weeks or so :-) [09:40:58] At the time I asked the graph was not yet updated my bad [09:41:07] Great :) [09:41:50] I have the heap change for the namenodes ready to go, so when we're happy with hive I will run the `sre.hadoop.roll-restart-masters` cookbook to increase that value. [09:46:13] Ack btullis [09:52:19] It's looking OK, so I'll go ahead with that rolling restart of the hadoop masters. [09:53:38] !log roll-restarting hadoop masters to pick up new heap size [09:53:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:00:47] (03PS2) 10Steven Sun: IPInfo: Add event source special_watchlist and update description [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789568 (https://phabricator.wikimedia.org/T307594) [10:01:49] (03CR) 10jerkins-bot: [V: 04-1] IPInfo: Add event source special_watchlist and update description [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789568 (https://phabricator.wikimedia.org/T307594) (owner: 10Steven Sun) [10:02:24] btullis: I see the max bump for an-coord1001 in grafana - I also see an-coord1001 doing huge GC after the restart - we should wait a bit ) [10:03:53] OK, it's half way through running the cookbook, but it's at the stage of waiting 10 minutes before offering to fail back to an-coord1001 [10:04:40] (03PS3) 10Steven Sun: IPInfo: Add event source special_watchlist and update description [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789568 (https://phabricator.wikimedia.org/T307594) [10:20:04] joal: The GC has subsided back to normal. Are you happy for me to fail back to an-master1001? [10:25:53] I think it's OK, so I'm proceeding to fail back to an -master1001 now. [10:26:42] (03PS1) 10AGueyte: Add special_watchlist ipinfo_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789576 (https://phabricator.wikimedia.org/T307594) [10:30:17] (03PS2) 10AGueyte: Add special_watchlist ipinfo_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789576 (https://phabricator.wikimedia.org/T307594) [10:32:07] (03PS4) 10AGueyte: Add event source special_watchlist to ipinfo_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789568 (https://phabricator.wikimedia.org/T307594) (owner: 10Steven Sun) [10:32:56] 10Data-Engineering, 10Data-Engineering-Kanban, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): View 'centralauth_p.localuser' references invalid table/column/rights to use them - https://phabricator.wikimedia.org/T304733 (10OwenBlacker) Looks like stalktoy is working now, as a result — someone sh... [10:55:40] Hmm. The namenode on an-master1001 keeps stopping when I try to fail it back. Looking into the cause now. Maybe it is too much pressure on the RAM after all. [10:59:34] It doesn't seem to be RAM related. Error messages about quorum and epochs in `/var/log/hadoop-hdfs/hadoop-hdfs-namenode-an-master1001.log` [10:59:40] https://www.irccloud.com/pastebin/QRT3gu24/ [11:01:48] Similar error messages to those seen in this ticket: https://phabricator.wikimedia.org/T283733 [11:24:50] I am trying again with a larger value for `dfs_namenode_handler_count` (144 instead of 127) [11:29:14] hm [11:30:08] btullis: the GCtime and GCCount metrics for 1001 are not great [11:30:54] It's as if the namenode wouldn't manage to process the backlog of state [11:31:28] No, it has been restarted several times. There is a big period of GC after each start. Then it runs successfully until the failover occurs. Failover triggers a quit. [11:32:52] (I meant 'no it's not great' - as opposed to 'no that's not it') :-) [11:33:16] Ack! [11:35:12] It's running again now with a manually tweaked handler count. I'll wait until GC has settles down as much as possible, then I'll try another failover. [11:39:43] Failing over again now [11:43:13] Nope. Same result. [11:43:20] there are errors in the log [11:45:01] Do you want to jump in the batcave to discuss? I've started the namenode again on an-master1001. [11:45:47] sure [12:26:29] !log Regular analytics weekly train [analytics/refinery@cc4b2bd] [12:26:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:27:55] Hi aqu [12:28:00] Have you already started to deploy? [12:28:09] Nop [12:28:13] \o/ [12:28:25] I had not added the changes to the purge job I did [12:28:35] sorry for being late :S [12:28:57] let me check if the code has been merged [12:29:03] ok. This is my fault, deploy was supposed to be on Tuesday. [12:29:23] nono, I hadn't added my patches to the etherpad - my bad [12:30:06] ok - merging now and adding to the etherpad [12:30:27] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for dpeloy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/786448 (https://phabricator.wikimedia.org/T303988) (owner: 10Joal) [12:31:15] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/786452 (owner: 10Joal) [12:31:55] ok here we are aqu :) [12:31:59] Thank you! [12:35:44] You're welcome. [12:41:33] (03CR) 10Klein Muçi: "recheck" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/788829 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [12:42:12] aqu: The other I mentioned yesterday to be done while in opsweek at the beginning of the month is: https://wikitech.wikimedia.org/wiki/Data_Engineering/Ops_week#Is_there_a_new_Mediawiki_History_snapshot_ready?_(beginning_of_the_month) [12:42:28] aqu: let's talk after your deploy :) [12:45:19] 10Analytics, 10EventStreams: EventStreams doesn't show the Wikistories-* streams - https://phabricator.wikimedia.org/T307679 (10SBisson) [12:50:25] (03CR) 10Klein Muçi: "recheck" [analytics/pageview-api] - 10https://gerrit.wikimedia.org/r/789181 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [12:57:05] ok btullis - I confirm the problem comes from zkfc not managing to make an-master1001 master, due to a timeout exception [12:57:30] The timeout is set for 60000 millis - this feels like a lot for the node not respond :S [12:57:33] let sync on that later [12:58:09] Great. I'm back now. I'll try increasing the number of parallel GC threads on an-master1001 and then attempt the failover again. [12:58:45] Ack - I wonder if growing the timeout on the fkfc side wouldn't be a good approach too [12:59:20] Yes, but I'm nervous to touch anything on an-master1002 until we're back with an-master1001 as the active node. [12:59:36] yeah [12:59:58] Looking at low-level metrics, I wonder if the problem wouldn't be more on the network side [13:00:05] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [13:00:06] (03PS1) 10Luke Bowmaker: Image Suggestions Feedback Sanitized Hive Tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/789606 [13:01:05] mh, probably not - let's try your idea btullis [13:01:56] (03CR) 10Luke Bowmaker: "Hi Andrew," [analytics/refinery] - 10https://gerrit.wikimedia.org/r/789606 (owner: 10Luke Bowmaker) [13:03:28] I have set dfs_namenode_handler_count back to 127 and increased ParallelGCThreads to 25 [13:03:50] ack - let's see if it helps [13:03:56] monitoring logs [13:04:30] hadoop-hdfs-namenode restarted - waiting for GC to settle before attempting failback. [13:16:04] ottomata: sorry I didn't make it to the meeting, was focusing on documentation, please ping me if you need me there today. [13:16:33] interesting btullis - it feels that GC went "a bit" faster :) [13:16:50] maybe it's just me ecpecting it to be ) [13:17:22] joal: Yes, I agree. Almost ready to do a failback again. This time I'll tail `/var/log/hadoop-hdfs/hadoop-hdfs-zkfc-an-master1002.log` at the same time. [13:17:37] ack - doing so too :) [13:18:26] Pressing the button now [13:19:47] Nope. [13:21:13] hm [13:21:33] `Caused by: java.net.SocketTimeoutException: Call From an-master1001/10.64.5.26 to an-master1001.eqiad.wmnet:8040 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.64.5.26:37895 remote=an-master1001.eqiad.wmnet/10.64.5.26:8040]; For more details see: [13:21:33] http://wiki.apache.org/hadoop/SocketTimeout` [13:22:48] This time the namenode is still running on an-master1001 - it didn't quit. [13:26:07] This is weird [13:26:14] I could try running the failover command fron an-master1002 - it probably won't help but might be worth a try? [13:26:52] it must be that the namenode cannot be contacted by zkfc but hasn't failed in terms of epochs... [13:26:56] weird [13:27:07] we can try again but I don't see why it would change [13:27:43] it worked! [13:27:48] zkfc is happy [13:27:51] It worked! Why? [13:27:57] I have no clue :SS [13:28:01] Almost immediately, too. [13:28:05] yeah [13:30:38] well ... I guess we can call it success, but without explanation :) [13:31:35] (03Abandoned) 10AGueyte: Add special_watchlist ipinfo_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789576 (https://phabricator.wikimedia.org/T307594) (owner: 10AGueyte) [13:31:35] Yeah, needs more testing and successful cookbook executions before I'll be happy. I guess we can restart the service on an-master1002 now. [13:31:45] works for me [13:32:07] hm, still quite some GCs on 1001 [13:34:32] (03CR) 10AGueyte: [C: 03+2] Add event source special_watchlist to ipinfo_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789568 (https://phabricator.wikimedia.org/T307594) (owner: 10Steven Sun) [13:35:05] (03Merged) 10jenkins-bot: Add event source special_watchlist to ipinfo_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789568 (https://phabricator.wikimedia.org/T307594) (owner: 10Steven Sun) [13:35:25] (03CR) 10Tchanders: [C: 03+1] "Thanks, looks good! Tested by opening a popup on Special:Watchlist and observing the correct event get sent to the dev server." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789568 (https://phabricator.wikimedia.org/T307594) (owner: 10Steven Sun) [13:36:48] (03CR) 10Tchanders: [C: 03+1] Add event source special_watchlist to ipinfo_interaction schema (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789568 (https://phabricator.wikimedia.org/T307594) (owner: 10Steven Sun) [13:38:34] Restarted namenode service on an-master1002 [13:51:09] RECOVERY - Check unit status of mediawiki-history-drop-snapshot on an-launcher1002 is OK: OK: Status of the systemd unit mediawiki-history-drop-snapshot https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:52:57] (03CR) 10Klein Muçi: "recheck" [analytics/datahub] - 10https://gerrit.wikimedia.org/r/789170 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [13:53:46] (03CR) 10jerkins-bot: [V: 04-1] Fix typo [analytics/datahub] - 10https://gerrit.wikimedia.org/r/789170 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [13:54:36] (03CR) 10Klein Muçi: "recheck" [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/788840 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [14:00:38] 10Data-Engineering, 10Data-Engineering-Kanban: Increase Java heap for HDFS namenodes - https://phabricator.wikimedia.org/T307549 (10BTullis) This has been deployed, but we experienced some issues when attempting to fail back to an-master1001 after it had restarted. The following command failed multiple times... [14:04:20] aqu - I assume you've restarted the mediawiki-history-drop-snapshot, right? [14:05:36] I just checked the logs, and indeed, it has been restarted and removed successfully what was to delete :) [14:05:46] Gone for kids, back at standup time [14:08:21] (03CR) 10Klein Muçi: "recheck" [analytics/datahub] - 10https://gerrit.wikimedia.org/r/788832 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [14:09:40] (03CR) 10jerkins-bot: [V: 04-1] Fix typo [analytics/datahub] - 10https://gerrit.wikimedia.org/r/788832 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [14:09:49] (03CR) 10Klein Muçi: "recheck" [analytics/pageview-api] - 10https://gerrit.wikimedia.org/r/788741 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [14:10:28] (03CR) 10Klein Muçi: "recheck" [analytics/wikimetrics] - 10https://gerrit.wikimedia.org/r/788738 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [14:11:01] (03CR) 10jerkins-bot: [V: 04-1] Fix typo [analytics/wikimetrics] - 10https://gerrit.wikimedia.org/r/788738 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [14:16:27] (03CR) 10Klein Muçi: "recheck" [analytics/aggregator] - 10https://gerrit.wikimedia.org/r/787726 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [14:16:56] (03CR) 10jerkins-bot: [V: 04-1] Fix typo [analytics/aggregator] - 10https://gerrit.wikimedia.org/r/787726 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [14:17:54] (03CR) 10Klein Muçi: "recheck" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/787849 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [14:18:50] (03CR) 10Klein Muçi: "recheck" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/787846 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [14:19:16] (03CR) 10Klein Muçi: "recheck" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/788612 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [14:19:57] joal: yes it's restarted [14:20:14] (03CR) 10Klein Muçi: "recheck" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/788736 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [14:21:14] (03CR) 10Klein Muçi: "recheck" [analytics/wikimetrics] - 10https://gerrit.wikimedia.org/r/788620 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [14:21:48] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Image Suggestions Feedback Sanitized Hive Tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/789606 (owner: 10Luke Bowmaker) [14:21:54] (03CR) 10Klein Muçi: "recheck" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/788853 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [14:22:18] (03CR) 10jerkins-bot: [V: 04-1] Fix typo [analytics/wikimetrics] - 10https://gerrit.wikimedia.org/r/788620 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [14:22:36] (03CR) 10Klein Muçi: "recheck" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/788829 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [14:23:00] (03CR) 10jerkins-bot: [V: 04-1] Fix typo [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/788853 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [14:23:19] (03CR) 10Klein Muçi: "recheck" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/788829 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [15:01:21] (03PS2) 10Snwachukwu: Create Hive Query to generate Wikidata CoEditors metrics [analytics/refinery] - 10https://gerrit.wikimedia.org/r/780749 (https://phabricator.wikimedia.org/T306177) [15:15:21] (03CR) 10Snwachukwu: Create Hive Query to generate Wikidata CoEditors metrics (036 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/780749 (https://phabricator.wikimedia.org/T306177) (owner: 10Snwachukwu) [15:27:53] joal: What do you think? https://gerrit.wikimedia.org/r/c/operations/puppet/+/789631 [15:57:02] Heya - back [15:57:06] Hi aqu - reading now [15:57:49] aqu: +1ed - we need help from an SRE now [16:01:50] joal: aqu I will deploy it! [16:03:21] Thanks razzi :) [16:03:45] Thanks! [16:15:54] 10Data-Engineering, 10Data-Catalog, 10Product-Analytics: Propagate field descriptions from event schemas to metastore - https://phabricator.wikimedia.org/T307040 (10EChetty) [16:16:05] 10Data-Engineering, 10Data-Catalog: Integrate Superset with DataHub - https://phabricator.wikimedia.org/T306903 (10EChetty) [16:18:04] 10Analytics-Radar, 10SRE, 10Traffic-Icebox, 10User-jbond: Fix geoip updaters for new MaxMind hashed keys by 2019-08-15 - https://phabricator.wikimedia.org/T228533 (10Dzahn) I think this can be closed since it's in the past and superseded by T302864. [16:18:20] 10Analytics-Radar, 10SRE, 10Traffic-Icebox, 10User-jbond: Fix geoip updaters for new MaxMind hashed keys by 2019-08-15 - https://phabricator.wikimedia.org/T228533 (10Dzahn) [16:19:31] (03CR) 10Joal: "3 comments left, I also closed the one already fixed for readability :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/780749 (https://phabricator.wikimedia.org/T306177) (owner: 10Snwachukwu) [16:20:16] 10Data-Engineering: Migrate to MaxMind GeoIP2 - https://phabricator.wikimedia.org/T302989 (10Dzahn) [16:20:55] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, and 3 others: Maxmind: GeoIP Download Failed - https://phabricator.wikimedia.org/T302864 (10Dzahn) [16:21:56] 10Data-Engineering, 10Data-Catalog: Spike: Integrate Spark with DataHub - https://phabricator.wikimedia.org/T306896 (10EChetty) [16:22:20] 10Data-Engineering, 10Data-Catalog: Spike: Integrate Spark with DataHub - https://phabricator.wikimedia.org/T306896 (10EChetty) Making this into a spike for upcoming work. [16:22:29] 10Data-Engineering, 10Data-Catalog: Spike: Integrate Spark with DataHub - https://phabricator.wikimedia.org/T306896 (10EChetty) [16:26:25] 10Analytics, 10Data-Engineering, 10SRE: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10CDanis) How hard is option 1? I'm starting to think up use cases for NEL data like comparing the ratio of reports/time vs webrequests/time for a given... [16:28:47] 10Data-Engineering, 10Data-Catalog: Streamline CI for our fork of DataHub - https://phabricator.wikimedia.org/T303381 (10EChetty) [16:29:03] 10Data-Engineering, 10Data-Catalog: Streamline CI for our fork of DataHub - https://phabricator.wikimedia.org/T303381 (10EChetty) [16:47:12] (03PS2) 10Milimetric: Implement job to extract officewiki webrequests [analytics/refinery] - 10https://gerrit.wikimedia.org/r/779952 (https://phabricator.wikimedia.org/T306136) [16:47:39] btullis: ottomata I'm planning on doing the reboots for the remaining kafka clusters today: main-codfw and main-eqiad. jumbo-eqiad went just fine yesterday. [16:47:56] (03CR) 10Milimetric: Implement job to extract officewiki webrequests (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/779952 (https://phabricator.wikimedia.org/T306136) (owner: 10Milimetric) [16:48:27] razzi: Great. I didn't know if they were ours to do, but fine by me. [17:00:05] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [17:07:01] btullis: now that you've changed the heap, I guess the alert needs to be hcanged as well maybe? --^ [17:09:50] joal: Doing that now: https://gerrit.wikimedia.org/r/c/operations/alerts/+/789618 [17:16:17] That's merged. [17:22:18] thanks a lot btullis :) [17:28:50] (HdfsTotalFilesHeap) resolved: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [17:34:50] (HdfsTotalFilesHeap) resolved: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [17:35:20] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [17:40:20] (HdfsTotalFilesHeap) resolved: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [18:30:28] ottomata: I'm trying to restart the production Airflow scheduler, which seems broken for the last 9 hours. [18:30:34] ottomata: can you help? [18:31:26] I only found the following service: wmf_auto_restart_airflow-scheduler@analytics [18:31:41] which I don't think is the scheduler itself, but just an auto-restart [18:32:07] maybe it is restarting, but it's failing. will look at the logs [18:35:27] btullis, joal: I imagine you're off already... I think since you restarted a couple machines this morning, Airflow scheduler stopped working.. [18:36:43] razzi: Maybe you're still there? [18:43:15] wow mforns :S [18:43:33] I'm super sorry we didn't check :S [18:44:12] mforns: the UI seems ok [18:44:41] joal: do you see the banner at the top too? [18:44:54] It says the scheduler is not running... [18:44:58] I do! I hadn't noticed it [18:45:02] ok [18:45:05] And the hourly days are blocked since 9 hours ago [18:45:06] that's bad :( [18:45:13] hourly * DAGs [18:45:19] Oh sorry mforns. I can still get to my computer. Hang on a sec... [18:45:28] I'm trying to find something in the logs [18:45:51] Sorry to bother guys, I'm not sure what to do [18:46:25] Seems there's some connection problems with the db? Can't connect to MySQL server on 'an-coord1001.eqiad.wmnet' [18:46:35] looking more, might be just a side effect [18:47:15] The db was definitely restarted, but looks like the scheduler didn't reconnect. I wonder if the other instances are equally affected. [18:48:44] https://usercontent.irccloud-cdn.com/file/MCESIyQW/image.png [18:49:08] So the scheduler service is still running, but the last thing in the logs is a stacktrace. [18:49:24] hmmm [18:49:35] that explains no alerts or slas... [18:49:54] wow - I dislike this type of failures :( [18:49:55] !log restarting airflow-scheduler@analytics service on an-launcher1002 [18:49:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:49:59] yeap [18:50:40] In addition to the fix, we should have a retro on at least having something outside airflow telling us that data doesn't flow [18:50:53] the warning at the top of the UI is gone, hourly jobs are running [18:51:00] thanks a lot ben for restarting [18:51:07] Logs look better. --^ Totally agree joal. We need better monitoring. [18:52:00] 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics, 10Research: Update HDFS links tables as Mediawiki changes - https://phabricator.wikimedia.org/T304979 (10JAllemandou) Ping @EChetty - This task needs to be priositized by the team - the SQL change is already happening and will impact some... [18:52:11] Thanks a milion mforns and btullis <3 [18:52:39] I have checked the airflow-scheuler service on an-airflow1001 and it looks fine. an-airflow1002 (research) looks borked in the same way. [18:52:49] Back to being gone ;) [18:52:55] It was Sandra who spotted it! [18:53:07] !log restarted airflow-scheduler@research.service on an-airflow1002 [18:53:09] talk to you tomorrow folks [18:53:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:53:18] thanks ben and joal for quickly assisting [18:53:25] bye joal [18:53:26] Kudos SandraEbele for finding that airflow was broken - Thanks a milion for this [18:53:44] I didn't do anything though ;) [18:53:53] !log restarting airflow-scheduler@platform_eng.service on an-airflow1003 [18:53:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:57:05] 10Data-Engineering, 10Airflow: Improve monitoring for airflow-scheduler services - https://phabricator.wikimedia.org/T307739 (10BTullis) [18:57:23] Thanks btullis :] [18:57:27] --^ there is the first follow-up ticket :-) [18:57:39] let's discuss this in standup with the team [18:57:53] Cool. I'm also standing down for now. Catch you tomorrow. [18:57:59] bye btullis !!! [19:09:24] hey btullis mforns joal am here [19:09:32] i think the no alerts is partly my fault [19:09:35] i put this task in last week [19:09:36] https://phabricator.wikimedia.org/T307102 [19:09:39] but we haven't groomed [19:10:07] actually, not sure if we would have gotten an alert in this case, as the scheduler was still running [19:10:27] 10Data-Engineering, 10Airflow: Improve monitoring for airflow-scheduler services - https://phabricator.wikimedia.org/T307739 (10Ottomata) See also: {T307102} [19:12:36] heya ottomata yea, the scheduler had an active status [19:16:30] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Fix typo [analytics/refinery] - 10https://gerrit.wikimedia.org/r/789338 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [19:23:43] 10Analytics, 10Data-Engineering, 10SRE: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10Ottomata) Not hard at all, there is plenty of puppet to support it. Just need to run it somewhere. We currently colocate MirrorMaker on target cluste... [20:10:30] (03PS1) 10Razzi: Upgrade superset to 1.5.0 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/789683 (https://phabricator.wikimedia.org/T304972) [20:52:19] 10Data-Engineering, 10Data-Engineering-Kanban: Some varnishkafka instances dropped traffic for a long time due to the wrong version of the package installed - https://phabricator.wikimedia.org/T300164 (10Effeietsanders) @Mayakp.wiki I don't have access to the full report, but the public slides mention 2-4% (gl... [22:26:23] If anybody gets a notification about superset, I'm testing on the staging instance [22:28:39] (03PS2) 10Razzi: Upgrade superset to 1.5.0 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/789683 (https://phabricator.wikimedia.org/T304972) [22:33:19] 10Data-Engineering, 10Data-Engineering-Kanban, 10Superset, 10Patch-For-Review, 10Product-Analytics (Kanban): Upgrade Superset to 1.4.2 - https://phabricator.wikimedia.org/T304972 (10razzi) I tried out Superset 1.5 briefly but found it requires python 3.8, and an-tool1005 is currently running python 3.7.... [22:34:22] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics, and 2 others: Add superset-next.wikimedia.org domain for superset staging - https://phabricator.wikimedia.org/T275575 (10razzi) 05Open→03Resolved [22:34:24] 10Analytics-Clusters, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, and 2 others: Upgrade Superset to 1.3.1 or higher - https://phabricator.wikimedia.org/T288115 (10razzi) [22:38:13] (03Abandoned) 10Razzi: Upgrade superset to 1.5.0 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/789683 (https://phabricator.wikimedia.org/T304972) (owner: 10Razzi) [22:45:40] 10Data-Engineering, 10Data-Engineering-Kanban: Some varnishkafka instances dropped traffic for a long time due to the wrong version of the package installed - https://phabricator.wikimedia.org/T300164 (10Mayakp.wiki) Hi @Effeietsanders , sorry about the confusion but the typo was in the slides. Per T300164#76... [22:57:48] 10Data-Engineering, 10Product-Analytics (Kanban): Make analytics-product the owner of canonical_data - https://phabricator.wikimedia.org/T307749 (10nettrom_WMF) [23:07:31] 10Data-Engineering, 10Data-Engineering-Kanban: Some varnishkafka instances dropped traffic for a long time due to the wrong version of the package installed - https://phabricator.wikimedia.org/T300164 (10kzimmerman) @Effeietsanders I saw that you are collaborating with @MGerlach on Research. We don't yet have... [23:22:32] (03CR) 10Klein Muçi: "recheck" [analytics/gobblin] - 10https://gerrit.wikimedia.org/r/788856 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [23:22:54] (03CR) 10Klein Muçi: "recheck" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/788841 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi)