[00:03:50] <jinxer-wm>	 (HdfsTotalFilesHeap) resolved: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[00:03:50] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[00:59:50] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[01:10:07] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] "Thank you!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/789183 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi)
[01:10:31] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Fix typo [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/789183 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi)
[01:12:27] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] "TY!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/789182 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi)
[01:57:15] <jinxer-wm>	 (EventgateLoggingExternalLatency) firing: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency
[02:00:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[02:02:15] <jinxer-wm>	 (EventgateLoggingExternalLatency) resolved: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency
[03:20:09] <wikibugs>	 (03PS1) 10DLynch: DesktopUIActions/MobileUIActions: add pageToken field [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789324 (https://phabricator.wikimedia.org/T307640)
[03:20:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] DesktopUIActions/MobileUIActions: add pageToken field [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789324 (https://phabricator.wikimedia.org/T307640) (owner: 10DLynch)
[03:30:13] <wikibugs>	 (03PS2) 10DLynch: DesktopUIActions/MobileUIActions: add pageToken field [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789324 (https://phabricator.wikimedia.org/T307640)
[03:33:37] <wikibugs>	 (03CR) 10DLynch: "@Jdlrobson: adding you as a reviewer just so you can double-check this doesn't interfere with your usage of these schemas. (I don't think " [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789324 (https://phabricator.wikimedia.org/T307640) (owner: 10DLynch)
[04:05:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[04:11:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[04:16:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[04:18:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[04:28:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[04:33:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[04:38:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[05:00:05] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[07:04:13] <wikibugs>	 (03CR) 10Joal: "I added a bunch of comments as well. Let's synchronize with Dan on whether we wish to add a constant-timestamp parameter to the spark job." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/780749 (https://phabricator.wikimedia.org/T306177) (owner: 10Snwachukwu)
[08:31:35] <btullis>	 joal: Was this what you had in mind for gobblin? https://gerrit.wikimedia.org/r/c/operations/puppet/+/789560
[08:34:30] <btullis>	 I'm also planning to stop oozie on an-coord1001 shortly. Any objections?
[08:35:41] <joal>	 Hi btullis - sorry I'm late to answer
[08:36:11] <btullis>	 No worries :-)
[08:36:27] <joal>	 btullis: I assume that your patch will absent all gobblin jobs
[08:36:45] <joal>	 which is exactly what we want
[08:36:52] <btullis>	 Yes. By modifying the defined type temporarily: https://puppet-compiler.wmflabs.org/pcc-worker1001/35096/an-launcher1002.eqiad.wmnet/index.html
[08:37:28] <joal>	 great
[08:38:26] <btullis>	 OK. I'll +2 it now and merge. Happy with my stopping oozie too?
[08:39:26] <joal>	 btullis: I'd suggest waiting for gobblin to have stopped for some before stopping oozie
[08:39:39] <btullis>	 ack
[08:43:24] <btullis>	 OK, that's deployed. Checking for any remaining gobblin processes now...
[08:51:00] <btullis>	 The change inadvertently deleted old gobblin logs. I can get them back with `puppet filebucket`
[08:51:29] <joal>	 hm, I don't think that's very important given nothing had failed 
[08:52:42] <btullis>	 No, I agree. They're available for a while if we need them though. Right, I can't find any gobblin processes running on an-launcher1002. Shall I stop oozie?
[08:53:01] <joal>	 btullis: I see no prod job running now from oozie on the cluster - I think we're safe to go
[08:53:10] <joal>	 Thank you btullis for having waited )
[08:53:58] <btullis>	 !log stopping oozie on an-coord1001
[08:54:00] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:00:05] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[09:00:06] <btullis>	 About to restart an-coord1001 then.
[09:00:31] <btullis>	 !log restarting an-coord1001
[09:00:33] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:05:04] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[09:06:29] <btullis>	 OK, host is back up and running. All services green in icinga.
[09:10:16] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_network_flows_internal_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_network_flows_internal_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[09:10:38] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[09:12:50] <btullis>	 https://usercontent.irccloud-cdn.com/file/mxuIAOhe/image.png
[09:12:56] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[09:13:21] <btullis>	 I should just able to restart these, right joal? No additional cleanup required beforehand?
[09:13:42] <joal>	 yup btullis - should be all good
[09:15:14] <btullis>	 !log restarting failed eventlogging_to_druid_ services on an-launcher1002
[09:15:15] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:16:02] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[09:16:22] <btullis>	 !log re-enabling gobblin jobs now
[09:16:23] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:23:00] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[09:23:44] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[09:25:58] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_network_flows_internal_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_network_flows_internal_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[09:26:16] <btullis>	 OK, those are all coming back. Anything else we need to check?
[09:27:51] <joal>	 btullis: I'm monitoring as well - the webrequest gobblin job has started, I'll check it finishes ok, then it's about potential emails
[09:28:41] <joal>	 Actually btullis - have you restarted the hive-server?
[09:32:43] <wikibugs>	 (03PS1) 10Steven Sun: IPInfo: Add event source special_watchlist and update description [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789568 (https://phabricator.wikimedia.org/T307594)
[09:33:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] IPInfo: Add event source special_watchlist and update description [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789568 (https://phabricator.wikimedia.org/T307594) (owner: 10Steven Sun)
[09:39:55] <btullis>	 joal: I haven't restarted it, but it restarted normall as part of the reboot. Is there an issue?
[09:40:22] <joal>	 nono, I wanted to check about the memory-leak issue
[09:40:47] <joal>	 And indeed it shows a reboot :)
[09:40:56] <btullis>	 Oh right, yes I think that will be OK for about the next 6 weeks or so :-)
[09:40:58] <joal>	 At the time I asked the graph was not yet updated my bad
[09:41:07] <joal>	 Great :)
[09:41:50] <btullis>	 I have the heap change for the namenodes ready to go, so when we're happy with hive I will run the `sre.hadoop.roll-restart-masters` cookbook to increase that value.
[09:46:13] <joal>	 Ack btullis 
[09:52:19] <btullis>	 It's looking OK, so I'll go ahead with that rolling restart of the hadoop masters.
[09:53:38] <btullis>	 !log roll-restarting hadoop masters to pick up new heap size
[09:53:40] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:00:47] <wikibugs>	 (03PS2) 10Steven Sun: IPInfo: Add event source special_watchlist and update description [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789568 (https://phabricator.wikimedia.org/T307594)
[10:01:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] IPInfo: Add event source special_watchlist and update description [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789568 (https://phabricator.wikimedia.org/T307594) (owner: 10Steven Sun)
[10:02:24] <joal>	 btullis: I see the max bump for an-coord1001 in grafana - I also see an-coord1001 doing huge GC after the restart - we should wait a bit )
[10:03:53] <btullis>	 OK, it's half way through running the cookbook, but it's at the stage of waiting 10 minutes before offering to fail back to an-coord1001
[10:04:40] <wikibugs>	 (03PS3) 10Steven Sun: IPInfo: Add event source special_watchlist and update description [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789568 (https://phabricator.wikimedia.org/T307594)
[10:20:04] <btullis>	 joal: The GC has subsided back to normal. Are you happy for me to fail back to an-master1001?
[10:25:53] <btullis>	 I think it's OK, so I'm proceeding to fail back to an -master1001 now.
[10:26:42] <wikibugs>	 (03PS1) 10AGueyte: Add special_watchlist ipinfo_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789576 (https://phabricator.wikimedia.org/T307594)
[10:30:17] <wikibugs>	 (03PS2) 10AGueyte: Add special_watchlist ipinfo_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789576 (https://phabricator.wikimedia.org/T307594)
[10:32:07] <wikibugs>	 (03PS4) 10AGueyte: Add event source special_watchlist to ipinfo_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789568 (https://phabricator.wikimedia.org/T307594) (owner: 10Steven Sun)
[10:32:56] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): View 'centralauth_p.localuser' references invalid table/column/rights to use them - https://phabricator.wikimedia.org/T304733 (10OwenBlacker) Looks like stalktoy is working now, as a result — someone sh...
[10:55:40] <btullis>	 Hmm. The namenode on an-master1001 keeps stopping when I try to fail it back. Looking into the cause now. Maybe it is too much pressure on the RAM after all.
[10:59:34] <btullis>	 It doesn't seem to be RAM related. Error messages about quorum and epochs in `/var/log/hadoop-hdfs/hadoop-hdfs-namenode-an-master1001.log`
[10:59:40] <btullis>	 https://www.irccloud.com/pastebin/QRT3gu24/
[11:01:48] <btullis>	 Similar error messages to those seen in this ticket: https://phabricator.wikimedia.org/T283733
[11:24:50] <btullis>	 I am trying again with a larger value for `dfs_namenode_handler_count` (144 instead of 127) 
[11:29:14] <joal>	 hm
[11:30:08] <joal>	 btullis: the GCtime and GCCount metrics for 1001 are not great
[11:30:54] <joal>	 It's as if the namenode wouldn't manage to process the backlog of state
[11:31:28] <btullis>	 No, it has been restarted several times. There is a big period of GC after each start. Then it runs successfully until the failover occurs. Failover triggers a quit.
[11:32:52] <btullis>	 (I meant 'no  it's not great' - as opposed to 'no that's not it') :-)
[11:33:16] <joal>	 Ack!
[11:35:12] <btullis>	 It's running again now with a manually tweaked handler count. I'll wait until GC has settles down as much as possible, then I'll try another failover.
[11:39:43] <btullis>	 Failing over again now
[11:43:13] <btullis>	 Nope. Same result.
[11:43:20] <joal>	 there are errors in the log
[11:45:01] <btullis>	 Do you want to jump in the batcave to discuss? I've started the namenode again on an-master1001.
[11:45:47] <joal>	 sure
[12:26:29] <aqu>	 !log Regular analytics weekly train [analytics/refinery@cc4b2bd]
[12:26:31] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:27:55] <joal>	 Hi aqu
[12:28:00] <joal>	 Have you already started to deploy?
[12:28:09] <aqu>	 Nop
[12:28:13] <joal>	 \o/
[12:28:25] <joal>	 I had not added the changes to the purge job I did
[12:28:35] <joal>	 sorry for being late :S
[12:28:57] <joal>	 let me check if the code has been merged
[12:29:03] <aqu>	 ok. This is my fault, deploy was supposed to be on Tuesday.
[12:29:23] <joal>	 nono, I hadn't added my patches to the etherpad - my bad
[12:30:06] <joal>	 ok - merging now and adding to the etherpad
[12:30:27] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for dpeloy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/786448 (https://phabricator.wikimedia.org/T303988) (owner: 10Joal)
[12:31:15] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/786452 (owner: 10Joal)
[12:31:55] <joal>	 ok here we are aqu :)
[12:31:59] <joal>	 Thank you!
[12:35:44] <aqu>	 You're welcome.
[12:41:33] <wikibugs>	 (03CR) 10Klein Muçi: "recheck" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/788829 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi)
[12:42:12] <joal>	 aqu: The other I mentioned yesterday to be done while in opsweek at the beginning of the month is: https://wikitech.wikimedia.org/wiki/Data_Engineering/Ops_week#Is_there_a_new_Mediawiki_History_snapshot_ready?_(beginning_of_the_month)
[12:42:28] <joal>	 aqu: let's talk after your deploy :)
[12:45:19] <wikibugs>	 10Analytics, 10EventStreams: EventStreams doesn't show the Wikistories-* streams - https://phabricator.wikimedia.org/T307679 (10SBisson)
[12:50:25] <wikibugs>	 (03CR) 10Klein Muçi: "recheck" [analytics/pageview-api] - 10https://gerrit.wikimedia.org/r/789181 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi)
[12:57:05] <joal>	 ok btullis - I confirm the problem comes from zkfc not managing to make an-master1001 master, due to a timeout exception
[12:57:30] <joal>	 The timeout is set for 60000 millis - this feels like a lot for the node not respond :S
[12:57:33] <joal>	 let sync on that later
[12:58:09] <btullis>	 Great. I'm back now. I'll try increasing the number of parallel GC threads on an-master1001 and then attempt the failover again.
[12:58:45] <joal>	 Ack - I wonder if growing the timeout on the fkfc side wouldn't be a good approach too
[12:59:20] <btullis>	 Yes, but I'm nervous to touch anything on an-master1002 until we're back with an-master1001 as the active node.
[12:59:36] <joal>	 yeah
[12:59:58] <joal>	 Looking at low-level metrics, I wonder if the problem wouldn't be more on the network side
[13:00:05] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[13:00:06] <wikibugs>	 (03PS1) 10Luke Bowmaker: Image Suggestions Feedback Sanitized Hive Tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/789606
[13:01:05] <joal>	 mh, probably not - let's try your idea btullis 
[13:01:56] <wikibugs>	 (03CR) 10Luke Bowmaker: "Hi Andrew," [analytics/refinery] - 10https://gerrit.wikimedia.org/r/789606 (owner: 10Luke Bowmaker)
[13:03:28] <btullis>	 I have set dfs_namenode_handler_count back to 127 and increased ParallelGCThreads to 25 
[13:03:50] <joal>	 ack - let's see if it helps
[13:03:56] <joal>	 monitoring logs
[13:04:30] <btullis>	 hadoop-hdfs-namenode restarted - waiting for GC to settle before attempting failback.
[13:16:04] <mforns>	 ottomata: sorry I didn't make it to the meeting, was focusing on documentation, please ping me if you need me there today.
[13:16:33] <joal>	 interesting btullis - it feels that GC went "a bit" faster :)
[13:16:50] <joal>	 maybe it's just me ecpecting it to be )
[13:17:22] <btullis>	 joal: Yes, I agree. Almost ready to do a failback again. This time I'll tail `/var/log/hadoop-hdfs/hadoop-hdfs-zkfc-an-master1002.log` at the same time.
[13:17:37] <joal>	 ack - doing so too :)
[13:18:26] <btullis>	 Pressing the button now
[13:19:47] <btullis>	 Nope.
[13:21:13] <joal>	 hm
[13:21:33] <btullis>	 `Caused by: java.net.SocketTimeoutException: Call From an-master1001/10.64.5.26 to an-master1001.eqiad.wmnet:8040 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.64.5.26:37895 remote=an-master1001.eqiad.wmnet/10.64.5.26:8040]; For more details see:  
[13:21:33] <btullis>	 http://wiki.apache.org/hadoop/SocketTimeout`
[13:22:48] <btullis>	 This time the namenode is still running on an-master1001 - it didn't quit.
[13:26:07] <joal>	 This is weird
[13:26:14] <btullis>	 I could try running the failover command fron an-master1002 - it probably won't help but might be worth a try?
[13:26:52] <joal>	 it must be that the namenode cannot be contacted by zkfc but hasn't failed in terms of epochs...
[13:26:56] <joal>	 weird
[13:27:07] <joal>	 we can try again but I don't see why it would change
[13:27:43] <joal>	 it worked!
[13:27:48] <joal>	 zkfc is happy
[13:27:51] <btullis>	 It worked! Why?
[13:27:57] <joal>	 I have no clue :SS
[13:28:01] <btullis>	 Almost immediately, too.
[13:28:05] <joal>	 yeah
[13:30:38] <joal>	 well ... I guess we can call it success, but without explanation :)
[13:31:35] <wikibugs>	 (03Abandoned) 10AGueyte: Add special_watchlist ipinfo_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789576 (https://phabricator.wikimedia.org/T307594) (owner: 10AGueyte)
[13:31:35] <btullis>	 Yeah, needs more testing and successful cookbook executions before I'll be happy. I guess we can restart the service on an-master1002 now.
[13:31:45] <joal>	 works for me
[13:32:07] <joal>	 hm, still quite some GCs on 1001
[13:34:32] <wikibugs>	 (03CR) 10AGueyte: [C: 03+2] Add event source special_watchlist to ipinfo_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789568 (https://phabricator.wikimedia.org/T307594) (owner: 10Steven Sun)
[13:35:05] <wikibugs>	 (03Merged) 10jenkins-bot: Add event source special_watchlist to ipinfo_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789568 (https://phabricator.wikimedia.org/T307594) (owner: 10Steven Sun)
[13:35:25] <wikibugs>	 (03CR) 10Tchanders: [C: 03+1] "Thanks, looks good! Tested by opening a popup on Special:Watchlist and observing the correct event get sent to the dev server." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789568 (https://phabricator.wikimedia.org/T307594) (owner: 10Steven Sun)
[13:36:48] <wikibugs>	 (03CR) 10Tchanders: [C: 03+1] Add event source special_watchlist to ipinfo_interaction schema (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/789568 (https://phabricator.wikimedia.org/T307594) (owner: 10Steven Sun)
[13:38:34] <btullis>	 Restarted namenode service on an-master1002
[13:51:09] <icinga-wm>	 RECOVERY - Check unit status of mediawiki-history-drop-snapshot on an-launcher1002 is OK: OK: Status of the systemd unit mediawiki-history-drop-snapshot https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[13:52:57] <wikibugs>	 (03CR) 10Klein Muçi: "recheck" [analytics/datahub] - 10https://gerrit.wikimedia.org/r/789170 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi)
[13:53:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Fix typo [analytics/datahub] - 10https://gerrit.wikimedia.org/r/789170 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi)
[13:54:36] <wikibugs>	 (03CR) 10Klein Muçi: "recheck" [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/788840 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi)
[14:00:38] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Increase Java heap for HDFS namenodes - https://phabricator.wikimedia.org/T307549 (10BTullis) This has been deployed, but we experienced some issues when attempting to fail back to an-master1001 after it had restarted.  The following command failed multiple times...
[14:04:20] <joal>	 aqu - I assume you've restarted the mediawiki-history-drop-snapshot, right?
[14:05:36] <joal>	 I just checked the logs, and indeed, it has been restarted and removed successfully what was to delete :)
[14:05:46] <joal>	 Gone for kids, back at standup time
[14:08:21] <wikibugs>	 (03CR) 10Klein Muçi: "recheck" [analytics/datahub] - 10https://gerrit.wikimedia.org/r/788832 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi)
[14:09:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Fix typo [analytics/datahub] - 10https://gerrit.wikimedia.org/r/788832 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi)
[14:09:49] <wikibugs>	 (03CR) 10Klein Muçi: "recheck" [analytics/pageview-api] - 10https://gerrit.wikimedia.org/r/788741 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi)
[14:10:28] <wikibugs>	 (03CR) 10Klein Muçi: "recheck" [analytics/wikimetrics] - 10https://gerrit.wikimedia.org/r/788738 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi)
[14:11:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Fix typo [analytics/wikimetrics] - 10https://gerrit.wikimedia.org/r/788738 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi)
[14:16:27] <wikibugs>	 (03CR) 10Klein Muçi: "recheck" [analytics/aggregator] - 10https://gerrit.wikimedia.org/r/787726 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi)
[14:16:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Fix typo [analytics/aggregator] - 10https://gerrit.wikimedia.org/r/787726 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi)
[14:17:54] <wikibugs>	 (03CR) 10Klein Muçi: "recheck" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/787849 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi)
[14:18:50] <wikibugs>	 (03CR) 10Klein Muçi: "recheck" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/787846 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi)
[14:19:16] <wikibugs>	 (03CR) 10Klein Muçi: "recheck" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/788612 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi)
[14:19:57] <aqu>	 joal: yes it's restarted
[14:20:14] <wikibugs>	 (03CR) 10Klein Muçi: "recheck" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/788736 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi)
[14:21:14] <wikibugs>	 (03CR) 10Klein Muçi: "recheck" [analytics/wikimetrics] - 10https://gerrit.wikimedia.org/r/788620 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi)
[14:21:48] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Image Suggestions Feedback Sanitized Hive Tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/789606 (owner: 10Luke Bowmaker)
[14:21:54] <wikibugs>	 (03CR) 10Klein Muçi: "recheck" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/788853 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi)
[14:22:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Fix typo [analytics/wikimetrics] - 10https://gerrit.wikimedia.org/r/788620 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi)
[14:22:36] <wikibugs>	 (03CR) 10Klein Muçi: "recheck" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/788829 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi)
[14:23:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Fix typo [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/788853 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi)
[14:23:19] <wikibugs>	 (03CR) 10Klein Muçi: "recheck" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/788829 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi)
[15:01:21] <wikibugs>	 (03PS2) 10Snwachukwu: Create Hive Query to generate Wikidata CoEditors metrics [analytics/refinery] - 10https://gerrit.wikimedia.org/r/780749 (https://phabricator.wikimedia.org/T306177)
[15:15:21] <wikibugs>	 (03CR) 10Snwachukwu: Create Hive Query to generate Wikidata CoEditors metrics (036 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/780749 (https://phabricator.wikimedia.org/T306177) (owner: 10Snwachukwu)
[15:27:53] <aqu>	 joal: What do you think? https://gerrit.wikimedia.org/r/c/operations/puppet/+/789631
[15:57:02] <joal>	 Heya - back
[15:57:06] <joal>	 Hi aqu - reading now
[15:57:49] <joal>	 aqu: +1ed - we need help from an SRE now
[16:01:50] <razzi>	 joal: aqu I will deploy it!
[16:03:21] <joal>	 Thanks razzi :)
[16:03:45] <aqu>	 Thanks!
[16:15:54] <wikibugs>	 10Data-Engineering, 10Data-Catalog, 10Product-Analytics: Propagate field descriptions from event schemas to metastore - https://phabricator.wikimedia.org/T307040 (10EChetty)
[16:16:05] <wikibugs>	 10Data-Engineering, 10Data-Catalog: Integrate Superset with DataHub - https://phabricator.wikimedia.org/T306903 (10EChetty)
[16:18:04] <wikibugs>	 10Analytics-Radar, 10SRE, 10Traffic-Icebox, 10User-jbond: Fix geoip updaters for new MaxMind hashed keys by 2019-08-15 - https://phabricator.wikimedia.org/T228533 (10Dzahn) I think this can be closed since it's in the past and superseded by T302864.
[16:18:20] <wikibugs>	 10Analytics-Radar, 10SRE, 10Traffic-Icebox, 10User-jbond: Fix geoip updaters for new MaxMind hashed keys by 2019-08-15 - https://phabricator.wikimedia.org/T228533 (10Dzahn)
[16:19:31] <wikibugs>	 (03CR) 10Joal: "3 comments left, I also closed the one already fixed for readability :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/780749 (https://phabricator.wikimedia.org/T306177) (owner: 10Snwachukwu)
[16:20:16] <wikibugs>	 10Data-Engineering: Migrate to MaxMind GeoIP2 - https://phabricator.wikimedia.org/T302989 (10Dzahn)
[16:20:55] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, and 3 others: Maxmind: GeoIP Download Failed - https://phabricator.wikimedia.org/T302864 (10Dzahn)
[16:21:56] <wikibugs>	 10Data-Engineering, 10Data-Catalog: Spike: Integrate Spark with DataHub - https://phabricator.wikimedia.org/T306896 (10EChetty)
[16:22:20] <wikibugs>	 10Data-Engineering, 10Data-Catalog: Spike: Integrate Spark with DataHub - https://phabricator.wikimedia.org/T306896 (10EChetty) Making this into a spike for upcoming work.
[16:22:29] <wikibugs>	 10Data-Engineering, 10Data-Catalog: Spike: Integrate Spark with DataHub - https://phabricator.wikimedia.org/T306896 (10EChetty)
[16:26:25] <wikibugs>	 10Analytics, 10Data-Engineering, 10SRE: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10CDanis) How hard is option 1?  I'm starting to think up use cases for NEL data like comparing the ratio of reports/time vs webrequests/time for a given...
[16:28:47] <wikibugs>	 10Data-Engineering, 10Data-Catalog: Streamline CI for our fork of DataHub - https://phabricator.wikimedia.org/T303381 (10EChetty)
[16:29:03] <wikibugs>	 10Data-Engineering, 10Data-Catalog: Streamline CI for our fork of DataHub - https://phabricator.wikimedia.org/T303381 (10EChetty)
[16:47:12] <wikibugs>	 (03PS2) 10Milimetric: Implement job to extract officewiki webrequests [analytics/refinery] - 10https://gerrit.wikimedia.org/r/779952 (https://phabricator.wikimedia.org/T306136)
[16:47:39] <razzi>	 btullis: ottomata I'm planning on doing the reboots for the remaining kafka clusters today: main-codfw and main-eqiad. jumbo-eqiad went just fine yesterday.
[16:47:56] <wikibugs>	 (03CR) 10Milimetric: Implement job to extract officewiki webrequests (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/779952 (https://phabricator.wikimedia.org/T306136) (owner: 10Milimetric)
[16:48:27] <btullis>	 razzi: Great. I didn't know if they were ours to do, but fine by me.
[17:00:05] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[17:07:01] <joal>	 btullis: now that you've changed the heap, I guess the alert needs to be hcanged as well maybe? --^
[17:09:50] <btullis>	 joal: Doing that now: https://gerrit.wikimedia.org/r/c/operations/alerts/+/789618
[17:16:17] <btullis>	 That's merged.
[17:22:18] <joal>	 thanks a lot btullis :)
[17:28:50] <jinxer-wm>	 (HdfsTotalFilesHeap) resolved: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[17:34:50] <jinxer-wm>	 (HdfsTotalFilesHeap) resolved: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[17:35:20] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[17:40:20] <jinxer-wm>	 (HdfsTotalFilesHeap) resolved: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[18:30:28] <mforns>	 ottomata: I'm trying to restart the production Airflow scheduler, which seems broken for the last 9 hours.
[18:30:34] <mforns>	 ottomata: can you help?
[18:31:26] <mforns>	 I only found the following service: wmf_auto_restart_airflow-scheduler@analytics
[18:31:41] <mforns>	 which I don't think is the scheduler itself, but just an auto-restart
[18:32:07] <mforns>	 maybe it is restarting, but it's failing. will look at the logs
[18:35:27] <mforns>	 btullis, joal: I imagine you're off already... I think since you restarted a couple machines this morning, Airflow scheduler stopped working..
[18:36:43] <mforns>	 razzi: Maybe you're still there?
[18:43:15] <joal>	 wow mforns :S
[18:43:33] <joal>	 I'm super sorry we didn't check :S
[18:44:12] <joal>	 mforns: the UI seems ok
[18:44:41] <mforns>	 joal: do you see the banner at the top too?
[18:44:54] <mforns>	 It says the scheduler is not running...
[18:44:58] <joal>	 I do! I hadn't noticed it
[18:45:02] <joal>	 ok
[18:45:05] <mforns>	 And the hourly days are blocked since 9 hours ago
[18:45:06] <joal>	 that's bad :(
[18:45:13] <mforns>	 hourly * DAGs
[18:45:19] <btullis>	 Oh sorry mforns. I can still get to my computer. Hang on a sec...
[18:45:28] <mforns>	 I'm trying to find something in the logs
[18:45:51] <mforns>	 Sorry to bother guys, I'm not sure what to do
[18:46:25] <mforns>	 Seems there's some connection problems with the db? Can't connect to MySQL server on 'an-coord1001.eqiad.wmnet'
[18:46:35] <mforns>	 looking more, might be just a side effect
[18:47:15] <btullis>	 The db was definitely restarted, but looks like the scheduler didn't reconnect. I wonder if the other instances are equally affected.
[18:48:44] <btullis>	 https://usercontent.irccloud-cdn.com/file/MCESIyQW/image.png
[18:49:08] <btullis>	 So the scheduler service is still running, but the last thing in the logs is a stacktrace.
[18:49:24] <mforns>	 hmmm
[18:49:35] <mforns>	 that explains no alerts or slas...
[18:49:54] <joal>	 wow - I dislike this type of failures :(
[18:49:55] <btullis>	 !log restarting airflow-scheduler@analytics service on an-launcher1002
[18:49:57] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:49:59] <mforns>	 yeap
[18:50:40] <joal>	 In addition to the fix, we should have a retro on at least having something outside airflow telling us that data doesn't flow
[18:50:53] <mforns>	 the warning at the top of the UI is gone, hourly jobs are running
[18:51:00] <mforns>	 thanks a lot ben for restarting
[18:51:07] <btullis>	 Logs look better. --^ Totally agree joal. We need better monitoring.
[18:52:00] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics, 10Research: Update HDFS links tables as Mediawiki changes - https://phabricator.wikimedia.org/T304979 (10JAllemandou) Ping @EChetty - This task needs to be priositized by the team - the SQL change is already happening and will impact some...
[18:52:11] <joal>	 Thanks a milion mforns and btullis <3
[18:52:39] <btullis>	 I have checked the airflow-scheuler service on an-airflow1001 and it looks fine. an-airflow1002 (research) looks borked in the same way.
[18:52:49] <joal>	 Back to being gone ;)
[18:52:55] <mforns>	 It was Sandra who spotted it!
[18:53:07] <btullis>	 !log restarted airflow-scheduler@research.service on an-airflow1002
[18:53:09] <joal>	 talk to you tomorrow folks
[18:53:10] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:53:18] <mforns>	 thanks ben and joal for quickly assisting
[18:53:25] <mforns>	 bye joal 
[18:53:26] <joal>	 Kudos SandraEbele for finding that airflow was broken - Thanks a milion for this
[18:53:44] <joal>	 I didn't do anything though ;)
[18:53:53] <btullis>	 !log restarting  airflow-scheduler@platform_eng.service on an-airflow1003
[18:53:54] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:57:05] <wikibugs>	 10Data-Engineering, 10Airflow: Improve monitoring for airflow-scheduler services - https://phabricator.wikimedia.org/T307739 (10BTullis)
[18:57:23] <mforns>	 Thanks btullis :]
[18:57:27] <btullis>	 --^ there is the first follow-up ticket :-)
[18:57:39] <mforns>	 let's discuss this in standup with the team
[18:57:53] <btullis>	 Cool. I'm also standing down for now. Catch you tomorrow.
[18:57:59] <mforns>	 bye btullis !!!
[19:09:24] <ottomata>	 hey btullis mforns  joal  am here
[19:09:32] <ottomata>	 i think the no alerts is partly my fault
[19:09:35] <ottomata>	 i put this task in last week
[19:09:36] <ottomata>	 https://phabricator.wikimedia.org/T307102
[19:09:39] <ottomata>	 but we haven't groomed
[19:10:07] <ottomata>	 actually, not sure if we would have gotten an alert in this case, as the scheduler was still running
[19:10:27] <wikibugs>	 10Data-Engineering, 10Airflow: Improve monitoring for airflow-scheduler services - https://phabricator.wikimedia.org/T307739 (10Ottomata) See also: {T307102}
[19:12:36] <mforns>	 heya ottomata yea, the scheduler had an active status
[19:16:30] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Fix typo [analytics/refinery] - 10https://gerrit.wikimedia.org/r/789338 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi)
[19:23:43] <wikibugs>	 10Analytics, 10Data-Engineering, 10SRE: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10Ottomata) Not hard at all, there is plenty of puppet to support it.   Just need to run it somewhere.  We currently colocate MirrorMaker on target cluste...
[20:10:30] <wikibugs>	 (03PS1) 10Razzi: Upgrade superset to 1.5.0 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/789683 (https://phabricator.wikimedia.org/T304972)
[20:52:19] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Some varnishkafka instances dropped traffic for a long time due to the wrong version of the package installed - https://phabricator.wikimedia.org/T300164 (10Effeietsanders) @Mayakp.wiki I don't have access to the full report, but the public slides mention 2-4% (gl...
[22:26:23] <razzi>	 If anybody gets a notification about superset, I'm testing on the staging instance
[22:28:39] <wikibugs>	 (03PS2) 10Razzi: Upgrade superset to 1.5.0 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/789683 (https://phabricator.wikimedia.org/T304972)
[22:33:19] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Superset, 10Patch-For-Review, 10Product-Analytics (Kanban): Upgrade Superset to 1.4.2 - https://phabricator.wikimedia.org/T304972 (10razzi) I tried out Superset 1.5 briefly but found it requires python 3.8, and an-tool1005 is currently running python 3.7....
[22:34:22] <wikibugs>	 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics, and 2 others: Add superset-next.wikimedia.org domain for superset staging - https://phabricator.wikimedia.org/T275575 (10razzi) 05Open→03Resolved
[22:34:24] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, and 2 others: Upgrade Superset to 1.3.1 or higher - https://phabricator.wikimedia.org/T288115 (10razzi)
[22:38:13] <wikibugs>	 (03Abandoned) 10Razzi: Upgrade superset to 1.5.0 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/789683 (https://phabricator.wikimedia.org/T304972) (owner: 10Razzi)
[22:45:40] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Some varnishkafka instances dropped traffic for a long time due to the wrong version of the package installed - https://phabricator.wikimedia.org/T300164 (10Mayakp.wiki) Hi @Effeietsanders , sorry about the confusion but the typo was in the slides.  Per T300164#76...
[22:57:48] <wikibugs>	 10Data-Engineering, 10Product-Analytics (Kanban): Make analytics-product the owner of canonical_data - https://phabricator.wikimedia.org/T307749 (10nettrom_WMF)
[23:07:31] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Some varnishkafka instances dropped traffic for a long time due to the wrong version of the package installed - https://phabricator.wikimedia.org/T300164 (10kzimmerman) @Effeietsanders I saw that you are collaborating with @MGerlach on Research. We don't yet have...
[23:22:32] <wikibugs>	 (03CR) 10Klein Muçi: "recheck" [analytics/gobblin] - 10https://gerrit.wikimedia.org/r/788856 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi)
[23:22:54] <wikibugs>	 (03CR) 10Klein Muçi: "recheck" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/788841 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi)