[00:08:28] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:14:48] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:23:40] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_netflow_daily on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_daily https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:54:23] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] Review access change [analytics/reportupdater-queries] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/709646 (https://phabricator.wikimedia.org/T287578) (owner: 10Awight)
[01:09:16] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[01:18:22] <wikibugs>	 10Analytics: Data Loss Check always shows false positives - https://phabricator.wikimedia.org/T288496 (10Milimetric)
[02:10:10] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[02:30:07] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[03:01:05] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[04:06:47] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[04:12:53] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:00:15] <icinga-wm>	 RECOVERY - Check unit status of refinery-import-page-current-dumps on an-launcher1002 is OK: OK: Status of the systemd unit refinery-import-page-current-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:07:03] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:07:51] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:13:03] <icinga-wm>	 PROBLEM - Check unit status of refinery-import-page-current-dumps on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-import-page-current-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[06:37:06] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10elukey) Hey Ben great work! I couple of things to remember for the decom:  1) I am not sure if there is a way to force overlord/middlemanager  to stop accepting indexation jobs, or if we...
[06:37:37] <elukey>	 btullis_: o/ sorry I was afk yesterday and I didn't see the code reviews, left some notes for the decom in the task, all good afaics :)
[06:46:08] <elukey>	 ah there are also a couple of failed indexation jobs on the new druid nodes
[06:49:41] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10elukey) While checking the indexation failures I noticed:  ` 2021-08-10T06:01:39,961 INFO org.apache.druid.indexing.overlord.ForkingTaskRunner: Exception caught during execution java.io....
[07:07:45] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[07:13:41] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[07:18:22] <wikibugs>	 (03CR) 10Awight: "Thanks!" [analytics/reportupdater-queries] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/709646 (https://phabricator.wikimedia.org/T287578) (owner: 10Awight)
[07:18:47] <wikibugs>	 (03CR) 10Awight: [V: 03+2 C: 03+2] "Ready to go :-)" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/709007 (https://phabricator.wikimedia.org/T287578) (owner: 10Svantje Lilienthal)
[07:40:10] <wikibugs>	 10Analytics, 10Dumps-Generation: xmldatadumps dumpstatus.json files only readable by root - https://phabricator.wikimedia.org/T287989 (10elukey) 05Resolved→03Open Hi Ariel! Sorry to re-open, but I found:  ` elukey@dumpsdata1003:/data/xmldatadumps/public$ ls -l /data/xmldatadumps/public/wikidatawiki/2021080...
[07:41:40] <elukey>	 I reopened --^ since there is a file that causes the import jobs to fail
[07:41:49] <elukey>	 once fixed we should be able to re-run without problems
[07:53:19] <wikibugs>	 10Analytics, 10Dumps-Generation: xmldatadumps dumpstatus.json files only readable by root - https://phabricator.wikimedia.org/T287989 (10ArielGlenn) >>! In T287989#7271424, @elukey wrote: > Hi Ariel! Sorry to re-open, but I found: >  > ` > elukey@dumpsdata1003:/data/xmldatadumps/public$ ls -l /data/xmldatadump...
[07:56:35] <icinga-wm>	 RECOVERY - Check unit status of refinery-import-page-current-dumps on an-launcher1002 is OK: OK: Status of the systemd unit refinery-import-page-current-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[07:58:55] <wikibugs>	 10Analytics, 10Dumps-Generation: xmldatadumps dumpstatus.json files only readable by root - https://phabricator.wikimedia.org/T287989 (10elukey) @ArielGlenn thanks a lot! There is one last little issue, namely that the `dumpstatus.json` file aforementioned now seems not to be valid json. I checked `/data/xmlda...
[08:05:00] <wikibugs>	 10Analytics, 10Dumps-Generation: xmldatadumps dumpstatus.json files only readable by root - https://phabricator.wikimedia.org/T287989 (10ArielGlenn) >>! In T287989#7271463, @elukey wrote: > @ArielGlenn thanks a lot! There is one last little issue, namely that the `dumpstatus.json` file aforementioned now seems...
[08:13:19] <wikibugs>	 10Analytics, 10Dumps-Generation: xmldatadumps dumpstatus.json files only readable by root - https://phabricator.wikimedia.org/T287989 (10ArielGlenn) After IRC discussion: waiting a few days is ok, in the meantime I have put a 0 byte file in its place.
[08:17:01] <icinga-wm>	 RECOVERY - Check unit status of refinery-import-page-history-dumps on an-launcher1002 is OK: OK: Status of the systemd unit refinery-import-page-history-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[08:18:55] <icinga-wm>	 RECOVERY - Check unit status of refinery-import-siteinfo-dumps on an-launcher1002 is OK: OK: Status of the systemd unit refinery-import-siteinfo-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[08:27:34] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] docs: added docker compose link and minor rewording [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/709951 (owner: 10David Caro)
[08:27:53] <wikibugs>	 (03CR) 10David Caro: [V: 03+1 C: 03+2] docs: added docker compose link and minor rewording [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/709951 (owner: 10David Caro)
[08:28:04] <wikibugs>	 (03CR) 10David Caro: [V: 03+2 C: 03+2] docs: added docker compose link and minor rewording [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/709951 (owner: 10David Caro)
[08:54:21] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) Thanks @elukey. I've addressed the ownership problem on an-druid1005 with: `btullis@an-druid1005:~$ sudo chown druid:druid /srv/druid/deep-storage /srv/dru...
[08:55:31] <btullis>	 elukey: Thanks for all of the help re the decom. Do you think I need to re-run anything because of the indexation failures?
[08:56:12] <elukey>	 btullis: o/ yes on an-launcher1002 there are a couple of failed timers, it should be sufficient to systemctl restart those
[08:56:32] <elukey>	 they will kick off new indexations (no guarantee that they will land on new nodes)
[09:00:59] <btullis>	 Thanks. It's the `.service` component that shows as having failed, but you're saying that a restart of the `.timer` component should kick off a new instance of each of the .service job, right? 
[09:01:03] <btullis>	 https://usercontent.irccloud-cdn.com/file/NAZEO0N9/image.png
[09:02:51] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[09:04:06] <elukey>	 btullis: nono a restart of .service is fine
[09:04:25] <elukey>	 the .timer component doesn't need a restart, it is the .service that failed
[09:04:41] <elukey>	 in theory we could wait for the periodic run of the job to catch up
[09:04:52] <btullis>	 !log btullis@an-launcher1002:~$ sudo systemctl restart eventlogging_to_druid_netflow_daily.service
[09:04:54] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:04:59] <btullis>	 !log btullis@an-launcher1002:~$ sudo systemctl restart eventlogging_to_druid_prefupdate_hourly.service
[09:05:02] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:05:17] <btullis>	 Gotcha, thanks.
[09:05:35] <elukey>	 perfect :)
[09:13:43] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_netflow_daily on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_daily https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[09:42:02] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) > 2  Turnilo and Superset are configured to target a specific host's broker (manually set in their configs). For Turnilo the host is listed in puppet, for...
[10:04:44] <Amir1>	 Hi, is there anything in the analytics data lake about if a mw job got succeeded or failed, etc.
[10:12:10] <btullis>	 amir1: I might be able to help here. Can you be any more specific about what you mean by 'mw job' though?
[10:13:45] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10elukey) >>! In T255148#7271658, @BTullis wrote: >  > Am I right in thinking that we hardcode the brokers' addresses because we haven't got access to a load-balancer...
[10:14:24] <Amir1>	 btullis: I want outcome of this event https://phabricator.wikimedia.org/T278924#7271651
[10:15:04] <Amir1>	 https://wikitech.wikimedia.org/wiki/Kafka_Job_Queue
[10:27:22] <btullis>	 I think that this data is stored in Hive. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging#Hive - but I'm struggling to find which table your `refreshLinks` event would go to.
[10:41:18] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) This looks like the best way to prevent new jobs being added to a middle-manager, prior to decommission: https://druid.apache.org/docs/latest/operations/ro...
[10:43:39] <Amir1>	 I'm at lunch, once I back, I dig 
[12:22:24] <wikibugs>	 (03PS2) 10Kosta Harlan: link_suggestion_interaction: Add outdatedsuggestions_dialog interface [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/701376 (https://phabricator.wikimedia.org/T283109)
[12:23:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] link_suggestion_interaction: Add outdatedsuggestions_dialog interface [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/701376 (https://phabricator.wikimedia.org/T283109) (owner: 10Kosta Harlan)
[12:53:59] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) I had put the following process down for the zookeeper switch.  When deploying this change, we will need to:  * Manually stop zookeeper on druid1001 * Manu...
[13:31:19] <btullis>	 Has anyone seen this before. I'm unable to get the zookeeper client to connect to the ensemble running on druid100[1-3] from themselves.
[13:31:27] <btullis>	 https://www.irccloud.com/pastebin/RbprfwEZ/
[13:32:25] <btullis>	 I can connect if I use a zkCli from another host, such as an-conf1001.
[13:32:29] <btullis>	 https://www.irccloud.com/pastebin/6oBtBTxC/
[13:34:01] <btullis>	 Maybe it's no big deal and will go away when I decommission these servers anyway. But it would help to have confidence in the switch-over if I can actually connect to them without coming in from outside.
[13:36:44] <wikibugs>	 (03PS1) 10David Caro: db: Added a script to generate a DB schema from the models [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711133 (https://phabricator.wikimedia.org/T288523)
[13:38:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] db: Added a script to generate a DB schema from the models [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711133 (https://phabricator.wikimedia.org/T288523) (owner: 10David Caro)
[13:42:11] <btullis>	 Never mind, I can get the information I need with `echo mntr | nc localhost 2181`
[13:43:33] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) As per [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/711120/comment/3997f42b_00e91954/ | comments ]] from @elukey on the change request, I'll upd...
[13:43:46] <wikibugs>	 (03PS2) 10David Caro: db: Added a script to generate a DB schema from the models [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711133 (https://phabricator.wikimedia.org/T288523)
[13:43:52] <wikibugs>	 (03PS1) 10David Caro: tox: Add python to the allowlist_externals [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711134
[13:47:15] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) I can use `echo mntr | nc localhost 2181` and check for `zk_synced_followers 2` to verify that the ensemble of three servers is healthy each time.  Other f...
[13:47:41] <elukey>	 btullis: o/
[13:47:49] <elukey>	 sorry I didn't see the msgs before
[13:48:36] <elukey>	 one thing to notice - during the procedure the cluster will go into a weird state, since the nodes will disagree about the nodes in it
[13:49:07] <elukey>	 for example, when an-druid1001 will be started, it will think that the cluster is composed by itself and druid100[2,3]
[13:49:27] <elukey>	 but druid100[2,3] will of course think that druid1001 is their other buddy
[13:49:27] <wikibugs>	 (03CR) 10David Caro: db: Added a script to generate a DB schema from the models (031 comment) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711133 (https://phabricator.wikimedia.org/T288523) (owner: 10David Caro)
[13:49:38] <elukey>	 and so on when druid1002 is restarted, etc..
[13:49:59] <elukey>	 so the check about followers might lead to strange results, never checked it during this procedure
[13:50:06] <elukey>	 I usually use "ruok" or "stats"
[13:52:16] <btullis>	 Yes, I see what you mean. Did you see my question about rolling the druid workers between each zookeeper server shuffle as well?
[13:54:12] <elukey>	 btullis: yes I've seen it now, it is indeed needed but I'd do it after each zookeeper host swap, just to be sure
[13:54:43] <elukey>	 for the broker restart it is usually very quick, some queries may fail from turnilo/superset but it shouldn't be a big issue
[13:55:43] <elukey>	 for indexations we can probably think about stop puppet + druid-related timers on an-launcher1002, and also suspend oozie jobs that ships to druid
[13:55:47] <btullis>	 I see that  from version 3.5 zookeeper has dynamic reconfiguration, allowing adding and removing servers from the ensemble. https://zookeeper.apache.org/doc/r3.6.3/zookeeperReconfig.html#sc_reconfig_modifying
[13:55:49] <elukey>	 (the hourly ones)
[13:56:14] <elukey>	 yeah but 3.5 is still not a stable release IIRC, I wish it was on debian
[13:58:11] <btullis>	 OK, so if I stop puppet and those hourly timers on an-launcher1002, then I shouldn't have to worry about disabling/re-enabling the middlemanagers during the restart, correct?
[13:58:38] <btullis>	 Just wait for any jobs other than the netflow one to finish.
[13:59:28] <elukey>	 exactly yes
[13:59:46] <elukey>	 you will also need to stop the ones in hue.wikimedia.org
[14:00:37] <elukey>	 basically from https://hue.wikimedia.org/hue/jobbrowser/#!schedules (remove any search filter present)
[14:00:48] <elukey>	 you'd need to select the coordinators mentioning druid hourly
[14:00:57] <elukey>	 tick them and "suspend"
[14:01:00] <elukey>	 just to be sure
[14:01:13] <elukey>	 (since there are two sources of periodic jobs, timers and oozie)
[14:07:17] <wikibugs>	 (03PS2) 10David Caro: tox: Add python to the allowlist_externals [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711134
[14:07:23] <wikibugs>	 (03PS3) 10David Caro: db: Added a script to generate a DB schema from the models [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711133 (https://phabricator.wikimedia.org/T288523)
[14:09:59] <wikibugs>	 (03CR) 10David Caro: "The red warning is gone \o/" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711134 (owner: 10David Caro)
[14:16:04] <btullis>	 I'm seeing these four jobs from Hue. Does this look right?
[14:16:09] <btullis>	 https://www.irccloud.com/pastebin/SR1wlUOZ/
[14:18:49] <btullis>	 Suspend is greyed out for me.
[14:19:10] <elukey>	 ah! you are not an admin
[14:19:12] <elukey>	 lemme fix it
[14:20:00] <elukey>	 done
[14:20:35] <btullis>	 As if by magic! Thanks, suspend button available.
[14:20:44] <elukey>	 one of the list has "KILLED" and it shouldn't be taken into consideration
[14:20:48] <elukey>	 only the orange/running ones
[14:20:57] <elukey>	 you are in the "Schedules" tab right
[14:20:58] <elukey>	 ?
[14:21:01] <elukey>	 not workflows etc..
[14:21:14] <wikibugs>	 (03PS3) 10Kosta Harlan: link_suggestion_interaction: Add outdatedsuggestions_dialog interface [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/701376 (https://phabricator.wikimedia.org/T283109)
[14:21:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] link_suggestion_interaction: Add outdatedsuggestions_dialog interface [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/701376 (https://phabricator.wikimedia.org/T283109) (owner: 10Kosta Harlan)
[14:22:17] <btullis>	 I was having trouble searching for druid hourly jobs, so I just search for job names, then I can find them in the list on the schedules tab. Still finding my feet with Hue.
[14:23:35] <elukey>	 Hue is not great sadly, we tried to follow up with upstream to fix ui bugs but the experience was not great (namely: if we provided patches it was ok, otherwise no progresses)
[14:23:53] <elukey>	 so we basically just moved to python3
[14:24:05] <elukey>	 (needed a repackaging of a non cloudera cdh version)
[14:24:11] <elukey>	 and we are waiting to move to Airflow
[14:24:14] <elukey>	 to drop hue :)
[14:24:35] <elukey>	 so if you are frustrated about Hue, it is all "normal", we are all in the same spot :)
[14:25:54] * elukey bbiab
[14:26:22] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) Updating the deployment plan again.  - Disable puppet on druid100[1-3] and an-druid100[1-3] and an-launcher1002 - Disable the following four timers on an-l...
[14:28:03] <btullis>	 Cool. OK. If you're happy with the latest deployment plan on the ticket I can get on and start this this afternoon, or tomorrow morning. I haven't reached out to neteng about the wmf_netflow realtime job, but I don't expect it to fail completely.
[14:46:01] <wikibugs>	 (03PS4) 10Kosta Harlan: link_suggestion_interaction: Add outdatedsuggestions_dialog interface [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/701376 (https://phabricator.wikimedia.org/T283109)
[14:48:37] <elukey>	 btullis: +1 looks good, I'd add a verification step to make sure that no indexations are running etc.. but is it a minor nit
[14:58:12] <wikibugs>	 10Quarry: quarry to python 3.7 - https://phabricator.wikimedia.org/T288528 (10mdipietro)
[15:07:12] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) Puppet disabled on all affected hosts. Systemd timers disabled on an-launcher1002 Schedules disabled in Hue Zookeeper stopped and disabled on druid1001  `...
[15:07:22] <wikibugs>	 (03PS1) 10Michael DiPietro: upgrade quarry to python 3.7 [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711150 (https://phabricator.wikimedia.org/T288528)
[15:21:19] <btullis>	 Zookeeper restart not going terribly well. First node to restart cannot join existing quorum. `INFO  [QuorumPeer[myid=1001]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@347] - Have smaller server identifier, so dropping the connection: (1002, 1001)` - All advice is to restart the current leader, but that would temporarily cause us to lose quorum. https://issues.apache.org/jira/browse/ZOOKEEPER-2938
[15:27:08] <elukey>	 btullis: I can check on the node if you want
[15:28:17] <elukey>	 so ruok and stats seems to work
[15:28:38] <elukey>	 the election is not working, and I think it is due to the weird set up
[15:28:49] <btullis>	 I'd appreciate any ideas at this point. I was wondering about changing the `myid` parameter, so that it tries to join the ensemble with a different id.
[15:28:57] <elukey>	 nono 1001 is fine
[15:29:10] <elukey>	 I think that these errors will clear when the procedure is finished
[15:29:27] <elukey>	 it is just a fencing mechanism for election
[15:29:38] <elukey>	 I'd proceed with the leader for last though
[15:29:44] <elukey>	 (not sure which nodes it is)
[15:30:09] <btullis>	 It's druid1002 at the moment. I'm just concerned that it's not syncing.
[15:30:12] <btullis>	 https://www.irccloud.com/pastebin/rX0GJXob/
[15:31:23] <elukey>	 in my opinion it is expected, it doesn't know anything about an-druid1001
[15:31:38] <btullis>	 Interestingly, I'm not even sure why the `myid` parameters of `100[1-3]` work. From the docs: https://zookeeper.apache.org/doc/r3.4.14/zookeeperAdmin.html#sc_zkMulitServerSetup
[15:31:49] <btullis>	 > The myid file consists of a single line containing only the text of that machine's id. So myid of server 1 would contain the text "1" and nothing else. The id must be unique within the ensemble and should have a value between 1 and 255.
[15:32:23] <btullis>	 OK, well I'll continue then.
[15:32:34] <elukey>	 interesting, worth looking into it, didn't know it
[15:32:52] <elukey>	 anyway yes let's proceed, you'll surely see a weird behavior until we restart druid1002
[15:35:23] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) We had some issue with an-druid1001 joining an existing ensemble. It might be OK as it is, but the advice we have found is that restarting the leader fixes...
[15:48:06] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) ` btullis@druid1003:~$ sudo systemctl stop zookeeper btullis@druid1003:~$ sudo systemctl disable zookeeper zookeeper.service is not a native service, redir...
[15:51:55] <btullis>	 It looks like moving druid1003 to an-druid1003 has kicked the election off.
[15:52:00] <btullis>	 https://www.irccloud.com/pastebin/KvqipVNE/
[15:52:00] <elukey>	 btullis: I think I made a mistake in suggesting you the way forward, I just realized  it now
[15:52:14] <btullis>	 But druid is a bit non-responsive.
[15:52:32] <btullis>	 Oh dear, shall we jump into the BC to discuss?
[15:52:35] <wikibugs>	 (03CR) 10Awight: "This change is ready for review." [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/711159 (https://phabricator.wikimedia.org/T287578) (owner: 10Awight)
[15:52:46] <elukey>	 so it was one swap at the time, I +1ed your last change but it was not ok
[15:53:02] <elukey>	 we needed to roll restart the other two zookeepers first
[15:53:48] <elukey>	 but we can quickly rollback, yes let's jump on bc
[15:54:01] <elukey>	 really sorry I was doing other things and I didn't think this carefully
[15:54:07] <btullis>	 Oh dear. Yes I see. I was worried about losing quorum since 1001 wasn't joining.
[15:54:21] <elukey>	 it is fine, but let's rollback the last change
[15:54:28] <elukey>	 and roll restart druid1002,3
[15:54:50] <elukey>	 (and stop zookeeper on an-druid1003
[16:03:53] <elukey>	 btullis: I think that zookeeper is fine now
[16:04:10] <elukey>	 so the next steps should be to update the configs on druid100[2,3]
[16:04:15] <elukey>	 and restart them one at the time
[16:04:20] <elukey>	 to pick up an-druid1001
[16:04:23] <elukey>	 does it sound good?
[16:07:33] <btullis>	 https://www.irccloud.com/pastebin/z3xk4Av9/
[16:08:35] <elukey>	 btullis: that looks good in theory, druid1002 doesn't know about an-druid1001
[16:08:38] <elukey>	 yet
[16:08:40] <btullis>	 ^ I'm concerned about the above though. If we don't have a synced copy on an-druid1001 then aren't we going to get a loss of quorum.
[16:46:07] <btullis>	 I'm prevented from scheduling downtime by Icinga, so I had trouble running the sre.druid.roll-restart-workers cookbook. Have submitted a quick patch to Icinga config.
[16:51:39] <elukey>	 ahh right
[16:52:57] <btullis>	 Still didn't work. Do I need to restart the Icinga service after a change to `cgi.conf`? 
[16:53:01] <btullis>	 `100.0% (1/1) of nodes failed to execute command 'bash -c 'echo -n.../rw/icinga.cmd '': alert1001.wikimedia.org`
[17:02:33] <elukey>	 in theory no, but puppet needs to run on the alert1001 node
[17:10:15] <razzi>	 !log sudo cookbook sre.druid.roll-restart-workers analytics (errored out)
[17:10:17] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:11:48] <elukey>	 mmm razzi what error did you get?
[17:12:10] <elukey>	 ah I see in #sre nice :)
[17:12:11] <razzi>	 ```----- OUTPUT of 'bash -c 'echo -n.../rw/icinga.cmd '' -----
[17:12:11] <razzi>	 bash: -c: line 0: unexpected EOF while looking for matching `"'
[17:12:11] <razzi>	 bash: -c: line 1: syntax error: unexpected end of file
[17:12:11] <razzi>	 ```
[17:12:14] <elukey>	 nothing on fire on the druid front
[17:12:20] <elukey>	 that was my concern :)
[17:12:27] <razzi>	 unfortunately the failing command is truncated
[17:12:27] <elukey>	 lovely error :D
[17:12:44] <razzi>	 yep yep Reuven is helping out
[17:14:26] <btullis>	 razzi: I'd be really grateful if you could take on the following steps to help me to finish up today please.
[17:15:25] <btullis>	 1: run the `cookbook sre.druid.roll-restart-workers analytics`
[17:15:58] <btullis>	 https://www.irccloud.com/pastebin/UWvJGbwx/
[17:16:40] <btullis>	 Reactivate the following jobs in Hue:
[17:16:48] <btullis>	 https://www.irccloud.com/pastebin/pJtI3lyW/
[17:25:23] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) We were blocked from running the sre.druid.roll-restart-workers cookbook by a bug, so we went ahead and re-enabled the timers and Hue jobs, given that Driu...
[17:27:00] <razzi>	 !log resume the following schedules in hue: edit-hourly-druid-coord, pageview-druid-hourly-coord, webrequest-druid-hourly-coord
[17:27:03] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:13:39] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] upgrade quarry to python 3.7 [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711150 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro)
[19:12:34] <wikibugs>	 (03CR) 10Bstorm: "Are we agreed we are abandoning pipenv, then? If we are not, this should include updates to the Pipfile and Pipfile.lock. If we are abando" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711150 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro)
[19:23:29] <wikibugs>	 (03CR) 10Bstorm: "Why is it ever running /usr/bin/python, though? That seems like an actual problem in the way tox is setting up the venv that might bite An" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711134 (owner: 10David Caro)
[20:05:57] <wikibugs>	 (03PS2) 10Michael DiPietro: upgrade quarry to python 3.7 [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711150 (https://phabricator.wikimedia.org/T288528)
[20:50:08] <wikibugs>	 (03PS1) 10Andrew Bogott: .gitreview: associate local 'buster' branch with gerrit 'buster' branch [analytics/quarry/web] (buster) - 10https://gerrit.wikimedia.org/r/711198
[20:55:17] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] .gitreview: associate local 'buster' branch with gerrit 'buster' branch [analytics/quarry/web] (buster) - 10https://gerrit.wikimedia.org/r/711198 (owner: 10Andrew Bogott)
[20:57:02] <wikibugs>	 (03Merged) 10jenkins-bot: .gitreview: associate local 'buster' branch with gerrit 'buster' branch [analytics/quarry/web] (buster) - 10https://gerrit.wikimedia.org/r/711198 (owner: 10Andrew Bogott)
[21:08:02] <wikibugs>	 (03PS3) 10Michael DiPietro: upgrade quarry to python 3.7 [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711150 (https://phabricator.wikimedia.org/T288528)
[21:08:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] upgrade quarry to python 3.7 [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711150 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro)
[21:18:49] <wikibugs>	 (03PS1) 10Michael DiPietro: upgrade quarry to python 3.7 [analytics/quarry/web] (buster) - 10https://gerrit.wikimedia.org/r/711208 (https://phabricator.wikimedia.org/T288528)
[21:19:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] upgrade quarry to python 3.7 [analytics/quarry/web] (buster) - 10https://gerrit.wikimedia.org/r/711208 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro)
[21:22:36] <wikibugs>	 (03PS2) 10Michael DiPietro: upgrade quarry to python 3.7 [analytics/quarry/web] (buster) - 10https://gerrit.wikimedia.org/r/711208 (https://phabricator.wikimedia.org/T288528)
[21:22:57] <wikibugs>	 (03Abandoned) 10Michael DiPietro: upgrade quarry to python 3.7 [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711150 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro)
[21:23:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] upgrade quarry to python 3.7 [analytics/quarry/web] (buster) - 10https://gerrit.wikimedia.org/r/711208 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro)
[21:49:20] <wikibugs>	 (03PS1) 10Andrew Bogott: tox.ini: update to work with default buster tox version [analytics/quarry/web] (buster) - 10https://gerrit.wikimedia.org/r/711211
[21:54:13] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] "Probably don't even need to change the Pipfile since you removed tox-pipenv, but this would get it moving forward on the new changes." [analytics/quarry/web] (buster) - 10https://gerrit.wikimedia.org/r/711211 (owner: 10Andrew Bogott)
[22:30:50] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] "Found out why there's drift. The models using SQLAlchemy ORM were an afterthought https://gerrit.wikimedia.org/r/c/analytics/quarry/web/+/" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711133 (https://phabricator.wikimedia.org/T288523) (owner: 10David Caro)
[23:15:43] <razzi>	 Stepping out to pick up a prescription