[00:08:28] RECOVERY - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:14:48] PROBLEM - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:23:40] PROBLEM - Check unit status of eventlogging_to_druid_netflow_daily on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_daily https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:54:23] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Review access change [analytics/reportupdater-queries] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/709646 (https://phabricator.wikimedia.org/T287578) (owner: 10Awight) [01:09:16] RECOVERY - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:18:22] 10Analytics: Data Loss Check always shows false positives - https://phabricator.wikimedia.org/T288496 (10Milimetric) [02:10:10] PROBLEM - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:30:07] PROBLEM - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:01:05] RECOVERY - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:06:47] RECOVERY - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:12:53] PROBLEM - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:00:15] RECOVERY - Check unit status of refinery-import-page-current-dumps on an-launcher1002 is OK: OK: Status of the systemd unit refinery-import-page-current-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:07:03] RECOVERY - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:07:51] PROBLEM - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:13:03] PROBLEM - Check unit status of refinery-import-page-current-dumps on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-import-page-current-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:37:06] 10Analytics-Clusters, 10Analytics-Kanban: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10elukey) Hey Ben great work! I couple of things to remember for the decom: 1) I am not sure if there is a way to force overlord/middlemanager to stop accepting indexation jobs, or if we... [06:37:37] btullis_: o/ sorry I was afk yesterday and I didn't see the code reviews, left some notes for the decom in the task, all good afaics :) [06:46:08] ah there are also a couple of failed indexation jobs on the new druid nodes [06:49:41] 10Analytics-Clusters, 10Analytics-Kanban: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10elukey) While checking the indexation failures I noticed: ` 2021-08-10T06:01:39,961 INFO org.apache.druid.indexing.overlord.ForkingTaskRunner: Exception caught during execution java.io.... [07:07:45] RECOVERY - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:13:41] PROBLEM - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:18:22] (03CR) 10Awight: "Thanks!" [analytics/reportupdater-queries] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/709646 (https://phabricator.wikimedia.org/T287578) (owner: 10Awight) [07:18:47] (03CR) 10Awight: [V: 03+2 C: 03+2] "Ready to go :-)" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/709007 (https://phabricator.wikimedia.org/T287578) (owner: 10Svantje Lilienthal) [07:40:10] 10Analytics, 10Dumps-Generation: xmldatadumps dumpstatus.json files only readable by root - https://phabricator.wikimedia.org/T287989 (10elukey) 05Resolved→03Open Hi Ariel! Sorry to re-open, but I found: ` elukey@dumpsdata1003:/data/xmldatadumps/public$ ls -l /data/xmldatadumps/public/wikidatawiki/2021080... [07:41:40] I reopened --^ since there is a file that causes the import jobs to fail [07:41:49] once fixed we should be able to re-run without problems [07:53:19] 10Analytics, 10Dumps-Generation: xmldatadumps dumpstatus.json files only readable by root - https://phabricator.wikimedia.org/T287989 (10ArielGlenn) >>! In T287989#7271424, @elukey wrote: > Hi Ariel! Sorry to re-open, but I found: > > ` > elukey@dumpsdata1003:/data/xmldatadumps/public$ ls -l /data/xmldatadump... [07:56:35] RECOVERY - Check unit status of refinery-import-page-current-dumps on an-launcher1002 is OK: OK: Status of the systemd unit refinery-import-page-current-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:58:55] 10Analytics, 10Dumps-Generation: xmldatadumps dumpstatus.json files only readable by root - https://phabricator.wikimedia.org/T287989 (10elukey) @ArielGlenn thanks a lot! There is one last little issue, namely that the `dumpstatus.json` file aforementioned now seems not to be valid json. I checked `/data/xmlda... [08:05:00] 10Analytics, 10Dumps-Generation: xmldatadumps dumpstatus.json files only readable by root - https://phabricator.wikimedia.org/T287989 (10ArielGlenn) >>! In T287989#7271463, @elukey wrote: > @ArielGlenn thanks a lot! There is one last little issue, namely that the `dumpstatus.json` file aforementioned now seems... [08:13:19] 10Analytics, 10Dumps-Generation: xmldatadumps dumpstatus.json files only readable by root - https://phabricator.wikimedia.org/T287989 (10ArielGlenn) After IRC discussion: waiting a few days is ok, in the meantime I have put a 0 byte file in its place. [08:17:01] RECOVERY - Check unit status of refinery-import-page-history-dumps on an-launcher1002 is OK: OK: Status of the systemd unit refinery-import-page-history-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:18:55] RECOVERY - Check unit status of refinery-import-siteinfo-dumps on an-launcher1002 is OK: OK: Status of the systemd unit refinery-import-siteinfo-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:27:34] (03CR) 10David Caro: [C: 03+2] docs: added docker compose link and minor rewording [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/709951 (owner: 10David Caro) [08:27:53] (03CR) 10David Caro: [V: 03+1 C: 03+2] docs: added docker compose link and minor rewording [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/709951 (owner: 10David Caro) [08:28:04] (03CR) 10David Caro: [V: 03+2 C: 03+2] docs: added docker compose link and minor rewording [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/709951 (owner: 10David Caro) [08:54:21] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) Thanks @elukey. I've addressed the ownership problem on an-druid1005 with: `btullis@an-druid1005:~$ sudo chown druid:druid /srv/druid/deep-storage /srv/dru... [08:55:31] elukey: Thanks for all of the help re the decom. Do you think I need to re-run anything because of the indexation failures? [08:56:12] btullis: o/ yes on an-launcher1002 there are a couple of failed timers, it should be sufficient to systemctl restart those [08:56:32] they will kick off new indexations (no guarantee that they will land on new nodes) [09:00:59] Thanks. It's the `.service` component that shows as having failed, but you're saying that a restart of the `.timer` component should kick off a new instance of each of the .service job, right? [09:01:03] https://usercontent.irccloud-cdn.com/file/NAZEO0N9/image.png [09:02:51] RECOVERY - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:04:06] btullis: nono a restart of .service is fine [09:04:25] the .timer component doesn't need a restart, it is the .service that failed [09:04:41] in theory we could wait for the periodic run of the job to catch up [09:04:52] !log btullis@an-launcher1002:~$ sudo systemctl restart eventlogging_to_druid_netflow_daily.service [09:04:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:04:59] !log btullis@an-launcher1002:~$ sudo systemctl restart eventlogging_to_druid_prefupdate_hourly.service [09:05:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:05:17] Gotcha, thanks. [09:05:35] perfect :) [09:13:43] RECOVERY - Check unit status of eventlogging_to_druid_netflow_daily on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_daily https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:42:02] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) > 2 Turnilo and Superset are configured to target a specific host's broker (manually set in their configs). For Turnilo the host is listed in puppet, for... [10:04:44] Hi, is there anything in the analytics data lake about if a mw job got succeeded or failed, etc. [10:12:10] amir1: I might be able to help here. Can you be any more specific about what you mean by 'mw job' though? [10:13:45] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10elukey) >>! In T255148#7271658, @BTullis wrote: > > Am I right in thinking that we hardcode the brokers' addresses because we haven't got access to a load-balancer... [10:14:24] btullis: I want outcome of this event https://phabricator.wikimedia.org/T278924#7271651 [10:15:04] https://wikitech.wikimedia.org/wiki/Kafka_Job_Queue [10:27:22] I think that this data is stored in Hive. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging#Hive - but I'm struggling to find which table your `refreshLinks` event would go to. [10:41:18] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) This looks like the best way to prevent new jobs being added to a middle-manager, prior to decommission: https://druid.apache.org/docs/latest/operations/ro... [10:43:39] I'm at lunch, once I back, I dig [12:22:24] (03PS2) 10Kosta Harlan: link_suggestion_interaction: Add outdatedsuggestions_dialog interface [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/701376 (https://phabricator.wikimedia.org/T283109) [12:23:17] (03CR) 10jerkins-bot: [V: 04-1] link_suggestion_interaction: Add outdatedsuggestions_dialog interface [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/701376 (https://phabricator.wikimedia.org/T283109) (owner: 10Kosta Harlan) [12:53:59] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) I had put the following process down for the zookeeper switch. When deploying this change, we will need to: * Manually stop zookeeper on druid1001 * Manu... [13:31:19] Has anyone seen this before. I'm unable to get the zookeeper client to connect to the ensemble running on druid100[1-3] from themselves. [13:31:27] https://www.irccloud.com/pastebin/RbprfwEZ/ [13:32:25] I can connect if I use a zkCli from another host, such as an-conf1001. [13:32:29] https://www.irccloud.com/pastebin/6oBtBTxC/ [13:34:01] Maybe it's no big deal and will go away when I decommission these servers anyway. But it would help to have confidence in the switch-over if I can actually connect to them without coming in from outside. [13:36:44] (03PS1) 10David Caro: db: Added a script to generate a DB schema from the models [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711133 (https://phabricator.wikimedia.org/T288523) [13:38:18] (03CR) 10jerkins-bot: [V: 04-1] db: Added a script to generate a DB schema from the models [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711133 (https://phabricator.wikimedia.org/T288523) (owner: 10David Caro) [13:42:11] Never mind, I can get the information I need with `echo mntr | nc localhost 2181` [13:43:33] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) As per [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/711120/comment/3997f42b_00e91954/ | comments ]] from @elukey on the change request, I'll upd... [13:43:46] (03PS2) 10David Caro: db: Added a script to generate a DB schema from the models [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711133 (https://phabricator.wikimedia.org/T288523) [13:43:52] (03PS1) 10David Caro: tox: Add python to the allowlist_externals [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711134 [13:47:15] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) I can use `echo mntr | nc localhost 2181` and check for `zk_synced_followers 2` to verify that the ensemble of three servers is healthy each time. Other f... [13:47:41] btullis: o/ [13:47:49] sorry I didn't see the msgs before [13:48:36] one thing to notice - during the procedure the cluster will go into a weird state, since the nodes will disagree about the nodes in it [13:49:07] for example, when an-druid1001 will be started, it will think that the cluster is composed by itself and druid100[2,3] [13:49:27] but druid100[2,3] will of course think that druid1001 is their other buddy [13:49:27] (03CR) 10David Caro: db: Added a script to generate a DB schema from the models (031 comment) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711133 (https://phabricator.wikimedia.org/T288523) (owner: 10David Caro) [13:49:38] and so on when druid1002 is restarted, etc.. [13:49:59] so the check about followers might lead to strange results, never checked it during this procedure [13:50:06] I usually use "ruok" or "stats" [13:52:16] Yes, I see what you mean. Did you see my question about rolling the druid workers between each zookeeper server shuffle as well? [13:54:12] btullis: yes I've seen it now, it is indeed needed but I'd do it after each zookeeper host swap, just to be sure [13:54:43] for the broker restart it is usually very quick, some queries may fail from turnilo/superset but it shouldn't be a big issue [13:55:43] for indexations we can probably think about stop puppet + druid-related timers on an-launcher1002, and also suspend oozie jobs that ships to druid [13:55:47] I see that from version 3.5 zookeeper has dynamic reconfiguration, allowing adding and removing servers from the ensemble. https://zookeeper.apache.org/doc/r3.6.3/zookeeperReconfig.html#sc_reconfig_modifying [13:55:49] (the hourly ones) [13:56:14] yeah but 3.5 is still not a stable release IIRC, I wish it was on debian [13:58:11] OK, so if I stop puppet and those hourly timers on an-launcher1002, then I shouldn't have to worry about disabling/re-enabling the middlemanagers during the restart, correct? [13:58:38] Just wait for any jobs other than the netflow one to finish. [13:59:28] exactly yes [13:59:46] you will also need to stop the ones in hue.wikimedia.org [14:00:37] basically from https://hue.wikimedia.org/hue/jobbrowser/#!schedules (remove any search filter present) [14:00:48] you'd need to select the coordinators mentioning druid hourly [14:00:57] tick them and "suspend" [14:01:00] just to be sure [14:01:13] (since there are two sources of periodic jobs, timers and oozie) [14:07:17] (03PS2) 10David Caro: tox: Add python to the allowlist_externals [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711134 [14:07:23] (03PS3) 10David Caro: db: Added a script to generate a DB schema from the models [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711133 (https://phabricator.wikimedia.org/T288523) [14:09:59] (03CR) 10David Caro: "The red warning is gone \o/" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711134 (owner: 10David Caro) [14:16:04] I'm seeing these four jobs from Hue. Does this look right? [14:16:09] https://www.irccloud.com/pastebin/SR1wlUOZ/ [14:18:49] Suspend is greyed out for me. [14:19:10] ah! you are not an admin [14:19:12] lemme fix it [14:20:00] done [14:20:35] As if by magic! Thanks, suspend button available. [14:20:44] one of the list has "KILLED" and it shouldn't be taken into consideration [14:20:48] only the orange/running ones [14:20:57] you are in the "Schedules" tab right [14:20:58] ? [14:21:01] not workflows etc.. [14:21:14] (03PS3) 10Kosta Harlan: link_suggestion_interaction: Add outdatedsuggestions_dialog interface [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/701376 (https://phabricator.wikimedia.org/T283109) [14:21:45] (03CR) 10jerkins-bot: [V: 04-1] link_suggestion_interaction: Add outdatedsuggestions_dialog interface [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/701376 (https://phabricator.wikimedia.org/T283109) (owner: 10Kosta Harlan) [14:22:17] I was having trouble searching for druid hourly jobs, so I just search for job names, then I can find them in the list on the schedules tab. Still finding my feet with Hue. [14:23:35] Hue is not great sadly, we tried to follow up with upstream to fix ui bugs but the experience was not great (namely: if we provided patches it was ok, otherwise no progresses) [14:23:53] so we basically just moved to python3 [14:24:05] (needed a repackaging of a non cloudera cdh version) [14:24:11] and we are waiting to move to Airflow [14:24:14] to drop hue :) [14:24:35] so if you are frustrated about Hue, it is all "normal", we are all in the same spot :) [14:25:54] * elukey bbiab [14:26:22] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) Updating the deployment plan again. - Disable puppet on druid100[1-3] and an-druid100[1-3] and an-launcher1002 - Disable the following four timers on an-l... [14:28:03] Cool. OK. If you're happy with the latest deployment plan on the ticket I can get on and start this this afternoon, or tomorrow morning. I haven't reached out to neteng about the wmf_netflow realtime job, but I don't expect it to fail completely. [14:46:01] (03PS4) 10Kosta Harlan: link_suggestion_interaction: Add outdatedsuggestions_dialog interface [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/701376 (https://phabricator.wikimedia.org/T283109) [14:48:37] btullis: +1 looks good, I'd add a verification step to make sure that no indexations are running etc.. but is it a minor nit [14:58:12] 10Quarry: quarry to python 3.7 - https://phabricator.wikimedia.org/T288528 (10mdipietro) [15:07:12] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) Puppet disabled on all affected hosts. Systemd timers disabled on an-launcher1002 Schedules disabled in Hue Zookeeper stopped and disabled on druid1001 `... [15:07:22] (03PS1) 10Michael DiPietro: upgrade quarry to python 3.7 [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711150 (https://phabricator.wikimedia.org/T288528) [15:21:19] Zookeeper restart not going terribly well. First node to restart cannot join existing quorum. `INFO [QuorumPeer[myid=1001]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@347] - Have smaller server identifier, so dropping the connection: (1002, 1001)` - All advice is to restart the current leader, but that would temporarily cause us to lose quorum. https://issues.apache.org/jira/browse/ZOOKEEPER-2938 [15:27:08] btullis: I can check on the node if you want [15:28:17] so ruok and stats seems to work [15:28:38] the election is not working, and I think it is due to the weird set up [15:28:49] I'd appreciate any ideas at this point. I was wondering about changing the `myid` parameter, so that it tries to join the ensemble with a different id. [15:28:57] nono 1001 is fine [15:29:10] I think that these errors will clear when the procedure is finished [15:29:27] it is just a fencing mechanism for election [15:29:38] I'd proceed with the leader for last though [15:29:44] (not sure which nodes it is) [15:30:09] It's druid1002 at the moment. I'm just concerned that it's not syncing. [15:30:12] https://www.irccloud.com/pastebin/rX0GJXob/ [15:31:23] in my opinion it is expected, it doesn't know anything about an-druid1001 [15:31:38] Interestingly, I'm not even sure why the `myid` parameters of `100[1-3]` work. From the docs: https://zookeeper.apache.org/doc/r3.4.14/zookeeperAdmin.html#sc_zkMulitServerSetup [15:31:49] > The myid file consists of a single line containing only the text of that machine's id. So myid of server 1 would contain the text "1" and nothing else. The id must be unique within the ensemble and should have a value between 1 and 255. [15:32:23] OK, well I'll continue then. [15:32:34] interesting, worth looking into it, didn't know it [15:32:52] anyway yes let's proceed, you'll surely see a weird behavior until we restart druid1002 [15:35:23] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) We had some issue with an-druid1001 joining an existing ensemble. It might be OK as it is, but the advice we have found is that restarting the leader fixes... [15:48:06] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) ` btullis@druid1003:~$ sudo systemctl stop zookeeper btullis@druid1003:~$ sudo systemctl disable zookeeper zookeeper.service is not a native service, redir... [15:51:55] It looks like moving druid1003 to an-druid1003 has kicked the election off. [15:52:00] https://www.irccloud.com/pastebin/KvqipVNE/ [15:52:00] btullis: I think I made a mistake in suggesting you the way forward, I just realized it now [15:52:14] But druid is a bit non-responsive. [15:52:32] Oh dear, shall we jump into the BC to discuss? [15:52:35] (03CR) 10Awight: "This change is ready for review." [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/711159 (https://phabricator.wikimedia.org/T287578) (owner: 10Awight) [15:52:46] so it was one swap at the time, I +1ed your last change but it was not ok [15:53:02] we needed to roll restart the other two zookeepers first [15:53:48] but we can quickly rollback, yes let's jump on bc [15:54:01] really sorry I was doing other things and I didn't think this carefully [15:54:07] Oh dear. Yes I see. I was worried about losing quorum since 1001 wasn't joining. [15:54:21] it is fine, but let's rollback the last change [15:54:28] and roll restart druid1002,3 [15:54:50] (and stop zookeeper on an-druid1003 [16:03:53] btullis: I think that zookeeper is fine now [16:04:10] so the next steps should be to update the configs on druid100[2,3] [16:04:15] and restart them one at the time [16:04:20] to pick up an-druid1001 [16:04:23] does it sound good? [16:07:33] https://www.irccloud.com/pastebin/z3xk4Av9/ [16:08:35] btullis: that looks good in theory, druid1002 doesn't know about an-druid1001 [16:08:38] yet [16:08:40] ^ I'm concerned about the above though. If we don't have a synced copy on an-druid1001 then aren't we going to get a loss of quorum. [16:46:07] I'm prevented from scheduling downtime by Icinga, so I had trouble running the sre.druid.roll-restart-workers cookbook. Have submitted a quick patch to Icinga config. [16:51:39] ahh right [16:52:57] Still didn't work. Do I need to restart the Icinga service after a change to `cgi.conf`? [16:53:01] `100.0% (1/1) of nodes failed to execute command 'bash -c 'echo -n.../rw/icinga.cmd '': alert1001.wikimedia.org` [17:02:33] in theory no, but puppet needs to run on the alert1001 node [17:10:15] !log sudo cookbook sre.druid.roll-restart-workers analytics (errored out) [17:10:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:11:48] mmm razzi what error did you get? [17:12:10] ah I see in #sre nice :) [17:12:11] ```----- OUTPUT of 'bash -c 'echo -n.../rw/icinga.cmd '' ----- [17:12:11] bash: -c: line 0: unexpected EOF while looking for matching `"' [17:12:11] bash: -c: line 1: syntax error: unexpected end of file [17:12:11] ``` [17:12:14] nothing on fire on the druid front [17:12:20] that was my concern :) [17:12:27] unfortunately the failing command is truncated [17:12:27] lovely error :D [17:12:44] yep yep Reuven is helping out [17:14:26] razzi: I'd be really grateful if you could take on the following steps to help me to finish up today please. [17:15:25] 1: run the `cookbook sre.druid.roll-restart-workers analytics` [17:15:58] https://www.irccloud.com/pastebin/UWvJGbwx/ [17:16:40] Reactivate the following jobs in Hue: [17:16:48] https://www.irccloud.com/pastebin/pJtI3lyW/ [17:25:23] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) We were blocked from running the sre.druid.roll-restart-workers cookbook by a bug, so we went ahead and re-enabled the timers and Hue jobs, given that Driu... [17:27:00] !log resume the following schedules in hue: edit-hourly-druid-coord, pageview-druid-hourly-coord, webrequest-druid-hourly-coord [17:27:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:13:39] (03CR) 10Andrew Bogott: [C: 03+1] upgrade quarry to python 3.7 [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711150 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [19:12:34] (03CR) 10Bstorm: "Are we agreed we are abandoning pipenv, then? If we are not, this should include updates to the Pipfile and Pipfile.lock. If we are abando" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711150 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [19:23:29] (03CR) 10Bstorm: "Why is it ever running /usr/bin/python, though? That seems like an actual problem in the way tox is setting up the venv that might bite An" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711134 (owner: 10David Caro) [20:05:57] (03PS2) 10Michael DiPietro: upgrade quarry to python 3.7 [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711150 (https://phabricator.wikimedia.org/T288528) [20:50:08] (03PS1) 10Andrew Bogott: .gitreview: associate local 'buster' branch with gerrit 'buster' branch [analytics/quarry/web] (buster) - 10https://gerrit.wikimedia.org/r/711198 [20:55:17] (03CR) 10Andrew Bogott: [C: 03+2] .gitreview: associate local 'buster' branch with gerrit 'buster' branch [analytics/quarry/web] (buster) - 10https://gerrit.wikimedia.org/r/711198 (owner: 10Andrew Bogott) [20:57:02] (03Merged) 10jenkins-bot: .gitreview: associate local 'buster' branch with gerrit 'buster' branch [analytics/quarry/web] (buster) - 10https://gerrit.wikimedia.org/r/711198 (owner: 10Andrew Bogott) [21:08:02] (03PS3) 10Michael DiPietro: upgrade quarry to python 3.7 [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711150 (https://phabricator.wikimedia.org/T288528) [21:08:57] (03CR) 10jerkins-bot: [V: 04-1] upgrade quarry to python 3.7 [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711150 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [21:18:49] (03PS1) 10Michael DiPietro: upgrade quarry to python 3.7 [analytics/quarry/web] (buster) - 10https://gerrit.wikimedia.org/r/711208 (https://phabricator.wikimedia.org/T288528) [21:19:37] (03CR) 10jerkins-bot: [V: 04-1] upgrade quarry to python 3.7 [analytics/quarry/web] (buster) - 10https://gerrit.wikimedia.org/r/711208 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [21:22:36] (03PS2) 10Michael DiPietro: upgrade quarry to python 3.7 [analytics/quarry/web] (buster) - 10https://gerrit.wikimedia.org/r/711208 (https://phabricator.wikimedia.org/T288528) [21:22:57] (03Abandoned) 10Michael DiPietro: upgrade quarry to python 3.7 [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711150 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [21:23:03] (03CR) 10jerkins-bot: [V: 04-1] upgrade quarry to python 3.7 [analytics/quarry/web] (buster) - 10https://gerrit.wikimedia.org/r/711208 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [21:49:20] (03PS1) 10Andrew Bogott: tox.ini: update to work with default buster tox version [analytics/quarry/web] (buster) - 10https://gerrit.wikimedia.org/r/711211 [21:54:13] (03CR) 10Bstorm: [C: 03+1] "Probably don't even need to change the Pipfile since you removed tox-pipenv, but this would get it moving forward on the new changes." [analytics/quarry/web] (buster) - 10https://gerrit.wikimedia.org/r/711211 (owner: 10Andrew Bogott) [22:30:50] (03CR) 10Bstorm: [C: 03+1] "Found out why there's drift. The models using SQLAlchemy ORM were an afterthought https://gerrit.wikimedia.org/r/c/analytics/quarry/web/+/" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711133 (https://phabricator.wikimedia.org/T288523) (owner: 10David Caro) [23:15:43] Stepping out to pick up a prescription