[06:33:25] Plumber’s coming tomorrow morning, might be 30 mins late to weds meeting depending on when they show up [11:04:28] Lunch [11:06:13] lunch2 [13:03:26] greetings [13:15:07] we need to reboot some of the wcqs servers today for https://phabricator.wikimedia.org/T303179 dcausse or gehel anything I need to know before I get started? [13:23:32] inflatador: should be straight forward: schedule downtime in icinga, depool, reboot, check that the updater and the 2 blazegraph instance are running and that no alerts are triggered. [13:23:46] The reboot cookbook probably takes care of icinga as well [13:24:37] ACK, will get started shortly [13:24:38] sre.wdqs.reboot [13:24:53] o/ [13:24:53] it looks like https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/hosts/reboot-cluster.py takes care of everything, including checking that there are no alerts post reboot [13:25:29] gehel: you could add a check for icinga optimal [13:25:45] https://doc.wikimedia.org/spicerack/master/api/spicerack.icinga.html#spicerack.icinga.IcingaHosts.wait_for_optimal [13:27:34] I think that sre.wdqs.reboot predates sre.hosts.reboot-cluster. The generic is probably now sufficient [13:27:49] yes it could be rewritten probably [13:27:55] good time to check if we should remove sre.wdqs.reboot [13:28:28] are wdqs behind LVS? [13:28:38] yes [13:29:51] then the generic is not good enough [13:30:08] ah no wait [13:30:11] I''m misremembering [13:30:46] yeah actually ther eis a third level of reboot logic that could be used [13:31:08] but yeah for next time [13:31:31] using https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/__init__.py#L382 [13:38:39] FYI I've added this to the docs [13:38:39] https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks#Renaming/Deleting_a_cookbook [13:40:12] ACK, I'm going to use the wdqs reboot cookbook ( https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/wdqs/reboot.py ) unless anyone has a better idea [13:41:50] sorry, between the sre.wdqs.reboot one and the generic sre.hosts.reboot-cluster I don't have enough context which one is better [13:45:00] That's OK, question was more directed at my team. WCQS and WDQS are running the same stack, so I assume it will work [13:45:36] damn those look-alike naming, I read the whole time wdqs instead of wcqs [13:45:39] sorry [13:46:21] inflatador: hopefully yes, some services have changed: wcqs-blazegraph from wdqs-blazegraph but I hope that for a reboot it's not explicitely called [13:46:58] looks like the cookbook can handle that: https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/wdqs/reboot.py#L24 [13:47:08] oh indeed [13:47:47] not sure why we explicitely stop blazegraph and not let systemd do it [13:48:32] it's in the comment, my bad: (to ensure they are not killed by systemd if taking too long) [13:59:48] that sounds like something that should be fixed at the systemd unit level, not via a cookbook [14:00:38] when you issue a reboot after a while systemd kills everything IIRC, it's standard in db-land too to stop gracefully the unit before issuing a reboot/shutdown [14:01:05] when you have units that take long time to stop [14:03:13] wcqs1001.eqiad.wmnet rebooted successfully via cookbook. It did throw an alert for port 9999 but that cleared in ~2m [14:19:38] cookbook bombed out on wcqs1002.eqiad.wmnet and did not repool, but services are healthy so I manually repooled [16:21:31] workout, back in ~30 [16:21:52] (also for those playing along, the wcqs reboots are done) [17:12:33] and back [17:43:47] lunch, back in ~30 [17:44:14] dinner [18:16:47] back [18:29:11] ryankemper or anyone else, are you able to review https://gerrit.wikimedia.org/r/c/operations/puppet/+/763481 and give a +1 if everything looks good? [18:31:21] inflatador: seems plausible to me, afaik everything keys off that [18:31:58] ebernhardson gotcha, thanks for reviewing [19:40:36] anyone have a rough estimate on how long it takes to drain an elastic node in prod? [19:40:52] inflatador: actually draining everything may never happen due to shard allocation :( [19:41:01] inflatador: oh, but this is cloudelastic..hmm checking [19:43:36] inflatador: tbh, i dunno :P the time will mostly be dictated by the largest indices on the shard and how long they take to copy somewhere else. This is probably mostly limited by the cluster.indices.recovery.max_bytes_per_sec value in cluster settings [19:44:29] on cloudelastic we put it at 512MB/s, prod is probably at 80MB/s but should be increased [19:44:51] yeah, cookbook isn't working because it wants the cluster to go back to green before moving on. My thought is that that can't happen because shards won't move from newer to older versions of ES, and restarting the service makes ES want to shuffle shards around? If that is a false assumption let me know [19:45:58] we are only draining old instances though, so hopefully the shards are mostly moving from old->old or old->new, but the heuristic may also try and shuffle other things around [19:49:17] I stopped the cookbook after it hung for ~15m w/cluster in yellow. Then I manually upgraded on a second host, and the cluster went back to green after a few mins [19:49:22] inflatador: poking at cloudelastic, it seems mostly happy right now. I see it's partially through the reboot but not complete, but the cluster is also reporting green [19:49:25] ahh, ok thats why :) [19:49:31] not sure if that proves anything though ;) [19:50:03] with the cluster stuck in yellow the most common thing to look at is alllocation reasons, lemme find that api [19:51:06] I suppose two things, there is `localhost:9200/_cat/recovery?active_only=true` which will list all shards currently being moved in the cluster, those might need to complete. And then there is https://www.elastic.co/guide/en/elasticsearch/reference/6.5/cluster-allocation-explain.html for when the cluster is refusing to allocate things [19:51:21] the allocation explain is, like too many things, tedious to read and interpret, but the information is usually there [19:51:53] yeah, that's still very helpful, thanks [19:55:23] OK, I'm running the cookbook again and watching the health, things are actually relocating and unassigned shared number is dropping [19:55:47] errr..."things" and "shared" should read "shard" [19:57:37] I suppose another view i tend to reference when looking over the cluster is the `cluster overview dashboard` which is a generic sre thing: https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-site=eqiad&var-cluster=cloudelastic&var-instance=All&var-datasource=thanos [19:57:48] just seeing things like big spikes in network at least means its doing something :) [19:59:32] * inflatador bookmarks [21:14:43] ebernhardson: you around? wanted to chat about the elastic upgrade stuff [21:14:51] ryankemper: sure [21:15:01] https://meet.google.com/uaa-neiz-opr [21:15:18] I think one point missed in the backlog above is that elastic will not allow replica shards to be a on a host of lower version # than the primary shard is [21:16:02] https://phabricator.wikimedia.org/T301955#7794845 [21:41:45] https://www.elastic.co/guide/en/elasticsearch/reference/7.17/rolling-upgrades.html [21:51:50] Ticket for the workaround for the cluster upgrade getting stuck after the first host: https://phabricator.wikimedia.org/T304570