[07:28:11] hey folks, I filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/1138634 for a failure in fetch_external_cloud_vendors_nets [07:28:22] if you have time lemme know what you think about it [07:28:33] tested on puppetserver2004 (that was failing), all working fine [08:48:48] thanks for the review Emperor! [09:16:22] we got several alerts related to confd not being happy: https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DConfdResourceFailed [09:18:03] https://phabricator.wikimedia.org/rONED4d2d69d4af2352fe0aabca7e16df602488b90f30 seems to be the culprit [09:18:14] I've seen it other times, IIRC it was related to host renames [09:18:23] yep [09:18:23] exactly yes [09:18:41] config-master1001 confd-lint-wrap: failed linting '/usr/local/bin/pybal-eval-check /srv/config-master/pybal/codfw/.search2914144687' with 1 (0.06804943084716797s) [invalid]: { 'host': 'elastic2064.codfw.wmnet', 'weight':10, 'enabled': True } [Errno -2] Name or service not know [09:19:42] so if the old fqdn is still in another pybal config, the rename will trigger a problem [09:20:05] pybal doesn't have realservers on its configuration [09:20:10] those are retrieved from etcd [09:22:26] sure ok, I explained it in the wrong way - what do you suggest then? [09:22:51] Remove the host from the conf puppet setting, and then rename? [09:23:39] elukey: sure.. a proper decommision of the host [09:24:04] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1138687 [09:24:12] vgutierrez: full decom? But does rename works then? [09:24:42] or decom in the sense of "pybal decom" ? :D [09:24:46] yes sorry [09:24:56] from pybal PoV it's a decom + adding a new host [09:25:06] makes sense yes [09:25:11] Cc: inflatador: ---^ [09:25:31] +1ed [09:29:50] same issue now with elastic2094.codfw.wmnet [09:30:36] I am wondering if all the recent renames for elastic have the same issue [09:32:47] they are applying the conftool changes in batches [09:32:48] see https://gerrit.wikimedia.org/r/c/operations/puppet/+/1137782/1/conftool-data/node/codfw.yaml [09:33:07] but leaving confd in this state is not right IMHO [09:34:47] all right so there is only a tiny backlog [09:35:13] that's https://gerrit.wikimedia.org/r/c/operations/puppet/+/1138693 [09:35:36] yes totally agree, but I think that they didn't see the confd failures, from now on it should be fine (after today's ping etc..) [09:36:01] thanks [09:39:04] elastic2095.codfw.wmnet too.. [09:39:11] let's see if I can get a complete list of impacted hosts [09:42:40] this https://phabricator.wikimedia.org/source/netbox-exported-dns/history/master/ helps :D [09:47:03] elukey: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1138695 [09:47:11] sorry for flooding you with the CRs /o\ [09:56:01] done! I was afk sorry [09:56:29] elukey: your p99 latency SLI isn't looking good [09:56:35] O:) [09:56:35] it is not a problem, on the contrary thanks for the clean up! Although if you want to work with me it is fine, you don't have to find excuses! [09:56:43] :D :D [09:56:52] yes yes terrible, not only the p99 :D [10:01:10] did I miss elastsic2095? :( sniff [10:03:45] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1138704 [10:21:01] alerts cleared after that :) [11:10:08] post+1 [13:14:57] sorry for the trouble with elastic hosts, everyone. We will remove hosts from conftool before reimaging in the future. vgutierrez if there are any other suggestions please let me know [13:23:34] inflatador: np! It was just as an FYI for the next time, confd wasn't happy but nothing was really broken! [13:30:26] sukhe: hi, any specific thing to keep in mind when merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1138342? [13:33:09] taavi: not really, I think I tend to disable puppet on A:dnsbox and test on one host out of an abundance of caution. gdnsd should reload itself and just check that it looks good before moving on to other hosts [13:33:20] and then the geo-maps change which you already did should follow that [13:33:40] (which I am looking at now) [13:33:53] thanks! [14:24:20] elukey if you're still around and have time to review, this patch'll remove all the remaining elastic hosts from conftool: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1138804 [14:26:22] done! [14:29:58] thanks! re: your comment, elastic/opensearch passes everything to the master anyway so we can lose that capacity. [14:30:58] or rather, we can lose hosts from the LVS pools without really losing their capacity [14:32:25] super then