[09:20:42] jbond: there is an alert on config masters https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3Dconfig-master.wikimedia.org%20requires%20authentication, is that related with anything you were doing recently ? [09:21:23] effie: yes ill take a look thanks [09:21:29] thank you! [09:40:45] jbond: it seems that the /usr/local/lib/nagios/plugins/check_puppetrun fails to load the puppet report in case puppet failed (when it tries), as it uses safe_load but the report passes a class `--- !ruby/object:Puppet::Transaction::Report` and it's not allowed (`Tried to load unspecified class: Puppet::Transaction::Report (Psych::DisallowedClass)`), is that something you are aware of and looking, or should I create a task and give it a [09:40:45] look? [09:42:29] dcaro: could it be this https://phabricator.wikimedia.org/T337951 [09:42:46] oh yes, looks like it :) [09:43:16] dcaro: can you add the report its failig on to that task [09:43:26] 👍 [09:43:33] cheers [12:32:22] Can someone please restart Apache on lists1001 as the LE cert is flapping between old & new and very close to expiry [12:44:56] RhinosF1: done [12:45:52] XioNoX: sukhe: FYI i have sent an update to sre.hosts.reimage to again use the microservice port. now with a hopefull fix for the issue you where seeing. please ping if you see more issues [12:46:05] the update: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/949179 [12:48:12] klausman: thanks, https://phabricator.wikimedia.org/T293826 is task of interested [13:10:43] jbond: thanks! looking shortly [13:14:34] jbond: cool, fix makes sens to me [13:15:09] yeah as long as we are searching for the title, it should work so +1 from me as well [13:16:05] great :) [13:24:30] hi all im planning to move config-master from the puppetmasters to dedicated vm's. AFAIK The onlye thing that uses config-master is puppet-merege. the other use cases are simply information. but i wanted to confirm so if you are aware of anything elses that relies on confiog-master please speak up [13:24:55] Amir1: possibly https://config-master.wikimedia.org/pools.json ? [13:32:30] jbond: Amir.1 is out this week [13:33:04] Emperor: ack thanks [13:43:01] XioNoX: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/949100 [13:43:06] going to merge this, if that's fine? [13:43:32] sukhe: yep [13:43:38] let me know if there is any issue [13:43:51] maybe some policies are missing and the homer run will fail [13:43:54] ok thanks [13:44:03] still running this on cr*-esams* correct? [13:44:29] sukhe: asw1-b*27-esams* [13:44:35] :) makes sense [13:45:16] moritzm: We've both got puppet-merges ready to go. Feel free to merge mine. [13:48:05] ack, done, both merged now [13:48:20] Thanks. [13:49:12] XioNoX: looks good I think [13:49:12] 185.15.59.2 64605 6 3 0 0 1:17 Establ inet.0: 2/2/2/0 [13:50:37] sukhe: niiiiice! [14:02:17] XioNoX: anycast_neighbors is fine, however, lvs_neighbors (both LVS hosts) are not it seems [14:02:20] 10.80.0.2 64600 0 0 0 0 14:53 Active [14:02:23] 10.80.0.3 64600 0 0 0 0 14:53 Active [14:03:03] this is lvs3008 and 3010 in asw1-bw27-esams [14:03:25] sukhe: there must be something in puppet like we did for the anycast hosts [14:03:39] on my phone for like 30min [14:03:46] np, I will take a peek in the meantime! [14:05:45] to define the routers [14:11:27] XioNoX: found it [14:11:33] patching it [14:28:21] XioNoX: https://gerrit.wikimedia.org/r/c/operations/puppet/+/949529 [14:30:12] herron: are you around? [14:30:35] effie: in a meeting atm [14:30:48] should be free in ~30 [14:31:05] I will have some time then, please ping [14:45:09] sukhe: https://github.com/wikimedia/operations-puppet/blob/production/modules/profile/manifests/pybal.pp#L30 [14:45:46] XioNoX: yep, good catch [14:45:52] will grep for other such IPs as well [14:46:02] so probably the above patch + this then [14:47:24] effie: hey ready when you are [14:48:52] sukhe: you can grep for "91.198.174." there are other mentions around [14:48:53] generally speaking, teh root certs in the running container of tegola, are the same as the ones we have everywhere [14:50:28] on the other hand, we have a new build with newer go lang, though so far it does not appear to be the issue [14:51:38] XioNoX: yeah, decomm is going to be on this one :) [14:51:39] herron: would it possible to do something like, a test rollout on codfw (which is depooled) ? [14:52:20] we could have a go tomorrow the earliest possible for you [14:53:02] so we could get some more data around the problem, unless it automagically dissapears with the new build [14:54:54] hmm, yes possibly, thinking about how best to partially deploy the new cert config to thanos-fe for testing [14:55:13] I don't think the certs have fqdn currently, maybe we can add that [14:56:03] I could use 1 host as well, and test against directly that one on codfw, if that could work [14:57:34] the other thing I found a bit odd is [14:57:34] https://grafana-rw.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=thanos-fe2004&var-datasource=thanos&var-cluster=thanos&from=1692019683231&to=1692025729636&viewPanel=20 [14:58:32] ok, sure I'll work on a patch for rolling cfssl out to just one codfw host and we could go from there [14:58:42] those errors appear when the host is trying to connect to a closed port, not sure how it fits this problem exactly [14:59:03] yeah IIRC we didn't see any cert validation errors specifically, although the issue started and ended with the cfssl switover/rever [14:59:04] herron: why not use the same patch, disable puppet on all prom host. run puppet in codfw [14:59:16] test and if everything is fine continue with the roll out other wise revert [14:59:33] might be easier then trying to hack the puppet manifest to just support changing just one node [14:59:40] we could do that although I was thinking testing may take some time, but if its quick sure [15:00:00] * jbond is not sure how long it would take [15:00:29] jbond: it is highly likely that we will roll back [15:01:54] but sure, we could try it [15:02:12] herron: what is teh earlier possible tomorrow we could have a go ? [15:02:33] in the meantime nemo-yiannis and I will deploy on codfw the new pods [15:02:59] would something around 9 or 10a eastern work? [15:02:59] worst case scenario, we get more info without production hammering it [15:03:28] herron: what is that in GMT ? [15:03:46] 2pm [15:04:21] I have a meeting but I right after it, if you are available [15:04:34] sure, works for me [15:05:52] herron: unless you can do 13:00 GMT, which is an hour before [15:06:33] effie: sure that works too [15:07:38] awesome! [15:07:43] thanks! [15:10:23] for sure, and one question how will tegola be configured to connect to test against the single codfw thanos-swift node running cfssl certs? point the config directly to fqdn? [15:14:16] overriding /etc/hosts probably would be straightforward but I'm wondering if we'll need to permit outbound connection to the different ip address of the test host as well [15:14:31] herron: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/changes/40/949540/1/helmfile.d/services/tegola-vector-tiles/values.yaml#40 [15:14:58] we could directly point to a host, given that it will provide a valid cert [15:15:17] but we could do codfw again, it has no live traffic atm [15:15:51] what we didn't try on Monday was to restart all tegola pods [15:20:56] gotcha ok thanks [15:23:51] cheers [15:49:13] XioNoX: with the decom stuff and removing the old network ranges and such [15:49:18] I guess for the traffic side of it we will do that [15:49:27] but while are at it, do you want us to also do it for other stuff? [15:53:09] XioNoX: also I see some changes when running homer on mr* like [15:53:10] - description "Core: msw1-bw27-esams:1 {#changeme_knams1}"; [15:53:10] + description "Core: msw1-by27-esams:48 {#30417}"; [15:53:11] + mtu 9192; [15:53:13] guessing these are fine [17:21:19] inflatador: you should be good to go on the VM stuff today, DNS issues resolved [17:21:51] sukhe thanks for getting back, will give it a shot [19:06:38] denisse: you can abandon patches of duplicate [19:08:08] @RhinosF1: I'm aware of it, thank you. :) [19:09:30] But it shouldn't be necessary if the first patch was merged. [19:11:48] Gerrit / phab have no way of telling exact dupe [19:12:02] You'll need to mark the uneeded one abandoned