[09:20:57] I belive cp hosts are overloading centrallog with: [09:21:05] Nov 22 09:19:55 cp3066 haproxykafka[1334378]: {"level":"error","TopicPartition":{"Topic":"webrequest_frontend_text","Partition":0,"Offset":588,"Metadata":null,"Error":{}},"error":"Broker: Topic authorization failed","time":"2024-11-22T09:19:55Z","message":"Failed to publish message"} [09:21:32] ^ does that ring a bell to someone? [09:22:06] we are going to run out of log space soon [09:22:39] effie: you were working on kafka? [09:22:51] or maybe this is traffic, sukhe ? [09:23:26] this started 17:40 yesterday [09:26:53] looking at SAL timing-wise this matches very well with "adding acls to kafka-jumbo cluster (T380373)" [09:26:54] T380373: Allow TLS authenticated client to write on new topics - https://phabricator.wikimedia.org/T380373 [09:26:58] ^ fabfur [09:28:10] jynus: yes, but that was 2 days ago [09:28:50] sorry, I was just trying to get help from someone, according to moritzm may be I.F. [09:59:45] I am going to create an UBN task [09:59:51] sorry ooo today, I'll look asap [10:00:12] moritzm: who can take care instead of fabfur? [10:01:36] I will start escalating to the managers [10:04:01] I'll take care [10:06:29] vgutierrez: https://phabricator.wikimedia.org/T380570 [10:08:23] the only urgent part is the logs, the rest can wait [10:13:44] thank you jynus [10:14:21] thanks to vgutierrez who is doing the work ! [10:15:20] sorry for all this [10:24:39] jynus: log flooding should've been stopped already [10:24:49] it did [10:25:36] to overcome the existing logs, how do you want to go about them: can I transfer them elsewere so you don't lose the valid events? [10:25:52] the haproxykafka ones? please discard them [10:26:27] sadly all the syslogs go to the same file [10:26:44] that is why I wanted to keep them rather than filtering them, which will take too much time [10:27:07] I guess technically they are on the local hosts [10:27:30] so can I remove the remote syslogs for the last 24 hours? [10:27:41] only the syslog, no the webrequets? [10:27:58] from centralauth, or do you want me to make a copy first? [10:28:12] we got them too [sadly] [10:28:16] -rw-r----- 1 root adm 29G Nov 22 10:27 daemon.log [10:28:16] -rw-r----- 1 root adm 33G Nov 22 00:00 daemon.log.1 [10:28:21] yeah [10:28:42] so is deletion acceptable? [10:29:12] and one can go to local ones if needed (?) [10:31:49] I am compressing and rotating while you give me some options, vgutierrez [10:32:21] jynus: you can easily delete per host on centrallogs? [10:32:36] per host yes, it is separate by that, but not per service [10:32:46] jynus: kill them with fire then [10:33:10] we can keep them on the cp hosts if they are needed for some other reaons [10:33:12] *reasons [10:33:12] for context: https://phabricator.wikimedia.org/P71117 [10:33:34] thanks, will do, and only touch the centrallog ones [10:33:45] which was what causing the immediate issues [10:34:15] will update on ticket when done and leave it open in case you want to reporpurse for other work [10:36:37] (e.g. checking why it was erroring out, as it was probably not intended) [13:44:24] jynus: back, sorry for all the mess today, let me know if I can help [13:46:17] nothing else from oncall perspective. I assigned https://phabricator.wikimedia.org/T380570 to you but coordinate with vgutierrez on next steps [13:46:31] thanks [13:46:48] if it is being handled by your team at T380583 [13:46:49] T380583: Avoid logging errors per produced message - https://phabricator.wikimedia.org/T380583 [13:46:53] you can resolve it [13:47:03] resolve my report, I mean [14:34:24] hi folks - I could do with a hand on a k8s deploy. we deployed push-notifications yesterday, and we want to revert that because logspam. i deployed the revert to staging without issue, but the deploy to codfw is timing out. i don't know the service and it's my third time trying to deploy... help? :/ [14:34:38] https://www.irccloud.com/pastebin/3l6Gzx7J/ [14:35:33] ihurbain: I'll take a look [14:35:50] claime: thank you :)) [14:36:13] (i still have a second deploy command running, in case that's relevant) [14:36:38] (i'm waiting for it to timeout, i don't know if i can kill it) [14:37:06] let it run through [14:37:41] 3m53s Warning FailedCreatePodSandBox pod/push-notifications-main-5954f99fcb-f22pj (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "93422201921138c255500204dccd63bbbdd8922b127e55fc0199ce3691b9fa3d": plugin type="calico" failed (add): failed to request IPv4 addresses: Assigned 0 out of 1 requested IPv4 [14:37:43] addresses; No more free affine blocks and strict affinity enabled [14:37:51] we seem to have run out of ip wth [14:38:00] mmmmmmmmh [14:38:08] that's sad [14:38:14] :( [14:38:57] (command has returned btw) (in the same way, unsurprisingly) [14:42:16] i think jayme has dealt with something similar in the past [14:42:51] 👀 [14:43:04] ah...ETOOMANYNODES [14:43:09] aaah [14:43:21] damn...I totally forgot that one. Ne need to stop adding nodes claime :) [14:43:23] I can cordon all the parse nodes since they were refreshed [14:43:41] that will take a minute but they're supposed to be decom'd anyways [14:43:53] that won't help [14:44:02] they need to be removed completely? [14:44:08] we need to cordon every node that does not have a ippool assigned [14:44:12] or that, yes [14:45:22] or sync with Cathal on https://phabricator.wikimedia.org/T375845 [14:46:22] Maybe that's not a friday thing (actually fixing the blocks) [14:46:28] but that's a bit more complicated as it requires some helm chart patching as well I soppose [14:46:34] yeah [14:46:38] yeah, it's not - you're right [14:46:55] I'm gonna go ahead and start decommissioning the parse nodes [14:47:42] Is there something to get the ipblocks reassigned or does calico do that automatically afterwards? [14:48:10] they get freed automatically when you delete the node objekt in k8s api [14:48:19] should we split them up or something? [14:49:03] I'm starting the depool but yeah, take bottom 10, I'll take top, task is T380473 [14:49:05] T380473: Decommission parse20[01-20] - https://phabricator.wikimedia.org/T380473 [14:49:24] claime: ack [14:50:17] crap we should have cordoned the nodes with missing blocks first [14:50:26] it'll try to schedule pods there [14:53:04] claime: I can try to figure that out first, if you're not already doing so [14:54:49] I'm not [14:55:08] ok [15:01:10] claime: done [15:01:16] <3 [15:01:26] I'm preparing the patch to remove the hosts from puppet [15:01:47] do you need extra hands? [15:02:10] cdanis: this is the list of hosts I've cordoned/drained https://paste.debian.net/1336432/ [15:02:52] cdanis: I think we're good - thanks 🤗 [15:03:11] the list of hosts should have not pinged you, sorrt [15:03:28] np! [15:06:50] jayme: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094466 [15:06:54] looking [15:07:10] (the depool cookbook is still running) [15:09:18] <_joe_> One friday thing we could do is to add an alert on this, I guess [15:09:19] jayme: iirc merging this and running puppet on masters should remove them from the k8s api right? [15:09:44] or I can just kubectl delete node I guess [15:09:57] _joe_: I was just looking around what we already have in grafana for calico trying to find something relevant [15:10:06] claime: the latter - but stop puppet and kubelet on all of them first [15:12:19] cdanis: I think I checked last time and could not find a proper signal - apart from https://alerts.wikimedia.org/?q=%40state%3Dactive&q=team%3Dsre&q=alertname%3DKubernetesAPILatency [15:12:50] because the metrics do not track block allocation...or not the available blocks. One of the values was missing IIRC [15:14:21] jayme: nodes deleted [15:14:38] claime: ack, I'll double check and uncordon the new ones [15:14:59] I'll restart the push-notifications deploy cc ihurbain [15:15:17] thanks claime <3 [15:15:44] (and thanks all for dealing with that!) [15:16:27] that's what you get when you defer all you decom's to after you've put all the refreshes in production :( [15:16:51] ok, it deployed fine in codfw ihurbain you can go ahead with eqiad if it wasn't done yet [15:16:58] thanks :) [15:17:42] jayme: could we write our own exporter that watches the CRDs? [15:17:58] it would be really nice to have a mechanism to never put a k8s cluster into this state [15:18:22] cdanis: I wanted to double check the next calico version if it has extended metrics...something is in the back of my head [15:18:37] ah fair enough [15:18:44] also gotta make sure it's in core ;) [15:18:53] 🙈 [15:19:24] jayme: I'll start running decom cookbooks [15:19:25] but fwiw, in the meanwhile ... I think it would be reasonable to hack up something quickly with kubectl and jq writing a node_exporter textfile [15:19:41] (on the masters) [15:20:07] maybe kube-state-metrics has a custom query thing as well... [15:21:19] hm, not obviously [15:22:41] actually the correct way to deal with this (IMHO) is something like the node-problem-detector - marking the node as notready in case it has no ipblock assigned [15:25:22] oh the decom cookbook takes a list of hosts [15:25:31] btw: if it is for handy for impact estimation: I snagged the output of `kubectl get events -A -o json | jq -c '.items[]' | grep calico` in deploy2002:/home/cdanis/codfw-calico-events.2024-11-24.jsonl [15:26:09] that is one signal that exists, at least [15:26:11] jayme: I'll run the cookbook for all remaining parse nodes in one go, don't worry about taking the bottom 10 [15:26:16] might as well do it all at once [15:26:39] claime: ok. I'm trying to find out why the blocks are not re-assigned then :-| [15:26:46] :/ [15:36:34] jayme: when you have a minute can you post your recipe for finding nodes without ipblocks? [15:38:13] cdanis: I just compared kubectl get ipamblocks.crd.projectcalico.org with the list of nodes [15:40:25] cdanis: blockaffinities.crd.projectcalico.org sorry [15:40:30] aha thank you [15:40:35] I hadn't found that one yet [15:41:08] https://wikitech.wikimedia.org/wiki/Calico#IPAM :p [15:48:37] cdanis: but there is a better way [15:49:15] 👀 [15:50:05] the ipamblocks.crd.projectcalico.org objects have a spec.affinity field which references the node [15:50:21] maybe that's more fancy [15:50:40] but I could not find a reference from node to block || affinity [15:52:32] ip blocks have been re-assigned now and I did uncordon the nodes [15:52:38] claime: ^ [15:53:07] jayme: did you have to force something, or it was just a matter of waiting for a bit? [15:53:11] It was just me being impatient I suppose...it takes a while for the auto cleanup to happen [15:53:15] hehe [15:53:29] yeah...I tried to force it for a block and that broke assignment [15:54:05] because deleting the blockaffinity object is not enough - one has to delete that and the block object as well [15:54:45] or calico-kube-controller will go into a retry spiral because there's a reference to the nonexisting node still in the ipamblock [15:57:57] >:[ [16:00:36] lol [16:47:37] https://logstash.wikimedia.org/goto/c6b669ae3bc3a1c015ec6df6a2d62be4 lol what's going on with calico on mw2359 [16:56:24] jayme: I found a better way [17:08:57] cdanis: 👀 [17:10:10] I think mw2359 is because there's the active typha instance running [17:13:29] ah https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094489 , hehe [17:13:31] clever [17:15:22] I ran PCC on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094490 on top of that, and it does indeed fail: https://puppet-compiler.wmflabs.org/output/1094490/5001/wikikube-ctrl2003.codfw.wmnet/index.html [17:16:19] downside is that we don't regularly run PCC for changes like adding N new nodes :/ [17:16:39] but that's on us I suppose [17:16:47] jayme: well, it will still make puppet fail on the master, ahaha [17:16:56] ah, yes. lol [17:17:10] but...we could also just alert based on # of nodes [17:17:11] there is also *probably* a way to make CI check this [17:18:29] jhathaway: do you know offhand of any easy ways to validate some hiera keys against Puppet types? [17:18:33] because adding them does no harm initially as they are cordoned when they join. uncordoning them is what we should not do then [17:18:58] sure, that also seems good for now [17:19:30] cdanis: i.e. validatating them before compiling? [17:19:50] yeah [17:20:16] unfortunately not, I cooked up some prototypes with json schema, but never finished the work [17:20:21] it is an unfortunate gap [17:21:12] how hard is it to kludge up for one key and one type definition 😅 [17:21:31] https://gerrit.wikimedia.org/r/c/operations/puppet/+/979469 [17:21:53] neat [17:21:55] ty [17:22:41] longer discussion here about tradeoffs, https://phabricator.wikimedia.org/T352604 [17:23:39] in conjunction with redhats yaml lsp it was pretty nifty [17:28:59] jayme: I don't think it's necessarily true that it's the uncordon'ing that is the triggering step -- as soon as the nodes exist in the cluster, they'll be trying to run daemonsets, meaning they'll be trying to allocate pod IPs [17:29:50] and then all you need is for another not-cordoned node to reboot or something [17:58:15] okay, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094489 updated (thanks jhathaway!) and now such a patch would fail CI: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094490 [17:58:39] https://i.imgur.com/Ez1KPXP.png [17:59:39] nice! [18:07:14] \o/ [20:04:45] "WME is planning to do a full ingestion of commons metadata starting mid next week..." <-- not fan of the timing there 😳 [20:06:19] yeah. thankfully it's just HEAD for the actual files [20:06:49] sukhe: what could go wrong? [20:07:25] * sukhe asks Murphy [20:54:06] Small patch to fix PCC messages if y'all have time to look https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094530 [20:55:41] oops, comment is in the wrong place. Fixing... [20:58:47] nm, got a review [21:45:21] is anyone available to help me w/the conftool alert https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed&q=name%3D_srv_config-master_pybal_codfw_wdqs-internal-scholarly.toml ? Alert happened after merging this patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1088383 . [21:49:10] from what I can tell, `/srv/config-master/pools.json` on config-master2001 looks correct, but I can still see `ERROR 100: Key not found (/conftool/v1/pools/eqiad/wdqs-internal-main` in `/var/log/confd.log` [21:53:14] let's see [21:54:48] sukhe ACK, thanks and sorry to burden you with this on a Friday afternoon ;( [21:55:02] the key certainly doesn't exist yeah [21:57:27] inflatador: when you merged this on puppetserver, did you notice something in the output? [21:57:31] Damn. Is that because the LVS service is still in `service_setup` ? [21:58:08] have you merged other patches as well? other than this [21:59:32] sukhe Y to both. Here is the output from puppet-merge https://phabricator.wikimedia.org/P71118 [21:59:52] we can back out of that patch [22:00:22] the only reason I merged it is because our earlier patch to add roles was causing PCC failures (for everyone) [22:00:25] so, is this not in eqiad? [22:00:43] codfw [22:01:02] right, so what I mean is that the conftool-data/node entries are only present in codfw? [22:01:18] I see what you mean [22:01:25] i.e., https://gerrit.wikimedia.org/r/c/operations/puppet/+/1088383 [22:01:28] so far, it seems it was only added to codfw [22:01:32] inflatador: ^ [22:01:32] swfrench-wmf correct, although the codfw templates are failing as well [22:01:45] https://grafana.wikimedia.org/goto/kCbxEpnHg?orgId=1 [22:02:39] I can remove the eqiad key if you think it would help...but since we were getting CODFW failures as well I didn't think it would help on its own [22:05:01] ah [22:05:13] so [22:05:39] what's the name of the service? [22:05:51] in conftool, you say: [22:05:51] wdqs-internal-main: [eqiad, codfw] [22:05:52] wdqs-internal-scholarly: [eqiad, codfw] [22:06:04] in the node data though, [22:06:11] wdqs-internal-main: [22:06:11] wdqs2018.codfw.wmnet: [wdqs-main] [22:06:11] wdqs2019.codfw.wmnet: [wdqs-main] [22:06:11] wdqs2020.codfw.wmnet: [wdqs-main] [22:06:11] wdqs-internal-scholarly: [22:06:14] wdqs2026.codfw.wmnet: [wdqs-scholarly] [22:06:16] wdqs2027.codfw.wmnet: [wdqs-scholarly] [22:07:20] sukhe good catch, sounds like we need to fix the node data then, and also remove eqiad from conftool-data/discovery/services.yaml ? [22:08:16] inflatador: I think so but to clarify what I mean further: [22:08:29] if the name of the services are wdqs-internal-main and wdqs-internal-scholarly [22:08:44] then that goes in the service.yaml file, which is correct [22:08:54] what doesn't seem to be correct though is the etcd data [22:09:05] this then results in the incorrect key error you are seeing [22:09:23] so unless I am mistaken, what the above should be for example [22:09:24] + wdqs-internal-scholarly: [22:09:24] + wdqs2026.codfw.wmnet: [wdqs-scholarly] [22:09:24] + wdqs2027.codfw.wmnet: [wdqs-scholarly] [22:10:11] this I thing is not correct, unless you actually wanted wdqs-scholarly lhere [22:10:14] *think [22:10:28] ACK, I think I understand. Writing a patch now [22:11:13] * inflatador wonders if running PCC against config-master would've caught this [22:11:17] tricky, not sure [22:11:28] the etcd key layout part is distinct from PCC [22:13:05] so, there are kind of two issues going on at the moment leading to the confd errors [22:13:35] one is the mismatch between the lvs.conftool section of the service catalog entries vs. what's in conftool-data [22:13:51] the second part is that no conftool-data objects exist at all for eqiad [22:14:16] ACK. I **think** I address both in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094536 , but LMK what you think [22:15:06] and yeah, as sukhe suspected, PCC doesn't catch this type of error [22:15:37] inflatador: try this, I think we should be OK, because now the service name matches [22:15:43] unless swfrench-wmf disagrees [22:15:52] so, just backing up a sec - what are you trying to achieve here? are these services supposed to exist only in codfw? [22:17:18] if we look at the service definition, then no [22:17:31] swfrench-wmf we are deploying a net-new service, which does only exist in CODFW. I merged a patch this morning which was supposed to just add the new service's puppet role [22:17:52] inflatador: also, if these two patches had just been merged by themselves, then we probably won't be seeing this error now. but since the service definition bit was also merged, this is where the error came from [22:18:05] But it activated some config that caused all PCC runs to fail (not just ours) [22:18:09] and hence that's why you see pybal complaining [22:18:18] well, not pybal itself to be clear but the confd pybal error [22:18:40] inflatador: if you mean: Could not find service wdqs-internal-main in service::catalog" [22:19:13] sukhe confirmed [22:19:23] yeah it's quite the mess tbh [22:19:48] yeah, sorry...I did not intend to do anything but merge some role-related stuff today [22:20:21] OK, merging the new patch now, let's see if it helps [22:20:28] alright, I revise my statement from before - there seem to be two things wrong: [22:20:28] 1. the service-catalog entries contain listing for eqiad when they should not (this replaces "there are no conftool-data entries now") [22:20:28] 2. the service-catalog entries have lvs.conftool sections that do not match conftool-data/node [22:21:05] the puppet failures you were getting [22:21:09] are related to this bit: [22:21:09] profile::lvs::realserver::pools: [22:21:09] wdqs-internal-main: [22:21:09] services: [22:21:09] - wdqs-blazegraph [22:21:12] - nginx [22:21:26] this happens later in the process and hence the problem [22:21:52] ryankemper ^^ I think this was the stuff we were talking about commenting out yesterday? [22:22:44] OK, I merged the latest patch and I'm dumping the pools on config-master now [22:23:30] so, you are going to find that that patch does not fix the confd issues [22:23:43] in order to do that, see above [22:24:19] swfrench-wmf ACK, will get to work on that now [22:25:45] so now at least, you still need to do what swfrench says but at least in codfw: [22:26:08] inflatador: oof yes [22:26:08] sukhe@cumin2002:~$ etcdctl -C https://conf1007.eqiad.wmnet:4001 ls /conftool/v1/pools/codfw/wdqs-internal-main/ [22:26:11] /conftool/v1/pools/codfw/wdqs-internal-main/wdqs-main [22:26:13] /conftool/v1/pools/codfw/wdqs-internal-main/wdqs-internal-main [22:26:31] folks don't get me wrong please but please follow the order as it is in the future, it will save you a lot of pain :) [22:26:36] my bad on that one didn’t think the hiera would have effect without a profile included to use it [22:27:33] FWIW i would vote to just roll back rather than forward here and do it properly come monday [22:27:53] yes [22:28:01] but you will have to ensure that the roll back is also clean in this case [22:28:28] and to be honest, I am a bit split on what is fully wrong but yeah, that's where the order bit comes in [22:28:46] sukhe yeah, it was bad judgment on my part. I saw that profile stuff causing problems on Thursday and fixed it, but we had a regression yesterday that was not immediately apparent [22:29:15] inflatador: so now at least what swfrench-wmf is saying and swfrench-wmf correct me if I am wrong [22:29:17] +wdqs-internal-main: [eqiad, codfw] [22:29:20] +wdqs-internal-scholarly: [eqiad, codfw] [22:29:23] add this to discovery/services.yaml [22:29:34] including the relevant entries in conftool-data/node/eqiad.yaml [22:29:42] or simply revert the patch and see if everything clears up and start on Monday [22:29:46] that is probably the best idea IMO [22:31:11] sukhe: I think it's the opposite, in that the service will only exist in codfw, so they do not need to add conftool-data/node/eqiad.yaml entries. but yeah, +1 to just reverting the service::catalog patch and trying again next week with a fixed version [22:31:18] OK, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094539 is up and that should fix the conftool alerts [22:31:40] inflatador: thanks! looking [22:31:48] otherwise I think we will have to revert 3 patch [22:32:03] the service catalog specifies eqiad though [22:32:18] ok, so I see above that it should only exist in codfw, ok, so yeah that needs to be fixed as well [22:33:11] sukhe understood, fixing that now [22:33:40] inflatador: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094539 seems like a good thing to do, yeah. I don't _think_ it'll fix the confd alerts on configmaster, though. [22:34:47] swfrench-wmf just pushed an update that also removes references to eqiad from hieradata/common/service.yaml as sukhe suggested. Let me know if this is OK. Otherwise I'll start the rollback process [22:36:10] sukhe I will defer to you 100% on this, if you are more comfortable with us rolling back I am happy to do so [22:36:28] inflatador: hard to say honestly, it's a bit messy and yeah, I am not convinced this will fix all errors [22:36:32] inflatador: one more thing you'll need to fix if you're going to get it all. lvs.conftool.service is wrong [22:37:02] you need to make them match what you changed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094536 [22:37:18] (i.e., it looks like cluster == service in confiool-data/node) [22:38:01] ACK, fixing [22:38:02] inflatador: I would very much vote for a rollback of all patches at this point and given the time [22:38:27] and then on Monday, follow the order exactly as in https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service [22:38:30] including merging any patch [22:38:35] you can prepare the patches but don't merge this [22:40:10] simply rollback in the order you merged them and ensure that the alerts clear yup [22:40:13] *up [22:40:53] sorry I have to go and prepare dinner now. but just revert at this stage and things should clear up. [22:41:07] ^^ following rollback as recommended by sukhe . Sorry to bother everyone on Friday afternoon [22:41:17] and in case the confd alerts don't, defer to swfrench-wmf if he is around but IMO don't delete any key in etcd at all, obviously. let it be there [22:41:21] good luck guys [22:41:32] have a good evening, sukhe [22:41:49] +1 to rolling back to the last known good state :) [22:47:04] inflatador: ping me when you have your stack of reverts if you'd like review on them [22:48:36] swfrench-wmf I'm gonna go ahead and roll back, I will definitely need your help confirming that things look good from etcd when I'm finished [22:49:34] sound good, I'll be around. thanks for going the rollback route, even if it's a bit of a hassle given the number of layers! [23:02:24] swfrench-wmf OK, I **believe** we should be back to good data with the merging of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094550 , but I'm still seeing confd errors on config-master2001 . Let me know if I missed something, I'm still checking Puppet as well [23:02:47] inflatador: ack, looking [23:03:39] swfrench-wmf hmm, I take that back. Will still need to remove the realserver stuff from the profiles, 1 sec [23:04:29] inflatador: I will be back in 20’ to help out w any final cleanup [23:05:28] inflatafor: so, the confd errors on config-master hosts should be good now that puppet has run (I see no errors in the journal since 23:01) [23:05:36] inflatador: heh, typing ^ [23:06:37] 1001 probably just needs a puppet-agent run (2001 is fine now) [23:07:46] aaaand 1001 is fine now post-puppet [23:07:48] ACK, we have one more thing to revert, as the realserver profile stuff is breaking PCC for everyone [23:07:56] yes plz :) [23:11:22] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094554 does not want to rebase due to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094468 , working on it [23:13:27] inflatador: if it's a hassle to resolve the conflicts, I think it should be fine to just leave that as-is and roll forward your change to comment out the profile::lvs::realserver::pools that are causing PCC problems [23:15:10] swfrench-wmf ACK, I think that's faster. Small change to that patch... [23:18:55] as long as a roll-forward of the patch to comment out the realserver pools has a clean PCC run, then (if I understand the sequence of events and the rationale for the stub service catalog entry correctly) I think that puts us in a good state to pause [23:19:42] I agree, running PCC against https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094539 now [23:21:19] just to confirm, that has a stale base, right? (i.e., the service catalog entries are already gone?) [23:21:37] (ah, yeah, confirmed those are already gone at head - just got confused by the diff) [23:22:09] Y, we rolled that back [23:22:24] hmm, that shouldn't be in the patch then [23:23:09] oh yeah, exactly what you said ;P [23:23:29] rebased [23:35:14] inflatador: so, I'm seeing a lot of puppet failures complaining about the removal of wdqs-internal-main from the service catalog - were these added to envoy? [23:35:48] swfrench-wmf good catch, that patch needs to be reverted as well [23:36:19] got it, yeah I see those are from If3a8709717ede52f282f2b9f9a7b3d35246dc8ff [23:37:44] thanks! [23:38:20] ACK, reverting back [23:39:47] OK, merged/puppet-merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094570 [23:40:57] awesome, and indeed PCC looks happy on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094539 [23:43:00] thanks for the +1, I just merged/puppet-merged [23:43:03] aaand puppet-reported failed node count is going doing [23:43:13] *down [23:43:27] * swfrench-wmf cannot words terribly well today [23:44:13] It **is** Friday. Thanks for hanging with me [23:44:58] I think we are in a good state now, but LMK if I need to check anything else [23:47:20] I mean, if puppet's happier (both live and PCC) and the alerts have cleared (seems like they have), then I think we're good as far as I can tell :) [23:47:29] thanks for wrangling this back into a good state [23:48:21] It's the least I could do. And next time I try to roll forward on a Friday...OK, no. There will NOT be a next time [23:56:53] Thanks again all, have a wonderful weekend [23:57:26] have a good weekend as well!