[07:10:16] <_joe_> !incidents [07:10:16] You're not allowed to perform this action. [07:10:31] <_joe_> uhhh something went *very wrong* I'd say [07:11:33] <_joe_> and the biggest issue is I can't repro this problem on my computer [07:46:45] https://www.irccloud.com/pastebin/dAkc6UAu/ [08:01:43] <_joe_> Amir1: lol stop bragging [08:01:58] <_joe_> Amir1: try to surf wikipedia logged out [08:02:18] <_joe_> it must be disturbingly fast - it was in Amsterdam when I tried on purpose [08:05:24] Yeah. Gonna try it 😍😍 [08:28:45] https://kubernetes.io/blog/2023/08/15/pkgs-k8s-io-introduction/ k8s switching to community-owned deb repos (also rpm). [08:30:32] <_joe_> klausman: if the quality of those debian packages is what I remember from older releases, no thanks :P [08:30:38] <_joe_> but yeah, we should take a look [08:30:41] <_joe_> !incidents [08:30:41] 3951 (ACKED) [12x] ProbeDown sre (probes/service esams) [08:30:56] <_joe_> godog: can you try as well? ^^ [08:31:28] !incidents [08:31:29] 3951 (ACKED) [12x] ProbeDown sre (probes/service esams) [08:31:31] \o/ [08:31:34] <_joe_> yep, it worked [08:31:35] I doubt the quality of the package contents will change much in the short term. This is more for their CI/publishing machinery having fewer human steps in it. [08:31:39] thank you _joe_ for the fix [08:32:13] <_joe_> oh it was my fault all along [10:17:16] XioNoX: for the DNS changes, https://gerrit.wikimedia.org/r/c/operations/dns/+/949930/ [10:17:43] sukhe: for puppet https://gerrit.wikimedia.org/r/c/operations/puppet/+/949934 [10:17:57] I wasn't sure if we are removing 2620:0:862::/48 completely and thus the failing CI is for missing PTRs for IPs assigned on Netbox, such as ae1-103.cr2-esams.wikimedia.org, so I will leave that to you :) [10:18:20] as in, I don't want to delete stuff such as ae1-103.cr2-esams.wikimedia.org above, you are the better judge of that! [10:18:40] XioNoX: ha nice, thank you! I thought I will do that next [10:18:53] reviewing yours then [10:19:04] and you can check and improve mine [10:20:05] sukhe: I think we should keep templates/2.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa and templates/174.198.91.in-addr.arpa but with no records [10:20:48] XioNoX: ok. why do you prefer that out of curiosity? [10:22:27] sukhe: no strong feelings, but makes it easier to re-use it in the future [10:23:03] if you think it's cleaner to remove I'm ok too [10:23:09] ah, I didn't know plans for reuse [10:23:47] sukhe: me neither [10:23:53] ha [10:23:59] ok putting it back [10:24:04] more like to make it easier in the future, but maybe it's not worth it [10:24:49] yeah I am fine with that, no strong opinions there. we have commented some stuff today in the Traffic hosts, just because it's easier to update vs add from scratch [10:25:15] revising the commit and then I will you check it for the missing PTRs [10:25:43] sukhe: the ones failing CI need to be removed from netbox and run the netbox.dns cookbook [10:26:01] XioNoX: happy to do that as long as you can give me your +1 that it's fine to remove them [10:26:59] sukhe: don't delete the IPs for now but you can remove the DNS name [10:27:03] (from netbox) [10:27:07] got it [10:38:30] /tmp/dns-check.rfb81ysh/zones/netbox/0.20.10.in-addr.arpa [10:38:40] not found, which makes sense, when pushing the netbox changes [10:39:49] but as I scroll back up, it seems like it did remove 0.20.10.in-addr.arpa [10:42:47] ah [10:42:47] $INCLUDE netbox/0.20.10.in-addr.arpa [10:43:04] this include is failing then in the dns repo [10:43:24] you need to remove it [10:43:25] going to remove this then and push this change first, then retry [10:43:26] yep [10:43:28] yep [10:43:30] cool [10:50:11] XioNoX: https://gerrit.wikimedia.org/r/c/operations/dns/+/949938, quick review thank you [10:52:35] sukhe: I think you can fully empty templates/2.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa ? (except the header) [10:52:56] yeah [10:53:03] in the other commit that I will update [10:53:10] in this one, I just wanted to get a working authdns-update [10:53:29] hence just excluding this include and then will revise the other commit to empty that and the /24 [10:53:36] oh ok [10:53:48] then lgtm :) [10:53:49] thanks [11:08:46] 07:07:40 error: Zone '2.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa.' has no SOA record ha [11:08:57] which makes sense but we will see what we can exclude then [11:28:15] herron: when you are online, ping me, I will be around [12:00:42] Is that the whole /48? [12:01:05] If it’s delegated by the RIR we should probably keep an empty zone, return the NS/SOA records for it [12:18:06] topranks: thanks, that's what we did. https://gerrit.wikimedia.org/r/c/operations/dns/+/949930 [12:20:47] and that is the whole /48 yes [12:33:39] https://puppet-compiler.wmflabs.org/output/949934/42915/ the more eyes we have on this the better, especially the failed ones to know if it's realated or not [12:33:47] for https://gerrit.wikimedia.org/r/c/operations/puppet/+/949934 [12:47:25] XioNoX: I can confirm that the failures on an-worker1137, flink-zk2001, and idp-test1002 are unrelated, though I understand that they're not the two failures you're most interested in. [12:51:26] sukhe: I deleted a bunch of old esams IPs and the cookbook now returns "FileNotFoundError: [Errno 2] No such file or directory: '/tmp/dns-check.l3vdte02/zones/netbox/0.21.10.in-addr.arpa'" [12:51:33] I think I have to remove the include [12:54:21] effie: hey [12:54:28] oh [12:54:46] XioNoX: we can do one thing [12:55:06] we can push an independent change for templates/174.198.91.in-addr.arpa and templates/2.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa [12:55:35] and then merge the geo-resources change when we are ready [12:55:40] https://gerrit.wikimedia.org/r/c/operations/dns/+/949930 basically split this up [12:55:41] sukhe: or you can remove the include of "netbox/0.21.10.in-addr.arpa" in https://gerrit.wikimedia.org/r/c/operations/dns/+/949930 [12:56:17] deploy it, then the cookbook should run fine [12:57:15] ok, let me patch that up then [12:58:38] btullis: thanks! [13:01:01] TIL: thisisunsafe https://stackoverflow.com/a/49130998 [13:01:09] XioNoX: on more missing include [13:01:12] 0-25.174.198.91.in-addr.arpa [13:01:16] updating this as well [13:02:29] sukhe: it's already taken care in https://gerrit.wikimedia.org/r/c/operations/dns/+/949930 [13:02:35] the 0-25.174.198.91.in-addr.arpa [13:02:50] right, but we won't be merging this till later [13:03:05] ok [13:03:06] https://gerrit.wikimedia.org/r/c/operations/dns/+/949972/ this is failing right now [13:07:17] sukhe: "$INCLUDE netbox/128-28.174.198.91.in-addr.arpa" can probably be removed too, it's not in netbox anymore [13:08:27] jbond: some of them comments scare me [13:08:53] The whole point of HSTS preload is to make it clear under no circumstances should this ever use http [13:09:01] XioNoX: have we removed everything under 91.198.174.0/24 and the /48 from netbox? [13:09:28] The dev environment needs another domain. They are cheap now. [13:09:45] sukhe: not all the IPs, but the dns names can go away [13:10:09] they're still used on some P2P transport links that we need to migrate [13:10:22] sukhe: that's what's left: https://netbox.wikimedia.org/ipam/prefixes/82/ip-addresses/ [13:10:45] cool thanks [13:11:12] so if I do a commit that just empties out the PTRs for the /24 and /48 above [13:11:19] that shoudl get you a working netbox run then [13:11:24] sukhe: yep [13:11:28] patching [13:12:03] sukhe: basically a mix of https://gerrit.wikimedia.org/r/c/operations/dns/+/949972/ and the two .arpa files from https://gerrit.wikimedia.org/r/c/operations/dns/+/949930 [13:13:10] yep [13:16:07] XioNoX: https://gerrit.wikimedia.org/r/c/operations/dns/+/949975/ [13:17:26] ouch, why is this failing [13:17:32] $INCLUDE netbox/248-29.59.15.185.in-addr.arpa [13:17:58] herron: here, sorry for being late [13:18:20] effie: hey no worries [13:18:35] herron: we are not using codfw, if we could roll out there, I can do some manual testing [13:19:06] sounds good? [13:19:14] effie: sounds good, could I bug you to throw a +1/any notes on https://gerrit.wikimedia.org/r/c/operations/puppet/+/948125 before I merge that? [13:20:14] XioNoX: did we delete some IPs under 185.15.59.0/24 [13:20:17] or the DNS names, rather [13:20:24] herron: I will, shall we disable puppet on the servers if not already? [13:21:09] effie: sure, done [13:21:17] awesome! [13:22:24] sukhe: ah right... tilaaa [13:22:32] effie: great, thank you, I'll get started deploying to codfw now, ready? [13:22:43] yes [13:22:47] ok, doing [13:23:59] sukhe: pushed a new PS https://gerrit.wikimedia.org/r/c/operations/dns/+/949975/ waiting for ci [13:24:20] ok makes sense :) [13:25:40] XioNoX: the v6! [13:26:28] sukhe: that never ends! [13:26:33] pushed a new PS [13:27:36] sukhe: finally! [13:28:29] where is AI when you need it to generate and mess around with v6 PTRs :) [13:28:32] +1 [13:29:03] herron: has puppet run on all codfw hosts? [13:29:15] effie: just wrapping up now [13:29:18] cool [13:30:12] effie: ok, deployed in codfw [13:30:31] how is esams work going? I was off the first part of the week [13:32:38] going well enough yep no major issues :) [13:32:44] light at the end of the tunnel now [13:33:12] sukhe: host text-lb.esams.wikimedia.org [13:33:12] text-lb.esams.wikimedia.org has address 185.15.59.224 [13:33:13] text-lb.esams.wikimedia.org has IPv6 address 2a02:ec80:300:ed1a::1 [13:34:31] herron: the manual test says ok, I will attempt to pool codfw temporarily, see how things behave [13:35:13] effie: ack ok [13:36:21] XioNoX: wait, where is that from? [13:36:25] oh Netbox ha [13:36:27] nice [13:37:47] topranks: nice to hear! [13:42:36] XioNoX: looking at the Puppet patch now [13:42:40] guess that's the last of it? [13:42:55] herron: same problem [13:42:57] sukhe: should be [13:43:06] I am looking [13:43:13] ok, moving to dc-ops for another related topic [13:44:06] effie: gotcha, so same issue with updated cert bundle and after restarting? [13:45:45] herron: the tegola container had the correct certificates on monday too, the updated wmf-certificates packages didnt include anything new [13:46:14] I am rolling restarting the pods on codfw, as it is something we didnt try last time [13:46:23] effie: got it, ok sgtm [13:51:23] inflatador: want me to merge 'Start Blazegraph from systemd unit, without runBlazegraph.sh' ? [13:52:21] andrewbogott yes, thanks [13:52:28] done [13:59:14] herron: no go for the time being, is it ok to pick this up in an hour? [13:59:38] in the meantime I have a tcpdump I will look at [13:59:53] effie: fine by me, although I think there are some open icinga alerts for maps [14:00:09] herron: they will clear out, codfw is depooled again [14:00:18] effie: kk [14:01:34] thanks [14:03:34] Am I imagining this, or do we have some kind of utility on hosts that can create a pastebin somewhere? [14:04:09] inflatador: phaste [14:04:20] jbond excellent, thanks [14:04:26] np [15:19:15] herron: Ξ™ am still looking at the dump [15:19:34] effie: ok [15:20:15] we have 2 options, either leave this=ngs as they are and I can cont debugging tomorrow with codfw [15:20:26] or revert fully [15:20:34] and try again similar time [15:21:18] and a 3rd one, where we do some puppet fiddling to only roll out on codfw, though we cant leave things like that during the weekend [15:21:28] in case we need to depool eqiad and pool codfw [15:22:55] effie: gotcha, IMO we should revert when finished today and can always re-deploy to codfw for followup tests [15:23:12] since things get cranky about long disabled puppet agents [15:23:42] out of curiosity does a manual connection with curl to the endpoint work? [15:24:07] yes, but in that case I believe, since it fails to fetch from cache, it generates and returns [15:24:45] so what we actually see is, tegola being overwhelmed trying to generate what it canr fetch from its cache (swift) [15:25:04] I will verify this with teh devs, but it is very plausible [15:25:37] on the other hand, my tcpdump does not look great [16:05:21] herron: please roll back, we will pick this up tomorrow I am afraid [16:05:45] effie: sure no worries, will do [16:05:59] any promising leads so far? [16:06:38] sukhe: I'm back and I see that we got jbond's blessing on https://gerrit.wikimedia.org/r/c/operations/puppet/+/949934 [16:06:47] so far it looks like it is the applcation itself, but no solid leads yet [16:06:59] I will post on phab tomorrow a few notes [16:08:33] ok, sounds good, I'll revert and prep the followup patch shortly. feel free to ping when ready to have another look [16:08:56] XioNoX: let's do it! [16:09:17] XioNoX: probably best to let the oncall folks know [16:10:16] arnoldokoth, jhathaway: please hold tight, we're merging this https://gerrit.wikimedia.org/r/c/operations/puppet/+/949934 [16:10:40] woohoo! [16:11:11] Haha. Sure. [16:13:29] merged [16:13:38] XioNoX: this is the day you get to run cumin '*' [16:15:13] sukhe: "OK to proceed on 2068 hosts? Enter the number of affected hosts to confirm or "q" to quit: 2068" [16:15:17] ha [16:16:12] I used this https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed [16:16:21] without the " --failed-only" [16:54:52] XioNoX: I forced an agent run on alert1001 [16:55:04] waiting, alert1001 takes a long time [16:55:23] alright [16:56:32] sukhe: if that's the only issue with that huge patch I think we're good [16:56:49] yeahhh [16:56:58] yep :) [16:57:01] we are good! worked [16:57:07] thanks cwhite! [16:57:13] XioNoX: clean run, hosts removed [16:57:17] was going t osay https://puppetboard.wikimedia.org/report/alert1001.wikimedia.org/ad86f9c944407a161150f7b45da675fbc2498def [16:57:18] yep [16:57:20] thanks a lot! [16:57:27] <3 [16:57:36] <3 [17:05:24] NOICE [17:48:39] Am i right in reading that helm doesn't define what happens between charts and releases, that we needed to invent our own middle piece? Like, helm defines charts, many releases can be made from one chart, but I'm not finding helm's documentation on how separate releases from the same chart are defined/variables provided/etc [17:58:04] arnoldokoth: jhathaway: I have an alert in place but if you get a page for ncredir, I will handle it (nothing to worry, esams is depooled so no traffic) [18:05:47] ebernhardson: Is deployment-charts/helmfile.d/services/* what you are looking for? That's where we keep config files that feed values to a chart at deploy time. Multiple services may depend on the same chart and apply different values. As an example there are 5 'shellbox*' services that use the shellbox chart. [18:06:45] bd808: ahh, that makes sense. Apparently helmfile is third party and not part of helm [18:06:58] would be why i kept going back and forth over helm docs and finding nothing :) [18:07:32] yeah, helmfile is basically a helper for making the right `helm` cli call I think [18:08:45] sukhe: thanks [18:32:22] sukhe: that's quite the CR# https://gerrit.wikimedia.org/r/950000 [18:32:38] ha yeah! we were talking about it [18:32:47] XioNoX: going to buy a lottery ticket today, while we are it [18:32:55] :) [21:02:01] Has anyone gotten stuck with an unresponsive DRAC after reimage cookbook failure? I'm guessing the host is in the preboot/PXE env. Anyone know if it's possible to SSH in to that env?