[08:11:28] volans, godog - hello hello [08:11:38] * volans runs away [08:11:53] so I'd liketo merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/904667, the idea is to move kafka-main1001's kafka tls cert to pki [08:12:43] in theory all clients support it, in practice I may have missed one or two, since it is not super easy to figure out who pulls from kafka and how [08:13:07] LGTM, thank you for the heads up elukey [08:13:11] so the idea is to monitor the tls connections with tshark after the move, and see if any handshake failure happens [08:13:21] so you just want us to ingore any page and assign it to you? deal :-P [08:14:36] sort of :) [08:14:44] jokes aside, ok if I proceed? [08:14:51] SGTM [08:14:56] yeah +1 [08:15:00] super thanks :) [08:25:15] ok folks 1001 restarted, the cluster is recovering [08:25:59] \o/ [08:26:51] now let's see if I missed any client [08:27:30] hashar: thirdparty for buster seems broken due to a expired GPG key for thirdparty/ci [08:28:25] vgutierrez: o/ I see that purged complained a little about 1001 being restarted, but I am on cp1075 and I don't see more screaming errors etc.. if you see anything weird lemme know [08:29:17] elukey: as long as lag doesn't increase we should be fine [08:29:30] librdkafka seems to be quite conservative regarding retries [08:31:40] I am reasonably sure that the new bundle works fine, but we'll see [08:42:13] so far nothing seems to fail handshakes [08:44:20] volans: maybe after a long time we'll be able to remove https://github.com/wikimedia/operations-software-spicerack/blob/master/spicerack/kafka.py#L59 [08:44:43] that would be great :D [08:46:08] mmm now I have some doubts about the line afterwards [08:46:10] /etc/ssl/certs/ca-certificates.crt [08:46:21] does it contain the PKI root's cert as well? [08:47:48] seems so [08:47:50] okok :) [08:48:04] it should contain Wikimedia_Internal_Root_CA.pem right? [08:48:19] yes exactly [09:20:12] in the meantime, kafka-jumbo runs on PKI now :) [09:25:07] yay [09:31:25] great work elukey [09:31:53] <3 [09:33:22] hmmm https://pkg.jenkins.io/debian-stable/ still shows the expired key /o\ hashar [09:58:45] vgutierrez: looks like that got noticed a couple weeks ago and the key reached its expiration after three years of service https://github.com/jenkins-infra/helpdesk/issues/3457#issuecomment-1481403634 [09:59:35] https://github.com/jenkins-infra/helpdesk/issues/3457#issuecomment-1490809505 [09:59:38] /o\ [10:00:33] ah and here is the doc https://www.jenkins.io/blog/2023/03/27/repository-signing-keys-changing/ :] [10:01:05] our Puppet needs to feed reprepro with the key at https://pkg.jenkins.io/debian/jenkins.io-2023.key [10:01:33] but is that one already being used? [10:01:42] or that's gonna be used beginning tomorrow? [10:02:43] I guess it will be for the future releases only unless they go with resigning all the previous packages they generated [10:03:11] "The LTS line of Jenkins will use the new key this Wednesday 05 of April with the new Jenkins LTS patch release" [10:03:28] from https://github.com/jenkinsci/packaging/issues/383#issuecomment-1494031348 [12:26:05] fyi, I'm half boldly merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/903686 feel free to send patches if incorrect [13:32:41] XioNoX: happy to deploy https://gerrit.wikimedia.org/r/c/operations/homer/public/+/827950 whenever you had like :) [13:33:02] sukhe: cool, let's do it [13:35:44] sukhe: going to do drmrs Anycast6 first, so realistically going to only apply to durum6001 (as the doh6001 doesn't advertise any v6 prefixes) [13:36:17] is this a monitoring issue or something expected? https://grafana.wikimedia.org/goto/Uge1t4YVz?orgId=1 [13:36:36] XioNoX: doh6001 advertises 2001:67c:930::1/128 [13:36:48] dns6001 doesn't, unless I am reading it wrong! [13:37:01] but yeah, we should definitely do durum6001 anyway [13:37:47] sukhe: in drmrs doh6001 isn't sending any prefix to the router [13:37:52] well, to the switch [13:38:13] https://www.irccloud.com/pastebin/eDNDE9Hp/ [13:38:14] that's interesting! same config as durum6001 though [13:38:49] not true for doh6002 as well? [13:39:47] same for doh6002 [13:40:09] ok definitely need to check this then [13:40:28] because config wise, there is nothing different in durum6001 and doh6001 [13:40:34] yeah, looking [13:42:47] checking a few things here as well [13:46:48] sukhe: I think I found the issue... [13:47:17] * sukhe all ears [13:48:03] the filter is missing the v6 term [13:49:04] some parts of drmrs got missed in Homer [13:49:05] XioNoX: sorry, which filter though? [13:49:36] one sec [13:53:35] nah looks like it's not it [13:53:54] probably not a filter issue as it says "Received prefixes: 0" and this is before filtering [13:55:30] the missing filter needs to be cleaned up but it's not the root cause here [13:56:23] sukhe: wild guess but can we bounce bird on doh6001 ? [13:57:23] yeah I guess, just a sec, I am confirming v6 in other places too to role out non-drmrs specific issue [13:57:31] It claims to be properly exporting prefixes there, but the other side claims it's not receiving anything https://www.irccloud.com/pastebin/BpnYy8FF/ [13:57:53] (I have a meeting in 3min) [13:57:59] we can resume after [13:58:30] sure [14:00:19] heads up to oncallers I'm beginning the failover of alert1001 to alert2001 now [14:05:05] XioNoX: ping me when you are back, not urgent, thanks [14:22:28] sukhe: I'm back [14:26:11] XioNoX: ok, I definitely have v6 connectivity in eqiad [14:26:21] so I guess that leaves durum being the outlier in some way [14:26:48] sukhe: you mean doh6001 ? [14:27:09] yeah. you can't see any prefixes for doh6002 as well right? [14:27:26] indeed [14:28:10] can you try now for doh6001? [14:28:18] but durum6001 is fine [14:28:27] sukhe: it's good now :) [14:28:34] well, a simply bird restart then [14:28:43] which is weird because everything else looked fine :] [14:28:54] yeah that's not great [14:29:00] yep [14:29:54] there is nothing at all in the bird log to indicate a failure on the host [14:29:59] nothing in the kernel logs as well [14:30:05] and v4 was fine [14:30:30] sukhe: the only thing we can do is monitor the # of accepted prefixes on the router side [14:30:51] or https://phabricator.wikimedia.org/T311618 [14:31:11] yeah we definitely need better monitoring [14:31:28] # of accepted prefixes as checking the number of expected prefixes and seeing if they match up? [14:31:30] for the former, there is https://phabricator.wikimedia.org/T333210 [14:34:13] XioNoX: still can't see anything :) [14:34:21] so I guess this can be a nice project, improving monitoring [14:34:31] it certainly is critical enough given the number of services depending on anycast now [14:34:39] yeah for sure [14:34:42] recdns, wikidough, centrallog, durum and now even WMCS it seems? [14:35:20] do you still want to proceed with the remove local-as patch? I am fine with it given we have resolved the issues [14:35:28] (issues with doh600[12]) [14:36:06] sukhe: I also had a LibreNMS alert staged https://librenms.wikimedia.org/alert-rules see BGP Not accepting any prefix [14:36:24] not sure why I didn't complete it, maybe because it was too noisy [14:37:22] very useful IMO. do you want to try it turning on again? as long as it doesn't page I think we can experiment [14:37:30] given the otherwise limited visiblity into this [14:37:34] yeah I'm doing it wrong, it alerts with all the BGP peers :) [14:38:46] I'd need to dig more [14:39:31] happy to help with this. I will go over the tickets and see if I can add this to the OKR work as well [14:39:54] the reason I am interested is because at some stage, we need to bring in authdns too and announce via BGP instead [14:40:01] yeah for sure [14:40:04] instead of doing what we are doing right now, with manual routing configs :> [14:40:32] and the reason for not doing that is basically this, bird being unstable in some cases and our limited visibility into it [14:40:40] the librenms one would only be a workaround [14:40:50] I mean temporary solution [14:40:55] yeah at least something [14:41:23] like bird on the host was awfully quite about this and advertising the correct prefixes [14:41:41] or well, anycast was but yeah [14:41:45] anycast-hc [14:50:37] sukhe: I'm getting somewhere playing with the LibreNMS MySQL DB [14:50:58] :D [14:51:36] sukhe: 31 anycast peers that we don't receive or accept anything from [14:52:03] 31? [14:57:07] sukhe: for example doh4001 .... [14:57:22] for v4 [14:59:01] sukhe: https://phabricator.wikimedia.org/P46008 [14:59:51] so doh4001 and doh1001 [14:59:55] for v4 [15:00:36] ;; NSID: 646F6831303032 "doh1002" [15:00:45] curl https://abc.check.wikimedia-dns.org/check [15:00:45] {"wikidough": true, "service": "dot", "site": "eqiad", "ipv": "ipv4"} [15:01:11] I am now wondering if the recent reimages to bullseye had anything to do with this [15:01:16] what though I am not sure [15:01:28] I added the sql query I used as comment to the paste https://phabricator.wikimedia.org/P46008#187157 [15:01:42] thanks [15:01:51] I guess this is as good a case fo monitoring as any [15:02:12] are any on this list false positives? [15:02:48] well directly querying the hosts works fine as expected [15:03:04] it's the anycasted address that is the issue [15:03:10] yeah [15:03:32] I mean if I alert on that query would there be false positives [15:03:34] I am definitely hitting doh1002 from multiple places that should go to eqiad [15:04:28] but never 10001? [15:05:43] now hitting doh1001 after the bird restart :] [15:05:49] ;; NSID: 646F6831303031 "doh1001" [15:06:03] I mean not that we should hit doh1001 all the time but yeah [15:07:14] if you restart bird, can you keep a "broken" one? [15:07:22] just so I can make sure alerting works as expected [15:07:36] sure yep [15:07:42] v6 is broken on doh2002 [15:08:16] I mean thanks to anycast, we fall over to the next site [15:08:20] but that's hardly ideal [15:09:25] yay to anycast :) [15:10:52] yep! [15:11:08] ok I am doing a rolling restart of bird on doh* except doh2002 [15:12:26] cool, I'm down to 7 unique peer IPs [15:14:10] XioNoX: I guess the other good thing is that no other anycast services seem to have been affected [15:14:22] which is good and also bad in the sense of what makes A:wikidough different [15:14:38] you mentionned a recent bullseye upgrade? [15:14:44] yeah but that's true for durum too... [15:14:44] did the dns hosts do the same? [15:14:46] same time [15:14:52] and DNS hosts, yep [15:15:17] durum has v6 prefixes announced as well [15:15:20] dns, just v4 [15:15:59] I think at least for the time being while we figure out the best solutions to improve monitoring, +1 to the libre checks [15:16:12] I can keep an eye out and then silence them through this week if required (no paging, so we should be good) [15:17:36] I added a check, waiting for it to see if it works as expected [15:17:44] nice, thanks! [15:17:56] I misread, I thought you were still split about it [15:19:01] XioNoX: run the query again please, we should be just having doh2002 now [15:19:18] '2001:df2:e500:1:103:102:166:14' [15:19:18] '2620:0:860:2:208:80:153:38' [15:19:18] '2620:0:863:1:198:35:26:6' [15:19:21] still showing up [15:21:52] so doh5001, doh4002 [15:21:58] 2002 is expected [15:22:37] bird did restart [15:24:21] https://librenms.wikimedia.org/alerts is not going as planned :) [15:24:55] the other part being though is that while I can directly connect to the host to verify v6 connectivity [15:25:01] you can expand by clicking the "+" that show up on hover [15:25:11] there isn't really a way to do so with anycast without having multiple vantage points [15:25:16] unless you know of a better one :) [15:25:40] sukhe: that's https://phabricator.wikimedia.org/T311618 :) [15:26:23] oh right, but I meant as of now! [15:26:35] you're correct [15:27:11] there is nothing else on the router side that gives more info? [15:27:49] sukhe@cr4-ulsfo> show bgp summary | match 2620:0:863:1:198:35:26:6 [15:27:49] 2620:0:863:1:198:35:26:6 64605 29 25 0 7 11:37 Establ [15:27:57] should it show established here? [15:28:01] this is doh4001 [15:28:23] show bgp neighbor | match prefixes [15:28:26] er wait [15:28:32] "Received prefixes" [15:28:56] 2620:0:863:1:198:35:26:6 is doh4002 [15:28:57] checking prefixes [15:29:22] sukhe@cr4-ulsfo> show bgp neighbor | match 2620:0:863:1:198:35:26:6 [15:29:22] Peer: 2620:0:863:1:198:35:26:6+179 AS 64605 Local: 2620:0:863:ffff::2+62360 AS 14907 [15:29:28] this looks fine or am I misreading? [15:29:39] ha [15:30:18] I was running it incorrectly, I see the output now [15:30:45] Active prefixes: 1 [15:31:11] looks fine now? [15:31:51] yep, and I think I fixed the alert rule by going full mysql instead of their "helper" [15:32:31] give it 5min to clear if it's good [15:32:38] ok [15:35:17] sukhe: https://librenms.wikimedia.org/alerts now looks good [15:35:58] sukhe: you can fix 2002 now [15:36:05] ok let's try [15:36:09] alerting should go to AM like the others [15:36:27] so we're not/less blind now [15:36:28] nice [15:36:40] still not happy about what happened here though :> [15:36:51] of course [15:36:53] as in, what caused this and just on doh [15:36:53] me neither [15:37:09] there is nothing specific in the configuration [15:37:57] XioNoX: let's see I guess if it happens again [15:38:10] do you know how I can scheduled a recheck for the alert on LibreNMS? [15:38:21] like I wanted to check if doh4002 has cleared up [15:38:28] sukhe: it runs every 5min [15:38:33] it will clear on its own [15:38:34] oh so just automatic [15:38:36] ok yeah that's fine [15:38:48] thanks for helping debug this [15:38:59] I will spend some more time after lunch + SRE meeting to see [15:39:08] not that I expect to find something since all the obvious ones check out :P [15:56:46] sukhe: fyi 2002 is still alerting [15:59:51] XioNoX: sorry, seems like I just did 4002 and not 2002 [15:59:54] should be clearing up [16:03:42] confirmed [16:04:03] thanks [18:51:35] this simple but effective firefox addon. "close same domain tabs". Now it's like ... you have 1000 tabs open.. you click a single time in one Gerrit tab.. and boom, 200 Gerrit tabs closed at once, down to 800. https://addons.mozilla.org/en-US/firefox/addon/close-same-domain-tabs/ [23:01:58] not on call anymore, it's just the bot and the topic update again [23:02:02] goes afk [23:02:22] that's also why you see 4 people in topic [23:02:42] nothing to hand-over