[06:18:06] \o came across something weird while adding IPs for `wdqs-internal-main` and `wdqs-internal-scholarly` in `eqiad` and `codfw` [06:18:44] I saw `.92` and `.93` available as VIPs in the netbox UI so I chose those, added via UI and netbox synced [06:19:03] However now that I'm going to add the DNS patch for the A-records, it looks like those are currently being used by `mw-parsoid`, at least as far as the A-records are concerned [06:19:31] eg in `templates/wmnet` I see `mw-parsoid 1H IN A 10.2.1.92` but I was just about to add a line like `wdqs-internal-main 1H IN A 10.2.1.92` [06:20:42] Same deal for the PTR records, I see lines like `92 1H IN PTR mw-parsoid.svc.codfw.wmnet.` [06:21:42] Holding off on creating the DNS patch for that reason but I could use some help sorting this out tomorrow :) [06:35:33] I found https://gerrit.wikimedia.org/r/c/operations/puppet/+/1004152/5/hieradata/common/service.yaml ; I think if I understand the process correctly (big if) it may be just that when `mw-parsoid` was set up the step to add the VIP in the netbox UI was missed. I'm going to operate under that assumption for now and use `.93` and `.94` instead so there's no longer a collision [06:48:50] Okay, did the above. Here's the corresponding DNS patch for my new change (which has been synced). There shouldn't be a collision any longer, but if I'm understanding the process correctly we will still want to circle back and add the `mw-parsoid` entry into the netbox UI [06:51:02] forgot to link patch: https://gerrit.wikimedia.org/r/c/operations/dns/+/1100010 [08:19:09] FIRING: [8x] LVSHighCPU: The host lvs5005:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs5005 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [08:24:09] RESOLVED: [8x] LVSHighCPU: The host lvs5005:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs5005 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [10:16:36] 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10375008 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti7004.magru.wmnet with OS boo... [10:26:06] 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10375032 (10Volans) FYI I have aborted the last reimage execution that was at the last step waiting for use input for the netbox-hiera int... [13:19:42] 10netops, 06Infrastructure-Foundations, 06SRE: Add QoS markings to profile HDFS analytics traffic - https://phabricator.wikimedia.org/T381389 (10cmooney) 03NEW p:05Triage→03Medium [13:20:20] 10netops, 06Infrastructure-Foundations, 06SRE: Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10375639 (10cmooney) [13:21:19] 10netops, 06Infrastructure-Foundations, 06SRE: Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10375643 (10cmooney) [13:39:15] 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10375702 (10RobH) >>! In T380307#10375032, @Volans wrote: > FYI I have aborted the last reimage execution that was at the last step waitin... [15:13:40] sukhe ryankemper just a heads-up, I updated https://gerrit.wikimedia.org/r/c/operations/puppet/+/1097541 to include eqiad as we got 2 hosts from dc ops last week [15:14:00] I'm working my way thru the rest of the patch chain to add eqiad as needed [15:17:58] inflatador: happy to look after this meeting [15:29:20] ACK, no rush...just want to be prepared for the turn-up call [15:39:18] inflatador: just one realserver per service? [15:39:58] (in eqiad, that's it) [15:47:03] vgutierrez Yeah, this is a net-new internal service that won't be used for awhile...we have a few more hosts that should be handled over within the next quarter so it hopefully won't be single hosts for too long [16:12:58] 10netops, 06Data-Platform-SRE, 06Infrastructure-Foundations, 06SRE: Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10376400 (10Gehel) [16:42:38] ryankemper: inflatador: note that for the service IP allocation in Netbox (which has already been done by you) [16:42:44] you also need a manual patch in the DNS repo [16:42:56] see https://wikitech.wikimedia.org/wiki/LVS#DNS_changes_(svc_zone_only) [16:44:23] otherwise for example, wdqs-internal-main.svc.eqiad.wmnet will return an NXDOMAIN (which it is currently) [16:44:57] 10netops, 06Data-Platform-SRE, 06Infrastructure-Foundations: Enable QoS for Hadoop to Presto traffic - https://phabricator.wikimedia.org/T381412#10376533 (10CDanis) Thanks Xabriel! In practice I think we'll be adding the QoS marking bits on all traffic transmitted from an-worker* with source port 50010, whi... [16:46:39] sukhe: I think they have already sent https://gerrit.wikimedia.org/r/c/operations/dns/+/1100010 for that [16:47:11] oh yeah, I definitely missed that one. sorry folks. (thanks volans) [16:47:25] +1, thanks volans [16:47:57] np :) [16:49:27] ryankemper sukhe I created a checklist for the maintenance today. Still WIP but hopefully it helps us keep track of where we're at during the maintainance itself https://etherpad.wikimedia.org/p/internal-graph-split-lvs [16:49:42] feel free to add/remove anything [16:49:44] thanks [16:57:29] 10netops, 06Data-Platform-SRE, 06Infrastructure-Foundations: Enable QoS for Hadoop to Presto traffic - https://phabricator.wikimedia.org/T381412#10376658 (10BTullis) @cmooney has already created {T381389} which might cover this, I think. Or maybe they should be parent->child tickets of each other. I won't c... [16:57:41] inflatador: is there a DNS discovery record patch? [16:58:31] 06Traffic, 10Data-Engineering (Q2 2024 October 1st - December 31th), 13Patch-For-Review: Rollout haproxykafka on all hosts - https://phabricator.wikimedia.org/T378578#10376675 (10Fabfur) [16:59:03] for adding the discovery record to template/wmnet and utils/mock [16:59:16] also will be in a meeting for the next 30 mins but we will start at 13:00 ET as planned [17:00:20] 10netops, 06Data-Platform-SRE, 06Infrastructure-Foundations: Enable QoS for Hadoop to Presto traffic - https://phabricator.wikimedia.org/T381412#10376686 (10CDanis) →14Duplicate dup:03T381389 [17:00:32] 10netops, 06Data-Platform-SRE, 06Infrastructure-Foundations, 06SRE: Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10376692 (10CDanis) I think we need an-worker* source port 50010, which I am pretty sure is just the dataplane of HDFS and not the metada... [17:02:48] 10netops, 06Data-Platform-SRE, 06Infrastructure-Foundations, 06SRE: Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10376688 (10CDanis) [17:48:40] ryankemper ^^ see above, is there a DNS discovery patch? I don't see one attached to T379334 [17:48:40] T379334: Create DNS records for wdqs-internal-main and wdqs-internal-scholarly - https://phabricator.wikimedia.org/T379334 [17:53:42] * inflatador starts work on the discovery patch [17:54:02] 10netops, 06Data-Platform-SRE, 06Infrastructure-Foundations, 06SRE: Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10376882 (10cmooney) >>! In T381389#10376688, @CDanis wrote: > I think we need an-worker* source port 50010, which I am pretty sure is ju... [17:59:37] inflatador: running late but back in 6 mins. and yes we need that patch [18:02:36] ACK, patch is up at https://gerrit.wikimedia.org/r/c/operations/dns/+/1100165 [18:03:28] looking [18:04:37] cool, I am up in Meet but going to grab coffee really quick [18:07:00] I am around, when you guys start, just ping me [18:17:39] ack. found some issue with a couple backends so working on fixing those before we get started. might be 30' [18:17:46] ok [18:24:06] 06Traffic, 13Patch-For-Review, 07User-notice: Remove RSA certificates and use only ECDSA certificates - https://phabricator.wikimedia.org/T370837#10377070 (10BCornwall) 05In progress→03Resolved Mail sent to wikitech-l (`Message-Id: `) confirming removal. [19:21:22] sukhe: I think we'll need to push back, running into a scap issue on the 2 eqiad hosts that I haven't figured out yet. I thought we'd already ran test queries on the eqiad ones but I guess we only did codfw [19:21:33] ryankemper: totally fair. happy to pick this up tomorrow [19:21:42] it's best not to get into deploying a new service this way [19:21:48] agreed :) [19:22:06] sukhe: What's a good time to block off tomorrow? same time work? [19:22:07] ryankemper: inflatador: ping me tomorrow when you are ready [19:22:08] yep [19:22:25] the earlier the better in a way, so whenever you come online ryankemper [19:23:34] makes sense. we'll hit you up tomorrow [19:24:52] sukhe: while I have you here, not sure if you saw the irc backlog from last-night wrt the VIP allocation stuff, but is my assumption that `mw-parsoid` is missing an entry in the netbox UI for its VIP `10.2.2.92/32` correct? if so I can reach out to the appropriate team [19:26:21] ryankemper: I didn't see the backlog, sorry [19:26:47] and yeah, it's missing indeed [19:26:55] let's see who added it in the dns repo and we can tag them [19:27:31] added in 17d02986db7670960122aea85acc964e4d4a7ad3 [19:27:49] sukhe: it would be akosiaris I believe [19:27:51] yep [19:29:10] akosiaris: quick TLDR, `mw-parsoid` has PTR and A records for `10.2.2.92/32` and `10.2.1.92/32` but is missing an entry in the netbox UI, see https://wikitech.wikimedia.org/wiki/DNS/Netbox#How_to_manually_allocate_a_special_purpose_IP_address_in_Netbox [19:29:11] ryankemper: I am going to ping the folks in the serviceops channel to see if this was intentional or something (missing the entry); I doubt it but will check [19:29:31] can you paste the above in general in the serviceops channel? too late for Alex but someone else might see it [19:29:41] got it [19:29:44] thanks [19:30:24] this makes me wonder if we should have some sort of a check for this since it's easy to miss this step and you really don't get an alert [19:31:46] isn't there already an alert that says "netbox does not match DNS" [19:32:27] mutante: I think that's mostly if you have changes that are not merged with the DNS cookbook but nothing that really compares two IP addresses. [19:32:39] ah, yea [19:32:56] this problem of duplication is due to https://phabricator.wikimedia.org/T270071 [19:33:02] so for now we have to define the IP in both places [19:34:38] maybe you can add to "authdns-update" to say "hey, human, did you remember adding it in netbox? yes/no to continue" :p [19:35:09] :] [19:36:02] I mean there should be some familiar at some point in the process (CI, actually merge) but it is perhaps better to alert after someone makes a change [19:41:10] ryankemper: can you share how/where you discovered this? I know you were allocating a /32 and that you might have picked up the one that should have been mw-parsoid [19:41:13] but when did you actually notice it? [19:42:09] picking up on mutante's point earlier and thinking of putting an alert for this [19:42:37] not urgent, you can reply later [19:42:52] sukhe: mutante: I allocated the IPs and ran the netbox cookbook, then went to make the DNS patch for the A & PTR records and discovered the collision at that point. so if i'd forgotten to do the dns patch I never would have noticed [19:44:17] ok thanks [19:44:45] discovered the collision: manually or through something that failed? [19:46:13] ah , yours is 93 and 94. so the duplication check would have fired anyway here [19:46:24] I was like why didn't CI fail? it should have for two same IP addresses. ok. [19:46:38] s/would have/would not have [19:46:41] sukhe: no, it would have fired. I initially allocated 92 and 93 [19:46:49] then when I caught the collision I bumped mine up by 1 to work around it [19:46:51] and did CI fail? [19:47:28] CI for what, DNS repo? if so I caught the issue before actually uploading the patch so not sure [19:47:29] ok, I am assuming you mean Netbox [19:47:31] we could trivially check that though [19:47:58] sukhe: sorry assuming I meant netbox for which sentence of mine? [19:48:04] 14:46:41 < ryankemper> sukhe: no, it would have fired. I initially allocated 92 and 93 [19:48:07] this I think [19:48:12] ah yes [19:48:36] cool, thanks. adds up then [19:48:37] basically: i allocated 92 and 93 via netbox ui, no issues there since no entry for parsoid. then I ran the netbox sync cookbook, no issues detected since the diffs just showed this: [19:48:44] https://www.irccloud.com/pastebin/kHyloi4G/ [19:48:57] yeah [19:49:59] I am curious about what the netbox sync cookbook actually does though. like let's say I didn't notice the collision, but didn't merge the dns patch just only did the netbox sync side of things. would something have eventually broken in mw-parsoid? [19:50:57] in case you didn't merge the DNS patch in ops/dns.git, then no, nothing would have broken. and if you tried to merge a duplicate IP, CI should have picked it up but I am confirming it in https://gerrit.wikimedia.org/r/c/operations/dns/+/1100187 [19:51:26] actually this will work, no, wrong check [19:53:44] 14:53:33 E001|GLOBAL_DUPLICATE: Global duplicate records found: [19:53:57] yeah so this fails but this doesn't cover what you encountered [19:54:03] ok, I will think a bit more about this [19:55:10] your change had a clean CI run https://gerrit.wikimedia.org/r/c/operations/dns/+/1100010 [19:55:38] and even if you had 91 here, it would still have worked. that's what we want to prevent [20:05:10] this question will expose my ignorance but what do zonefiles actually do? i.e. setting aside the dns repo for a second why would there not be an issue if the netbox sync script has pushed `+92 1H IN PTR wdqs-internal-main.svc.codfw.wmnet.` to `1.2.10.in-addr.arpa` [20:25:21] in this case you mean or in general? as in the relation? at least in this case, there isn't any include for the PTRs that Netbox is generating above, so whatever is the source of truth is what is in 10.in-addr.arpa in ops/dns.git [20:25:54] this is only specific for this case though (the SVC records) [20:26:49] otherwise, we have a bunch of includes in various places in ops/dns.git and those are tightly coupled with Netbox [20:28:21] so to answer your question: no, no issue with netbox syncing it, because that's just specific to what is in Netbox and whatever you define there, it will push to that repo. [20:28:47] and ops/dns.git is oblivious to some bits of this so it won't complain as well. the story changes though if ops/dns.git is including bits from Netbox [22:32:29] ah, I didn't realize there was a parallel discussion about mw-parsoid over here. FYI, I've backfilled the 92 VIPs in netbox and run the cookbook.