[08:59:49] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host ganeti6001.drmrs.wmnet with OS buster [09:13:20] 10netops, 10Infrastructure-Foundations: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10ayounsi) p:05Triage→03High [09:14:46] 10netops, 10Infrastructure-Foundations: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10ayounsi) [09:40:05] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host ganeti6001.drmrs.wmnet with OS buster executed with errors: - gan... [09:47:42] 10Traffic, 10SRE: Varnish packages installed from the wrong component on host reimage - https://phabricator.wikimedia.org/T295120 (10ema) [09:49:21] 10Traffic, 10SRE: Varnish packages installed from the wrong component on host reimage - https://phabricator.wikimedia.org/T295120 (10ema) [09:50:54] 10Traffic, 10SRE: Varnish packages installed from the wrong component on host reimage - https://phabricator.wikimedia.org/T295120 (10MoritzMuehlenhoff) The puppet code lacks a "priority => 1002", if you want to override "main" (which also has priority=1001). See the comments in the apt::package_from_component... [09:54:18] 10Traffic, 10SRE: Varnish packages installed from the wrong component on host reimage - https://phabricator.wikimedia.org/T295120 (10ema) p:05Triage→03Medium [09:56:07] 10Traffic, 10SRE: Varnish packages installed from the wrong component on host reimage - https://phabricator.wikimedia.org/T295120 (10ema) >>! In T295120#7484351, @MoritzMuehlenhoff wrote: > The puppet code lacks a "priority => 1002", if you want to override "main" (which also has priority=1001). See the commen... [10:21:07] 10Traffic, 10SRE, 10Patch-For-Review: Varnish packages installed from the wrong component on host reimage - https://phabricator.wikimedia.org/T295120 (10ema) 05Open→03Resolved [10:29:14] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host ganeti6001.drmrs.wmnet with OS buster [11:09:27] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host ganeti6001.drmrs.wmnet with OS buster executed with errors: - gan... [12:10:19] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host ganeti6001.drmrs.wmnet with OS buster [12:50:34] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host ganeti6001.drmrs.wmnet with OS buster executed with errors: - gan... [13:00:40] we spent some time with mmandere troubleshooting the drmrs servers re-imaging [13:01:43] tldr: 1/ the servers BIOS config is not set, I did it for ganeti6001 but all the others will need it (otherwise they can't boot on the proper NIC) [13:02:11] 2/ the switches are not sending DHCP option 82 as they're configured to [13:03:36] they're sending it but don't let me customize it, I'll need to consult with volans|off to work around it if I can't make the switch cooperate [13:05:07] hi :) [13:05:42] on (1) - what is it we need to fix everywhere? The Ctrl+S setting on the broadcom NIC firmware to turn on booting there? [13:06:19] (or something in the regular bios settings?) [13:08:34] bblack, mmandere, forwarded you an email with the instruction, and what I did on top of them [13:09:59] ok [13:10:37] yeah I guess, we'll have to confirm all the settings on all of them, since I donno what state they were originally in. [13:10:52] (probably should double-check all the other settings too, even the ones that don't affect netboot) [13:11:48] I agree [13:12:15] XioNoX: re option 82, the switch sends it but misses what customization? the device-name prefix, or is it not putting the port info in, or? [13:12:56] bblack: it's sending "IRB-irb.621:et-0/0/13.0" [13:13:13] while we're expecting "hostname:interface:vlan-name" [13:13:36] so with the former we can still workaround our way to make it work [13:13:52] hmmm [13:14:08] (as 621 is the vlan ID and it's only on 1 switch) [13:14:14] but so far the switch is ignoring any configuration option [13:15:09] can I take a peek at it? [13:15:37] bblack: for sure, the device is in a reboot loop and keeps trying DHCP [13:16:01] so you can check `install1003:~$ sudo tcpdump -n host 185.15.58.131 -vvv` [13:16:30] yeah I mean the juniper config, peeking now [13:17:09] https://www.juniper.net/documentation/us/en/software/junos/security-services/topics/ref/statement/option-82-edit-vlans.html [13:17:19] on the switch there are two potential locations for option-82 settings: `forwarding-options dhcp-relay relay-option-82` and the usual `vlans private1-b12-drmrs forwarding-options dhcp-security option-82 ` [13:17:27] so there's 3 settings for the 3 parts it looks like [13:17:29] I tried with only one or only the other without change so far [13:17:58] yeah I guess you'll have to have at least some info in the per-vlan one, for the vlan string [13:18:01] we only use the circuit-id part [13:18:45] bblack: DHCPD expects this: host-identifier option agent.circuit-id "asw1-b12-drmrs:et-0/0/13.0:private1-b12-drmrs"; [13:19:12] does it work in other DCs correctly? [13:19:20] yep [13:19:28] hmmm [13:19:40] but here it's a different setup, the switches are routers too [13:19:50] so that's what might confuse it [13:20:00] or at least confuse me :) [13:20:14] yeah [13:20:20] like the relay overrides the option 82 from the switch port [13:20:27] the IRB-irb.621 is the routing/bringing instance or whatever [13:21:10] yep it's the switch L3 interface on the private-b12 vlan [13:21:22] so the host's gateway [13:23:59] yeah looking at switches in other DCs, they're all configured that same way [13:24:22] just option-82 { circuit-id { prefix { host-name; } } } [13:24:28] yep [13:24:39] is it just a default that it puts the vlan name at the end? [13:24:49] and "prefix" is what puts the asw name at the front in theory? [13:26:24] exactly [13:26:32] "If DHCP option 82 is enabled on the switch, the circuit ID is supplied by default in the format interface-name:vlan-name " [13:27:44] yeah the forwarding + 82 thing must be in play here... [13:29:38] but both have the "host-name" option, so regardless of what takes precedence it should be there [13:30:36] yeah [13:31:14] the docs are pretty unclear heh [13:32:00] you could try setting all of the options [13:32:00] bblack: there is also https://www.juniper.net/documentation/us/en/software/junos/subscriber-mgmt-sessions/topics/topic-map/dhcp-option-82-using.html#id-using-dhcp-relay-agent-option-82-information :) [13:32:13] that's gonna burn down the DC [13:32:15] :) [13:32:41] :) - I mean as in prefix {host-name} + use-interface-description + use-vlan-id [13:32:57] maybe the defaults are just different in this scenario or needs some other config options combo to get the same result [13:33:17] yeah, I'll keep digging around it [13:34:11] hmmm yeah, the relay-level one might be it [13:38:38] there is also the risk that it's a stuck process on the switch, and the config is correct :) [13:41:21] yeah [13:41:30] I was gonna say, by all the docs this seems a bit baffling [13:42:06] I would guess maybe get rid of the vlan-level config (just the relay-level config) and try to restart whatever process handles dhcp stuff [13:43:33] I'm wondering if the server didn't fully give up too, I'm not seeing anything coming though install1003 [13:43:50] yeah possible [13:44:11] I'm gonna go jump on the server console and see what it's doing and double-check various bios settings are what I think they'd be [13:44:13] yeah that was the initial config, only at the relay level [13:49:29] it's dhcping again now [13:49:49] I think it had timed out. I checked bios on the way through, looks pretty sane. [13:51:05] ah yeah it times out pretty quickly and requires an F-key press to retry, in current config [13:51:12] maybe there's a setting to make it loop better [13:53:11] but I'm watching it retry dhcp now, and nothing coming up on install1003 sniff [13:53:32] so yeah, dhcpd on qfx might be dead, I donno [13:55:55] (restarting dhcp-service on b12) [13:56:50] eh, I jsut did that too :) [13:58:01] ah so my earlier tcpdump was faulty [13:58:13] the source IP is the private-vlan gw IP of the b12 switch [13:58:32] so that changed [13:58:38] it used to be the loopback [13:58:39] 10.136.0.1.67 > 208.80.154.32.67: [udp sum ok] BOOTP/DHCP, Request ..... [13:58:57] but still: [13:58:58] Circuit-ID SubOption 1, length 23: IRB-irb.621:et-0/0/13.0 [13:59:59] added "use-interface-description device" to see if it changes something [14:01:04] I removed the vlans stanza [14:01:42] no effect I think, but let me see one more in case we just missed the timing [14:01:59] yeah no change [14:02:32] ok [14:02:42] can't even ask JTAC... [14:02:43] I'm gonna experiment with some random relay-option-82 things [14:02:50] because it's odd that nothing's having any impact [14:03:03] bblack: ok, all yours, got some errands to run anyway. Ping me if needed [14:04:42] ack, thanks [14:23:43] so far, nothing I do anywhere in the related relay/82 or vlan/82 stuff ever affects the circuit-id of the packet [14:24:27] I've even tried a few more-exotic bits like: [14:24:29] {master:0}[edit forwarding-options dhcp-relay] [14:24:29] bblack@asw1-b12-drmrs# show [14:24:29] overrides { user-defined-option-82 asdfasdf; [14:24:29] } [14:24:57] ditto for trying various output options at the vlan level [14:25:06] it's like nothing is paying attention to any of these settings :P [14:48:01] (I did get the source address back to the loopback IP though, with: forwarding-options dhcp-relay overrides relay-source lo0.0; [15:57:13] !! Circuit-ID SubOption 1, length 45: asw1-b12-drmrs:et-0/0/13.0:private1-b12-drmrs [15:57:36] I made more changes than I had to though, I think I could get it back a little closer to the original [15:57:51] (on overall config setup for dhcp-related things I mean) [15:58:38] and it didn't cause PXE to actually work yet, but I did get that string to show up in the sniff on install1003 [15:58:50] (by removing the forward-only option) [16:07:18] even with that fixed - you can only successfully set the "circuit-id prefix host-name" option stuff in the global "forwarding-options dhcp-relay" - either for the whole thing or per-group within there. Setting it at the per-vlan level like the other sites' switches doesn't do anything useful. [16:07:47] other than that tidbit, the main thing was just turning off "forward-only" I think [16:08:08] now, to debug why this still isn't resulting in a good response packet from install1003! :) [16:11:40] --- [16:11:55] probably unrelated to current woes, but while staring at things, I noticed a probable typo: [16:12:09] "authoratative" in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/install_server/templates/automation.conf.erb#15 [16:18:00] anyways [16:18:20] of course, the reason for no reply is that the automation stuff is ephemeral and there's no reimage running right now [16:30:25] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host ganeti6001.drmrs.wmnet with OS buster [16:30:35] trying reimagine again, using mmandere's existing tmux session [16:33:24] seems to be getting somewhere, it got through tftp and I think is launching installer now [16:52:23] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host ganeti6001.drmrs.wmnet with OS buster executed with errors: - gan... [17:02:05] so it got the OS installed, but reimage script failed a bit further down during the icinga parts [17:02:14] which, looking at the icinga reload: [17:02:15] Nov 05 17:01:16 alert1001 icinga[11489]: Error: Could not find any hostgroup matching 'asw1-b12-drmrs.wikimedia.org' (config file '/etc/icinga/obje [17:04:34] which is obviously somehow related to the processing of modules/netops/manifests/monitoring.pp [17:04:50] probably something to do with the routerless setup there and the "parents" bits, etc [17:06:25] bblack: you probably need to add the asw:s to https://gerrit.wikimedia.org/g/operations/puppet/+/2ad64c08006b7fbce3aa27f5195b7d3463fc9775/hieradata/common/monitoring.yaml#5 [17:07:06] ah, they are there, but for some reason without wikimedia.org [17:07:33] the new cloudsws are listed based on their fqdn instead of just the short hostname too [17:09:11] yeah [17:10:03] I'm not sure which is "correct", but either way I think the hieradata/ vs modules/ have to match up on whether they're using short of fq names [17:11:58] I have vague memories of someome talking something about newer JunOS versions using fqdns instead of hostnames, but I don't at all remember the context or where to look it up [17:13:00] yeah, I'll take a stab anyways, I bet just changing hieradata to fqdn works [17:16:12] that's https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/fd82d95fa033248fcd52e931d839712bb83fa0ec [17:16:54] "Starting on JunOS 18.x the reported sysname via lldp is the FQDN instead of the hostname." [17:17:09] or that's what the committer said ;P [17:23:01] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host ganeti6001.drmrs.wmnet with OS buster [17:57:18] bblack: good job! [17:59:44] indeed more recent Junos versions need the FQDN... https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR1383295 for people with Juniper account. tldr, a customer thought it was a bug and asked juniper to "fix" it [18:00:58] [14/15, retrying in 42.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal..check' raised: Not all services are recovered: ganeti6001:Device not healthy -SMART-,EDAC syslog messages,Filesystem available is greater than filesystem size,Memory correctable errors -EDAC- [18:01:04] is what it's about to fail on now [18:01:08] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host ganeti6001.drmrs.wmnet with OS buster completed: - ganeti6001 (**... [18:01:48] ah that didn't fail it - if that times out, it just doesn't remove the downtime but still succeeds [18:02:06] there's 4x icinga checks failing for lack of prometheus in drmrs (chicken and egg, need ganeti clusters first) [18:02:48] XioNoX: yeah so I merge that patch for the FQDN issue [18:02:58] yep, saw it [18:04:01] XioNoX: TL;DR on the switch dhcp-relay issue is: configuring the prefix host-name in the per-vlan settings doesn't do anything (in this config), but it can be set at the dhcp-delay level, and the forward-only option needed removing [18:04:12] s/delay/relay/ [18:04:38] ok, so the key was the "forward-only" [18:04:52] why? we will never know! [18:04:58] yeah, which doesn't make a ton of sense based on the documentation. they even have examples of setting 82 + forward-only [18:05:52] I suspect forward-only really means something like "don't muck with the packet, just forward what the host sent", and thus disables option 82 customization. [18:06:24] but clearly that's not really the truth either, because the switch was still editing option 82 (as observed at install1003), it just was doing it with that irb-based string instead [18:06:37] either that, or... [18:08:12] I was going to say: maybe that original irb.IRB option-82 string was actually coming from the host NIC (and maybe it was learning the juniper iface/port from LLDP and then reflecting that back in its own option 82 [18:08:41] which almost makes sense, except the lldp info the host gets from the switch doesn't seem to include the IRB interface name [18:09:08] so that had to be coming from the switch, when added to DHCP [18:11:40] anyway :) [18:16:03] there's 3 more to do, but I'll leave that for Marc to pick up. And then we need to figure out the ganeti cluster layout [18:16:24] I gotta refresh my brain on "cluster" vs "group" and how to arrange for split vlans in this new layout [18:17:00] (I'm not yet sure if it's two separate "clusters", or still all one cluster but two "groups" for the two racks/vlans) [18:20:20] surely eqiad would serve as a guide here, since it has nodes in every row/vlan [18:21:05] bblack: and the terminology is different from Netbox as well... https://phabricator.wikimedia.org/T262446#7272106 [18:22:27] ok [18:22:38] so I think in Ganeti term, we need a drmrs cluster, and two b12/b13 groups [18:22:47] so in ganeti terms, we probably want a single drmrs "cluster" with a group-per-rack [18:23:24] and in netbox terms, the groups will be called cluster, and the cluster will be called a cluster group [18:24:23] the only thing about this arrangement that doesn't jive in my head now, is: [18:24:59] a ganeti "cluster" has a single API endpoint like ganeti01.svc.eqiad.wmnet, and in that eqiad case, the IP of that svc hostname is directly from the row C private vlan space. [18:25:12] so... if we lose row C we lose the ganeti API endpoint? [18:25:25] (for all the rows, I mean) [18:26:24] (and does that matter for basic monitoring or functionality? or does it just impact some reporting that can be down for a bit?) [18:27:36] anways, we can sort it out monday. I'm not bringing up more hosts now anyways, I don't want to jump ahead while Marc's not online. [18:27:52] bblack: I think it's mostly Netbox sync, some more details in that thread https://phabricator.wikimedia.org/T270071#6689398 [18:28:12] there is now one live host in the insetup role to shell to at least :) [18:29:34] welcome to France! [18:30:21] Congrats on a first live drmrs server [19:57:02] 10netops, 10Infrastructure-Foundations, 10SRE: Move management routers ssh port - https://phabricator.wikimedia.org/T277438 (10RobH) [20:17:30] 10Traffic, 10SRE, 10Patch-For-Review: Decomission malmok.wikimedia.org - https://phabricator.wikimedia.org/T286480 (10Dzahn) >>! In T286480#7221256, @ops-monitoring-bot wrote: > Failed to run the sre.dns.netbox cookbook: Cumin execution failed (exit_code=2) confirmed this is not in netbox and not in DNS rep... [20:20:20] 10Traffic, 10SRE, 10Patch-For-Review: Decomission malmok.wikimedia.org - https://phabricator.wikimedia.org/T286480 (10Dzahn) I also don't see this host in debmonitor. It seems all done here besides the 2 entries in DHCP/installserver? [20:24:13] 10Traffic, 10SRE, 10Patch-For-Review: Decomission malmok.wikimedia.org - https://phabricator.wikimedia.org/T286480 (10ssingh) >>! In T286480#7486550, @Dzahn wrote: > I also don't see this host in debmonitor. It seems all done here besides the 2 entries in DHCP/installserver? I think that should be it. Last... [20:25:16] 10Traffic, 10SRE, 10Patch-For-Review: Decomission malmok.wikimedia.org - https://phabricator.wikimedia.org/T286480 (10Dzahn) 05Open→03Resolved ACK!:) Also not in Icinga, so it's gone from puppet db.