[06:41:57] (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [06:46:57] (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [06:51:57] (EdgeTrafficDrop) firing: 67% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [06:56:57] (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [08:13:24] mmandere: good morning! let me know when you're around to troubleshot that DHCP [08:19:45] I've merged the new DHCP cookbook and I'm ready to test it, I guess ganeti6004 would be a good candidate [08:34:57] alright, should be fixed [08:35:11] https://www.irccloud.com/pastebin/WjQoIa8E/ [08:35:27] yay [08:35:45] should I exit the dhcp cookbook and run a full reimage? [08:36:18] or just stop and then let #traffic take over? :D [08:37:22] the server replies to the relay with "208.80.154.32.67 > 10.136.1.1.67" (which was already permitted) [08:37:22] and the relay re-writes the packet to be "208.80.154.32.67 > 255.255.255.255.68" (so it can send it back as broadcast on the proper vlan) [08:37:22] The catch is that the now re-written packet goes through the loopback filter again (filter only applied on input) so it was discarded there [08:38:30] the why is lost on Juniper's internals, but that's how the dhcp relay stuff works [08:38:48] volans: up to you, no preference [08:41:45] ack, I'm reimaging ganeti6004 [08:41:55] with buster as was done previously [08:43:20] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by volans@cumin1001 for host ganeti6004.drmrs.wmnet with OS buster [08:52:00] \o/ Host up (Debian installer) [08:53:24] nice! [08:59:55] going through as normal so far [09:22:12] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by volans@cumin1001 for host ganeti6004.drmrs.wmnet with OS buster completed: - ganeti6004 (**WA... [09:22:48] all good! [10:01:51] XioNoX: just catching up - nice find on that loopback filter! [10:02:07] welcome back! [10:02:17] and great work all nice to return to some good news like that! [10:56:15] 10netops, 10Infrastructure-Foundations, 10SRE: Rebuild Routinator (rpki) VMs with larger disk - https://phabricator.wikimedia.org/T292503 (10cmooney) A security update is now available which means we need to upgrade again: https://www.nlnetlabs.nl/news/2021/Nov/09/routinator-0.10.2-released/ I'll dig into... [14:03:25] XioNoX: volans: thanks for sorting out the dhcp bits! :) [14:12:48] yw, but I did very little [16:12:52] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host dns6001.wikimedia.org with OS buster [16:14:51] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host dns6002.wikimedia.org with OS buster [16:36:37] XioNoX: committed on drmrs b12 for the reply change made on b13, I think (because dns6001 was timing out) [16:36:40] bblack@asw1-b12-drmrs# show|compare [16:36:43] [edit firewall family inet filter loopback4 term allow_dhcp_reply4 from] [16:36:46] - destination-port 67; [16:36:48] + destination-port [ 67 68 ]; [16:37:24] I thought I pushed it there too, but maybe I forgot to confirm it and it auto-rolled back [16:38:24] seems to be working now! :) [16:39:27] on a more-general note about the reimage cookbook [16:39:40] shoot [16:39:47] on these remote/slower hosts (when we don't have a local install node, and maybe even when we do it's generally slower at edge sites) [16:39:59] the second set of spam is about checks is constantly something like: [16:40:00] [72/120, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Uptime for dns6002.wikimedia.org higher than threshold: 1091.94 > 958.26 [16:40:04] [73/120, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Uptime for dns6002.wikimedia.org higher than threshold: 1105.38 > 971.71 [16:40:26] yes [16:40:34] I'm not really sure what's going on there. 10s retry, uptime moved (in this case above) by different amount than the threshold does [16:40:45] but at the base of it, it seems like there's a constant offset from a slower earlier stage [16:41:01] it's polling for the reboot after d-i finishes and reboots the host [16:41:17] and it checks that uptime < now() -start_time [16:41:52] because we can still ssh into the host, as d-i is going on [16:42:04] yeah but there's a constant divergence there of ~133 seconds is why we fail the threshhold [16:42:17] yes, and that makes total sense [16:42:22] what is the "threshold"? I assume that it's an assumption of where uptime should be, but in this case we were too slow [16:42:27] no [16:42:33] let me recap [16:42:43] we reboot into PXE and start polling [16:43:02] when we can ssh into d-i, we check that is the d-i environment and get the current timestamp [16:43:15] then we start polling for the uptime until we find a new reboot [16:44:00] so those 133s are just the delta between d-i boot time and when the cookbook was able to ssh into it [16:44:25] and it took that timestamp as "start", and now looks for a reboot after that [16:44:26] so the threshold is just to detect rollover from a new reboot? [16:44:30] yes [16:44:38] because d-i does auto-reboot the host upon completion [16:44:47] ok maybe I'll adjust the point of my contention then :) [16:44:59] the message is unclear, I get that, happy to change it :D [16:45:24] it seems like a lot of faily-sounding repetitive spam that's scary, when things are going "normally". Since they're raised exceptions and complaining about some state/numbers that are in fact expected to be that way. [16:45:38] that is basically running: [16:45:39] https://doc.wikimedia.org/spicerack/master/api/spicerack.remote.html#spicerack.remote.RemoteHosts.wait_reboot_since [16:46:56] there is also another aspect, that is we could check less frequently, and I can add context to that too [16:47:34] yeah or if line-by-line output is what we have to work with, it's ok to spam this much for progress, but maybe the messaging needs to indicate it's normal progress instead of sounding like an exceptional state [16:48:19] (even more pie-in-the-sky would be to have the 120 retries or whatever just occupy a single output line with some ansi/vt100 spinner/counter or whatever, but then that gets in the way of logging and screen output capture, etc) [16:49:05] (yes, at some point I wanted to do something similar to what docker does when downloading images) [16:49:10] but yeah if I put myself in the shoes of someone hitting this for the first time.... the way it is now, it always sounds like something is wrong, when everything's probably fine. [16:49:44] ack, thanks for hte feedback, will improve the wording [16:50:11] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host dns6002.wikimedia.org with OS buster executed with errors: - dns600... [16:55:21] hmmm the 6002 failure - it seemed like it was in the installer for a while before failing [16:55:47] do we save some kind of installer logs/output anywhere when the installer step itself fails? [16:56:23] there's not one in /var/log/spicerack/sre/hosts/reimage/ for this particular one, i guess because it didn't get far enough [16:56:38] bblack: did it start the first puppet run? or failed way earlier? [16:57:07] I think it probably failed in the installer itself somehow [16:57:36] then no, connect to console is your only help, we don't screen-record it :D [16:57:44] ok :) [16:58:07] although I am curious how it detected/reacted [17:00:44] oh it reached the timeout end [17:00:54] 2021-11-10 16:49:32,963 bblack 9069 [WARNING decorators.py:157 in wrapper] [119/120, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHos [17:00:57] ts.wait_reboot_since' raised: Uptime for dns6002.wikimedia.org higher than threshold: 1724.0 > 1590.34 [17:01:12] so basically it gave up after the 120 retries. technically the installation might be successful now, just unpuppeted [17:01:32] checking console as advised :) [17:01:44] yes, so far has always been much quicker [17:01:51] and I was asked to reduce the timeout to fail early [17:02:09] but maybe drmrs is so slow with the eqiad routing that it actually didn't' fail yet [17:02:23] in taht case I'll just raise the timeout [17:02:34] * volans would love to have a more native way in python to customize a decorator [17:02:44] yeah in general, even in our existing edge DCs, everything about installs is generally slower [17:02:47] (agent runs, too) [17:04:22] drmrs is probably even moreso, without local infra yet [17:06:36] well the console doesn't work well in the installer (known, some bios setting) [17:06:54] but I can log in via the new_install key remotely, and see that the installer is still running and doing things [17:07:13] 5497 root 2580 S {92updates} /bin/sh /usr/lib/apt-setup/generators/92updates /target/tmp/fileJIys0h [17:07:24] ^ at least, I assume stuff like that in ps means it's still doing things [17:07:28] the console was working fine for the one host I tested [17:07:35] IIRC [17:07:40] yeah I've had mixed luck, even though they're all the same settings [17:07:48] :/ [17:07:59] something about the redirect-after-boot setting and related bits, I donno [17:08:21] maybe they require a reboot/reset of the racadm? [17:08:25] maybe just need to run back through them all and set them back the "right" way, but the "wrong" way worked better while debugging bios/nic settings [17:08:31] and yeah, maybe that too [17:08:59] if the d-i completes correctly and reboots into the new host, you can re-run the reimage with the --no-pxe option [17:09:02] to "resume" [17:09:04] from there [17:09:53] ok, thanks! [17:10:31] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host dns6001.wikimedia.org with OS buster executed with errors: - dns600... [17:10:33] looks like dns6001 is now reaching the same basic fate, so at least it's not some new host-specific corner case to debug probably :) [17:11:19] * volans wonders why all the first reimages went by ok [17:11:37] XioNoX: anything that might have slow down so much things with recent routing/settings? [17:11:40] [related rant: why in 2021 can't we ship full-feature bash and util-linux (ps, etc) in the installer shell? it's not *that* much extra download :P] [17:12:03] [+1000 to that!] [17:12:03] volans: the first 4 reimages we did were all ganeti hosts in the private network, and with role(insetup) [17:12:17] these are public-subnet and in real roles, with a different partman layout too [17:12:20] the role doesn't change d-i [17:12:21] not sure what else might differ [17:12:34] the partman part might, although you'd think the time diff would be minimal [17:12:35] dns don't even have much disk [17:12:50] for cp hosts I might have that doubt [17:14:14] so far 6002 is still stuck in the same apt script it was when I started looking, and I don't see df output changing at all either [17:14:18] so it might actually be "stuck" [17:14:46] maybe some missing firewall settings somewhere for a public-network host (vs private ganetis before) to fetch something from mirrors.wm.o or something like that... [17:15:19] what resolv.conf i thas? [17:15:22] *it has [17:15:30] if the default domain is wikimedia.org [17:15:39] apt points to DYNA disc-apt [17:15:54] search wikimedia.org [17:15:54] nameserver 10.3.0.1 [17:16:31] the step it's on is: [17:16:42] {92updates} /bin/sh /usr/lib/apt-setup/generators/92updates /target/tmp/fileJIys0h [17:16:47] and that file has contents: [17:16:54] # buster-updates, previously known as 'volatile' [17:16:55] deb http://mirrors.wikimedia.org/debian/ buster-updates main contrib non-free [17:17:08] so that's why I'm guessing maybe it can't fetch from whatever mirrors resolves to [17:17:27] it can ping it [17:17:34] can you curl http://mirrors.wikimedia.org/debian [17:17:35] ? [17:17:44] I get a 301 Moved Permanently from ganeti6001 [17:18:41] I get a 200 OK with a filesystem index [17:18:42] the right url is with trailing / but is a quick example to check if it works connecting or not [17:18:46] (from dns6002) [17:19:00]

Index of /debian/


../
[17:19:00] 	 dists/                                             09-Oct-2021 10:07                   -
[17:19:03] 	 [...]
[17:19:08] 	 yep same
[17:19:10] 	 so that seems to work
[17:19:17] 	 so the traffic works, yeah
[17:19:22] 	 I wonder what it's doing :P
[17:23:15] 	 trawling the installer logs
[17:23:30] 	 which are succint and informative, of course :)
[17:23:45] 	 :D
[17:24:14] 	 you sure is not stuck with a d-i window asking for user input?
[17:24:39] 	 it might be, but I don't think I have working linux console on the virtual-serial right now
[17:30:22] 	 bblack: have you seen this?
[17:30:22] 	 E: Release file for http://mirrors.wikimedia.org/debian/dists/buster-updates/InRelease is not valid yet (invalid for another 2h 29min 34s). Updates for this repository will not be applied.
[17:31:52] 	 interestingly the host has a screwed clock: Wed Nov 10 12:30:44 UTC 2021
[17:32:21] 	 but maybe ntp come way later in the process
[17:32:48] 	 yeah I was noticing the timestamps are off by hours
[17:33:18] 	 that couldbe another hwinstall/dcops -level thing that got skipped over in this case - maybe they set the TZ and/or set the clock from bios to get things more closely within reasonable offset
[17:33:31] 	 NTP probably won't step-adjust far enough to fix an issue like that immediately
[17:33:32] 	 it's like it's in EST tz time but with UTC TZ
[17:34:07] 	 yeah so that's probably it
[17:37:40] 	 I think, I should probably set the console setting the "correct" way from wikitech too, to make debugging easier
[17:38:03] 	 so yeah, I'll do another round of "bios access to all the things in drmrs" and see if I can correct that, and get the clocks reasonably-close enough at the bios level
[17:38:09] 	 and then go again! :)
[17:41:53] 	 ack
[18:29:53] 	 the linux console output mystery continues, but hopefully the timestamp fixes will help!
[18:30:17] 	 I have 9 consoles open now in parallel (the 4x ganeti that have linux installed, and also the 3x lvs + 2x dns that still need reimaging)
[18:30:45] 	 confirmed all are identical on the important settings (everything related to bootup, serial console, the basics about processor settings, etc)
[18:31:40] 	 after all recent changes were saved to bios, they were all put through a sequence of: "racadm serveraction powerdown; racadm racreset; [... wait for idrac to come back, ssh back into idrac ...]; racadm serveraction powerup; console com2"
[18:32:06] 	 and the result for the 4x ganetis was that ganeti6001 had a normal console output (showed all the kernel boot stuff and a login prompt)
[18:32:58] 	 ganeti600[234] all came up into the runtime linux install, but the console is blank after the "Loading initial ramdisk...", never get a login prompt or any kernel output (but again, they're online, it's just a serial console output issue)
[18:33:32] 	 and the rest (dns+lvs) - they're presumably back to bootlooping on disk/PXE attempts now without an installer running, but when they get to the boot stage they stop producing any visible console output too :P
[18:33:47] 	 what a mess
[18:34:16] 	 it might also just be missing bios/firmware updates to get past some known bugs or whatever, too
[18:34:21] 	 I didn't even try to update any of that
[18:35:04] 	 will try again imaging dns6 with --new and see if they make it past the apt timestamp issue :)
[18:37:29] 	 ack
[18:37:38] 	 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host dns6001.wikimedia.org with OS buster
[18:37:46] 	 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host dns6002.wikimedia.org with OS buster
[18:38:11] 	 if you need a more in-depth comparison between the bios/idrac settings bblack I can run the redfish dump as I have a half-baked code that will be part of the bios automation settings
[18:38:42] 	 for now, if we can just get the installs pushed through, we can always look at this later
[18:38:57] 	 (I assume dcops will probably look again, when they go to do bios/firmware updates at some point later)
[18:41:06] 	 it'd be nice if we had options, from a vendor like dell, to have a different BIOS that was tailored to larger-scale linux datacenter installations.
[18:41:21] 	 it sure seems like wait through and wade through a lot of legacy crap that doesn't matter to us, just to get hosts booted and installed :P
[18:41:48] 	 maybe some open source firmware meant for this use-case, or something
[18:47:07] 	 I wonder if coreboot.org works on any of the Dell servers
[18:47:12] 	 the serial thing could be something else I don't understand yet, too
[18:47:36] 	 like, maybe the BIOS settings are *now* correct, but they need to be correct at imaging time because it determines some runtime config, indirectly
[18:48:25] 	 mutante: yeah that's kinda what I was thinking, but there's a lot I don't know and haven't looked into
[18:48:49] 	 "Dell Is Exploring The Use Of Coreboot, At Least Internally"  "23 March 2017 " :p
[18:48:59] 	 https://www.phoronix.com/scan.php?page=news_item&px=Dell-Coreboot-Confidential
[18:49:11] 	 like: does it support features we need (remote ssh console for low level remote ops, APIs/IPMI/whatever for automations, etc) + hardware support for all the models we use, and add-in cards, etc.
[18:49:43] 	 it's kind of hard to forge your own path on this stuff without vendor support
[18:50:16] 	 yea, it does sound like a really large effort
[18:50:59] 	 still, you'd think the market for rack servers which are used exclusively with Linux would be big enough, it would warrant a split
[18:51:59] 	 e.g. Dell might sell all their rack hardware in two variants that just differ by some firmware/bios/idrac bits: one that's a linux-only box and optimized for our kind of use-case (and probably implicitly also works well for FreeBSD and the like), and one that's more legacy/generic and supports Windows and other alternate/older OSes.
[19:12:35] 	 made it to first puppet runs on both so far, so that's better! :)
[19:13:29] 	 :) we see that Icinga picked up dns6001 and dns6002 in their IPv6 form
[19:30:08] 	 yeah puppet is failing and ipv6 appears to be only configured with the link-local, not the real ipv6
[19:30:12] 	 something amiss there
[19:30:52] 	 for that matter, the ganeti600x are in the same state, it just didn't much up their puppetization in the basic role(insetup)
[19:31:00] 	 (ipv6 is set to link local only)
[19:31:04] 	 hmmmm
[19:31:49] 	 e/n/i has:
[19:31:52] 	 	pre-up /sbin/ip token set ::185:15:58:5 dev ens3f0np0
[19:31:52] 	 	up ip addr add fe80::185:15:58:5/64 dev ens3f0np0
[19:32:12] 	 whereas a [very] similar eqiad host shows:
[19:32:12] 	    pre-up /sbin/ip token set ::208:80:154:10 dev ens2f0np0
[19:32:12] 	    up ip addr add 2620:0:861:1:208:80:154:10/64 dev ens2f0np0
[19:33:09] 	 now I'm trying to remember all the contortions on the sources of installtime/runtime truth for this
[19:33:24] 	 I know puppet does do some augeas editing of /e/n/i, I think the installer does too
[19:33:43] 	 it could be there's no RA on these vlans advertising the v6 prefix for the token-based approach to happen during install?
[19:35:30] 	 bblack: did the reimage complete?
[19:35:43] 	 well, it's waiting for the final confirmation of a successful agent run
[19:35:53] 	 but it'snot happening, the agent is failing runs because of the bad IPv6 config
[19:36:11] 	 failing as full failure or the puppet failures that keep running?
[19:36:22] 	 the installer stuff looks at the SLAAC IPv6 the installer picks up, and uses it to initially configure e/n/i
[19:36:36] 	 the agent run ends in failure (I can log in and run it manually)
[19:36:59] 	 the cookbooks are sitting on:
[19:37:00] 	 [7/60, retrying in 210.00s] Attempt to run 'spicerack.puppet.PuppetHosts.wait_since' raised: Unable to find a successful Puppet run
[19:37:03] 	 Caused by: Cumin execution failed (exit_code=2)
[19:37:22] 	 but the ipv6 issue is real, and isn't cookbook-related I don't think
[19:37:43] 	 the seed of the issue is that there was no correctly-prefixed SLAAC for the installer to initially pick up
[19:38:00] 	 probably because we don't have IPv6 RAs working on our L3 switches to hand out that info to the hosts
[19:38:32] 	 I could hack e/n/i and reboot them to fix it for now
[19:38:38] 	 but need to address the root problem too
[19:39:37] 	 dinner's ready over here, I have to step out, sorry
[19:40:55] 	 np!
[19:41:16] 	 for now, I'm manually fixing dns600[12] with "ip addr [del|add]" CLI just to make the agent run succeed
[19:46:59] 	 unrelated side-rant: when did the standard "date" command start outputting AM/PM-formed times when the host has a UTC TZ set? :P
[19:47:09] 	 it's annoying, it used to always use a 24-hour form
[19:47:53] 	 I guess it's just a change of the default output format-spec
[19:52:05] 	 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host dns6002.wikimedia.org with OS buster completed: - dns6002 (**WARN**...
[19:52:08] 	 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host dns6001.wikimedia.org with OS buster completed: - dns6001 (**WARN**...
[20:03:27] 	 XioNoX: for whenever you're next able to look at drmrs stuff: ntp packets aren't reaching bidirectionally between dns600[12] and the other ntp servers of the network.  Probably some ACL issue.
[20:03:39] 	 (ferm rules look ok)
[20:04:12] 	 (and I mean other servers globally: the need bidirectional with the core sites' ntps)
[20:07:41] 	 volans: something to peek at and confirm later: when the dns6 boxes finished up the cookbook (successfully) - they made some netbox edits automagically, which included editing the fixed virtual IPs for ns0.wikimedia.org and similar
[20:08:15] 	 some entries in the cookbook output from that look like:
[20:08:16] 	 [default] 208.80.153.231/32: already have PTR ns1.wikimedia.org                                                                             [0/359]
[20:08:19] 	 [default] 91.198.174.239/32: already have PTR ns2.wikimedia.org 
[20:08:27] 	 and then you can see the diff in netbox:
[20:08:29] 	 https://netbox.wikimedia.org/extras/changelog/?request_id=586ee416-75ff-40fc-96a9-62a10db2df2b
[20:08:44] 	 here's ns0 in particular as an example:
[20:08:46] 	 https://netbox.wikimedia.org/extras/changelog/68989/
[20:09:07] 	 not sure what if any pragmatic impact that has, but it seemed odd to me that individual hosts sharing such a virtual IP are causing changes to those objects on reimage
[20:29:44] 	 bblack: at the end of the reimage the script runs the Netbox script to import data from puppetdb for the given host
[20:29:47] 	 https://netbox.wikimedia.org/extras/scripts/interface_automation.ImportPuppetDB/
[20:29:58] 	 to replace the ##PRIMARY## interface with the real ones as seen by the OS
[20:30:05] 	 and exported to PuppetDB
[20:30:20] 	 I'm looking at the diff
[20:31:29] 	 yeah the interface-name ones (and creating the secondary interface), I get
[20:31:36] 	 VIPs shouldn't be attached to the hosts in Netbox in general, as they can float anytime from one host to another, not sure if this is the case
[20:31:43] 	 but the ns[012a].wikimedia.org address updates seem weird
[20:31:59] 	 those VIPs aren't even really "floating", they're anycast and completed disassociated in that sense
[20:32:22] 	 (at least they're logically anycast.  in practice, custom router rules pin ns[012] to specific hosts)
[20:32:36] 	 but either way, this all seems related to the dns update alert now
[20:33:00] 	 I ran the sre.dns.netbox 'test' as instructed by wikitech
[20:33:02] 	 it gave:
[20:33:04] 	 2021-11-10 20:27:32,308 [INFO] Generating DNS records                                                                                              
[20:33:06] 	 and interestingly enough it fails
[20:33:07] 	 2021-11-10 20:27:32,606 [ERROR] Failed to run                                                                                                      
[20:33:08] 	 this is new
[20:33:10] 	 Traceback (most recent call last):                                                                                                                 
[20:33:12] 	 ValueError: max() arg is an empty sequence
[20:33:13] 	   File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 689, in main                                                             
[20:33:16] 	     batch_status, ret_code = run_commit(args, config, tmpdir)                                                                                      
[20:33:19] 	   File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 595, in run_commit                                                       
[20:33:22] 	     records.generate()                                                                                                                             
[20:33:25] 	   File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 386, in generate                                                         
[20:33:28] 	     hostname, zone, zone_name = self._split_dns_name(address)                                                                                      
[20:33:31] 	   File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 475, in _split_dns_name                                                  
[20:33:34] 	     key=attrgetter('prefixlen')                                                                                                                    
[20:33:37] 	 ValueError: max() arg is an empty sequence 
[20:33:39] 	 so yeah
[20:33:49] 	 not sure if that was induced by the same ns[012a] wierdness, or something else about the dns6 hosts/interfaces
[20:34:20] 	 yeah weird, checking
[20:44:36] 	 manual-fixed the live ipv6 config and the e/n/i files on dns6* + gnaeti6* (the only ones imaged so far), so the ipv6 weirdness should be gone for now
[20:44:45] 	 the "real" fix is probably RA config on the switches, for future ones
[20:45:17] 	 btw ssh-ing into the new hosts takes a long time for me, stuck for a while before it goes on
[20:45:40] 	 probably the v6 thing, it was *just* fixed
[20:45:46] 	 I don't have v6
[20:45:50] 	 oh hmmm
[20:45:54] 	 v6 from bast->host?
[20:45:55] 	 but could be, let me retry
[20:46:15] 	 the two lines it gets stuck are
[20:46:16] 	 debug1: Remote: /etc/ssh/userkeys/volans:1: key options: agent-forwarding port-forwarding pty user-rc x11-forwarding
[20:46:33] 	 # 30s pass by and then proceed into
[20:46:40] 	 debug3: send packet: type 80
[20:46:41] 	 debug3: receive packet: type 82
[20:46:56] 	 which bast do you use to reach them, and which host are you trying?
[20:47:30] 	 bast3005 and dns6002
[20:48:54] 	 yeah going via bast3005, I'm seeing a similar long delay
[20:50:03] 	 hmm yeah that's strange, I get the same delays now for dns600x via either that or the bast I was using before (2002)
[20:50:08] 	 I didn't notice any problem earlier
[20:58:04] 	 anyways, the ssh delays are likely ipv6-related
[20:58:08] 	 root@dns6001:~# ping -6 bast2002.wikimedia.org
[20:58:08] 	 connect: Network is unreachable
[20:58:15] 	 whereas ipv4 pings work fine
[20:58:28] 	 for the DNS stuff I think itt might be some transitional code that was needed when we migrated to the netbox-based data, I'm double checking things
[20:58:36] 	 so in addition to basic RAs for IPv6 on the drmrs vlans, we're probably missing some ipv6 routing in general
[20:59:15] 	 maybe it delays more now *because* I fixed the ipv6 on the host side :)
[20:59:55] 	 lol
[21:00:26] 	 the default route is missing for ipv6 on the hosts too
[21:00:35] 	 the entry that usually looks something like:
[21:00:41] 	 default via fe80::1 dev ens2f0np0 proto ra metric 1024 expires 596sec hoplimit 64 pref medium
[21:00:58] 	 I think that comes from the RA as well, so maybe that's still the only real ipv6 problem
[21:02:02] 	 (RA in all my writing above, for anyone reading, is Router Advertisement from https://en.wikipedia.org/wiki/Neighbor_Discovery_Protocol )
[21:05:12] 	 poking at the switches a little, to see if I can solve the RA issue myself easily or not
[21:08:41] 	 volans: so the alert cleared, did the netbox+dns thing go away somehow?
[21:09:04] 	 bblack: which alert?
[21:09:13] 	 the uncommitted dns one shouldn't yet
[21:13:10] 	 oh I guess it's flapping
[21:13:33] 	 (see last few lines in -ops)
[21:13:37] 	 I'm fixing the data in netbox manually and testing changes to the script
[21:14:03] 	 ah the systemd unit
[21:16:09] 	 ok data in netbox is correct now
[21:17:55] 	 what was it that was off?
[21:19:24] 	 the import script from puppetdb did assign the DNS Name to the ns0/1/2 and nsa records
[21:19:29] 	 that are marked to be kept manual
[21:19:54] 	 I need to do a patch to the script so that it doesn't try to do that again
[21:20:17] 	 but I'll do that tomorrow as I need to check side effects
[21:20:47] 	 ok
[21:21:23] 	 I don't think it should be triggered by other reimages
[21:21:27] 	 as those are specific to the dns hosts
[21:21:40] 	 and probably hitting a corner case
[21:23:13] 	 in the sense I know the 1 line I need to remove and I checked that does what it should in this case, but I need to make sure it still does all that's needed in all other cases :)
[21:23:27] 	 :)
[21:24:21] 	 I'll send the fix in the morning
[21:24:24] 	 sorry for the trouble
[21:24:32] 	 so I added the ipv6 RA stuff and it seems to work (might be another minute before every host auto-recovers to working state now)
[21:25:28] 	 XioNoX: the RA config I set up is probably less-than-ideal, but "works".  I added the basic RA parts, but it ends up advertising the router IP as e.g. fe80::cafe:6a02:6d2d:3800 instead of fe80::1.  Probably needs some other bits :)
[21:25:31] 	 I can confirm quick ssh time
[21:25:42] 	 thx
[21:25:47] 	 stepping out for a bit
[21:25:54] 	 cya, thanks for all the help!
[21:25:56] 	 (or maybe until tomorrow :D )
[21:26:00] 	 np at all
[21:26:47] 	 long as I'm logged into the switches again, taking a peek if there's an easy ACL fix for ntp traffic
[22:09:26] 	 I ended up adding an allow_ntp6 term to loopback6 thinking it was an ipv6+ntp issue (since I saw allow_ntp6 in some other routers)
[22:09:53] 	 I don't think that was actually a/the problem with ntp.  Eventually I just had to restart the core sites' ntp daemons to get it peering right.
[22:10:17] 	 either those ntpd had cached old dns info, or after so long with no response (a couple weeks) they really gave up or got in a stuck state about the peer relationship.
[22:10:24] 	 in any case, ntp peering seems to work now!