[06:51:57] (EdgeTrafficDrop) firing: 68% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [06:56:57] (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [07:01:57] (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [07:06:57] (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [09:03:42] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host ganeti6002.drmrs.wmnet with OS buster [09:04:57] mmandere: good morning, I'm back, I've slightly follow what happened with dns6001 (behind the scenes ;) ) [09:05:11] did someone fix all the BIOS setup for the other hosts? [09:06:09] volans: good morning and welcome back :) [09:06:39] thanks [09:06:54] Yes, bblack did fix the Bios on all dns, lvs, ganeti and cp servers yesterday [09:07:20] I am trying to reimage the second ganeti and see if all is well [09:08:04] ah :/ I was hoping to have one left untouched for the BIOS automation project, ok thanks for th einfo [09:12:08] :( understood. Let's see how this current reimage will run [09:13:17] don't worry, no problem, I'll pick any of the new servers coming in codfw/eqiad [09:15:00] Great 👍 [10:01:21] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): cr-codfw: set up static route for 185.15.57.8/30 - https://phabricator.wikimedia.org/T295288 (10ayounsi) 05Open→03Resolved Good catch! Added. ` ayounsi@bast1003:~$ ping -c1 virt.cloudgw.codfw1dev.wikimediaclo... [10:22:48] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host ganeti6002.drmrs.wmnet with OS buster completed: - ganeti6002 (**... [10:37:27] ^ ganeti6002 is up [10:37:49] yay [10:38:49] I'm ready to start advertising drmrs public space through esams btw [10:43:39] XioNox: ack [10:44:52] * volans looking at the pending alerts in icinga with mmandere for ganeti6002 [11:27:41] ^ all looks good(with no prometheus preinstalled in drmrs, that's expected, notifications are also disabled for now, so it will be quiet )... confirmed with volans :) [11:29:26] XioNox: We'll also reach out to you with bblack he'll help with guiding on next steps from our end [11:32:25] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host ganeti6003.drmrs.wmnet with OS buster [11:48:16] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp5006.eqsin.wmnet with OS buster [11:51:06] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [11:55:57] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp5006:9331 is unreachable - https://alerts.wikimedia.org [11:57:05] mmandere: sorry, spicerack doesn't yet support alertmanager and as such can't yet silence those ^^^ [11:59:36] ^^ cp5006 is being reimaged [12:00:06] but it looks like it's having some issues on the reboot into PXE [12:00:07] sigh [12:01:42] apparently it's loading debian-installer/amd64/initrd.gz. [12:02:57] yeah.. no issues.. just eqsin being slow :) [12:05:45] yep I know is being reimaged, was just stating why this alert has not been silenced [12:06:05] after I've spent some time with Marc explaining how we downtime things automatically in Icinga :D [12:06:10] so that could have been confusing :) [12:10:57] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp5006:9331 is unreachable - https://alerts.wikimedia.org [12:12:37] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host ganeti6003.drmrs.wmnet with OS buster executed with errors: - gan... [12:18:09] volans: understood [12:21:13] volans: we ganeti6003 has failing to reimage, the cookbook timed out while waiting the server to reboot [12:23:02] mmandere: from the logs it didn't even reboot into the debian-installer, it smells BIOS mis-config [12:23:12] that didn't allow it to reboot into PXE correctly [12:25:51] volans: misssed that... which logs was checking reimage-extended and reimage [12:33:50] yes, in those, or also the console [12:33:53] it said "- Host rebooted via IPMI" [12:33:55] and then failed [12:34:00] before getting to the debian installer [12:34:13] I made a note to myself to make that a bit more explicit on what step failed when it fails that way [12:43:57] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp5006:9331 is unreachable - https://alerts.wikimedia.org [13:02:56] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host ganeti6003.drmrs.wmnet with OS buster [13:07:32] volans: ack [13:09:34] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host ganeti6003.drmrs.wmnet with OS buster executed with errors: - gan... [13:15:11] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host ganeti6003.drmrs.wmnet with OS buster [13:38:04] XioNoX: on drmrs b12 switch, we don't get dhcp forwarding to install1003 anymore (at all, apparently). b13 still seems to be fine. dhcp-relay config same on both, I suspect it's something with the firewall/routing changes for the esams-public route. [13:38:50] yeah it [13:38:57] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp5006:9331 is unreachable - https://alerts.wikimedia.org [13:39:12] it's most likely because of the loopback filters [13:39:35] bblack: does the request get to install1003? [13:39:54] no, I've been running tcpdump there and seeing nothing while a host is trying PXE DHCP [13:40:21] we just had a host on b13 work fine before, but the b12 host doesn't get through [13:40:50] weird, as I pushed the same filters on both [13:40:52] let me check [13:41:41] I'm off the switches [13:42:49] of course, the timing could be an issue with our test info too (maybe b13 doesn't work now either, but the last b13 attempt was a couple hours ago before some config change there) [13:43:03] we can try another b13 [13:43:11] bblack, XioNoX: in general, would it be useful to have a cookbook that setup the DHCP for a host, stops there and waits for the user to tell when removing it? [13:43:32] maybe for debugging, but probably not for general use [13:43:53] bblack: I disabled the v4 filter on b12 to check, can you try again? [13:44:11] yes, it just suddenly made it through the DHCP step on the console! [13:44:19] it's booting installer now :) [13:45:08] I burned up 3/4 of the reimage's reboot max timeout already though, so it might still fail for that reason. it's a race now :) [13:45:25] lol [13:46:44] initrd.gz seems to be taking forever, but that might be normal [13:47:03] but at least it's sitting at that spot insstead of the DHCP spinner [13:47:15] ah it got it finally [13:47:46] bblack, that's the two terms, applied on all the core routers, and b12/b13, not sure at first right why b12 would not work https://www.irccloud.com/pastebin/AhYsBhME/ [13:48:51] does wikimedia4 have both drmrs public nets, in the other sites' routers? [13:48:54] XioNoX: when did you apply those? [13:49:04] maybe was after the last successful reimage [13:49:10] 0.0.0.0/32? [13:49:26] oh, yeah, maybe [13:49:31] that's odd, shouldn't it be /0 ? [13:49:36] volans: in the afternoon [13:49:48] morning I mean :) [13:50:07] bblack: nah that's correct, broadcast requests are from 0/32 [13:50:07] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp5006.eqsin.wmnet with OS buster c... [13:50:44] the first term is from the server to the relay, the 2nd one is to allow the reply from install1003 to the router [13:51:25] I'd be interested to know if that rely uses the same ports as the "old" MX relay [13:51:27] relay [13:51:33] XioNoX: shouldn't the second term have source-port 68 and not 67? [13:51:50] yeah I was staring at that too [13:52:23] wikipedia says "UDP port number 67 is the destination port of a server, and UDP port number 68 is used by the client." [13:52:42] that's the correct one for the MXs at least [13:52:42] yeah well anyone can edit wikipedia though [13:52:43] so the second one should have destination-port 68 [13:52:58] it's not client to server, it's server to relay [13:53:28] but if something could have changed between junos version it could be it [13:53:38] bblack: do you have other re-imaging planned for b12? [13:53:49] also, if the global settings are similarly problematic elsewhere, probably would've affected vgutierrez reimage in eqsin that seems to be working [13:54:26] yeah I haven't changed the filters in the other sites, so it's a specificity to drmrs [13:55:04] XioNoX: we have one queued up for b13 to try (ganeti6004). Or we can re-reimage the other working b12 host (ganeti6001, which was imaged back on Friday). [13:55:29] 6002 on b13 worked earlier today some hours back, and 6003 on b12 is the one we just got past with the loopback filter change you made [13:56:05] we can always just re-reimage machines if we need to test more [13:56:40] indeed my reimage (cp5006) went as expected [13:56:57] yeah there is 0 change in eqsin [13:57:07] bblack: let me know before doing ganeti6004 and I'll have a look [13:57:42] we have our meeting for the next ~hour, probably after that [13:58:47] ok! [14:01:23] listening on the ganeti6004 interface I'm seeing "0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from e4:3d:1a:14:7b:10" [14:03:27] so that's correct [14:03:30] and 185.15.58.1.67 > 208.80.154.32.67: [udp sum ok] BOOTP/DHCP, Request from e4:3d:1a:71:20:d0 [14:03:58] on install1003 (from b12), so ports are correct too [14:04:59] ok [14:05:33] yeah most of them that aren't already-imaged should be in PXE bootloops at this point, probably (I fixed all their bios/firmware settings on all 25 hosts yesterday) [14:06:01] that's all I need [14:06:33] and yeah when I re-enable the filter I don't see dhcp on install1003 from b12, so now to figure out what's missing :) [14:19:48] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host ganeti6003.drmrs.wmnet with OS buster completed: - ganeti6003 (**... [14:25:47] such a mystery, so far looks like a Junos bug [14:26:16] if I log all the discarded traffic nothing relevant shows up [14:28:29] maybe it's the lack of "forward-only" on the dhcp-relay doesn't work with the other new changes? [14:29:11] bblack: it works on b13, which make it even more weird [14:37:50] I'm wondering if the filter doesn't block silently some internal traffic, eg from loopback to loopback [14:38:09] it's called "optimization" :-P [14:41:27] I almost want to reboot it... [14:43:39] is something about the esams-route setup on b12 making it send the dhcp relay traffic over the wrong link somewhere (or sending it via-esams to eqiad where it's getting stuck in some other filter)? [14:45:54] nah it's not a routing issue as disabling the loopback filter clears the issue [14:46:07] or setting a default permit instead of the default deny [14:53:37] there it is! [14:53:42] 14:53:03 loopback4-lo0.0-i D irb.621 UDP 0.0.0.0 255.255.255.255 [14:54:21] Time of Log: 2021-11-09 14:53:14 UTC, Filter: loopback4-lo0.0-i, Filter action: discard, Name of interface: irb.621 [14:54:22] Name of protocol: UDP, Packet Length: 644, Source address: 0.0.0.0:68, Destination address: 255.255.255.255:68 [14:55:06] so there is something the re-writes the destination port to 68 [14:55:52] doesn't show up in syslog, but show up in the lower level firewall log buffer [14:57:43] and now it works again! [14:57:54] https://www.irccloud.com/pastebin/w8kfiX2p/ [14:59:47] thanks! [15:09:01] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host ganeti6004.drmrs.wmnet with OS buster [15:12:25] bblack: I'm going to advertise drmrs publicly if that's ok with you :) [15:12:32] yes, please :) [15:12:56] once that's done, can the public-net instances get functional routing if we try to image one? [15:14:36] bblack: yep, it will be quite asymetrical, but it should work [15:14:46] outbound will be through drmrs, but inbound through esams [15:15:51] ok, works for me :) [15:20:49] bblack: it's live, 185.15.58.131 can be pinged [15:21:11] \o/ [15:21:35] I'm going to roll the change to all the CRs [15:21:52] awesome! thank you! [15:49:16] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host ganeti6004.drmrs.wmnet with OS buster executed with errors: - gan... [15:50:04] mmandere: I see reimage failed again on IPMI step? [15:50:11] well on booting to installer I mean [15:50:28] last time I ended up doing a racreset + powercycle and then it worked, I don't think I had to make any "real" changes [15:51:48] bblack: let me give it a try [15:51:56] ok [15:52:30] mmandere: basically what I did: do the mgmt serial connection to the racadm>> prompt [15:52:33] then: [15:52:39] racadm serveraction powerdown [15:52:42] racadm racreset [15:53:07] [wait several minutes - it will disconnect your ssh when it really resets, and it will be a few minutes before the ssh works again for you] [15:53:11] then once you're back in: [15:53:16] racadm powerup [15:53:21] console com2 [15:53:38] [wait to see it making progress on a fresh bootup] [15:53:44] [re-launch image script] [15:55:24] bblack: understood... Executing that now [15:57:39] oh I mistyped the power commands above. The powerup / powerdown commands need the serveraction keyword, too. [15:57:47] I guess I just missed the one on powerup [15:57:53] it's "racadm serveraction powerup" [16:01:28] 10Traffic, 10Observability-Metrics, 10SRE, 10Patch-For-Review, 10User-ema: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 (10colewhite) [16:07:35] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host ganeti6004.drmrs.wmnet with OS buster [16:08:30] if you keep the mgmt "console com2" open during the reimage attempt, it can help see what goes wrong (or what succeeds), sometimes [16:09:54] got it [16:13:15] looks like it failed a DHCP attempt [16:13:25] [I'm watching console too] [16:14:22] I can see the DHCP request arriving at install1003 [16:14:31] [...]Circuit-ID SubOption 1, length 45: asw1-b13-drmrs:et-0/0/13.0:private1-b13-drmrs [16:14:36] which is the port in netbox for ganeti6004 [16:14:52] maybe the response is not arriving now [16:20:07] I donno, I don't see the response packet even on intsall1003, but maybe I need to adjust my tcpdump filters [16:25:02] hmmm I think my filters are right, I think the dhcp server now isn't responding in this case [16:25:54] the temporary automation file with the port info looks correct on-disk though [16:26:04] you may have to ask netops to add the "dhcp-helper" ACL in routers to make that work in the new VLAN? [16:26:20] rememembers from installing install servers [16:26:29] the req has: Circuit-ID SubOption 1, length 45: asw1-b13-drmrs:et-0/0/13.0:private1-b13-drmrs [16:26:41] and the automation output has: [16:26:42] host ganeti6004 { host-identifier option agent.circuit-id "asw1-b13-drmrs:et-0/0/13.0:public1-b13-drmrs"; fixed-address 10.136.1.16; [16:26:45] which seems to match [16:27:10] mutante: we do have the dhcp-relay stuff set, but there have been other recent changes! :) [16:27:20] ah,ACK:) [16:27:40] still, I don't get why I can't see the OFFER response side coming out in install1003 tcpdump [16:27:43] hmmm [16:28:33] bblack: wrong vlan [16:28:35] log file shows it with no leases [16:28:35] host-identifier option agent.circuit-id "asw1-b13-drmrs:et-0/0/13.0:public1-b13-drmrs"; [16:28:45] oh, yeah [16:28:53] I just audited those, too, I thought [16:29:07] checking again! [16:29:47] for context I did just cat automation/ttyS1-115200/ganeti6004.conf in /etc/dhcp on install1003 [16:30:15] and https://netbox.wikimedia.org/dcim/interfaces/22481/ is indeed tagged as public [16:33:21] oh I thought you meant the error was on the switch side [16:33:22] got it [16:33:33] the switch is correct (I fixed one port in an audit earlier) [16:34:47] ah, even the host side is correct, just the switch side in netbox [16:35:01] is the switch side in netbox automated (and maybe last updated before my earlier fix)? [16:35:07] or do I just manually edit that? [16:35:42] I guess nothing auto-populates *into* netbox, it all flows the other way [16:36:12] either way, fixed in netbox [16:36:39] mmandere: you'll have to ctrl+C and retry the reimage of 6004 [16:37:28] volans: is there a certain script to manually re-run (for quicker timing or whatever) after editing the vlan of an interface, before reimaging? [16:38:15] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host ganeti6004.drmrs.wmnet with OS buster executed with errors: - gan... [16:38:30] for the earlier questions, yes ideall fix netbox first, run homer and get the switches updated [16:38:42] "run homer" is the step? [16:38:42] for the last one I'm not sure what you mean\ [16:39:03] volans: so the port in question was in the wrong vlan earlier today, both on the switch and in netbox [16:39:18] I fixed the switch config manually earlier today, so it's on the private vlan as it should be [16:39:20] drmrs switches are not yet managed by homer/automation [16:39:34] just now, I edited netbox to also set the port correctly to private there. [16:39:53] but do I need to run some script to make sure, before I run the reimage script again, that the reimager will populate the new data to dhcp? [16:40:06] no, it getst the data from netbox API on the fly [16:40:11] ok, thanks! [16:40:29] mmandere: so yeah, good to go, try running again [16:41:08] ok [16:41:32] * volans should probably make some flow diagram of automation data at some point [16:41:40] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host ganeti6004.drmrs.wmnet with OS buster [16:41:49] [the reason this one switch port is in the wrong vlan, is it's the host/port we renamed from bast6001 to ganeti6004 before. We changed the host-side addressing+vlans, but never fixed the switch side back then] [16:44:59] seeing it send DHCPOFFER packets in response on install1003, so maybe it will work now :) [16:46:11] seems to be sending them repeatedly though, and no progress on console. *now* we might have a return routing problem [16:46:59] lol [16:47:30] anything that can go wrong, will :P [16:48:40] trying a powercycle while the imager is waiting, maybe just to kick the tires in case [16:50:10] XioNoX: I don't think dhcp reply from install1003 is reaching b13 hosts, but still looking for other causes [16:50:25] (possibly b12 too, we haven't tried a fresh dhcp there since the temporary fix let that one through) [16:50:39] the request is reaching install1003, and install1003 is generating a response packet [16:50:57] (EdgeTrafficDrop) firing: 67% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org [16:51:05] but DHCP spinner on the b13 host isn't getting anywhere [16:53:46] preseed is from apt* vs install*, btw, in case actual DHCP works but it does not get to installer [16:54:17] yeah DHCP doesn't even get the IP, so far [16:55:19] mmandere: it's going to fail in any case, can stop the reimage for now [16:55:39] maybe try dns6001? the routing for it should work now, and it's on the opposite switch, so might not have the same issue if we're lucky [16:55:40] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host ganeti6004.drmrs.wmnet with OS buster executed with errors: - gan... [16:55:57] (EdgeTrafficDrop) resolved: 67% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org [16:57:39] mmandere: (or if it's getting late for you, I can take over and do more debugging and fixups on this) [17:05:18] bblack: that's ok. I'll watch out for your notes on the same [17:06:49] ok :) [17:08:10] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host dns6001.wikimedia.org with OS buster [17:10:59] same result [17:11:29] so both switches seem to have the same problem now: dhcp req makes it to install1003, install1003 generates response packet (seen in tcpdump on install1003), but doesn't reach the end-host in drmrs [17:13:49] I have to step away, I can look at it a bit when I'm back or tomorrow morning [17:26:17] ok :) [17:30:58] I've sent https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/737753 with a quick cookbook (mostly just the pieces from the reimage one for now) to do DHCP debugging [17:31:18] some things might be generalized/factored out but I thouthg would be simpler for now to get something ready in few minutes [17:31:21] bblack: ^^^ [17:36:47] nice [17:37:13] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host dns6001.wikimedia.org with OS buster executed with errors: - dns6... [17:39:13] if you want to test it right away we can safely merge and in case of issues I'll fix it [19:04:54] o/ hey folks, can someone help with purging the cache on a specific url? https://wikipedia.org/.well-known/assetlinks.json [19:05:17] it's currently a 301 redirect, but has been updated to point directly to that file. [19:11:26] dbrant: done [19:11:57] Reedy: whooo, many thanks!