[06:52:56] (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [06:57:56] (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [09:14:00] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp3065.esams.wmnet with OS buster [09:27:24] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10Volans) I have a question regarding the Ganeti setup, what will be the final clustering? I'm asking in particular to update the spicerack config for ganeti: http... [09:52:56] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp3065:9331 is unreachable - https://alerts.wikimedia.org [09:55:50] ^^ cp3065 is being reimaged [10:12:56] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp3065:9331 is unreachable - https://alerts.wikimedia.org [10:18:29] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp3065.esams.wmnet with OS buster c... [10:19:52] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [10:38:04] 10netops, 10Infrastructure-Foundations, 10SRE: Rebuild Routinator (rpki) VMs with larger disk - https://phabricator.wikimedia.org/T292503 (10MoritzMuehlenhoff) >>! In T292503#7495257, @cmooney wrote: > A security update is now available which means we need to upgrade again: > > https://www.nlnetlabs.nl/news... [14:49:57] (EdgeTrafficDrop) firing: 67% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org [14:54:57] (EdgeTrafficDrop) resolved: 67% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org [15:06:57] 10Traffic, 10SRE, 10Patch-For-Review: Wikidough: Support EDNS(0) Padding: RFC 7830 and RFC 8467 - https://phabricator.wikimedia.org/T274431 (10ssingh) Responses are being padded to 468 bytes, as expected and per the RFC: ` kdig @185.71.138.138 +tls-ca +tls-host=wikimedia-dns.org wikipedia.org ;; TLS session... [15:08:18] 10Traffic, 10SRE, 10Patch-For-Review: Wikidough: Support EDNS(0) Padding: RFC 7830 and RFC 8467 - https://phabricator.wikimedia.org/T274431 (10ssingh) An example of an incorrectly padded response (and as we reported to dnsdist developers): ` kdig @185.71.138.138 +tls-ca +tls-host=wikimedia-dns.org example.... [15:18:35] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6001.drmrs.wmnet with OS buster [15:44:18] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6001.drmrs.wmnet with OS buster executed with errors: - cp6001... [16:15:19] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6001.drmrs.wmnet with OS buster executed with errors: - cp6001... [16:24:17] bblack: testing the fix for the netbox import script from puppetdb that caused issues yesterday, I noticed that dns1002 has an IP "inet 208.80.154.254/32 scope global lo:legacy" that in Netbox doesn't have any name, nor automatic nor manual, is that expected? [16:24:21] https://netbox.wikimedia.org/ipam/ip-addresses/4168/ [16:26:27] in total there are 5 IPs in Netbox that are marked as VIP, ACTIVE and don't have neither a DNS Name neither a comment (cc XioNoX, topranks FYI) [16:27:11] easy to spot in the first block of this list: https://netbox.wikimedia.org/ipam/ip-addresses/?role=vip&sort=dns_name [16:38:28] Looking at the server itself all the IPs assigned to lo interface have a "Keep Manual" description, so I'd guess this one should have one also, and it was missed. [16:54:29] volans: Looking at the others all seem to be unused / unrouted (and probably safe to delete from netbox). [16:54:35] 208.80.153.83/32 - Unused / Not Routed [16:54:35] 2620:0:861:107:10:64:48:1384/128 - From private1-d-eqiad range, ND for this IP fails, numbering doesn't match our usual scheme in last nibble. [16:54:35] 10.0.5.3/32 - Not routed on network [16:55:22] The exception is 208.80.153.254/32. This is statically routed on CRs in codfw to dns2002. [16:55:35] all don't have DNS records active (dig -x) [16:55:44] It is configured on lo interface on dns2002, so similar case to the first one you caught. [16:57:56] yep. I also grepped the zone files to catch any forward entry pointing to them and nothing there. [16:58:58] I think traffic need to fix 208.80.154.254 and 208.80.153.254, either adding the DNS name or setting to manual and creating a static entry for those IPs. [16:59:07] And the other 3 can probably just be deleted from Netbox. [17:00:37] SGTM if all agrees :) [17:46:35] FYI I've sent https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/738274 for the DNS Name issue we had yesterday [18:24:30] I'm not really here today (US holiday) - but I'd be wary of removing any addresses that exist on a host like dns1002 [18:24:39] they probably exist for a reason, even if there's an accounting issue [18:25:39] at a quick glance at some of the numbers above: [18:26:16] 208.80.15[34].254 - these are legacy recdns IPs. We've moved on from these numbers, but we left them manually-routed in eqiad/codfw because there were 1-2 more trailing cases of misconfigured hosts/software still using them. [18:26:32] (I think one was some kind of ancient irc relay box) [18:27:21] the others, I don't recognize off the bat. [18:28:15] 10.0.5.3/32 - this might have been something leftover from the initial efforts that ended up making our 10.3.0.1 recdns anycast? some earlier idea that left a lingering loopback addr in place? not sure [18:28:45] but it's hard to say about any of them without some digging! [18:30:27] that oddball ipv6 ending in :1384 isn't related to the dns boxes. Seems to have been accidentally-created on sretest1001 instance ~1y ago: https://netbox.wikimedia.org/extras/changelog/?request_id=a455aecf-555c-400f-be21-88f376f4c477 [18:31:28] 208.80.153.83/32 is puzzling, as there's no changelog entry describing its creation, either. [18:33:25] [and on a higher-level note - it's dawning on me from various things yesterday now that sometimes IP address and/or interface records in netbox are auto-updated/created based on the results of what happens in a reimage? does it also effectively scan them from puppetdb at runtime for netbox import? does this create some odd logical inconsistency with the idea of netbox as the single source of [18:33:31] truth?] [18:34:56] The 10.0.5.3/32 one seems to be related to some cloudweb2001-dev instance some time back: https://netbox.wikimedia.org/extras/changelog/?request_id=e8d5f33a-e423-4e50-bfe1-68c226339e47 [18:35:06] * bblack retreats