[08:00:15] 10Traffic, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-REST-API, 10Performance-Team, and 2 others: Determine http cache control and active purging for REST endpoints serving parsoid output - https://phabricator.wikimedia.org/T308424 (10daniel) p:05Triage→03High [08:01:57] (HAProxyEdgeTrafficDrop) firing: 66% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [08:06:57] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [10:18:57] 10Traffic, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-REST-API, 10Performance-Team, and 2 others: Determine http cache control and active purging for REST endpoints serving parsoid output - https://phabricator.wikimedia.org/T308424 (10daniel) @BBlack do you have thoughts on this? [10:42:57] (HAProxyEdgeTrafficDrop) firing: 64% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [10:57:57] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [11:36:57] (HAProxyEdgeTrafficDrop) firing: 68% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [11:41:57] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [12:12:57] (HAProxyEdgeTrafficDrop) firing: 66% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [12:22:57] (HAProxyEdgeTrafficDrop) resolved: 68% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [12:27:19] 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10dom_walden) Happening again. Nothing in the apache2.log on mediawiki12 since 11:55 (UTC?) [12:27:40] 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10dom_walden) >>! In T302699#7952842, @dom_walden wrote: > Happening again. Nothing in the apache2.log on mediawiki... [13:48:57] (HAProxyEdgeTrafficDrop) firing: 68% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [13:53:57] (HAProxyEdgeTrafficDrop) resolved: 68% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [15:03:57] (HAProxyEdgeTrafficDrop) firing: 58% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [15:08:57] (HAProxyEdgeTrafficDrop) firing: (3) 58% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [15:13:57] (HAProxyEdgeTrafficDrop) resolved: (3) 63% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [15:25:26] (HAProxyEdgeTrafficDrop) firing: (3) 65% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [15:30:26] (HAProxyEdgeTrafficDrop) resolved: (2) 65% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [15:37:30] kindly seeking a reviewer for https://gerrit.wikimedia.org/r/c/operations/dns/+/793728 (adding zone validator ignore comments for "duplicate" names by design) [15:55:30] 10Traffic, 10SRE, 10observability: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10Vgutierrez) that's intended, every time that acme-chief fetches fresh OCSP stapling responses it issues a reload of apache2 >>! In T293826#7446839, @Legoktm wrote... [16:32:16] volans: yeah there's a lot of philosophical debate we could have about that check and the different kinds of "hostname" etc... I'm not sure if you really want to get at the deep parts, or just looking for an expedient way to move forward :) [16:33:41] I mean, there's fundamentally nothing wrong with these "duplicates" in general, and the check's way of looking at the world comes from it being about machines and provisioning, not about public aliasing usages, etc. [16:34:26] as a pragmatic matter though, marking these is an easy way to get past all this for now [16:35:08] but then there's also the aesthetic issue that it's kind of ugly to tack these big comments on these cases like they're something horrible to be dealt with later when they're not [16:37:09] on the other other hand - donate-lb is a particularly poor example, and they could probably be replaced with CNAME -> dyna like so many others, too... [16:37:21] if it's the only case that trips this, maybe we should just change the records [16:37:56] well, replaced with something anyways [16:38:33] I guess right now, text-lb are considered the canonical aliases over in netbox [16:38:52] it shouldn't fundamentally hurt anything to make these CNAMEs to the respective text-lb at least, though? [16:39:28] (maybe should check with FR first though) [16:40:42] but then if we remove our last remaining example of needing this escape-hatch, it will cause avoidance of the pattern (of duplicating A-record IPs when it may be warranted) in future cases (because it trips CI), which might not be a good thing. [16:42:19] there's some kind of clash at the root of all this, where zone_validator is more about provisioning and IPAM, and as most such things move to netbox, little if any of what's left in manual zonefiles should want zone_validator's rules (which aren't about DNS legality, but about avoiding provisioning mistakes) [17:25:33] 10Traffic, 10SRE, 10Wikimedia-Incident: All wikis down: error 503 (resolved, follow-up pending) - https://phabricator.wikimedia.org/T308940 (10AlexisJazz) >>! In T308940#7951736, @Dzahn wrote: > https://wikitech.wikimedia.org/wiki/Incidents/2022-05-21_-_varnish_cache_busting "A flood of API traffic from an... [17:39:36] 10Traffic, 10DC-Ops, 10SRE, 10ops-drmrs: hw troubleshooting: cp6006 b2 dimm issue - https://phabricator.wikimedia.org/T309123 (10RobH) [17:39:58] Heya traffic folks, I just filed https://phabricator.wikimedia.org/T309123 noticing cp6006 has a memory warn and wants a reboot to test [17:40:10] all good if I do a normal depool+maint in icinga to do so? [17:40:33] in about 15 min not immediate, going to finish fixing power draw in drmrs first. [17:40:36] bblack: ^ ? [17:42:15] if now isnt good it can wait, its a warn not an error so it may correct itself after reboot or just fail the dimm entirely. [17:50:59] 10Traffic, 10ops-drmrs: dns6002 https idrac produces 400 error - https://phabricator.wikimedia.org/T309124 (10RobH) [17:57:23] robh: no it's fine, go ahead, just depool first :) [17:57:48] cool, thx [17:58:41] help [17:58:45] bah, wrong window! [17:59:55] 10Traffic, 10ops-drmrs: dns6002 https idrac produces 400 error - https://phabricator.wikimedia.org/T309124 (10RobH) 05Open→03Resolved fixed the power via the idrac ssh cli [18:07:49] 10Traffic, 10DC-Ops, 10SRE, 10ops-drmrs: hw troubleshooting: cp6006 b2 dimm issue - https://phabricator.wikimedia.org/T309123 (10RobH) It fixed itself with reboot ` Normal,Tue 24 May 2022 18:06:22,The self-heal operation successfully completed at DIMM DIMM_B2., Normal,Tue 24 May 2022 18:06:22,The self-h... [18:08:36] 10Traffic, 10DC-Ops, 10SRE, 10ops-drmrs: hw troubleshooting: cp6006 b2 dimm issue - https://phabricator.wikimedia.org/T309123 (10RobH) [18:44:38] 10Traffic, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-REST-API, 10SRE, and 2 others: Determine http cache control and active purging for REST endpoints serving parsoid output - https://phabricator.wikimedia.org/T308424 (10Krinkle) [18:52:57] (HAProxyEdgeTrafficDrop) firing: 33% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [18:57:57] (HAProxyEdgeTrafficDrop) resolved: 41% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [19:08:46] bblack: late replies to your earlier comments. [19:09:36] sorry, it was way too much rambling :) [19:09:42] 1) the reported "issue" for donate-lb is currently at a warning level, so doesn't make CI fail but one of the few exceptions that would make that check pass completely and at that point we could potentially consider it to become an error [19:10:20] it's complicated by the fact that donate-lb is a bad example of a pattern we might legitimately use otherwise in some cases [19:11:09] currently we have 29 (including that patch) wmf-zone-validator-ignore comments [19:11:57] with those we'll have left a bunch of warnings all related to the wikimediacloud.org zonefile, that we could decide to ignore completely or handle in a different way [19:12:32] in general I agree with your rambling, zone validator was and is mostly to verify the IPAM side of our DNS [19:12:41] and the fact that we mix things in the wikimedia.org zonefile doesn't help [19:13:07] yeah I keep thinking we can manage some transition of legacy wikimedia.org to split off parts of it into other new zones, some of which we already hold [19:13:25] but I think that's probably a very long road, and it might still leave some corner cases :) [19:13:33] yeah for sure [19:13:54] side question... but I guess I know your answer [19:14:22] split private zones off to some other daemon? [19:14:32] I've now disabled MISSING_ASSET_TAG,MISSING_MGMT_FOR_NAME,TOO_FEW_MGMT_NAMES checks globally [19:14:57] because right now zone-validator doesn't know if a host is a VM or not [19:15:08] right [19:15:12] in the past days of manual things we were adding the ; VM on the ganeti cluster ... comment [19:15:28] could we make netbox do the same with some custom field? [19:15:28] ofc I could put back something from the auto-generation but seems a lame solution [19:15:43] the generation script knows already the VMs [19:15:47] right [19:16:05] but seems lame to add a comment for all those records, just to make a validator script happy [19:16:15] I think the real question at the heart of this is: [19:16:16] part/most of those checks are basically already in netbox reports [19:16:25] and I want to check exactly if we're checking the same things [19:16:41] so probably we're good already with netbox reports checking netbox data coherence [19:16:46] more than the resulting dns records [19:16:48] for the non-netbox (manual ops/dns repo) data, what checks is zone_validator executing that are useful and don't duplicate validation the server itself would do, and don't have fairly routine exceptions? [19:16:53] at the same time, there could be a bug in the generated data [19:17:20] because we could just have the validator automatically ignore anything from the root zonefiles and only flag things in the includes, too [19:17:58] eh... that presents some issues [19:18:21] it would still *parse* them, just automatically ignore anything it would flag from lines of the outermost zonefiles. [19:18:24] for some records we do have the direct in the ops/dns repo and the PTR in the automatic one, or vice-versa [19:18:31] ah ok [19:19:00] I'd be curious about those cases though, that seems weird [19:19:05] the list of rules checked are basically described here [19:19:05] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dns/+/refs/heads/master/utils/zone_validator.py#32 [19:19:12] yeah mostly WMCS-related [19:19:27] some 208... IPs that are in a zonefile delegated to openstack [19:19:28] IIRC [19:19:39] ok [19:20:00] I mean, we could subdelegate the reverse to them as well, to get it out of there. [19:20:11] sorry gittiles doesn't highlight that file, but you actually prefer it like this :D [19:20:25] no because the prefix delegated is another one [19:20:30] those are legacy IPs in our space :D [19:20:59] you can delegate partial subnets for reverse, I mean [19:21:02] it's a bit of a mess IIUIC, but I might be wrong, I think that netops and wmcs are working to fix some of those cases [19:21:35] https://datatracker.ietf.org/doc/html/rfc2317 <- in case that helps [19:21:57] basically even if a zonefile is e.g. naturally for a /24, you can sub-delegate from one authdns to another at e.g. a /28 boundary, with that technique [19:22:08] nice! [19:22:21] I'll relay the info [19:24:17] but I agree we could revisit what the zone validator is doing and evaluate how much of that makes sense for non-netbox data and maybe skip those [19:24:39] or make those warnings and elevate all things to error for the netbox data [19:24:45] something along those lines [19:25:56] yeah [19:26:16] the one that stands out for being useful is finding mismatched PTRs from bad manual commits on both sides [19:26:35] gdnsd used to do that as well, but stopped in more recent versions for esoteric irrelevant reasons [19:27:04] but it's easy to make a mistake in a manually-manaaged pair of records for an IPv6 especially, or by missing the trailing dot on the name at the end of the PTR [19:27:45] but things like duplicates, are also normal/natural in some cases for manual general-case DNS [19:27:48] * volans has to step out for dinner in a couple of minutes [19:27:54] np! [19:28:01] the global duplicate is useful too I guess [19:28:26] if someone adds a line without checking if already there, but less probable [19:28:38] (we had some though in the old manual data at the start IIRC) [19:29:36] I'm open to keep discussing (at another time) in what direction we should push the zone-validator [19:29:45] yeah [20:39:57] (HAProxyEdgeTrafficDrop) firing: 47% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [20:44:57] (HAProxyEdgeTrafficDrop) resolved: 58% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [22:11:57] (HAProxyEdgeTrafficDrop) firing: 58% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [22:16:57] (HAProxyEdgeTrafficDrop) resolved: 66% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [22:21:56] (HAProxyEdgeTrafficDrop) firing: (6) 62% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [22:26:56] (HAProxyEdgeTrafficDrop) resolved: (6) 62% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop