[02:43:15] 10Traffic, 10RESTBase, 10RESTBase-API, 10SRE: REST API not returning latest page when queried title is a redirect - https://phabricator.wikimedia.org/T346579 (10Brycehughes) Ah ok. Thanks for checking. I suppose this can just sit open for a bit. I have a workaround, it just involves me hitting the API 2-3... [08:11:59] 10netops, 10Infrastructure-Foundations, 10SRE: Juniper RA receive bug CVE-2023-28981 - https://phabricator.wikimedia.org/T334916 (10ayounsi) 05Open→03Resolved a:03ayounsi Deployed [09:01:30] 10netops, 10Infrastructure-Foundations, 10SRE: Juniper RA receive bug CVE-2023-28981 - https://phabricator.wikimedia.org/T334916 (10ayounsi) This might need to be rolled back the day we start doing BGP unnumbered between spine and leaf as it seems to rely on it: https://www.theasciiconstruct.com/post/junos-b... [10:52:34] 10Traffic, 10SRE, 10SRE-swift-storage, 10Thumbor: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334 (10MatthewVernon) > An interesting data point (that I didn't see directly in the other ticket, at least in a quick scan!) would be some idea of the curve of "i... [11:05:34] 10Traffic, 10SRE: Add README and build-specific Dockerfile to purged - https://phabricator.wikimedia.org/T347021 (10LSobanski) [11:06:54] hi, could someone please have a look at https://gerrit.wikimedia.org/r/c/operations/dns/+/959182 (swap eqiad and codfw in geodns defaults due to the switchover)? [12:42:13] kamila_: trying to understand this a bit and possible I am missing the context so forgive me: but we are going to be repooling eqiad for DNS this week, correct? [12:42:51] sukhe: yes, but eqiad is slower on the application level now that it's not primary [12:44:05] (but I am not sure whether it's actually relevant for most traffic) [12:45:08] I guess what I am trying to parse is that given that in admin_state, "geoip/generic-map/eqiad => DOWN" is set, the default map is already "codfw" [12:45:23] yes but we will be removing that when we repool eqiad [12:45:26] but I guess when you repool eqiad... [12:45:33] yep [12:49:38] kamila_: +1ed, thanks [12:49:48] thank you sukhe! [12:55:19] I'm curious if the default mapping is actually used in practice [12:56:17] volans: agreed, that's the catch all [12:56:41] my understanding is only if there is no explicit override in place [12:57:16] the stuff like in the "nets" block (line 281) is more important [12:57:17] exactl, but don't we have overides for all internal networks and external countries? [12:57:25] I also don't remember if we have done this for the previous switchovers [12:57:52] we've never been on codfw for more than a month, so I think we didn't [12:58:17] that's fair [12:58:48] 10netops, 10Infrastructure-Foundations, 10SRE: Juniper RA receive bug CVE-2023-28981 - https://phabricator.wikimedia.org/T334916 (10cmooney) Hmm yeah good point. We can probably upgrade devices to a release with the fix in it before then. [13:01:53] I do not know whether we need to change this, and if we do then we should probably also reorder eqiad and codfw in not just the default [13:03:21] the switchover docs (https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Phase_9_-_Post_read-only) say to do that, but I do not claim to understand the exact reasons for that [13:06:22] if the docs are wrong, that'd be good to know :D [13:06:24] to be clear, it totally makes sense to me, it's not "wrong", it's just that I think it's not actually used much as we do have more specific mappings that should match pretty much all traffic [13:06:34] yes, I figured [13:06:56] (notably I don't see South America in the mapping, so that's probably a big chunk of the world that uses the default?) [13:07:10] kamila_: correct, that's also where the default comes in [13:07:10] for the non-mapped traffic that uses the default, sending it to codfw is as good as eqiad as we don't know from where it's coming, but without other knowledge makes sense to send it to the primary [13:07:48] wut? TIL we didn't map SA [13:07:55] well I don't see it there [13:08:05] I was sure we had all major continents/countries mapped [13:08:09] my bad [13:08:15] sorry for giving you more work? :D [13:08:32] not me specifically :D [13:08:45] you plural :D [13:09:35] they yes we surely need your patch! [13:09:44] ouch XD [13:11:09] but... assuming that the default is worth changing, is it also worth it to change the order for not-defaults? sure, inside NA closer is better, but say for Asia or Oceania the primary/secondary difference might be more relevant than the geography, if that difference is indeed relevant at all? [13:14:14] no, those are ordered based on latency [13:14:21] and that doesn't change when the primary changes ;) [13:15:06] we send traffic to the closest (in latency) PoP (ideally, some mapping are more guessed then others, but we now have the data to do more precise measurements, and started to use them) [13:15:14] OK, that makes sense [13:15:40] but then it's not obvious to me that the default is worth changing [13:15:49] the only mappings that we could consier changing are the internal ones (the network based ones) [13:16:08] hi :) [13:16:14] o/ [13:16:23] so this "switch the default" came out of a ticket between alex & I back when he was planning this switch [13:16:49] initially I had thought we might want to move /more/ things off eqiad as primary, but later realized that probably wasn't the right thing to do, either. [13:17:41] so yeah, we're just switching the default, which basically only affects things that either (a) we've never bothered to measure + tune for a better list and/or (b) we have no idea where the client is (some IPs have no useful info in MMDB, in many cases because they're anonymous proxies or satellite-based service, etc) [13:18:18] it can be merged anytime during this window, logically-speaking, as it has no real effect while eqiad is still marked down. [13:19:38] (and as riccardo mentioned - our normal mappings that we do have, are all latency-based rather than load-based. At least in theory, although a lot of cases are more guestimations than driven by accurate data!) [13:19:58] probenet will fix that over time :) [13:22:03] bblack: that makes sense, thanks [13:22:51] but how much traffic uses the default anyway? [13:23:11] we don't really know, but it's reasonable to expect that it's not a large percentage [13:23:17] yeah [13:23:30] I mean, it clearly does no harm [13:23:47] but I am wondering if it is non-negligibly good :D [13:24:14] (but in case I've already burned up the time we should be spending on it given the small impact, feel free to tell me to go away :D) [13:24:31] it's more about logical consistency than anything else. the reason eqiad's at the front of the default list when we have no better idea where to send someone, is because it's the primary/write datacenter. All other things about latency or geography being equal, that's the best choice. [13:24:50] if we're gonna switch primaries every 6 months, then that logic should naturally flip every 6 months. [13:25:04] okay, that is a good reason [13:25:08] thank you! [13:53:28] 10netops, 10Infrastructure-Foundations, 10SRE: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10Jhancock.wm) @cmooney I haven't received it yet. I checked with the dock to make sure it hasn't arrived and we weren't notified but no luck. Is there a tracking number for the package? [13:59:43] 10Traffic: Implement VTC tests for PURGE requests - https://phabricator.wikimedia.org/T347297 (10Fabfur) [14:01:08] 10Traffic, 10SRE: Implement VTC tests for PURGE requests - https://phabricator.wikimedia.org/T347297 (10Fabfur) [15:33:12] 10Traffic, 10SRE, 10Patch-For-Review: Alert on Varnish high thread count - https://phabricator.wikimedia.org/T323723 (10BCornwall) 05Resolved→03In progress [15:33:24] 10Traffic, 10SRE, 10Patch-For-Review: Alert on Varnish high thread count - https://phabricator.wikimedia.org/T323723 (10BCornwall) @Vgutierrez Thanks for your patch fixing thread_pool_max; IIRC @bblack had advised the flat 12000 max threads due to the arbitrary nature of the processorcount. Is this patch to... [15:56:40] 10Traffic, 10SRE, 10Patch-For-Review: Alert on Varnish high thread count - https://phabricator.wikimedia.org/T323723 (10BBlack) To clarify and expand on my position about this thread count parameter (which is really just a side-issue related to this ticket, which is fundamentally complete): 1. Varnish's thr... [16:30:26] 10netops, 10Infrastructure-Foundations: Move cr1-esams<->cr2-esams link to QSFP port - https://phabricator.wikimedia.org/T347323 (10ayounsi) [16:34:16] 10netops, 10Infrastructure-Foundations: Move cr1-esams<->cr2-esams link to QSFP port - https://phabricator.wikimedia.org/T347323 (10cmooney) There's no free QSFP port on cr1-esams, which was the reason we had to use the 3x10G. We probably need to channelize et-0/0/2 on cr2-esams and use breakout cables if we... [19:43:19] 10Traffic, 10WMF-Legal, 10Patch-For-Review, 10Privacy: Add no-transform to Cache-Control header - https://phabricator.wikimedia.org/T218618 (10BCornwall) @bblack seems to agree that this header belongs. @Vgutierrez, do you still have reservations of this addition?