[09:30:42] 06Traffic, 06Security-Team, 10WMF-General-or-Unknown, 07ContentSecurityPolicy, 13Patch-Needs-Improvement: Add restrictive CSP to upload.wikimedia.org - https://phabricator.wikimedia.org/T117618#9766553 (10TheDJ) Scheduled this for may 9th puppet window. [09:35:36] 10netops, 06Infrastructure-Foundations: mr1-eqsin performance issue - https://phabricator.wikimedia.org/T362522#9766569 (10cmooney) p:05High→03Medium [09:49:16] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 22.2R3 - https://phabricator.wikimedia.org/T364092 (10cmooney) 03NEW p:05Triage→03Medium [09:50:59] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 22.2R3 - https://phabricator.wikimedia.org/T364092#9766638 (10cmooney) [09:55:38] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 22.2R3 - https://phabricator.wikimedia.org/T364092#9766653 (10ayounsi) Both Junos 22.2R3-Sx and Junos 22.4R3 are latest recommended. fyi, I went with 22.4R3 in magru. [10:16:30] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095 (10cmooney) 03NEW p:05Triage→03Medium [10:16:52] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9766721 (10cmooney) [10:16:53] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095#9766720 (10cmooney) [10:25:35] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097 (10cmooney) 03NEW p:05Triage→03Medium [10:25:48] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095#9766766 (10cmooney) [10:25:49] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9766767 (10cmooney) [10:28:05] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9766769 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b27eb80b-98ee-43fb-8026-b02b3e00b5d4) set by cmooney@cumin1002 for 14 days, 0:00:00 on 3 host(s) and their... [10:35:36] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE, 13Patch-For-Review: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9766810 (10cmooney) Device has been removed from LiberNMS now. I also downtimed it for 2 weeks just in case I mess up the order of anything. [10:44:52] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE, 13Patch-For-Review: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9766846 (10cmooney) [10:50:23] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE, 13Patch-For-Review: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9766856 (10cmooney) [10:58:58] 06Traffic, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:magru VM tracking task - https://phabricator.wikimedia.org/T364016#9766894 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host durum7002.magru.wmnet with OS bookworm [11:11:53] 06Traffic, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:magru VM tracking task - https://phabricator.wikimedia.org/T364016#9766933 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host doh7002.wikimedia.org with OS bookworm [11:13:10] 10netops, 06Infrastructure-Foundations, 06SRE: Adjust IBGP route-reflector spine/leaf automation to support separate client clusters - https://phabricator.wikimedia.org/T364103 (10cmooney) 03NEW p:05Triage→03Medium [11:39:09] 06Traffic, 06Infrastructure-Foundations: Q4:magru VM tracking task - https://phabricator.wikimedia.org/T364016#9767079 (10MoritzMuehlenhoff) [11:45:14] 06Traffic, 06Infrastructure-Foundations: Q4:magru VM tracking task - https://phabricator.wikimedia.org/T364016#9767112 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host durum7002.magru.wmnet with OS bookworm completed: - durum7002 (**PASS**) - Removed from Puppet... [11:51:44] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE, 13Patch-For-Review: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9767123 (10cmooney) [12:02:38] 06Traffic, 06Infrastructure-Foundations: Q4:magru VM tracking task - https://phabricator.wikimedia.org/T364016#9767181 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host doh7002.wikimedia.org with OS bookworm completed: - doh7002 (**PASS**) - Removed from Puppet a... [12:07:50] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE, 13Patch-For-Review: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9767210 (10cmooney) [12:57:20] https://phabricator.wikimedia.org/F49974214 [12:59:55] 06Traffic, 06Infrastructure-Foundations, 06SRE: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps - https://phabricator.wikimedia.org/T359054#9767450 (10CDanis) magru is a clear win for: UY, CL, AR, BR, PY It's better for some but not all users in: BO, PE {F49974214} [13:02:19] cdanis: very interesting thanks! [13:02:44] fabfur: npnp very happy to do so, let me know if any further breakdowns would be interesting (like breaking Brazil into subdivisions) [13:04:05] cdanis: I'll be interested in a before/after once we have full connectivity and peering :) [13:04:17] XioNoX: sure thing, it's quite an easy query to run [13:04:26] but that's perfect for like 80/20% [13:05:00] cdanis: thanks! PE is probaly the most interesting in a way [13:07:16] re: subdivisions of BR, seems like magru is a pretty clear win there but I think you meant if we want to do a progressive rollout and for that? [13:07:45] sukhe: yeah or if just you were curious [13:10:15] depends on how much work it is for you to run this script :) [13:10:24] if not too much, might be interesting to see [13:12:30] it is not at all [13:12:38] https://phabricator.wikimedia.org/F49974214#7084 [13:12:44] is an excerpt from my notebook [13:13:44] ah nice! [13:23:18] sukhe: https://phabricator.wikimedia.org/F49977482 [13:24:28] 06Traffic, 06Infrastructure-Foundations: Q4:magru VM tracking task - https://phabricator.wikimedia.org/T364016#9767555 (10ssingh) [13:26:34] cdanis: 👍 [13:26:56] if you would like anything tweaked about the plots let me know too :) [13:28:24] 06Traffic: Disable Chrome Private Prefetch - https://phabricator.wikimedia.org/T364126 (10OSefu-WMF) 03NEW [13:29:06] a bit surprised about Parabia here? [13:32:44] 06Traffic: Disable Chrome Private Prefetch - https://phabricator.wikimedia.org/T364126#9767623 (10OSefu-WMF) [13:36:48] 06Traffic: Disable Chrome Private Prefetch - https://phabricator.wikimedia.org/T364126#9767648 (10OSefu-WMF) [13:37:10] 06Traffic: Disable Chrome Private Prefetch Proxy - https://phabricator.wikimedia.org/T364126#9767651 (10OSefu-WMF) [13:37:35] 06Traffic: Disable Chrome Private Prefetch Proxy - https://phabricator.wikimedia.org/T364126#9767653 (10OSefu-WMF) [13:40:02] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9767654 (10cmooney) [13:41:07] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9767659 (10cmooney) a:03Papaul @papaul I think this one is ready to be moved to rack D1 now. [13:41:31] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9767661 (10cmooney) [14:04:57] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE, 13Patch-For-Review: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9767770 (10cmooney) [14:05:03] sukhe: surprised how? [14:05:30] sorry, I should have specified. surprised given I didn't expect eqiad to be as better as it is in the graph [14:05:56] for Paraiba magru looks better [14:07:00] I'd say magru looks better for all the BR subdivisions [14:08:14] 06Traffic, 06Movement-Insights: Disable Chrome Private Prefetch Proxy - https://phabricator.wikimedia.org/T364126#9767780 (10OSefu-WMF) [14:08:33] yeah [14:14:52] some of the data is small sample size, and also has some weirdnesses [14:15:22] like I don't believe the Amapa ulsfo numbers for a second [14:22:17] yeah. I guess it still helps with some of the countries we were split about for example or were on the fence about [14:23:49] I'm very interested in how things shift for BO and PE as we add more transits and also get more data [14:24:39] the PE sample size is quite large (~3k) and prety convincing that eqiad is actually better most of the time [14:24:51] but magru is better for at least some users [14:25:26] maybe subdivision for PE then? :) [14:25:31] potentially [14:25:46] we've also talked about things like per-asn or per-ipblock [14:27:18] yeah, subdivision isn't sufficient to tease out what's going on [14:28:03] 1746 datapoints (more than half) are from Lima Province and that distribution looks similar to the overall one [14:47:03] 06Traffic, 06Movement-Insights: Disable Chrome Private Prefetch Proxy - https://phabricator.wikimedia.org/T364126#9767992 (10KOfori) p:05Triage→03Medium Thanks, @OSefu-WMF. Received. We would usually not enable anything impactful going into a weekend. This will be prioritized early next week. [14:51:22] 06Traffic, 06SRE, 10Data Products (Data Products Sprint 13): Data Quality - requestctl not getting set - https://phabricator.wikimedia.org/T342577#9768015 (10VirginiaPoundstone) [14:51:27] 06Traffic, 06Data Products, 06Data-Engineering, 10Observability-Logging, 13Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117#9768016 (10VirginiaPoundstone) [14:52:31] 06Traffic, 06Data Products, 06SRE: Data Quality - requestctl not getting set - https://phabricator.wikimedia.org/T342577#9768036 (10VirginiaPoundstone) [14:53:04] 06Traffic, 06Data Products, 06SRE: Data Quality - requestctl not getting set - https://phabricator.wikimedia.org/T342577#9768033 (10VirginiaPoundstone) Once https://phabricator.wikimedia.org/T351117 is complete, this may need a spike to check if issue persists. [15:55:44] 06Traffic: Release tcp-mss-clamper for bullseye - https://phabricator.wikimedia.org/T357258#9768363 (10Vgutierrez) 05Open→03Resolved [16:01:10] 06Traffic, 06Infrastructure-Foundations, 06SRE: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps - https://phabricator.wikimedia.org/T359054#9768394 (10CDanis) Oh, and I think magru is a win for SV as well. [16:03:24] cdanis: in the long term (future projects), I'd love to be able to export probenet orderings as per-ipblock [16:04:05] and have some fancy python script suck that up automagically once a week or whatever, and damp it over time, and default missing bits based on maxmind, etc... and spit out a whole global netblock=>dclist for us. [16:04:23] but, definitely a real project, not a quick hack :) [16:04:52] bblack: totally doable, the analysis scripts have access to the full client IP plus geocoded data [16:04:59] nice [16:05:12] we do need to get ISP/ASN joined in there somewhere, I think that's just a matter of editing some config in analytics [16:05:21] going into the weeds a bit: I was really thinking of two kinds of maxmind-based defaulting [16:05:56] but! i'm not sure how good per-ipblock will actually be, until we start explicitly monitoring or recording the user ip > resolver ip mapping [16:06:02] s/monitoring/modeling/ [16:06:03] 1) Obviously, if we have whole large blocks/ASNs missing in our data, some default based on something like our geo-maps + maxmind just to fill in the blanks. [16:06:59] 2) But also - if MaxMind tells us a certain part of the space is a /16 with all the same location info, but we only get probes for X% of the /24s within it, but all with the same results, we can use the structural info from maxmind to infer it applies to the whole /16. [16:08:13] for the user-vs-resolver part: obviously we already have edns-client-subnet for cases where it helps. we've always had the user-vs-resolver problem for many cases though, even with the country-based geodns mapping we do today. [16:08:43] it's an orthogonal problem in my mind, that we try to attack by doing http-level alt-svc routing using the same input data (probenet-derived) as the geoip map. [16:08:58] I mostly agree except that I do think, as you get finer-grained, you have a lot more opportunities to make mistakes [16:09:13] and yeah totally agreed re: alt-svc is the way to step past it [16:09:43] maybe. I have to think about that. [16:10:09] I mean, what we have now is fine-grained in some sense. all the little networks are in maxmind, we're just only paying attention to the country/region field attached to them. [16:11:02] but it can potentially solve some edge cases to get more-fine-grained as well (where we're taking an average guess on a whole country, but maybe there's two major ISPs within that get different results due to how they peer) [16:11:20] that is also true yeah [16:11:31] and we have some evidence that that kind of thing happens all the time [16:13:11] mapping the user<->resolver is interesting too, but I don't have any great ideas for easily doing that. it would be useful as research input to tell us how prevalent the mismatches are. [16:13:31] we could probably invent a way based on some custom JS and custom DNS queries, etc [16:13:37] but probably complicated :) [16:13:47] yeah indeed, and we are running client code to do these probes now [16:13:54] but complicated :) [16:14:18] anyway I agree we should definitely go ahead with per-ipblock probenetting [16:14:55] another thing we might have to do is to start having a variable sampling rate, instead of just uniform [16:15:02] yeah, that and have an automated pipeline I think combine for a pretty nice step-change, but we'll need $time to attack it. [16:15:06] yeah [16:15:52] maybe varying the sample rate based on observe traffic levels from $networks? oversample the missing ones, undersample the very popular ones? [16:15:56] that kind of thing? [16:16:33] yeah exactly [16:17:07] or to oversample based on both (lack of) popularity and variance seen in the data [16:17:14] hmmm yeah [16:17:47] there's a lot of bikesheddy logic/math decisions to get into on how we auto-generate a netmap from the data, too [16:17:51] yes :) [16:17:55] all fun stuff to explore I'm sure :) [16:18:04] and also how to evaluate how we're doing over time [16:18:16] anyway, that can wait until v2 or v3 [16:18:43] i guess i'd call v0 where we are now, and v1 having an automated pipeline that outputs *something* that can be compared over time [16:19:00] 06Traffic, 06Infrastructure-Foundations, 10vm-requests: magru: (2) VMs for ncredir - https://phabricator.wikimedia.org/T363881#9768456 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncredir7002.magru.wmnet with OS bookworm [16:19:03] (for both countries/subdivisions and ipblocks) [16:19:26] seems like a prereq for doing anything more sophisticated -- you'd want your analysis pipeline to be the thing suggesting your sampling rate map, for instance [16:19:42] yeah [16:20:47] we could look at the RUM stuff from perf-team to see if had a measureable impact in some places. ditto for loading it up for alt-svc too. [16:21:04] yeah we did that in the past too, and it definitely did [16:21:17] (compared probenet with RUM measurements like first paint) [16:21:34] I was also saying earlier in another chan: we could also potentially solve the "slow editing for JP->eqsin->cores" sort of problem via alt-svc this way as well, as an add-on. [16:22:00] if $edit_session_cookie_exists ( use alternate mapping that just wants fastest transit/transport to core write dc } [16:22:05] hah [16:22:45] sure, and we should, but i think that's like, saving 10% of the user-visible latency at maximum [16:22:49] combine probenet plus data for transport latencies in our network, to decide if an editor session would be faster going through their fastest edge, or just sending them straight to eqiad, basically [16:23:27] yeah, I don't know how prevalent or bad it is. [16:23:51] but for cases like magru and eqsin, there can be some pretty large triangle-shaped problems, where the direct path to core DCs is much better if it's all uncacheable anyways. [16:24:04] yeah [16:37:58] 06Traffic: Craft geo-maps file to create lowest-latency routes from south america - https://phabricator.wikimedia.org/T363722#9768516 (10CDanis) magru is a clear win for: UY, CL, AR, BR, PY It's better for some but not all users in: BO, PE {F49974214} [16:43:04] 06Traffic: Craft geo-maps file to create lowest-latency routes from south america - https://phabricator.wikimedia.org/T363722#9768538 (10BBlack) We could choose to use subdivision-level mapping in cases where it makes sense. [16:59:20] 06Traffic: Craft geo-maps file to create lowest-latency routes from south america - https://phabricator.wikimedia.org/T363722#9768621 (10CDanis) Unfortunately subdivision-level mapping didn't help in PE -- there are many regions where magru is both better and worse than eqiad. And over half our data points so f... [17:04:15] 06Traffic, 06Infrastructure-Foundations, 10probenet, 06SRE: decide on an aggregation function to combine multiple probes into a single measurement - https://phabricator.wikimedia.org/T337318#9768653 (10CDanis) [17:04:56] 10netops, 06Infrastructure-Foundations, 10probenet, 06SRE, and 2 others: Add probenet configuration for magru - https://phabricator.wikimedia.org/T362902#9768659 (10CDanis) [17:05:09] cdanis: yeah if we're pre-gaming how an objective script would make decisions: what's the math there? do we compare medians? or something more complex as expressed in those violins? [17:06:03] or maybe even in some equity sense do we look at like the p75s in case they turn up a different decision than median? [17:06:07] so far in the past we've been very conservative and only changed mappings based on this data when one distribution looks to be strictly better than the other one [17:06:08] it's complicated :) [17:06:17] because there were enough cases where that was true [17:06:45] it's hard to come up with a good answer without understanding why the latency data is shaped like it often is [17:06:54] yeah [17:07:02] but yeah, i think comparing something like p75 is very reasonable to do [17:07:10] you can even make an argument for comparing the mean [17:07:12] I mean there's some obvious factors to do with last-mile access types, and distance of rural areas from regional network hubs, etc [17:07:22] but that they end up different to our DCs is still kinda odd [17:07:24] because your metric being influenced by the long tail is arguably a benefit [17:08:59] maybe this is a case where geography and networks just don't align: maybe in one region we have an urban ISP with great connectivity out through a certain global transit network, and in the rural parts of the same region we have mostly cellphone users on a totally different network with different transit arrangements. [17:09:18] (and thus different pings to our DCs) [17:09:19] yeah, totally [17:09:47] I count that as +1 for ipblock mapping :) [17:10:44] yeah, for example I had a 110ms latency from the DC's wifi, with the first hops a few ms away (so not a wifi issue) [17:10:56] so here's another question: how do you define an ipblock? [17:11:41] with the size of IX.BR (one of the largest IXP in the world), I'm hopeful that just peering with the route servers will significantly improve latency [17:12:31] XioNoX: can you poke me whenever that happens, or we add a new transit, etc? because we can do some before/after pictures if you want [17:13:19] cdanis: I'm off most of next week, sukhe or topranks can probably though if they get live before I'm back [17:13:23] ahh ok ok [17:13:27] enjoy :) [17:13:42] 110ms is upsetting for being in the same physical facility haha [17:14:08] haha yeah [17:14:20] 06Traffic, 06Infrastructure-Foundations, 10vm-requests: magru: (2) VMs for ncredir - https://phabricator.wikimedia.org/T363881#9768715 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncredir7002.magru.wmnet with OS bookworm completed: - ncredir7002 (**PASS**)... [17:18:29] cdanis: the "easy" way is let them define themselves naturally. start with collecting data against the minimum routeable networks: ipv4/24 and ipv6/48 . If the math results on latency produce the same lists for adjacent networks, merge them into supernets. [17:19:05] the problem with that is our data is gonna have plenty of "holes" in dumb places that prevent merging, but we can maybe close the gaps by looking at how maxmind merges network ip blocks in their db too, even if we're using our data. [17:20:04] that would be a pretty cool project! [17:20:05] (also there's probably very little point using all 7 DCs in every result. Just list the top N (3? 4?), and then you get more merge opporunities) [17:21:10] at the gdnsd level it does the same thing with our current geo-maps: it rips through all the networks in maxmind, maps them to dclists based on country, then does the same kind of "merge adjancents into supernets" thing, internally in gdnsd. [17:21:54] it's a nice optimization to do anyways, because in the edns-client-subnet cases, we can output larger supernets in our responses than we might otherwise naturally [17:22:14] which makes life easier for ednsc-splitting dns caches that query us (fewer misses in their network-varied caches) [17:23:20] interesting chat - just finished catching up! [17:23:37] cdanis I'll make sure to let you know if/when any other transits or peering come online next week [17:24:04] yeah once transit/peering stabilizes and then + a week or so of data, things may look slightly different [17:24:07] awesome thanks :) [17:26:12] for the curious, the gdnsd code that does it is nlist_normalize() here: https://github.com/gdnsd/gdnsd/blob/master/libgdmaps/nlist.c#L282 [17:26:44] (at that point in processing, all the mapped network blocks from maxmind are in a big list, and "normalize" means do all this adjacent merging and such, before later converting the optimized list to a tree for lookups) [17:28:36] but it has the advantage of starting with maxmind data, which has full coverage (they have some kind of metadata on the entire IP space, there's no holes) [17:31:45] am I naive for thinking we can keep that MaxMind data / cover for the whole space as a fall-back? [17:32:19] with more-specific overrides based on our probe data for certain networks? [17:37:12] yeah I think so [17:37:47] but what's now the gdnsd geo-maps file (mapping country => dclist manually) would be an input to whatever python script that's munging probenet data, to use as fallback. [17:38:03] and then it would deliver the singular output that gdnsd actually loads in that new world [17:38:10] (all ipblocks=>dclist) [17:39:44] and then as an additional kind of "fallback", we can use maxmind for hole-mapping too (e.g if maxmind says a whole /16 is one network in one location/asn/etc from their pov, and we have probenet data for only 54% of the /24s within it, but they all look similar... we take the structural hint from maxmind and map that whole /16 according to the probe results we got) [17:42:35] 06Traffic, 06Infrastructure-Foundations: Q4:magru VM tracking task - https://phabricator.wikimedia.org/T364016#9768835 (10ssingh) [17:43:12] gotcha yeah [17:43:25] a lot of work in all that but sounds achievable [17:48:24] yeah, it's stuff we can iterate and improve on over time. like cdanis was saying, we can start simple :) [17:48:42] the nice thing is we'll be in a world where everyone can look at those kinds of things in a python script [17:48:58] instead of the logic being buried in C code somewhere that's a pain to update safely. [18:44:31] 06Traffic, 06Infrastructure-Foundations, 10vm-requests: magru: (2) VMs for ncredir - https://phabricator.wikimedia.org/T363881#9768923 (10BCornwall) 05In progress→03Resolved