[00:44:44] 06Traffic: Track WMF owned non-canonical domains - https://phabricator.wikimedia.org/T247618#9983765 (10BCornwall) 05Open→03Resolved a:03BCornwall [00:51:11] 06Traffic, 10GitLab (Project Migration), 13Patch-For-Review: Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623#9983778 (10BCornwall) [01:21:37] 06Traffic, 10GitLab (Project Migration), 13Patch-For-Review: Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623#9983815 (10BCornwall) [06:57:59] 10netops, 06Infrastructure-Foundations, 06SRE: cr3-ulsfo flapping on July 14 - https://phabricator.wikimedia.org/T370048#9984008 (10ayounsi) 05Open→03Resolved a:03ayounsi Closing this task in favor of {T364092}. [07:01:18] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 22.4R3 - https://phabricator.wikimedia.org/T364092#9984011 (10ayounsi) There has been a spike of CPU usage on cr1-eqiad (with no impact), not sure if just a coincidence. [08:29:01] 06Traffic, 10conftool: Allow integrating requestctl rules into haproxy - https://phabricator.wikimedia.org/T369606#9984187 (10Joe) There's an interesting problem to manage with haproxy, which is making me think we should support a much simplified syntax. Let's say the user wants to disable all traffic that: *... [08:51:20] 06Traffic, 06collaboration-services, 06Release-Engineering-Team, 06SRE, 13Patch-For-Review: implement anti-abuse features for GitLAb (Move GitLab behind the CDN) - https://phabricator.wikimedia.org/T366882#9984285 (10Jelto) [08:53:22] 06Traffic, 06collaboration-services, 06Release-Engineering-Team, 06SRE, 13Patch-For-Review: implement anti-abuse features for GitLAb (Move GitLab behind the CDN) - https://phabricator.wikimedia.org/T366882#9984291 (10Jelto) I migrated the GitLab hosts to nftables which unblocks us using nftables built-in... [08:53:35] 06Traffic, 06collaboration-services, 06Release-Engineering-Team, 06SRE, 13Patch-For-Review: implement anti-abuse features for GitLab (Move GitLab behind the CDN) - https://phabricator.wikimedia.org/T366882#9984292 (10Jelto) [08:55:55] 06Traffic, 06collaboration-services, 06Release-Engineering-Team, 06SRE, 13Patch-For-Review: implement anti-abuse features for GitLab (Move GitLab behind the CDN) - https://phabricator.wikimedia.org/T366882#9984312 (10Jelto) [09:54:57] 06Traffic: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9984500 (10Vgutierrez) a:03Vgutierrez I'm taking a look today and I'll report back, sorry about the delay [12:15:41] 06Traffic: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9984938 (10Vgutierrez) As mentioned in Slack, the CDN enforces a max cap on the TTL of 24 hours, something that is no being triggered on /w/load... [12:39:48] 06Traffic: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9984975 (10CDanis) @Vgutierrez Out of curiosity, is a 304 response the only way to produce an x-cache of `miss, hit/X` ? [12:42:11] 06Traffic: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9984998 (10Vgutierrez) >>! In T366517#9984975, @CDanis wrote: > @Vgutierrez Out of curiosity, is a 304 response the only way to produce an x-cac... [13:14:23] vgutierrez: ah of course, I forgot that ATS-misses will show as cached hits in Varnish on second serving [13:14:25] thanks [13:14:37] cdanis: yes, unless you restart varnish in the middle [13:14:43] so varnish is cold and ATS is warm [13:14:46] yeah [13:15:03] is a 304 response status with ats-miss/varnish-hit always a revalidation? [13:15:40] I think so [13:15:54] basically I'm idly wondering if there's an easy way to get some observability in webrequest/analytics around the case you highlighted [13:15:57] I'm leaning towards adding a hit-refresh cache-status [13:16:03] 👍 [13:16:15] it would be really helpful when staring at TTFB data [13:16:18] yes [13:16:32] sometimes you get hit-local TTFBs of ~5ms [13:16:36] and it might also be good to have a field in webrequest for when the client sent etag or IMS [13:16:46] like in x-analytics perhaps [13:16:49] and sometimes you get hit-local TTFBs of ~160ms [13:16:54] 😅 [13:17:28] X-cache says the same, but one is a local response from ATS and the other one is a 304 from the applayer [13:18:10] yeah [13:18:44] (in general I'm kind of wondering if we want to / have space to record every header the client sends, but that's another discussion) [13:21:59] what's the client here? [13:22:09] whatever connects to HAProxy? [13:22:11] 06Traffic, 10Observability-Metrics, 13Patch-For-Review: Add prometheus-https load balancer - https://phabricator.wikimedia.org/T326657#9985179 (10fgiunchedi) [13:49:34] 06Traffic, 10conftool: Allow integrating requestctl rules into haproxy - https://phabricator.wikimedia.org/T369606#9985449 (10CDanis) >>! In T369606#9984187, @Joe wrote: > because, contrary to what the documentation suggests, it's not possible to aggregate logical expressions in conditions. > > The best way t... [13:49:46] vgutierrez: yes, that's what I was imagining [14:11:29] 06Traffic, 10conftool: Allow integrating requestctl rules into haproxy - https://phabricator.wikimedia.org/T369606#9985617 (10CDanis) As @Fabfur points out, in haproxy 3.0+ (but not haproxy 2.8.x) we have the option of evaluating many ACLs together with negation, as part of a fetching samples. https://docs.ha... [14:46:45] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997#9985956 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=36afd2cf-508d-4c02-a8cc-afb66ea29242) set... [15:01:25] FIRING: SystemdUnitFailed: anycast-healthchecker.service on durum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:01:49] yes known [15:07:22] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997#9986058 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=81c0aaa1-44d2-4d05-942a-66bcdfb90d2d) set... [15:08:26] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997#9986071 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=58bc700a-b84d-4058-9776-9f6510239089) set... [15:16:25] RESOLVED: SystemdUnitFailed: anycast-healthchecker.service on durum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:26:32] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997#9986154 (10cmooney) Upgrade completed, all hosts back online and pinging ok. Thanks all for the assistance! [15:30:05] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997#9986188 (10ABran-WMF) dbstore1009 has replication up to date on all 3 instances all 3 other nodes are repooling ↑ [15:31:47] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997#9986200 (10MatthewVernon) Swift looks good, thanks. [16:13:49] 06Traffic, 10GitLab (Project Migration), 13Patch-For-Review: Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623#9986510 (10hashar) [16:15:33] 06Traffic, 10GitLab (Project Migration), 13Patch-For-Review: Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623#9986517 (10hashar) I have marked https://gerrit.wikimedia.org/r/admin/repos/operations/software/varnish/libvmod-querysort,general read-only in Gerrit [18:16:26] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#9987104 (10Papaul) [18:38:26] 06Traffic, 06collaboration-services, 06SRE, 13Patch-For-Review, 10Release-Engineering-Team (Radar): implement anti-abuse features for GitLab (Move GitLab behind the CDN) - https://phabricator.wikimedia.org/T366882#9987283 (10brennen) [19:21:32] 06Traffic: Clean up Varnish VCL - https://phabricator.wikimedia.org/T370200 (10BCornwall) 03NEW [19:25:39] 06Traffic: Fix Varnish tests - https://phabricator.wikimedia.org/T370202 (10BCornwall) 03NEW [19:25:56] 06Traffic: Fix Varnish tests - https://phabricator.wikimedia.org/T370202#9987619 (10BCornwall) p:05Triage→03Medium [19:26:15] 06Traffic: Clean up Varnish VCL - https://phabricator.wikimedia.org/T370200#9987627 (10BCornwall) p:05Triage→03Medium [19:35:01] sukhe: interestingly, is seems that gdnsd is still handing out IPs for appservers-ro.discovery.wmnet, even though both DCs are marked DOWN [19:35:24] dns1004:~$ cat /var/lib/gdnsd/discovery-appservers-ro.state [19:35:24] 10.2.1.1 => DOWN/300 [19:35:24] 10.2.2.1 => DOWN/300 [19:35:31] I would not have expected that :) [19:48:16] specifically, if both are DOWN, it seems to revert to just geoip behavior (i.e., hands out the nearest option, per the discovery-map) [19:50:54] swfrench-wmf: thats the ip for appservers.discovery.wmnet [19:51:05] 1.1 is codfw, 2.1 is eqiad [19:51:23] it always has to hand out something [19:51:36] for an A/A disc record, it's going to be one of the two, even if both are down [19:51:57] for A/P, IIRC, we set it up so that if both are down, it uses the failoid IP (which is basically an unresponsive IP) [19:53:03] RhinosF1: yes, indeed, these are the LVS service IPs (i.e., appservers.svc.{dc}.wmnet) [19:53:49] A/A and A/P meaning active/active and active/passive [19:53:56] whic his the active_active flag in the discovery metadata [19:53:59] swfrench-wmf: we have it from the source now :) [19:54:07] if you want it to go nowhere, you can change it to active_passive I guess [19:54:11] bblack: we were wondering about this the other day and hence [19:54:21] (but nowhere will still be an IP address, just not a responsive one) [19:54:27] bblack: exactly, yes - falling back to failoid for the a/p case is pretty clear from the configs, but we weren't sure what it was going to do in the a/a case where there's no fallback :) [19:54:51] for the A/A case, the assumption is that there's always at least one working, and that all-down is a mistake [19:55:28] cool, that makes sense - I'll update the wikitech page to make the distinction clearer in the failure cases section [19:55:33] whereas A/P cases usually involve some critical state-handling that needs switching via break-then-make, so you need to be able to set both down inbetween switching from one to the other. [19:56:04] (break-then-make like you see in the electrical world: a switch that has to turn off A before it can turn on B, because A and B can't ever touch) [19:57:50] great, yeah that makes a lot of sense given the specific context around a/p services, also good tip re: potentially switching to a/p in order to get failoid behavior [19:57:56] https://en.wikipedia.org/wiki/Transfer_switch#Types <- I guess wikipedia says the correct term is an open switch vs a closed switch. The A/P discovery entries are like an open-style switch. [19:58:34] swfrench-wmf: I doubt the transition from a/a to a/p is simple though. I don't think anyone has ever run through a procedure for it :) [19:58:59] it moves it from one config file to another, and the names probably can't conflict, etc.... I'm sure there be dragons. [19:58:59] heh, yeah as I wrote that, I was like "hmmm ... I wonder what that would look like in practice" :) [19:59:56] maybe easier to craft a ferm rule to reject the traffic on the one remaining host in each DC, heh [20:00:08] lemme dig through what the templating looks like a little bit [20:01:07] well, the state labels are compatible, there's something nice [20:01:21] it might actually work out ok [20:02:36] sukhe: you'd probably want to carefully control an attempt at it. maybe set it to active_active=>false with puppet disabled on the dns nodes, then try running on one, or something [20:02:37] yeah, IIRC you end up with exactly the same geo-resources entry, it's just that it gets wrapped in a metafo one elsewhere [20:02:51] in theory, it should just flip from one file to another in the puppet diff and then everything reloads ok [20:03:13] but "in theory" is basically the SRE equiavlent of "Hold my beer and watch this" [20:03:19] lol [20:04:29] lucky for you, oncall switched to batphone 4 minutes ago, so everyone gets paged :) [20:07:40] I'll start with baby steps and see what the PCC diffs look like :) [20:07:51] but in seriousness, I [20:07:59] 'll take no action on this this afternoon [20:08:03] there is the issue of matching mock entries for dns CI [20:08:26] I don't think you can align that perfectly, but so long as nobody's running authdns-update between the two commits, should be ok? :) [20:08:50] oh, that's a good point! I forgot about the mocks in operations/dns [20:09:20] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dns/+/refs/heads/master/utils/mock_etc/ [20:09:32] basically make a matching commit to move the file there too. [20:10:08] tell people to hold authdns-update and netbox-driven changes for a bit: try the puppet part and push that everywhere if it works, then push + authdns-update the matching change in mock_etc [20:10:29] maybe? [20:10:50] oh there's also the change to the zonefile itself... [20:11:19] so! [20:11:49] it's 3 dns changes in sequence: add the new entry in mock_etc's metafo file, switch the zonefile entry to a/p style, then remove the old one from mock_etc geo file [20:11:53] something like that [20:14:34] 06Traffic, 10MW-on-K8s, 06serviceops, 06SRE, and 2 others: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#9987802 (10Scott_French) Current status: * appservers-rw and api-rw are depooled everywhere, and resolve to failoid as of 17:45 UTC * api-ro is serving only... [20:14:46] let me make some mock commits, that will be simpler [20:16:16] https://gerrit.wikimedia.org/r/c/operations/dns/+/1054658 (and the two that immediately follow, pushed together) [20:16:38] I /think/ each of these 3 commits will pass DNS CI one by one. [20:16:50] and taken as a whole, the combined 3 commits get you from one state to another [20:17:05] but there is no clean way to switch this + the puppet part, without asking everyone to hold authdns-update/netbox [20:17:46] basicaly you'd push the puppet part, then merge all 3 commits and run a single authdns-update for the combination. I think that will work. If anyone does anything else between the two, it probably breaks. [20:18:34] indeed, that seems to have placated CI :) [20:19:18] got it, yeah that makes sense re: sequencing, and the window where no conflicting commits should be allowed [20:19:33] all of this seems to make a strong argument that perhaps we should structure the a/a entries differently [20:20:08] I'm pretty sure the way it's structure now, if you turn on both sides of an A/P disc, it will work like an A/A. [20:20:29] I think the only thing that prevents that is the confd check [20:20:33] which is already a sort of footgun [20:20:38] yeah maybe [20:20:43] I didn't look for it [20:21:11] but we could do the same metafo->geoip + failoid setup for both [20:21:40] and use just confd enforcement of rules to differentiate them at the gdnsd level (A/A can't have both down, A/P can't have both up) [20:21:52] and then it would be far easier to flip that flag in a situation like this [20:22:37] ah, that's a neat idea [20:24:01] arguably for A/P confd should also additionally enforce: can't transition directly from UP+DOWN to DOWN+UP. has to transition through a DOWN+DOWN state on the way. [20:27:17] hmmmm ... from the looks of the way confd at least wields its etcd v2 client, it could in theory see coalescing if you transition too quickly [20:28:05] but that's arguably holding it wrong, of course, if the intention is to have a blackout period of sorts :) [20:28:44] yeah you're supposed to have some downtime in the middle, while you do the last of your data syncing and then flip mysql masters or whatever equivalent operation to ensure clean transitions of stateful things. [20:30:00] hopefully etcd can make a bit flip visible in less time than that takes, I guess! [22:12:44] 06Traffic: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9988347 (10Jdlrobson) Could varnish be behaving differently on the mobile domain? [22:15:24] just checked PCC diffs for a hypothetical switch to `active_active: false`, and indeed this all seems like it should work together with the example dns patches b.black posted / sequencing - at least in theory :) [22:15:24] definitely not going to touch this for now, heh [22:15:24] thanks again, b.black! [22:15:33] oh and I updated https://wikitech.wikimedia.org/wiki/DNS/Discovery#Failure_scenario [22:53:25] FIRING: SystemdUnitFailed: update-public-suffix-list.service on ncmonitor1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:13:25] RESOLVED: SystemdUnitFailed: update-public-suffix-list.service on ncmonitor1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:14:25] FIRING: SystemdUnitFailed: update-public-suffix-list.service on ncmonitor1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:33:40] RESOLVED: SystemdUnitFailed: update-public-suffix-list.service on ncmonitor1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed