[07:19:15] 10Traffic, 10SRE, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [07:22:40] 10Traffic, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10Jgiannelos) I think for wikiwand we only allow requests based on referer should we add or replace the rule with the user ag... [08:55:55] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [09:10:25] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE: Management LAN in eqsin offline due to failure of mr1-eqsin - https://phabricator.wikimedia.org/T341447 (10cmooney) p:05Triage→03High [09:20:38] 10Traffic, 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE: Management LAN in eqsin offline due to failure of mr1-eqsin - https://phabricator.wikimedia.org/T341447 (10Volans) Adding #traffic for awareness. [09:20:43] 10Traffic, 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE: Management LAN in eqsin offline due to failure of mr1-eqsin - https://phabricator.wikimedia.org/T341447 (10Volans) [09:39:26] 10Traffic, 10Observability-Alerting, 10SRE, 10collaboration-services, 10serviceops-radar: Timeouts when talking to phabricator API - https://phabricator.wikimedia.org/T341039 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi >>! In T341039#8995349, @Aklapper wrote: > Hmm. The problem //could// be rel... [10:05:56] hnowlan: so, ATS considers the response provided by api-gateway.discovery.wmnet as non-cacheable [10:06:03] [Jul 10 09:57:44.666] [ET_NET 58] DEBUG: (http_trans) [33094] [is_response_cacheable] config doesn't allow storing, and cache control does not say to ignore no-cache and does not specify never-cache or a ttl [10:06:03] [Jul 10 09:57:44.666] [ET_NET 58] DEBUG: (http_trans) [33094] [hcoofsr] response is not cacheable [10:06:03] [Jul 10 09:57:44.666] [ET_NET 58] DEBUG: (http_trans) [33094] [hcoofsr] response code: 200 [10:06:40] * vgutierrez reading source code ATM [10:10:29] vgutierrez: does the api.wikimedia.org cache::alternate_domains rule saying "pass" come into play here? Wondering if connection reuse or something might be causing it to apply the rule for all requests to that origin [10:11:08] hnowlan: request here is against wikimedia.org, not api.wm.o [10:11:24] yeah, just wondering if there's some kind of side-effect [10:11:47] doesn't make sense but it's the only exception I can think of to the setup [10:13:38] hnowlan: a pass there would definitely trigger a cache miss [10:13:48] but that should also happen on cp2039 [10:13:54] ahh right [10:14:02] and that's per vhost, and definitely wikimedia.org != api.wikimedia.org [10:19:43] hnowlan: arg, I think you're right: dest_host=api-gateway.discovery.wmnet action=never-cache [10:20:13] api.wikimedia.org is also hosted on api-gateway.discovery.wmnet? [10:20:33] yeah :( [10:20:37] so we have a problem there [10:21:56] * vgutierrez checking cache.config syntax [10:32:48] so cache.config lets us specify cache config rules (currently used to enforce the caching: 'pass' directive) based on the host or URL of the request [10:32:58] that match happens after remap is done [10:34:07] so at that point we can't tell between requests against api.wikimedia.org and wikimedia.org/api/rest_v1/metrics/unique-devices/(.+) based on the host [10:37:13] ahh, dang [10:41:08] we have some workarounds that can be done at ATS level [10:41:47] especially if we are going to an scenario where public hostname and private hostname mapping isn't 1:1 anymore [10:43:15] What do the workarounds look like? [10:44:21] avoid using cache.config and perform something similar with lua code based on the pristine URL rather than the host after remapping is performed [10:48:34] _joe_: maybe do you have some input here for us [10:48:52] * _joe_ reads backlog [10:49:56] current cache_reqhandling (https://github.com/wikimedia/operations-puppet/blob/production/hieradata/role/common/cache/text.yaml#L10) implementation for ATS assumes that backend servers won't have conflicting policies regarding cache behavior [10:50:13] this was true till last week apparently [10:50:15] <_joe_> So my 2c would be [10:50:43] <_joe_> let the api gateway specify caching headers [10:50:58] <_joe_> specifically, right now *almost everything* is non-cacheable [10:51:19] <_joe_> but we might soon have a mechanism to purge the CDN for the api gateway for specific urls [10:51:30] <_joe_> and in those cases we should actually cache them [10:53:55] yeah, ATS would observe a no-cache header from the API gateway AFAIK [10:56:38] but then we should trust that api-gateway.wm.o won't mess that up :) [10:56:52] actually api-gateway.d.w [10:57:40] aaand that would be an issue with varnish AFAIK, cause right now cache::req_handling is also used to enforce the same at varnish level AFAIK [10:57:44] AFAIK && IIRC [10:58:07] yeah [10:58:07] modules/profile/manifests/cache/varnish/frontend.pp: Profile::Cache::Sites $req_handling = lookup('cache::req_handling'), [11:02:20] so I'd feel more comfortable with varnish and ATS having the same behavior regarding cache::req_handling TBH [11:03:07] <_joe_> I am not sure why we'd need it to be different in my hypothesis [11:03:16] <_joe_> I'm proposing to send no-cache headers fromt eh api gw [11:03:21] <_joe_> which SREs cointrol anyways [11:03:25] <_joe_> and to add [11:03:33] <_joe_> caching: 'normal' for api.wikimedia.org [11:03:39] _joe_: it's different right now [11:04:27] varnish enforces a cache pass based on the Host header sent by the user and ATS based on the backend server hostname [11:04:39] and that's why we are having issues ATM [11:04:43] <_joe_> ok, I see [11:05:01] your approach avoids hitting that issue [11:05:11] <_joe_> yeah, well, not completely [11:05:27] <_joe_> it does in this case because wikimedia.org is 'normal', right? [11:05:40] anything not declared there == normal [11:06:42] IMHO we should unify varnish and ATS evaluation criteria for that config [11:07:43] and also we should clarify if api/rest_v1/metrics/unique-devices/(.+) responses served by api-gateway should be cached or not [11:08:01] <_joe_> heh yeah [11:08:06] <_joe_> sorry, going to lunch now [11:08:52] hnowlan: I don't know what your team thinks about not being able to PURGE those URLs and caching them [11:09:58] we have several options for that, api-gateway setting a no-cache header or s-maxage=0 or even leverage gateway-check.lua and just set those as non-cacheable there [11:12:21] vgutierrez: I don't think not being able to purge wouldn't be an issue, assuming we're keeping cache-control behaviours as expected [11:12:28] and I think the gateway defaulting to sending a no-cache would be fine [11:13:02] with exceptions for things like unique-devices ofc [11:15:59] happy to move that logic to api-gateway considering that we won't be able to filter those based on host header for restbase replacements [11:17:56] I'll submit a phab task for our side of the work (unifying varnish and ATS criteria) after taking care of the dogs :9 [11:19:50] thanks! [11:23:21] For your scope as soon as api-gateway sends the proper headers you can drop the api.wm.org reqhandling entry and ats should cache those requests as expected [12:07:45] 10Traffic: Unify host evaluation criteria on cache::req_handling for Varnish and ATS - https://phabricator.wikimedia.org/T341461 (10Vgutierrez) [12:08:25] hnowlan: ^^ that's the task for our side [12:14:04] 10Traffic: Unify host evaluation criteria on cache::req_handling for Varnish and ATS - https://phabricator.wikimedia.org/T341461 (10Vgutierrez) p:05Triage→03Medium [12:17:44] <_joe_> vgutierrez: uh wait, why not be able to purge? [12:18:04] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 3 others: Direct 0.5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341078 (10Clement_Goubert) 05In progress→03Resolved [12:18:13] _joe_: wdym? [12:18:16] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [12:18:44] <_joe_> vgutierrez> hnowlan: I don't know what your team thinks about not being able to PURGE those URLs and caching them [12:18:57] _joe_: I thought you said that api-gateway based services currently aren't able to PURGE the CDN [12:19:04] maybe I misread that [12:19:19] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 2 others: Direct 1% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341463 (10Clement_Goubert) [12:19:22] <_joe_> oh yeah, kamila has the tool we'd need almost ready [12:19:36] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [12:19:48] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 2 others: Direct 1% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341463 (10Clement_Goubert) 05Open→03In progress p:05Triage→03High [12:41:43] 10netops, 10Infrastructure-Foundations, 10SRE: BFD flapping from cloudsw1-c8-eqiad (QFX5100) - https://phabricator.wikimedia.org/T341466 (10cmooney) p:05Triage→03Medium [12:54:36] 10Traffic, 10netops, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Management LAN in eqsin offline due to failure of mr1-eqsin - https://phabricator.wikimedia.org/T341447 (10RobH) Order Number - 1-228138359365 entered for remote hands to power cycle the device and reply back to the ticket to let us... [13:18:05] 10Traffic, 10SRE: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ssingh) [13:47:40] 10Traffic, 10SRE, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host dns1005.wikimedia.org with OS bullseye [14:13:30] 10Traffic, 10netops, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Management LAN in eqsin offline due to failure of mr1-eqsin - https://phabricator.wikimedia.org/T341447 (10cmooney) Equinix came back and said they rebooted. Device is reachable again: ` cmooney@mr1-eqsin> show system uptime Curre... [14:23:31] 10Traffic, 10SRE, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host dns1005.wikimedia.org with OS bullseye completed: - dns1005 (**WARN**) - Removed from Puppet an... [14:49:27] 10Traffic, 10SRE, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns1006.wikimedia.org with OS bullseye [14:59:05] 10Traffic, 10netops, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Management LAN in eqsin offline due to failure of mr1-eqsin - https://phabricator.wikimedia.org/T341447 (10cmooney) p:05High→03Medium Device remains healthy after over an hour. In terms of what caused the initial problem the lo... [15:25:19] 10Traffic, 10SRE: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns1006.wikimedia.org with OS bullseye completed: - dns1006 (**PASS**) - Removed from Puppet and PuppetDB if present... [16:02:28] 10netops, 10Infrastructure-Foundations, 10SRE: BFD flapping from cloudsw1-c8-eqiad (QFX5100) - https://phabricator.wikimedia.org/T341466 (10cmooney) 05Open→03Resolved Session to cloudlb1001 is stable after over an hour so think this is good to close now with the fix of using longer timers ` cmooney@cloud... [16:07:05] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10bd808) Is there any particular reason that the "[ ] Wikitech is ideal to dogfood mw-on-k8s, there are challenges though that we need to over come T292707" step w... [16:39:56] 10Traffic, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10Jgiannelos) 05Resolved→03Open [16:42:29] 10Traffic, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10Jgiannelos) From comms with wikiwand: It seems User-Agent and Api-User-Agent (for client-side requests) are ignored, can y... [16:52:06] 10Traffic, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10akosiaris) `Wikiwand/0.1 (https://www.wikiwand.com; admin@wikiwand.com)` added to the list of user-agents. Please advise if... [18:14:04] 10Traffic, 10ops-eqiad: eqiad dns100[1-3] unified decommission task - https://phabricator.wikimedia.org/T341507 (10ssingh) [18:46:20] 10Traffic, 10SRE, 10ops-eqiad: eqiad dns100[1-3] unified decommission task - https://phabricator.wikimedia.org/T341507 (10ssingh) [18:47:06] 10Traffic, 10SRE, 10ops-eqiad: eqiad dns100[1-3] unified decommission task - https://phabricator.wikimedia.org/T341507 (10ssingh) The hosts have been decomissioned and ready for the hardware part. [18:51:03] 10Traffic, 10SRE: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `dns[1002-1003].wikimedia.org` - dns1002.wikimedia.org (**WARN**) - Downtimed host on Icinga/Alertmanager - Found ph... [18:54:42] 10Traffic, 10SRE: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ssingh) [18:55:09] 10Traffic, 10SRE: Reduce toil in provisioning and decommissioning of DNS/NTP servers by automating generation of resolv.conf and NTP peers - https://phabricator.wikimedia.org/T340479 (10ssingh) [18:55:29] 10Traffic, 10SRE: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ssingh) 05In progress→03Resolved Traffic has commissioned these boxes. Many thanks to dc-ops!