[08:50:35] 10Traffic, 10Observability-Alerting, 10SRE, 10serviceops: Timeouts when talking to phabricator API - https://phabricator.wikimedia.org/T341039 (10fgiunchedi) I have extracted the `maniphest.edit` event duration from phab1004 access log, and on the 29th the operation started to take a whole lot longer: ` 2... [08:51:39] 10Traffic, 10Observability-Alerting, 10SRE, 10serviceops: Timeouts when talking to phabricator API - https://phabricator.wikimedia.org/T341039 (10fgiunchedi) @brennen I saw your updates to phab in SAL, does the above (`maniphest.edit` taking a lot longer to create tasks) ring a bell? [09:16:12] (LVSHighRX) firing: Excessive RX traffic on lvs6001:9100 (enp175s0f0np0) #page - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs6001 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [09:21:12] (LVSHighRX) resolved: Excessive RX traffic on lvs6001:9100 (enp175s0f0np0) #page - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs6001 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [09:24:23] were those alert supposed to page? they didn't.... [09:25:03] soft page? aka just on IRC? [09:35:09] mmhh I'm taking a look, they are severity=page and should page [09:37:38] ok the alerts are team=traffic severity=page and there isn't a route for that, will add the route [09:40:04] https://gerrit.wikimedia.org/r/c/operations/puppet/+/935990 for your eyes [09:57:15] {{done}} [09:57:51] cheers godog [10:00:38] sure np! [10:03:27] oh, I didn't get paged for that one [10:06:02] yeah it was a missing route, I've fixed that now [10:31:23] mmmh wouldn't be weird to have those alert in this channel and page but not in operation? [10:31:34] akosiaris: you're not oncall so you shouldn't have been paged anyway :-P [11:00:07] ah, yes, it's Thursday [11:16:13] I'm the only one thinking that if an alert it's paging it should not be team-specific but in the sre group and alert in -operations on irc? At least as long as we have a single flat oncall for everything [11:17:29] IIRC when c.danis defined that alert he intented to keep it strictly IRC only [11:17:48] and that's why it was restricted to -traffic rather than -operations [11:22:03] sure, I'm talking only about the page version of it [11:23:58] vgutierrez: if you have some time today I'd like to try deploying the script again :) https://gerrit.wikimedia.org/r/c/operations/puppet/+/935464 different service this time, similarly low-traffic. already tested the incoming/outgoing URLs as far as restbase-compliant patterns are concerned [11:24:44] hnowlan: nice :) [11:28:37] commented with some concerns [11:29:19] yep, good point. Same problem as existed with the rest-gateway I think [11:29:47] yeah, you might want to cover all the SANs included in restbase/unified cert [11:29:58] otherwise the lua code needs to be patched [11:30:09] to validate Host header [11:30:28] and issue a 400 response if an unexpected Host is used [11:30:53] we don't want ATS resetting connections to the applayer due to TLS validation issues [11:31:55] connection reuse is quite important considering that everything is TLS between ATS and the applayer and latency can be quite high from ATS to the applayer (consider esams, drmrs and eqsin) [11:32:17] quite high being >10ms in this context [11:34:37] oh yeah, definitely. I think the SAN is the way to go. This code isn't complex as such but I would rather it not do more things right now [11:35:47] ok :) [11:49:46] vgutierrez: looking a bit more reasonable now. What time would suit? Don't want to keep you from lunch [12:31:30] hnowlan: having lunch ATM [12:31:38] hnowlan: give me 15 minutes and I'm ready [12:40:06] hnowlan: whenever you like :) [12:51:55] 10netops, 10Infrastructure-Foundations, 10SRE: Configure eqiad cloudsw devices to support cloud-private - https://phabricator.wikimedia.org/T341223 (10cmooney) p:05Triage→03Medium [12:54:29] 10netops, 10Infrastructure-Foundations, 10SRE: Configure eqiad cloudsw devices to support cloud-private - https://phabricator.wikimedia.org/T341223 (10cmooney) [12:55:13] 10netops, 10Infrastructure-Foundations, 10SRE: Configure eqiad cloudsw devices to support cloud-private - https://phabricator.wikimedia.org/T341223 (10cmooney) [13:24:33] vgutierrez: sorry, ran off for lunch myself - I have a meeting in 6 mins, could we do it in 30 minutes? [13:24:53] I need to go away at 16:30 to run an errand [13:25:03] gonna be tight :) [13:25:06] but no problem [13:26:14] I'll be quick! :D thanks [13:56:13] 10Traffic, 10SRE, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns1004.wikimedia.org with OS bullseye [14:01:16] hnowlan: around? :) [14:02:49] 10Traffic, 10SRE, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns1004.wikimedia.org with OS bullseye executed with errors: - dns1004 (**FAIL**) - Removed fro... [14:02:59] 10Traffic, 10SRE, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns1004.wikimedia.org with OS bullseye [14:04:09] vgutierrez: yep! good to go? [14:04:16] yep [14:04:19] cp2027 as usual? [14:04:26] or it was cp2037? [14:04:29] * vgutierrez getting old :) [14:04:41] cp2037! [14:04:51] I'll depool it now [14:05:09] lovely [14:05:19] remember to disable puppet A:cp-text wide [14:05:33] even A:cp [14:07:53] ah, I've been doing cp-text - I'll do A:cp [14:09:24] technically the code gets deployed everywhere [14:09:42] done [14:09:46] merging [14:11:43] running puppet on cp2037 [14:13:56] 10Traffic, 10SRE, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns1004.wikimedia.org with OS bullseye executed with errors: - dns1004 (**FAIL**) - Removed fro... [14:14:13] 10Traffic, 10SRE, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns1004.wikimedia.org with OS bullseye [14:14:17] hnowlan: that 301 is expected? :) [14:14:47] wrong URL :) [14:15:55] I'm missing something [14:16:09] getting 404 with my tetss [14:16:10] *tests [14:16:53] Minor config tweak was missing from the gateway - looks passing to me now [14:17:01] yeah [14:17:03] 200 now [14:18:02] hnowlan: hmm is it just me or something os off with caching? [14:18:50] yeah.. for some reason caching isn't working with the new endpoint [14:19:17] subsequent requests with curl -H "Host: wikimedia.org" -H "X-Forwarded-Proto: https" 127.0.0.1:3128/api/rest_v1/metrics/unique-devices/en.wikipedia.org/all-sites/daily/20230506/20230704 -v -o /dev/null -s 2>&1 |grep "<" |sort [14:19:17] on cp2039 return a cache-hit [14:19:21] always a miss in 2037 [14:19:45] https://www.irccloud.com/pastebin/Ah9N21uD/ [14:19:47] VS [14:20:04] https://www.irccloud.com/pastebin/wQAd8HRz/ [14:20:57] huh, weird... what could the service be returning that would cause that? [14:21:15] you can hit the service directly at https://device-analytics.discovery.wmnet:4972/metrics/unique-devices/es.wikipedia.org/desktop-site/daily/20160401/20160410 btw [14:22:41] so in this endpoint restbase doesn't provide an etag [14:22:45] and the new one does [14:22:59] hnowlan: it could be related to the additional remap [14:23:13] and how that impacts cache lookup [14:23:26] (combined with the URL mangling performed on rb-mw-mangling.lua [14:23:48] cache-control header in the new service looks good [14:25:17] wouldn't be a huge surprise if the remaps mess things up. One option for next time would be to create a more specific match in the ATS config for the service, change urls in the gateway and remove the rb-mw-mangling.lua call entirely [14:25:22] If you need to go soon I can roll back [14:26:01] so atslog-backend shows that ATS doesn't attempt to write the response on the cache [14:26:15] Date:2023-07-06 Time:14:25:38 ConnAttempts:0 ConnReuse:235 TTFetchHeaders:87 ClientTTFB:89 CacheReadTime:0 CacheWriteTime:0 TotalSMTime:89 TotalPluginTime:1 ActivePluginTime:1 TotalTime:89 OriginServer:restbase.discovery.wmnet OriginServerTime:89 CacheResultCode:TCP_MISS CacheWriteResult:FIN ReqMethod:GET RespStatus:200 OriginStatus:200 [14:26:15] ReqURL:http://wikimedia.org/api/rest_v1/metrics/unique-devices/en.wikipedia.org/all-sites/daily/20230506/20230704?vgutierrez=17406 ReqHeader:User-Agent:curl/7.74.0 ReqHeader:Host:wikimedia.org ReqHeader:X-Client-IP:- ReqHeader:Cookie:- BerespHeader:Set-Cookie:- BerespHeader:Cache-Control:s-maxage=14400, max-age=14400 BerespHeader:Connection:- RespHeader:X-Cache-Int:cp2039 miss RespHeader:Backend-Timing:- [14:26:20] that's cp2039 working as expected [14:26:27] see CacheWriteFinResult:FIN [14:27:02] Date:2023-07-06 Time:14:26:49 ConnAttempts:0 ConnReuse:0 TTFetchHeaders:73 ClientTTFB:92 CacheReadTime:0 CacheWriteTime:0 TotalSMTime:92 TotalPluginTime:0 ActivePluginTime:0 TotalTime:92 OriginServer:api-gateway.discovery.wmnet OriginServerTime:74 CacheResultCode:TCP_MISS CacheWriteResult:- ReqMethod:GET RespStatus:200 OriginStatus:200 [14:27:02] ReqURL:http://wikimedia.org/api/rest_v1/metrics/unique-devices/en.wikipedia.org/all-sites/daily/20230506/20230704 ReqHeader:User-Agent:curl/7.74.0 ReqHeader:Host:wikimedia.org ReqHeader:X-Client-IP:- ReqHeader:Cookie:- BerespHeader:Set-Cookie:- BerespHeader:Cache-Control:s-maxage=14400, max-age=14400 BerespHeader:Connection:- RespHeader:X-Cache-Int:cp2037 miss RespHeader:Backend-Timing: [14:27:11] cp2037 shows CacheWriteResult:- [14:27:19] no error, just no cache write operation [14:28:50] so weird. might there be some other restbase-specific rule elsewhere? [14:29:26] 10netops, 10Infrastructure-Foundations, 10SRE: Configure eqiad cloudsw devices to support cloud-private - https://phabricator.wikimedia.org/T341223 (10aborrero) [14:30:11] hnowlan: nah.. I need to debug this further [14:30:25] 10netops, 10Infrastructure-Foundations, 10SRE, 10User-aborrero: Configure eqiad cloudsw devices to support cloud-private - https://phabricator.wikimedia.org/T341223 (10aborrero) [14:30:28] doesn't make a lot of sense right now but I need to go afk for a while [14:30:59] sure. I'll roll back in the mean time? [14:31:01] 10netops, 10Infrastructure-Foundations, 10SRE, 10User-aborrero: Configure eqiad cloudsw devices to support cloud-private - https://phabricator.wikimedia.org/T341223 (10cmooney) [14:31:05] hnowlan: yep, thanks [14:31:37] vgutierrez: https://gerrit.wikimedia.org/r/c/operations/puppet/+/935859 [14:32:44] TTL is way lower in the new endpoint [14:32:47] s-maxage=14400 [14:32:59] but the minimum lifetime set in ATS is 3600 [14:33:00] proxy.config.http.cache.heuristic_min_lifetime: 3600 [14:33:44] and un restbase is also 14400 for this endpoint [14:33:47] s/un/in/ [14:33:56] * vgutierrez back soon [14:36:28] reverted and everything's back as it was [14:37:17] yeah the idea was to keep everything as close to restbase as possible to start [14:45:47] 10Traffic, 10SRE, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns1004.wikimedia.org with OS bullseye executed with errors: - dns1004 (**FAIL**) - Removed fro... [14:46:01] 10Traffic, 10SRE, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns1004.wikimedia.org with OS bullseye [15:22:02] 10Traffic, 10SRE, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns1004.wikimedia.org with OS bullseye completed: - dns1004 (**PASS**) - Removed from Puppet an... [15:37:31] 10Traffic, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10akosiaris) @MSantos, change deployed today. e.g. https://en.wikipedia.org/api/rest_v1/page/mobile-sections now returns a 40... [15:39:10] hnowlan: we don't have a deployment-prep environment right? [15:40:21] 10Traffic, 10Observability-Alerting, 10SRE, 10collaboration-services, 10serviceops-radar: Timeouts when talking to phabricator API - https://phabricator.wikimedia.org/T341039 (10akosiaris) [15:42:05] vgutierrez: unfortunately no - setting one up would be a bit of a hassle as we'd need to stand up the cassandra datastores too [15:42:28] but if that's the only way to get to the bottom of this I can spend some time seeing how much work it would be tomorrow [15:43:32] hnowlan: nah, I can build a dummy backend server with a static response [15:43:48] mimicking https://device-analytics.discovery.wmnet:4972/metrics/unique-devices/es.wikipedia.org/desktop-site/daily/20160401/20160410 [15:46:46] cool [15:47:06] 10netops, 10Infrastructure-Foundations, 10SRE, 10User-aborrero: Configure eqiad cloudsw devices to support cloud-private - https://phabricator.wikimedia.org/T341223 (10cmooney) [15:47:08] just for completeness the API gateway also uses URLs of the format https://api-gateway.discovery.wmnet:8087/wikimedia.org/v1/metrics/unique-devices/es.wikipedia.org/desktop-site/daily/20160401/20160410 [15:49:31] yeah, I don't care about URL format right now [15:49:44] that shouldn't matter at all for cacheability [15:55:11] for sure, just in case the gateway was adding more variables [16:05:54] https://www.irccloud.com/pastebin/T11sgSYm/ [16:06:02] using this simple mock [16:09:25] nice [16:10:12] 10Traffic, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10MSantos) 05Open→03Resolved a:03akosiaris >>! In T340036#8994407, @akosiaris wrote: > @MSantos, change deployed today.... [16:15:24] 10Traffic, 10RESTBase, 10RESTBase-API, 10SRE: REST API is not invalidating caches after template and/or module changes - https://phabricator.wikimedia.org/T335770 (10Brycehughes) 05Open→03Resolved [16:22:59] hnowlan: o_O [16:23:00] X-Cache-Int: traffic-cache-bullseye hit [16:23:32] hmm wait.. my test instance is running ATS 9.2.1 [16:23:34] rather than 9.1.4 [16:23:44] sukhe: ^^ I'm gonna rollback your upgrade, sorry [16:24:28] vgutierrez: go for it, much better than rolling back reprepro :) [16:25:23] 10Traffic, 10Observability-Alerting, 10SRE, 10collaboration-services, 10serviceops-radar: Timeouts when talking to phabricator API - https://phabricator.wikimedia.org/T341039 (10brennen) > @brennen I saw your updates to phab in SAL, does the above (maniphest.edit taking a lot longer to create tasks) ring... [16:26:13] hnowlan: I can't reproduce with 9.1.4 here in our test environment [16:29:24] that's definitely weird [16:29:31] my remap.config rules are basically the same as in production [16:29:33] regex_map http://(.*)/api/rest_v1 http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/api/rest_v1 @plugin=/usr/lib/trafficserver/modules/tslua.so @pparam=/etc/trafficserver/lua/normalize-path.lua @pparam="3A 40 21 24 28 29 2A 2C 3B 27" @pparam="5B 5D 26 2B 3D" @plugin=/usr/lib/trafficserver/modules/tslua.so @pparam=/etc/trafficserver/lua/gateway-check.lua @plugin=/usr/lib/trafficserver/modules/tslua.so [16:29:34] @pparam=/etc/trafficserver/lua/rb-mw-mangling.lua @plugin=/usr/lib/trafficserver/modules/conf_remap.so @pparam=proxy.config.http.server_session_sharing.match=ip [16:30:00] main difference is how cache is stored [16:31:01] hnowlan it looks like I need to debug in production :_) [16:31:22] hnowlan: I'll go ahead and do it on my own on Monday and report back to you [16:31:38] (I'm OoO tomorrow) [16:31:51] hnowlan: sorry about these delays :_) [16:33:03] vgutierrez: no worries at all! thanks for all of your help [16:37:39] one thing I considered and discounted but wanted to be sure could be ruled out - api.wikimedia.org has "caching: pass" set in cache::alternate_domains. However, we're not using that domain at any point - even if the api gateway *also* serves that domain. [17:04:30] 10Traffic, 10SRE, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `dns1001.wikimedia.org` - dns1001.wikimedia.org (**WARN**) - Downtimed host on Icinga/Alertmanag... [17:20:42] 10Traffic, 10RESTBase, 10RESTBase-API, 10SRE: REST API is not invalidating caches after template and/or module changes - https://phabricator.wikimedia.org/T335770 (10akosiaris) >>! In T335770#8988938, @Brycehughes wrote: > @akosiaris Yep all clear now from Georgia (the country). However, this lasted much m... [17:24:30] 10Traffic, 10RESTBase, 10RESTBase-API, 10SRE: REST API is not invalidating caches after template and/or module changes - https://phabricator.wikimedia.org/T335770 (10Brycehughes) @akosiaris Fair enough. Ah, the joys of caching. Thanks. [18:01:56] 10Traffic, 10Observability-Alerting, 10SRE, 10collaboration-services, 10serviceops-radar: Timeouts when talking to phabricator API - https://phabricator.wikimedia.org/T341039 (10Aklapper) Hmm. The problem //could// be related to deploying the bug fix (see non-public T338611#8965304 for details) in 6b59a3... [22:15:02] 10Traffic, 10WMF-Legal, 10Patch-For-Review, 10Performance-Team (Radar), 10Privacy: Add no-transform to Cache-Control header - https://phabricator.wikimedia.org/T218618 (10BCornwall) 05Open→03Stalled Setting this as stalled since there's been no response from @Vgutierrez, @ssingh, @SCherukuwada or @o...