[08:45:26] hnowlan: any response from the team regarding minifying? [09:08:49] vgutierrez: they're discussing it atm but I don't think it'll be included today. Would it be okay to roll out the service as is given the relatively small amount of traffic? Hopefully later versions will add it [09:11:59] hnowlan: doyou have a time frame on that? [09:12:48] vgutierrez: not clear tbh, but I'd guess a ~week? [09:16:11] ok [09:17:08] +1ed [09:17:27] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 2 others: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780 (10thiemowmde) Is this possibly related to toolforge.org being entirely unavailable right now? No matter what I try, everything fails with 502 or 503. Prominent examp... [09:18:23] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 2 others: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780 (10RhinosF1) >>! In T341780#9159438, @thiemowmde wrote: > Is this possibly related to toolforge.org being entirely unavailable right now? No matter what I try, everyt... [09:20:37] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 2 others: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780 (10Vgutierrez) >>! In T341780#9159438, @thiemowmde wrote: > Is this possibly related to toolforge.org being entirely unavailable right now? No matter what I try, ever... [09:23:47] cool. Going to disable puppet and test on cp2037 as usual [09:33:51] vgutierrez: change is in place on cp2037, looks okay to me [09:34:04] * vgutierrez doublechecking [09:34:08] testing with commands like `curl -H 'Host: en.wikipedia.org' -H 'X-Forwarding-Proto: https' http://127.0.0.1:3128/api/rest_v1/metrics/editors/by-country/en.wikipedia/5..99-edits/2018/01` [09:37:33] hnowlan: interesting.. that shouldn't work :) [09:37:56] that returns a 404 in cp2039 [09:39:06] hnowlan: so the old service just accept requests on wikimedia.org [09:39:21] the new one doesn't seem to enforce that [09:41:32] vgutierrez: yeah, that behaviour change is expected - for all of these APIs we pass the wiki name in the URL already [09:41:56] that's gonna introduce some cache fragmentation in our side, but ok [09:44:27] ah of course - I can try to tighten this on the gateway after [09:48:09] sure [09:58:54] that should be doable in ~20m or so. for the time being should I proceed or would you prefer to wait until that's in place? [10:01:02] go ahead [10:01:58] thanks! [10:08:47] here's a change to limit the domains for the AQS2 APIs https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/956826 [10:13:51] nice [10:32:09] hnowlan: btw, what's the response for an "invalid" domain? 404? [10:32:43] vgutierrez: yeah [10:32:56] hnowlan: cacheable one? [10:33:39] the CDN sets a cap of 10 minutes for caching 404s but that's still useful in case of some request bursts [10:34:36] vgutierrez: what criteria makes a 404 (un)cacheable? [10:35:05] lack of cache-control header [10:36:10] 10Traffic, 10Infrastructure-Foundations, 10SRE: LVS servers using autoconf SLAAC IPv6 addresses - https://phabricator.wikimedia.org/T336505 (10cmooney) I'm gonna close this, I think we can probably deal with it under T102099. [10:39:01] 10Traffic, 10SRE, 10cloud-services-team, 10Patch-For-Review: Rename references of labweb to cloudweb - https://phabricator.wikimedia.org/T317463 (10taavi) `lang=shell-session taavi@cumin1001 ~ $ confctl select "cluster=(lab|cloud)web" get {"cloudweb1003.wikimedia.org": {"weight": 10, "pooled": "yes"}, "tag... [10:40:50] vgutierrez: it seems envoy won't emit one for general 404s [10:41:01] by default at least [11:00:06] ok [11:00:43] it's tricky... to have something like the backend TLS termination layer setting cache-control though rather than the app itself [11:02:47] I believe envoy can do custom error handling/overrides with https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_conn_man/local_reply [11:02:53] I'd need to dig into it a little [11:09:51] vgutierrez: what would be a good default cache-control for a 404? [11:12:18] hnowlan: currently on wp 404 we have `cache-control: s-maxage=600` [11:16:10] fabfur: cool [11:17:57] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/956833 this should do it [11:28:03] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10cmooney) @aborrero feel free to close this one if it's not being worked on, the status... [11:35:30] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10aborrero) 05Open→03Declined OK, closing for now and hoping some more modern BGP-bas... [12:19:44] (VarnishHighThreadCount) firing: (7) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [12:24:27] (PurgedHighEventLag) firing: (15) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [12:24:44] (VarnishHighThreadCount) firing: (24) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [12:25:09] 10netops, 10Infrastructure-Foundations, 10SRE, 10IPv6, 10User-jbond: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10cmooney) In the medium term I think we need to carefully consider how this operates, probably as part of a move away from using... [12:29:27] (PurgedHighEventLag) resolved: (29) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [12:29:44] (VarnishHighThreadCount) firing: (24) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [12:49:44] (VarnishHighThreadCount) firing: (25) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [13:04:44] (VarnishHighThreadCount) firing: (31) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [13:09:45] (VarnishHighThreadCount) firing: (44) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [13:24:45] (VarnishHighThreadCount) firing: (33) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [13:29:45] (VarnishHighThreadCount) resolved: (21) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [13:54:37] we're now limiting requests to the AQS APIs to wikimedia.org as before btw [13:55:24] Would it be okay to do another change today? We previously routed wikifeeds and it was mostly successful but there were some issues with the mobile app using incorrect domains and wikifeeds not knowing how to handle it. It's since been patched, and we know the service routing etc is correct https://gerrit.wikimedia.org/r/c/operations/puppet/+/945558 [13:55:36] essentially redoing what's in that CR [14:04:41] 10Traffic, 10WMF-Legal, 10Patch-For-Review, 10Privacy: Add no-transform to Cache-Control header - https://phabricator.wikimedia.org/T218618 (10Isaac) > Sorry for the horrendous response timing. I think that this would be best created in a new ticket. Thanks for bringing it up, though! Hah, no worries. I wa... [14:10:37] 10netops, 10Infrastructure-Foundations, 10SRE: Set idle-timeout for Juniper logins - https://phabricator.wikimedia.org/T345710 (10cmooney) 05Open→03Resolved a:03cmooney Thanks all, config applied now. @volans I left the timeout at 30 mins. I think (esp. in an emergency situation) it's not unlikely yo... [14:39:48] 10netops, 10Infrastructure-Foundations, 10SRE: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10cmooney) Just to mention here, but the restriction described in T322937#8847201 no longer seems to be the case. In codfw with devices on JunOS 22.2R3.... [14:50:40] o/ I seem to be having some issues with getting to toolforge sites via wikimedia dns. just me or also others? for example: https://wikidata-game.toolforge.org/distributed/# [14:58:40] hi isaacj, thanks for reporting! seems like we are getting a SERVFAIL for some reasn [14:59:03] heading to a meeting but will check shortly and follow up here [15:00:05] thanks sukhe. no urgency from me but glad to hear it's not on my end :) if i can do anything to help test etc., don't hesitate to let me know. [15:00:22] thanks! and yes, definitely not on your end [15:53:14] (VarnishHighThreadCount) firing: Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=99&var-site=eqsin&var-instance=cp5022 - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:58:14] (VarnishHighThreadCount) firing: (9) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [16:08:14] (VarnishHighThreadCount) firing: (23) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [16:11:26] (PurgedHighEventLag) firing: (3) High event process lag with purged on cp5019:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [16:13:14] (VarnishHighThreadCount) firing: (24) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [16:13:29] (VarnishHighThreadCount) firing: (24) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [16:16:26] (PurgedHighEventLag) resolved: (28) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [16:43:14] (VarnishHighThreadCount) firing: (24) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [16:53:14] (VarnishHighThreadCount) firing: (34) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [16:58:14] (VarnishHighThreadCount) firing: (46) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [17:03:14] (VarnishHighThreadCount) firing: (45) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [17:13:14] (VarnishHighThreadCount) firing: (32) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [17:18:14] (VarnishHighThreadCount) resolved: (23) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [17:24:11] 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10cmooney) I took a little look at the routed-mode docs from [[ https://github.com/grnet/gnt-networking/blob/develop/docs/routed.rst | here ]]. Overall the setup looks a... [18:21:19] 10Traffic, 10RESTBase-API, 10Documentation: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (10BCornwall) @BBlack: Friendly poke [18:33:05] 10Traffic, 10SRE: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10BCornwall) @ssingh do you think this is still an issue that's worth keeping open, and should it then be tagged to IF? [18:35:08] 10Traffic, 10SRE: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10ssingh) Thanks @BCornwall, I think we can close this one as we have done some other reimages in eqsin and not observed this issue. [18:35:58] 10Domains, 10Traffic, 10SRE: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080 (10BCornwall) Hi, @CRoslof! Have you been able to look into this? Thanks so much! [18:37:27] 10Traffic, 10DNS, 10Infrastructure-Foundations, 10SRE-tools, 10Patch-Needs-Improvement: DNS repo: add Jenkins job to ensure there are no duplicates - https://phabricator.wikimedia.org/T155761 (10BCornwall) @Vgutierrez and @BBlack friendly poke :) [18:40:28] 10Traffic, 10SRE, 10Patch-For-Review: Incorrect X-Cache-Status reported by deployment-prep caches - https://phabricator.wikimedia.org/T269825 (10BCornwall) p:05Lowest→03Low [20:54:26] 10Traffic, 10Movement-Insights, 10SRE: Varnish mobile redirection misses some sites - https://phabricator.wikimedia.org/T344175 (10Fabfur) Hi, any update on this? [21:01:26] 10Traffic, 10SRE, 10Security-Team, 10SecTeam-Processed, 10Security: Denial of Service due to repeated hits from a particular IP - https://phabricator.wikimedia.org/T305863 (10sbassett) [21:01:47] 10Traffic, 10SRE, 10Security-Team, 10SecTeam-Processed, 10Security: Denial of Service due to repeated hits from a particular IP - https://phabricator.wikimedia.org/T305863 (10sbassett) [21:18:12] 10Traffic, 10SRE: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10BCornwall) 05Stalled→03Resolved a:03BCornwall [22:04:44] 10Traffic, 10Patch-For-Review: Package and deploy ATS 9.2.1 - https://phabricator.wikimedia.org/T339134 (10BCornwall)