[11:10:21] ema fyi i sent a first stab at adding genral public_cloud rate limits to varnish https://gerrit.wikimedia.org/r/c/operations/puppet/+/740545 (cc _joe_) [11:12:55] thanks jbond [11:15:37] jbond: I think it's easier/cleaner to just move the logic to the common VCL [11:15:54] modules/varnish/templates/wikimedia-frontend.vcl.erb [11:17:22] there we hook in the varnish FSM and call functions that we defined such as cluster_fe_miss, which are then defined in text-frontend.inc.vcl.erb upload-frontend.inc.vcl.erb and misc-frontend.inc.vcl.erb with cluster-specific behavior [11:19:09] however, if we want the logic to be the same for text and upload as I think is the case here, we can simply move it from cluster_fe_ratelimit (text-frontend.inc.vcl.erb) to vcl_miss and vcl_pass in wikimedia-frontend.vcl.erb instead [11:19:24] jbond: does that make sense? [11:20:47] so we can define something like "sub shutdown_public_clouds {" in wikimedia-frontend.vcl.erb and call it in vcl_miss and vcl_pass [11:25:58] ema: ack that all make sens and i think it is definetly the correct place to put the 403 shutdown_public_clouds bit of the code, however im not sure if we should add the genral ratrelimiting there. With the current patch i have put the limits in upload to 100/10s and in text to 1000/100s (theses where picked some what out of the air. however 100/10s seems a bit aggressive for textand 1000/100s [11:26:05] is allready the default for all upload requests [11:26:27] however one of theses things i wanted input on is if thoses rates make sense? [11:29:47] so in both cases it's 10 rps long term isn't it [11:31:12] anyways: let's have the shutdown switch in the cluster-agnostic file and keep the rate limiting separate for now [11:39:04] ema: i have updated to move the shutdwon function. as for rate limits currently we only have a limit of misses from public cloud at 100/10s. This cr adds [11:39:15] upload limit of 100/10s [11:39:24] text cached limit of 1000/10s [11:39:47] however as said im happy to change thoses limits they seemed sane to me but definetly want more input [11:47:37] ack! gotta go make lunch now, looking at the CR later this afternoon :) [11:47:40] thanks! [11:54:05] ack thx [13:14:48] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-swift-storage, 10ops-codfw: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10ayounsi) The above command doesn't commit on a pre-provisioned VC. I did this instead: ` [edit virtual-chassis member 2] - role routing-engine; +... [13:38:56] (EdgeTrafficDrop) firing: 53% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [13:53:56] (EdgeTrafficDrop) resolved: 67% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [13:54:56] (EdgeTrafficDrop) firing: 68% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [14:09:11] (EdgeTrafficDrop) resolved: 64% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [14:57:18] <_joe_> hello traffic team [14:57:39] <_joe_> can I safely merge changes to the loadbalancers or we're still in the "can't restart pybal" situation? [15:00:05] _joe_: lvs2007 is back in service [15:00:26] _joe_: double-check with arzhel, but I think he put lvs2007 back in service shortly ago [15:00:34] <_joe_> cool [15:00:40] <_joe_> thanks both :) [15:00:47] I'll repool codfw later today [15:00:52] well [15:01:06] oh ok, I was going to say, stats don't look right, but I guess because the DC is still edge-depooled [15:01:09] so all good :) [15:13:12] <_joe_> bblack: about to restart pybals then [15:44:56] (EdgeTrafficDrop) firing: 67% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [15:49:56] (EdgeTrafficDrop) resolved: 65% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [16:01:14] 10Traffic, 10Platform Engineering, 10SRE, 10Patch-For-Review, 10Wikimedia-production-error: Wikimedia\Assert\PostconditionException: Postcondition failed: makeTitleSafe() should always return a Title for the text returned by getRootText(). - https://phabricator.wikimedia.org/T290194 (10Umherirrender) >>!... [16:55:27] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp4032.ulsfo.wmnet with OS buster [17:01:56] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp4032:9331 is unreachable - https://alerts.wikimedia.org [17:06:19] ^^ expected, host is being reimaged [17:23:42] 10Traffic, 10Platform Engineering, 10SRE, 10Patch-For-Review, 10Wikimedia-production-error: Wikimedia\Assert\PostconditionException: Postcondition failed: makeTitleSafe() should always return a Title for the text returned by getRootText(). - https://phabricator.wikimedia.org/T290194 (10MdsShakil) I tryin... [17:41:56] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp4032:9331 is unreachable - https://alerts.wikimedia.org [17:47:02] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp4032.ulsfo.wmnet with OS buster c... [17:48:02] I'm repooling codfw [17:58:33] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10cmooney) So we had some unexpected consequences over the weekend following this change. Example mail from ISP below: ` > Cc'ing Wikimedia NOC. > > We have... [18:35:14] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-swift-storage, 10ops-codfw: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10ayounsi) 05Open→03Resolved Codfw repooled, everything is back to normal. [19:09:15] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [19:09:40] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [20:25:31] 10Traffic, 10Platform Engineering, 10SRE, 10Patch-For-Review, 10Wikimedia-production-error: Wikimedia\Assert\PostconditionException: Postcondition failed: makeTitleSafe() should always return a Title for the text returned by getRootText(). - https://phabricator.wikimedia.org/T290194 (10Umherirrender) >>!... [20:40:34] 10Traffic, 10Platform Engineering, 10SRE, 10Patch-For-Review, 10Wikimedia-production-error: Wikimedia\Assert\PostconditionException: Postcondition failed: makeTitleSafe() should always return a Title for the text returned by getRootText(). - https://phabricator.wikimedia.org/T290194 (10MdsShakil) Done. L... [23:26:24] bblack: I touched cache/text.yaml. removed scholarships.wikimedia.org (already deleted from DNS). cp1079: - alternate_domains.add("\Qscholarships.wikimedia.org\E"); puppet does: Scheduling refresh of Exec[load-new-vcl-file-frontend] [23:26:55] I don't see a problem, but it does do the frontend refresh, so sharing. [23:28:36] hospital service, heh, but it looks like it's working as intended. Service[varnish-frontend-hospital]/ensure: ensure changed 'stopped' to 'running' (corrective)