[09:48:56] (VarnishTrafficDrop) firing: (3) 44% GET drop in text@codfw during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [09:49:52] ummmh [09:53:56] (VarnishTrafficDrop) resolved: (3) 60% GET drop in text@codfw during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [10:04:57] majavah: the alerts there were likely due to some spammy traffic in multiple DCs [10:05:09] see the comparison in raw requests rate vs last week: [10:05:21] https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=5&orgId=1&from=1634028329037&to=1634033034123&var-cluster=text&var-site=codfw&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5 [10:06:04] same pattern in eqiad and ulsfo, to a lesser extent in eqsin too [10:06:49] so yeah the alert in this case is a bit of a false positive in the sense that traffic did not really drop, rather it spiked [10:08:29] we really should change the dashboard link T292820 - the current one may lead to think that something is wrong while a baseline comparison makes things clearer I think [10:08:30] T292820: Create runbook for VarnishTrafficDrop alert, change dashboard link - https://phabricator.wikimedia.org/T292820 [10:18:52] -- [10:19:14] I've merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/726912/ with puppet disabled, applied the change on cp4027 only [10:19:39] it looks good, /etc/varnish/directors.frontend.vcl unchanged despite the update to /etc/confd/templates/_etc_varnish_directors.frontend.vcl.tmpl [10:22:22] I'll try and depool ats-be on cp4028 to see if directors.frontend.vcl gets updated accordingly [10:26:28] it does [12:14:34] new meeting pad is up [12:17:56] thanks question_mark [12:18:24] I wrote a runbook for the VarnishTrafficDrop alert, edit at will: https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop [12:18:47] 10Traffic, 10SRE, 10User-ema: Create runbook for VarnishTrafficDrop alert, change dashboard link - https://phabricator.wikimedia.org/T292820 (10ema) Runbook created: https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop [13:16:39] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Data-Engineering, and 6 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10Ottomata) Okay, deployed, and I see HomepageVisit events with real client IPs. @nettrom_WMF, f it look... [13:37:48] 10Traffic, 10SRE: DNS Discovery for active/passive failover within a data centre - https://phabricator.wikimedia.org/T287584 (10Ottomata) Hahah, I think declining this is fine for now, but intra DC failover is probably something our traffic infrastructure should support, ya? I'm not opposed to Ben's corosync/... [14:17:09] 10Traffic, 10SRE: DNS Discovery for active/passive failover within a data centre - https://phabricator.wikimedia.org/T287584 (10BBlack) I think (but I'm sure it can be debated!) that from the Traffic POV, a service's resiliency/failover within a DC shouldn't be managed via DNS automations like the discovery se... [14:31:42] sukhe, topranks: any final decision for T289536 to go in one way or the other? [14:31:42] T289536: Deploy durum: check service for Wikidough - https://phabricator.wikimedia.org/T289536 [14:47:26] volans: nothing from my side; I think we can keep it manual for now and set everything to "keep manual DNS" [14:48:19] volans: was just about to say I'm happy with whatever sukhe thinks is best, so that suggestion is ok with me :) [14:48:52] we should probably clean up the records with the DNS name and "manual" description set, just to keep Netbox clean [14:48:58] ok I'll cleanup netbox then [14:50:43] {done}, running the cookbook [14:50:53] thanks! [14:51:05] thanks volans and topranks! [15:02:20] cookbook done too [15:04:33] 10Traffic, 10SRE, 10Patch-For-Review: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh) [15:04:40] 10Traffic, 10SRE, 10Patch-For-Review: Deploy durum: check service for Wikidough - https://phabricator.wikimedia.org/T289536 (10ssingh) 05Open→03Resolved Thanks to everyone for helping with the task. We just discussed this in IRC but for those following along: we have decided to go with managing the recor... [15:12:28] 10Traffic, 10Infrastructure-Foundations, 10SRE: Anycast: Add IPv6 support to bird and anycast-healthchecker (Puppet) - https://phabricator.wikimedia.org/T292737 (10ssingh) >>! In T292737#7415977, @ayounsi wrote: > Thanks that's great! > > Could you update the doc to reflect the new config knobs? Thanks for... [15:24:17] ema: still up for that reimage? [15:25:45] volans: hi! I'm about to go afk, so how about tomorrow? sorry! [15:26:08] that works too [16:06:58] 10Traffic, 10Browser-Support-Firefox: Firefox: Referrer Policy: Less restricted policies, including ‘no-referrer-when-downgrade’, ‘origin-when-cross-origin’ and ‘unsafe-url’, will be ignored soon for the cross-site request - https://phabricator.wikimedia.org/T293109 (10AntiCompositeNumber) [16:12:40] 10Traffic, 10Browser-Support-Firefox: Firefox: Referrer Policy: Less restricted policies, including ‘no-referrer-when-downgrade’, ‘origin-when-cross-origin’ and ‘unsafe-url’, will be ignored soon for the cross-site request - https://phabricator.wikimedia.org/T293109 (10AntiCompositeNumber) Per https://develope... [16:47:12] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Data-Engineering, and 6 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10nettrom_WMF) 05Open→03Resolved I've verified that there are now events in the Data Lake with client... [17:03:57] (VarnishTrafficDrop) firing: 60% GET drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=5&var-cluster=text&var-site=eqsin - https://alerts.wikimedia.org [17:08:57] (VarnishTrafficDrop) resolved: 65% GET drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=5&var-cluster=text&var-site=eqsin - https://alerts.wikimedia.org [18:27:52] 10netops, 10Infrastructure-Foundations, 10SRE: Eqiad Expansion - LVS Connectivity Options - https://phabricator.wikimedia.org/T292630 (10RobH) [18:42:30] 10Traffic, 10SRE, 10Performance-Team (Radar), 10User-ema: Package and deploy Varnish 6.0.8 - https://phabricator.wikimedia.org/T292290 (10dpifke) [18:49:36] 10netops, 10Infrastructure-Foundations, 10SRE: Eqiad Expansion - LVS Connectivity Options - https://phabricator.wikimedia.org/T292630 (10RobH) [21:41:00] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox: externally-hosted NEL report forwarders for more timely report reception - https://phabricator.wikimedia.org/T292870 (10CDanis) >>! In T292870#7415867, @ayounsi wrote: > I'd wary of the complexity of the setup. Yeah, a fair criticism. > As... [22:09:46] Does the CDN edge limit what HTTP verbs can pass through to a backend service? I'm trying to call a PATCH endpoint on the new Toolhub k8s deployment and getting a "Wikimedia Error" 405 page that I think could only be coming from the CDN. [22:44:01] 10Traffic, 10Toolhub: Toolhub API requests with PATCH verbs blocked by CDN - https://phabricator.wikimedia.org/T293157 (10bd808) p:05Triage→03High [22:45:07] I got some help from mutante and rzl in finding the place where PATCH is blocked ^ [22:47:16] 10Traffic, 10SRE, 10Toolhub: Toolhub API requests with PATCH verbs blocked by CDN - https://phabricator.wikimedia.org/T293157 (10bd808) The timing of this is on me, but I just found the block of PATCH today and the app is planned to be announced to the community tomorrow. I can hold the announce if necessary... [23:05:25] bblack: you are probably the only person near my timezone to talk with about this. Would adding PATCH to the allowed verbs be the worst idea ever? I see the default in the VCL is more closed down than the current hiera setting, so maybe its just that nobody made a case for PATCH yet? [23:22:23] 10Traffic, 10SRE, 10Toolhub: Toolhub API requests with PATCH verbs blocked by CDN - https://phabricator.wikimedia.org/T293157 (10bd808)