[07:43:56] (HAProxyEdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [07:48:56] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [08:52:35] 10netops, 10Infrastructure-Foundations, 10SRE: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10dcaro) > Is it possible we can allocate these IP addresses on the cloud switches, from the existing 192.168.4.0/24 range? That's ok yes, w... [09:07:31] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: drmrs: network configuration - https://phabricator.wikimedia.org/T283050 (10ayounsi) 05Open→03Resolved a:03ayounsi This is done. [09:15:49] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) > If there is any kind of anycast with the k8s prefixes (same prefix advertised from multiple locations), we should als... [09:51:19] 10netops, 10Infrastructure-Foundations, 10SRE: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10cmooney) @dcaro thanks! That "ip route show" output would work perfectly yes. Although I was suggesting maybe to add a route for 192.168.... [10:09:56] (HAProxyEdgeTrafficDrop) firing: (2) 53% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [10:14:56] (HAProxyEdgeTrafficDrop) resolved: (2) 53% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [10:21:41] (HAProxyEdgeTrafficDrop) firing: (2) 53% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [10:23:14] 10netops, 10Infrastructure-Foundations, 10SRE: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10dcaro) > Let me know what you find on the /22 question. It's ok to use /24 for each, no problem there, the /22 I think was just to scope t... [10:26:26] (HAProxyEdgeTrafficDrop) resolved: (2) 57% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [11:06:03] !log temporarily install bpfcc-tools on kubernetes1013 [11:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:16] !log purged bpfcc-tools from kubernetes1013 [11:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:37] 10Traffic, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) Well I'm not getting anywhere very fast with this. I now understand from @akosiaris... [13:29:31] 10Traffic, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) I identified the node process that was running eventgate-analytics-external, then ra... [13:58:52] 10Traffic, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) I've done more analysis of packet captures from eventgate-analytics-external and I s... [14:49:03] 10Traffic, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10akosiaris) >>! In T306181#7917390, @BTullis wrote: > > We have proposed creating a new bucke... [14:54:52] 10Traffic, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10Ottomata) > all that is needed is to deploy a version of eventgate On it. I had issues with... [15:20:52] Hello. Quick heads up, that razzi and I plan to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/779915 if that's OK. It adds a new realserver to two existing low-traffic LVS clusters. Look sane? [15:45:45] 10netops, 10Infrastructure-Foundations, 10SRE, 10fundraising-tech-ops: Upgrade pfw to Junos 20+ - https://phabricator.wikimedia.org/T295691 (10Papaul) @Cmjohnson @Jclark-ctr I upload the junos-srxsme-20.1R1.11.tgz to apt.wikimedia.org under /srv/junos . if you have time this week can you please copy that i... [16:00:34] 10netops, 10Infrastructure-Foundations, 10SRE, 10fundraising-tech-ops: Upgrade pfw to Junos 20+ - https://phabricator.wikimedia.org/T295691 (10Papaul) @Cmjohnson @Jclark-ctr the right image is junos-srxentedge-x86-64-20.4R3-S1.3.tgz and not junos-srxsme-20.1R1.11.tgz since codfw is using junos-srxentedge... [16:59:21] 10netops, 10Infrastructure-Foundations, 10SRE, 10fundraising-tech-ops: Upgrade pfw to Junos 20+ - https://phabricator.wikimedia.org/T295691 (10Papaul) @Jgreen hello. I am planning on doing this on the 16th at 10:00am CT . let me know it time works for you. Thanks [17:08:19] hi traffic, clinic duty here - anyone free to answer a user question about rate limits in https://phabricator.wikimedia.org/T307610? [18:04:44] lol [18:04:49] 200/s [18:16:58] 10Traffic, 10RESTBase-API, 10SRE: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (10BBlack) The limit you're hitting is an intentional one, from this block in our edge code: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/... [18:17:10] rzl: ^ [18:18:23] bblack: thanks! [18:28:02] 10Traffic, 10RESTBase-API, 10SRE: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (10Mitar) Hm, but documentation for REST API says I can use 200 requests per second? https://en.wikipedia.org/api/rest_v1/ > Limit your clients to no more than 200 requests/s to this... [18:28:59] 10Traffic, 10RESTBase-API, 10SRE: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (10Mitar) Sadly bulk downloads do not have HTML dumps, and Enterprise dumps do not offer them for template/module documentation (only articles, categories, and files). Also, there are... [18:52:43] 10Traffic, 10RESTBase-API, 10SRE: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (10BBlack) >>! In T307610#7918538, @Mitar wrote: > Hm, but documentation for REST API says I can use 200 requests per second? https://en.wikipedia.org/api/rest_v1/ > >> Limit your cli... [18:54:59] 10Traffic, 10RESTBase-API, 10SRE: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (10Mitar) Hm, it seems that [comments are out of sync](https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/varnish/templates/text-frontend.... [19:27:26] Hey, wondering if anyone in traffic can maybe review this change for me? [19:27:28] https://gerrit.wikimedia.org/r/c/operations/dns/+/790744 [19:27:46] volan.s looked at the last one I submitted but he's out sick this week [19:28:10] I think it's all good, I ran the netbox dns cookbook a moment ago and it created all the files referenced [19:32:54] 10Traffic, 10RESTBase-API, 10SRE: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (10Mitar) > Because our edge traffic code enforces a stricter limit of ~100/s (for responses that aren't frontend cache hits due to popularity), before the requests ever get to the Res... [22:00:57] (HAProxyEdgeTrafficDrop) firing: 24% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [22:05:57] (HAProxyEdgeTrafficDrop) firing: (6) 67% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [22:10:57] (HAProxyEdgeTrafficDrop) resolved: (6) 67% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [22:17:26] (HAProxyEdgeTrafficDrop) firing: (6) 67% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [22:22:26] (HAProxyEdgeTrafficDrop) resolved: (6) 67% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop