[01:26:00] 10Traffic, 10SRE, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10RLazarus) FWIW, this error message comes from En... [05:38:27] 10Traffic, 10SRE, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10MoritzMuehlenhoff) p:05Triage→03Low [05:56:16] 10Traffic, 10SRE, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10Ladsgroup) This might be helpful: {T113114} I th... [07:25:56] 10Traffic, 10SRE, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10ema) >>! In T287983#7257682, @RLazarus wrote: >... [07:53:41] vgutierrez: I'm around whenever you are [07:55:16] 10Traffic, 10Analytics, 10SRE, 10Patch-For-Review: Compare logs produced by atskfafka with those produced by varnishkafka - https://phabricator.wikimedia.org/T254317 (10ema) >>! In T254317#7255820, @elukey wrote: > In theory a lot of `tls = '-'` should be redirects from http to https, that hit Varnish and... [08:01:06] 10Traffic, 10Analytics, 10SRE, 10Patch-For-Review: Compare logs produced by atskfafka with those produced by varnishkafka - https://phabricator.wikimedia.org/T254317 (10elukey) A http to https redirect is probably not really a webrequest (following https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Tr... [08:01:45] legoktm: oh sorry, I'm here [08:05:32] vgutierrez: cool, let's move to -operations? [08:05:39] ack [08:11:56] (VarnishTrafficDrop) firing: 46% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [08:16:56] (VarnishTrafficDrop) resolved: 64% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [08:21:38] 10netops, 10Infrastructure-Foundations, 10SRE: Create an alert for output discards on network devices - https://phabricator.wikimedia.org/T284593 (10ayounsi) Current values for `ifOutDiscards_delta`: > 'asw2-b-eqiad.mgmt.eqiad.wmnet', 'xe-7/0/41', '509' > 'asw2-b-eqiad.mgmt.eqiad.wmnet', 'xe-2/0/41', '1601'... [08:24:02] 10Traffic, 10Analytics, 10SRE, 10Patch-For-Review: Compare logs produced by atskfafka with those produced by varnishkafka - https://phabricator.wikimedia.org/T254317 (10elukey) I had a chat with Ema on IRC, reporting a summary: * At the current state of the TLS termination layer, it is likely that ATS-TLS... [08:28:46] 10netops, 10Infrastructure-Foundations: Switch buffer re-partition - cloudsw1-c8-eqiad - https://phabricator.wikimedia.org/T288036 (10cmooney) [08:29:14] 10netops, 10Infrastructure-Foundations: Switch buffer re-partition - cloudsw1-c8-eqiad - https://phabricator.wikimedia.org/T288036 (10cmooney) [08:29:20] 10netops, 10Infrastructure-Foundations, 10SRE: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [08:32:00] 10netops, 10Infrastructure-Foundations: Switch buffer re-partition - cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T288037 (10cmooney) [08:32:24] 10netops, 10Infrastructure-Foundations: Switch buffer re-partition - cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T288037 (10cmooney) [08:32:26] 10netops, 10Infrastructure-Foundations, 10SRE: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [08:41:32] 10netops, 10Infrastructure-Foundations, 10SRE: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [08:47:28] 10Traffic, 10Platform Engineering, 10SRE, 10Wikimedia Enterprise (Okapi Wikimedia Enterprise): Securely connect Wikimedia Enterprise Infrastructure with WMF Kafka Streams - https://phabricator.wikimedia.org/T280628 (10AnnaMikla) [08:52:21] 10Traffic, 10SRE, 10Patch-For-Review, 10Wikimedia Enterprise (Okapi Wikimedia Enterprise): "wikimedia.com" DNS transfer to Wikimedia Enterprise's AWS infra - https://phabricator.wikimedia.org/T281428 (10AnnaMikla) [09:08:42] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - cloudsw1-c8-eqiad - https://phabricator.wikimedia.org/T288036 (10dcaro) [09:09:10] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T288037 (10dcaro) [09:26:34] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Anycast: consistent routers->servers routing - https://phabricator.wikimedia.org/T253666 (10ayounsi) a:05ayounsi→03None [12:36:56] (VarnishTrafficDrop) firing: 69% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [12:41:56] (VarnishTrafficDrop) resolved: 69% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [13:47:44] 10netops, 10Infrastructure-Foundations, 10SRE: ripe-atlas-codfw is down - https://phabricator.wikimedia.org/T267714 (10cmooney) a:03cmooney [13:50:12] 10Traffic: Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106 (10ema) [14:22:22] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T288037 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:22:35] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - cloudsw1-c8-eqiad - https://phabricator.wikimedia.org/T288036 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:31:51] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10cmooney) @Papaul do you know what the status is with this device? I can confirm there are some characters visible via serial console / port 47... [15:22:50] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - cloudsw1-c8-eqiad - https://phabricator.wikimedia.org/T288036 (10dcaro) @cmooney hey, I acknowledge that tomorrow is a good time, ping me whenever you want to get it going :) [15:29:49] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T288037 (10dcaro) Ack for tomorrow too (same as T288036) [15:54:34] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10Papaul) I as far as i know I removed the old faulty device, replaced it with this on, connected the console, power and network to the device, t... [16:00:12] 10Traffic, 10SRE, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10RLazarus) @ema That makes sense, thanks for the... [21:24:24] 10Traffic, 10SRE, 10serviceops, 10Datacenter-Switchover: During DC switch, helm-charts failed verification because it doesn't have a service IP - https://phabricator.wikimedia.org/T285707 (10Legoktm) [22:27:58] 10Traffic, 10SRE, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10Legoktm) >>! In T287983#7257682, @RLazarus wrote...