[01:46:57] (HAProxyEdgeTrafficDrop) firing: 67% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [02:01:57] (HAProxyEdgeTrafficDrop) firing: (2) 45% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [02:06:56] (HAProxyEdgeTrafficDrop) resolved: (2) 51% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [02:32:56] 10Traffic, 10SRE, 10Patch-For-Review: Test ESI feasibility with current Varnish installation - https://phabricator.wikimedia.org/T308799 (10AndyRussG) @BBlack @Vgutierrez hiii! The CentralNotice change will go out with the main cluster deploy train this week! The related config change to set the requested st... [05:56:56] (HAProxyEdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [06:01:56] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [07:34:43] 10netops, 10Infrastructure-Foundations: BFD flapping between cr1-eqiad and cr2-drmrs - https://phabricator.wikimedia.org/T321034 (10ayounsi) p:05Triage→03High [07:36:31] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10dcaro) > You also have to bear in mind, with some tasks like like Ceph initial syncing, that a well tuned/performant system will use whatever bandwidth is... [08:02:56] (HAProxyEdgeTrafficDrop) firing: 69% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [08:07:56] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [08:25:22] hi folks, what do you think of https://gerrit.wikimedia.org/r/c/operations/alerts/+/841905 ? and related task too, commenting on either sounds good [08:28:57] * vgutierrez checking [08:32:40] replied on the CR [08:33:11] thanks vgutierrez ! appreciate it, and good point [08:33:51] I don't know if administrative status is published into prometheus though [08:34:24] cause effectively pybal can ignore it to meet the depool threshold [08:34:33] but we got an alert for that in place already [08:36:57] I'm failing to find an example of depooled host that shows up as pybal_monitor_status == 0, did you find some vgutierrez ? [08:37:13] hmm nope [08:37:33] by depooled here I'm assuming we're talking about enabled: false con config-master.w.o [08:37:38] indeed [08:39:55] I'm seeing some errors on the low traffic load balancers in eqiad & codfw [08:40:02] but those are for pooled hosts [08:41:05] mmhh actually I'm finding some other things I hadn't noticed, e.g. cp4021 has pybal_monitor_status == 0 and doesn't show up in here https://config-master.wikimedia.org/pybal/ulsfo/upload (as I expected, maybe I'm wrong) [08:41:13] ditto for cp4027 [08:43:43] are those fresh metrics? [08:44:22] fresh as in prometheus collected them recently, yes, not sure about the pybal end [08:44:25] I'm looking at this [08:44:27] https://thanos.wikimedia.org/graph?g0.expr=pybal_monitor_status%20%3D%3D%200&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D [08:44:35] cp4021 and cp4027 have been decommissioned on 2022-09-30 [08:45:05] probably pybal hasn't been restarted since then on the upload/secondary load balancers in ulsfo [08:45:18] so the stale metric is still there [08:45:42] 10netops, 10Infrastructure-Foundations, 10SRE: BFD flapping between cr1-eqiad and cr2-drmrs - https://phabricator.wikimedia.org/T321034 (10cmooney) I think I may have solved this, although through nothing logical, similar to the earlier BGP bounce restoring the IPv6. I disabled OSPF for the interface and re... [08:45:43] yeah that would track [08:46:14] yup... in lvs4006 hasn't been restarted in 2 months 7 days [08:48:37] *nod* so far I think we've found signal (?) in the metrics, though I still can't find an example of depooled host with monitors down [08:49:37] not right now :) [08:50:11] heheh true [08:51:18] I'll try to simulate the scenario [08:51:41] we had that with cp5001 recently [08:51:58] we depooled it instead of set it as inactive [08:52:20] so it stayed 1 month in depooled state before being decomm'ed [08:52:53] ow :| [08:54:27] and of course as pybal monitors depooled hosts, we got 1 month of failed monitors for cp5001 as well [08:56:45] mmhh so in that case the warning alert would have helped to catch the condition (?) [08:58:50] yep [08:59:08] in that specific case yes [09:04:42] ok thanks! what do you think re: going ahead with the warning alert and see how often it mis-fires? the host needs to fail its monitors for >12h in this case, of course the proper fix would also be to have pybal export pooled status, which I think we might want anyways [09:07:56] 10netops, 10Infrastructure-Foundations, 10SRE: BFD flapping between cr1-eqiad and cr2-drmrs - https://phabricator.wikimedia.org/T321034 (10ayounsi) 05Open→03Resolved a:03cmooney Awesome, thanks! I cleared the Icinga downtimes now that it's all back to normal. [09:09:18] bbiab [09:26:38] godog: sure, let's do that [09:32:45] vgutierrez: ack! thank you, will merge [11:16:59] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10cmooney) > I'll give the test a go somewhere just to see what the throughput bottleneck looks like in grafana 👍 Cool. If you're starting with iperf I c... [12:25:56] (HAProxyEdgeTrafficDrop) firing: 18% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [12:30:56] (HAProxyEdgeTrafficDrop) resolved: 14% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [12:59:56] (HAProxyEdgeTrafficDrop) firing: 5% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [13:04:56] (HAProxyEdgeTrafficDrop) resolved: 7% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [15:07:17] 10netops, 10Cloud Services Proposals, 10Infrastructure-Foundations, 10SRE: Separate WMCS control and management plane traffic - https://phabricator.wikimedia.org/T314847 (10cmooney) I had a good chat with @aborrero today on some ideas on how to progress towards this goal. Some notes / additional thoughts... [15:14:05] 10netops, 10Cloud Services Proposals, 10Infrastructure-Foundations, 10SRE: Separate WMCS control and management plane traffic - https://phabricator.wikimedia.org/T314847 (10aborrero) >>! In T314847#8325727, @cmooney wrote: > I had a good chat with @aborrero today on some ideas on how to progress towards th... [15:41:42] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Jclark-ctr) 05Open→03Resolved Completed lvs connection moves [16:37:10] 10netops, 10Cloud Services Proposals, 10Infrastructure-Foundations, 10SRE: Separate WMCS control and management plane traffic - https://phabricator.wikimedia.org/T314847 (10taavi) > Probably makes sense to choose a /16 from 172.16.0.0/12 for the supernet, and allocate per-rack /24s from this. Please keep i... [18:10:50] 10netops, 10Cloud Services Proposals, 10Infrastructure-Foundations, 10SRE: Separate WMCS control and management plane traffic - https://phabricator.wikimedia.org/T314847 (10cmooney) >> /32 Service IPs should be from the cloud realm public /24 (185.15.56.0/24) if the service needs to be reachable from inter... [18:28:15] 10netops, 10Infrastructure-Foundations, 10SRE: Set consistent MTUs - https://phabricator.wikimedia.org/T315838 (10cmooney) FWIW I didn't get to the bottom of the MTU difference. But I was able to confirm that it is a real issue, i.e. there is a 4-byte "blackhole" where the switches will transmit packets wit... [20:02:46] 10netops, 10Infrastructure-Foundations, 10SRE, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10cmooney) Myself and @ayounsi were able to narrow down the issue a bit more during testing yesterday. It seems the iss... [20:29:30] 10Traffic, 10SRE: Create program to interact with Atlas RIPE API - https://phabricator.wikimedia.org/T315536 (10BCornwall) 05Open→03Resolved [23:01:46] 10Traffic, 10SRE, 10Patch-For-Review: Test ESI feasibility with current Varnish installation - https://phabricator.wikimedia.org/T308799 (10AndyRussG) The change has rolled out to group 0 wikis, and should go to groups 1 and 2 this week. An example of a page with the ESI comment in the base HTML is [[ https:...