[05:45:02] (HAProxyEdgeTrafficDrop) firing: 68% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [05:54:56] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [07:42:56] (HAProxyEdgeTrafficDrop) firing: 68% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [07:47:56] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [08:03:56] (HAProxyEdgeTrafficDrop) firing: 65% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [08:08:56] (HAProxyEdgeTrafficDrop) resolved: 67% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [10:01:29] 10Traffic, 10netops, 10Infrastructure-Foundations: lvs500[1-3] are unable to establish BGP sessions with cr3-eqsin.wikimedia.org - https://phabricator.wikimedia.org/T321545 (10Vgutierrez) [10:01:57] 10Traffic, 10netops, 10Infrastructure-Foundations: lvs500[1-3] are unable to establish BGP sessions with cr3-eqsin.wikimedia.org - https://phabricator.wikimedia.org/T321545 (10Vgutierrez) p:05Triage→03Medium [10:02:48] XioNoX, topranks ^^ [10:06:17] * topranks looking [10:08:52] 10Traffic, 10netops, 10Infrastructure-Foundations: lvs500[1-3] are unable to establish BGP sessions with cr3-eqsin.wikimedia.org - https://phabricator.wikimedia.org/T321545 (10Volans) [10:18:12] 10Traffic, 10netops, 10Infrastructure-Foundations: lvs500[1-3] are unable to establish BGP sessions with cr3-eqsin.wikimedia.org - https://phabricator.wikimedia.org/T321545 (10cmooney) 05Open→03Resolved a:03cmooney Ugh. So this is rather embarrassing but it seems the PyBal group was manually de-activa... [10:18:52] vgutierrez: seems this was due to a mistake I made, sessions were deactivated for past 6 weeks after our upgrade :( [10:19:20] thanks for the heads up, all ok now. it didn't affect traffic as cr3 got he routes forwarded in iBGP from cr2 [10:20:03] that was fast [10:20:11] thanks topranks [10:22:09] it seems like we need to work a little bit on the PyBalBGPUnstable alert [10:22:20] slightly :D [10:22:47] I found it totally by chance vgutierrez, I looked at the metric because of the CR doing a curl [10:22:59] and I picked lvs5001 as my first random host to test [10:24:59] it isn't a big deal [10:25:29] as long as pybal is able to establish a session against one of the core routers is ok [10:25:42] ok as is a WARNING and not a CRITICAL :) [10:29:45] 10Traffic, 10SRE: PyBalBGPUnstable didn't report T321545 - https://phabricator.wikimedia.org/T321547 (10Vgutierrez) [10:29:59] 10Traffic, 10SRE: PyBalBGPUnstable didn't report T321545 - https://phabricator.wikimedia.org/T321547 (10Vgutierrez) p:05Triage→03Medium [10:31:35] 10Traffic, 10SRE: PyBalBGPUnstable didn't report T321545 - https://phabricator.wikimedia.org/T321547 (10Vgutierrez) [10:34:04] vgutierrez, volans: yes. well certainly there was a layer-8 issue here in that I didn't do my checks correctly. [10:34:17] and in the normal scenario with both routers up it doesn't affect traffic flows. [10:34:56] *but* if cr2 had failed or reloaded we could have been in trouble, so better alerting would help I think [10:35:34] topranks: yep, that's reported under T321547 [10:35:35] T321547: PyBalBGPUnstable didn't report T321545 - https://phabricator.wikimedia.org/T321547 [10:38:01] thanks vgutierrez [11:38:31] 10Traffic, 10observability: rate() requires at least >=2m for HAProxy metrics in upload@(eqiad|codfw) - https://phabricator.wikimedia.org/T321553 (10Vgutierrez) [11:41:36] godog: ^^ that one is puzzling me, let me know what you think when you get the chance please [12:09:28] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti4002.ulsfo.wmnet` - ganeti4002.ulsfo.wmnet (**PASS**)... [12:10:08] topranks: also now we can use the "shutdown" keyword instead of deactivate, so they will show up as "idle" on the router side [12:10:36] and be caught by the icinga check [12:10:39] vgutierrez: for sure -- will take a look shortly [12:16:05] vgutierrez: do you have a link for the 'explore' queries you posted a screenshot of ? [12:16:42] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10MoritzMuehlenhoff) I have setup ganeti4005 as a node in the ulsfo Ganeti cluster and moved a VM to it to confirm it works as expected. @RobH : I've als... [12:17:10] XioNoX: Good tip, noted thanks! [12:27:38] godog: https://grafana.wikimedia.org/goto/14wkdONVz?orgId=1 [12:29:36] vgutierrez: thank you! [12:38:00] 10Traffic, 10SRE, 10observability: rate() requires at least >=2m for HAProxy metrics in upload@(eqiad|codfw) - https://phabricator.wikimedia.org/T321553 (10fgiunchedi) I can't reproduce the issue ATM via https://grafana.wikimedia.org/goto/14wkdONVz?orgId=1 however your intuition is correct: the interval for... [12:38:46] {{done}} ^ [12:58:42] 10Traffic, 10SRE, 10observability: rate() requires at least >=2m for HAProxy metrics in upload@(eqiad|codfw) - https://phabricator.wikimedia.org/T321553 (10Vgutierrez) oh got it, thanks @fgiunchedi @BCornwall please update the min step to 2m in the dashboard.. maybe adding a hidden variable and referencing... [13:11:04] 10Traffic, 10SRE: PyBalBGPUnstable didn't report T321545 - https://phabricator.wikimedia.org/T321547 (10fgiunchedi) I've been scratching my head a little on this because the alert seemingly *has* fired: {F35624931} {F35624934} Yet I can't find any notification ATM [14:05:24] 10netops, 10Infrastructure-Foundations, 10SRE, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10ayounsi) As data point I tried: `asw2-d-eqiad# run request virtual-chassis vc-port set pic-slot 0 member 2 port 49` th... [14:35:15] 10netops, 10Infrastructure-Foundations, 10SRE, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10cmooney) Thanks @ayounsi, was worth a shot :) I'm thinking we probably proceed as follows: 1. Perform master switch... [14:37:21] 10netops, 10Infrastructure-Foundations, 10SRE, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10cmooney) Just a note that I should have added previously that Juniper wouldn't provide support due to JunOS 14.1 being... [15:03:56] (HAProxyEdgeTrafficDrop) firing: 58% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [15:28:37] 10Traffic, 10SRE: PyBalBGPUnstable didn't report T321545 - https://phabricator.wikimedia.org/T321547 (10fgiunchedi) >>! In T321547#8341438, @Vgutierrez wrote: > nice catch @fgiunchedi. Actually I've assumed that it wasn't fired cause we didn't get the recovery on the traffic IRC channel when T321545 got fixed... [15:32:11] that text@ulsfo graph doesn't look happy..? [15:33:11] TheresNoTime: that's expected, see: https://gerrit.wikimedia.org/r/c/operations/dns/+/849105 [15:33:26] ah, good! :D [15:33:28] we depooled ulsfo as there will be a bunch of cp hosts offline for the hardware refresh [15:33:33] thanks for flagging it though! [15:33:39] just in case.. :) [15:33:43] yep, always good [15:48:40] 10netops, 10Infrastructure-Foundations, 10SRE, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10ayounsi) I don't remember the impact of a switchover (eg. if it's none or tiny). So to be done carefully. At least the... [16:33:56] (HAProxyEdgeTrafficDrop) resolved: 64% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [17:02:09] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) sub-ports are ready for cr2-eqiad ` papaul@re0.cr2-eqiad# run show interfaces terse | match xe-1/0/* xe-1/0/1:0 down down xe-1/0/1:1... [17:13:44] 10Traffic, 10SRE, 10observability: rate() requires at least >=2m for HAProxy metrics in upload@(eqiad|codfw) - https://phabricator.wikimedia.org/T321553 (10BCornwall) 05Open→03Resolved a:03BCornwall Thanks! [18:13:44] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10RobH) [18:19:28] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10RobH) failure of provision script against cp4039 ` [1/30, retrying in 30.00s] Polling task: JID_667217070909 not completed yet: status=OK, state=Running, complete... [18:36:06] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10RobH) [18:51:27] 10Traffic, 10DNS, 10SRE, 10Abstract Wikipedia team (Phase κ – Clean-up): Establish wikifunctions.org - https://phabricator.wikimedia.org/T275904 (10Jdforrester-WMF) [19:12:57] (HAProxyEdgeTrafficDrop) firing: 66% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [19:17:57] (HAProxyEdgeTrafficDrop) resolved: 67% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [20:29:56] (HAProxyEdgeTrafficDrop) firing: 65% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [20:34:56] (HAProxyEdgeTrafficDrop) resolved: 67% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [21:02:16] (VarnishTrafficDrop) firing: Varnish traffic in esams has dropped 69.44577406963825% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [21:02:57] (HAProxyEdgeTrafficDrop) firing: 20% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [21:07:16] (VarnishTrafficDrop) firing: (8) Varnish traffic in codfw has dropped 55.0576436920031% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [21:07:57] (HAProxyEdgeTrafficDrop) firing: (6) 59% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [21:12:16] (VarnishTrafficDrop) resolved: (10) Varnish traffic in codfw has dropped 55.0576436920031% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [21:12:57] (HAProxyEdgeTrafficDrop) resolved: (6) 62% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [22:07:53] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10RobH) [22:26:03] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [22:35:56] 10Traffic, 10SRE: PyBalBGPUnstable didn't report T321545 - https://phabricator.wikimedia.org/T321547 (10BCornwall) Perhaps this is because the severity is set to warning rather than critical? [23:07:38] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10RobH)