[08:21:57] (VarnishTrafficDrop) firing: 62% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [08:24:47] 10Traffic, 10netops, 10Infrastructure-Foundations: lvs2007, lvs2009 and lvs2010 connected to the same row A switch - https://phabricator.wikimedia.org/T286879 (10Vgutierrez) [08:27:26] 10Traffic, 10DC-Ops, 10ops-codfw: lvs2007, lvs2009 and lvs2010 connected to the same row A switch - https://phabricator.wikimedia.org/T286879 (10Vgutierrez) [08:28:14] 10Traffic, 10DC-Ops, 10ops-codfw: lvs2007, lvs2009 and lvs2010 connected to the same row A switch - https://phabricator.wikimedia.org/T286879 (10ayounsi) Other rows need to be audited as well. [08:30:00] 10Traffic, 10DC-Ops, 10ops-codfw, 10Sustainability (Incident Followup): lvs2007, lvs2009 and lvs2010 connected to the same row A switch - https://phabricator.wikimedia.org/T286879 (10Majavah) [08:42:33] 10Traffic, 10DC-Ops: Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Vgutierrez) [08:43:46] 10Traffic, 10DC-Ops, 10ops-codfw, 10Sustainability (Incident Followup): lvs2007, lvs2009 and lvs2010 connected to the same row A switch - https://phabricator.wikimedia.org/T286879 (10Vgutierrez) >>! In T286879#7220243, @ayounsi wrote: > Other rows need to be audited as well. You're right, I've created T28... [08:46:15] 10Traffic, 10DC-Ops, 10SRE, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10RhinosF1) [08:47:30] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw, 10Sustainability (Incident Followup): lvs2007, lvs2009 and lvs2010 connected to the same row A switch - https://phabricator.wikimedia.org/T286879 (10Vgutierrez) [08:47:33] 10Traffic, 10DC-Ops, 10SRE, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Vgutierrez) [08:48:02] jajaaj [08:48:07] oops, wrong window :) [09:16:57] (VarnishTrafficDrop) resolved: 69% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [09:18:29] ^ those are expected, it's been triggered by depooling text@codfw [10:29:22] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10MoritzMuehlenhoff) [10:38:39] topranks, vgutierrez, about tomorrow's row D maintenance, that reminds me that all LVS have links to row D, so we need to depool eqiad too, not sure what is the impact for the internal VIPs [10:39:07] or is the CP depool enough? [10:53:01] XioNoX: tricky question, depooling the cp servers should be enough, I'm looking into the pybal bug I detected on Friday though [10:55:13] oh of course... [10:55:25] authdns2001 being unreachable is messing with acme-chief as well :) [10:56:08] how [10:56:09] ? [10:57:09] acme-chief injects the DNS challenges on every DNS server [10:57:18] and it needs a 100% success rate to consider it done [10:57:41] I see [10:58:53] it fetches the list from authdns_servers on hieradata/common.yaml [11:01:44] add etcd support to depool it? :) [11:07:03] a little bit of hiera magic: https://gerrit.wikimedia.org/r/c/operations/puppet/+/705359 [11:30:08] vgutierrez: https://phabricator.wikimedia.org/T279457#7038822 :) [11:30:49] XioNoX: nice catch :) [11:31:37] so for cp servers the impact is quite limited as we only have 2 upload and 2 text servers in eqiad's row D [11:32:11] mw is currently being served from codfw... [11:32:22] but we have a variety of services on the low-traffic LVS [12:58:27] effie: I'm checking the "Hosts in IPVS but unknown to PyBal" icinga alert cause on Friday we detected some false positives [12:59:14] right now lvs2009 is still reporting CRITICAL: Hosts in IPVS but unknown to PyBal: set(['maps2005.codfw.wmnet']) [12:59:23] but obviously https://config-master.wikimedia.org/pybal/codfw/kartotherian maps2005 is right there [13:00:48] and it's also listed on http://127.0.0.1:9090/pools/kartotherian-ssl_443 (pybal metrics endpoint) [13:01:56] it looks to me like the check is too restrictive: https://github.com/wikimedia/puppet/blob/production/modules/pybal/files/check_pybal_ipvs_diff.py#L79 [13:05:36] if 'enabled/up/pooled' seems to much [13:09:40] 10Traffic, 10SRE, 10Patch-For-Review: Decomission malmok.wikimedia.org - https://phabricator.wikimedia.org/T286480 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin1001 for hosts: `malmok.wikimedia.org` - malmok.wikimedia.org (**PASS**) - Downtimed host on Icinga - Found Gan... [13:12:33] considering the scope of the check, from https://phabricator.wikimedia.org/T134893, it looks like we shouldn't look at it in this kind of scenario :) [13:12:58] cause the check only attempts to detect pooled servers on IPVS that are being untracked on the pybal side or the other way around [13:20:17] 10Traffic: False positives on PyBal IPVS diff check - https://phabricator.wikimedia.org/T286913 (10Vgutierrez) [13:26:51] vgutierrez: what is the current status of authdns2001? [13:27:14] and where are we tracking the steps to run before re-pooling it once we have a switch back? [13:27:33] volans: unreachable cause the A2 switch is still offline [13:28:06] volans: yup.. we're going to need a task for that, and not only for authdns2001 [13:28:25] because I was about to patch the dns cookbook to skip that host [13:28:40] to unblock a bunch of people ( mutante, jayme ) [13:29:03] but that should be reverted and we should run some command to make sure we have authdns2001 in sync before repooling it into the dns [13:31:46] sukhe: your run of the dns cookbook will fail, sending a patch now ^^^ [13:32:31] volans: you have a crystal ball! thanks :D [13:33:32] I've created T286914 [13:33:33] T286914: Track actions to perform before repooling authdns2001 - https://phabricator.wikimedia.org/T286914 [13:35:14] effie: let me know if https://gerrit.wikimedia.org/r/c/operations/puppet/+/705375 makes sense to you [13:39:50] 10netops, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10SRE, 10bacula: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) One interesting effect is that, since the datacenter... [13:42:08] 10Traffic, 10Patch-For-Review: False positives on PyBal IPVS diff check - https://phabricator.wikimedia.org/T286913 (10Vgutierrez) p:05Triage→03Medium [13:55:17] 10Traffic, 10SRE, 10Patch-For-Review: Decomission malmok.wikimedia.org - https://phabricator.wikimedia.org/T286480 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin1001 for hosts: `malmok.wikimedia.org` - malmok.wikimedia.org (**FAIL**) - **Failed downtime host on Icinga (like... [14:01:42] 10Traffic, 10SRE, 10Patch-For-Review: Decomission malmok.wikimedia.org - https://phabricator.wikimedia.org/T286480 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin1001 for hosts: `malmok.wikimedia.org` - malmok.wikimedia.org (**FAIL**) - **Failed downtime host on Icinga (like... [14:26:46] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [14:36:31] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [14:38:50] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [14:40:23] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [14:50:49] 10Traffic, 10DNS, 10SRE: Track actions to perform before repooling authdns2001 - https://phabricator.wikimedia.org/T286914 (10Vgutierrez) [14:51:21] volans: I took the liberty of copying the list of actions from your comment to the ticket description and added one... could you wipe your comment to avoid confusions? [15:34:35] 10Traffic: Actions to restore lvs2009/lvs2010 network configuration - https://phabricator.wikimedia.org/T286921 (10Vgutierrez) [15:34:43] 10Traffic: Actions to restore lvs2009/lvs2010 network configuration - https://phabricator.wikimedia.org/T286921 (10Vgutierrez) p:05Triage→03High [15:44:59] 10Traffic: Actions to restore lvs2009/lvs2010 network configuration - https://phabricator.wikimedia.org/T286921 (10Vgutierrez) [15:50:05] 10Traffic: LVS can't handle losing a NIC on eqiad and codfw - https://phabricator.wikimedia.org/T286924 (10Vgutierrez) [16:02:34] 10Traffic, 10SRE: Actions to restore lvs2009/lvs2010 network configuration - https://phabricator.wikimedia.org/T286921 (10Vgutierrez) [16:03:07] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [16:31:47] vgutierrez: sure [17:11:10] vgutierrez: FYI I'll be performing the revert actions in few minutes, the switch seems to be ready [17:11:59] cool [17:12:14] vgutierrez: I kept the LVS ports disabled [17:12:28] thanks... I'll be ready in 40 minutes [17:12:53] vgutierrez: should I go ahead with the revert for acme chief? [17:13:09] yes please [17:14:45] I can't ssh to authdns2001 though [17:14:52] am I the only one? [17:15:57] well, I can't ping it, did it get powered down? [17:16:08] wait, no, I'm learning a MAC [17:16:29] XioNoX: ? [17:17:03] https://librenms.wikimedia.org/device/device=95/tab=port/port=19556/view=fdb/ [17:18:09] cabling issue, working with Papaul [17:18:21] ok [17:19:32] host up for icinga [17:19:35] this is the reason trying this via remote-hands would have been a nightmare. [17:19:54] ok, authdns is up [17:20:06] * volans running cookbook to push latest changes [17:20:52] I'll check that it got also the codfw depooled one [17:21:36] 10netops, 10DC-Ops, 10SRE, 10ops-codfw, 10Wikimedia-Incident: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10Papaul) switch backup online and Netbox update [17:21:59] I can now ssh to authdns2001 , fwiw [17:22:50] yep, ns1 is back on authdns2001 [17:23:16] auto-generated data up to date [17:23:42] you added back the static route as well I'm guessing XioNox? [17:23:42] * volans running authdns-update to force-update authdns2001 [17:24:08] topranks: rolledback, yep [17:24:57] cool [17:26:04] last thing to do are the LVS when vgutierrez is back [17:26:27] ETA 20 minutes [17:26:33] Sorry about that [17:27:01] vgutierrez: no pb, I'm going for a run anyway :) back in 1h or so [17:27:26] Ok [17:28:06] 10Traffic, 10DNS, 10SRE: Track actions to perform before repooling authdns2001 - https://phabricator.wikimedia.org/T286914 (10Volans) [17:28:28] 10Traffic, 10DNS, 10SRE: Track actions to perform before repooling authdns2001 - https://phabricator.wikimedia.org/T286914 (10Volans) 05Open→03Resolved a:03Volans All done, resolving for now. [17:28:56] vgutierrez: I can probably help, certainly if it's just re-enabling the ports feel free to drop me a line [17:29:30] Will do [17:54:57] (VarnishTrafficDrop) firing: 18% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [17:55:10] back online :) [17:57:18] vgutierrez: could give authdns2001 a second pair of eyes [17:57:21] just to check is all good [18:00:17] responds to my dig's consistently from here anyway fwiw [18:01:29] wfm as well [18:04:35] thanks [18:07:01] * volans afk for a bit [18:10:59] topranks: could you enable xe-2/0/45 lvs2007 port please? [18:11:12] yep.. one minute [18:12:06] I've disabled puppet and stopped pybal via the mgmt console so it shouldn't interfere at all with lvs2010 [18:14:57] (VarnishTrafficDrop) resolved: 67% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [18:16:11] ok, port should be coming up momentarily now. [18:16:41] yep, I got link already [18:16:57] (VarnishTrafficDrop) firing: 67% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [18:17:00] MACs learnt on both vlans. [18:17:21] and I can SSH the host via the regular NIC [18:17:23] great [18:18:27] great :) [18:21:57] (VarnishTrafficDrop) resolved: 68% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [18:22:20] 10Traffic, 10SRE: Actions to restore lvs2009/lvs2010 network configuration - https://phabricator.wikimedia.org/T286921 (10Vgutierrez) [18:23:16] (VarnishTrafficDrop) firing: 60% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [18:23:59] topranks: everything good with lvs2007, let's move forward. Could you enable xe-2/0/44 lvs2010? [18:24:15] yep np [18:29:00] vgutierrez: should see it come up now in one moment [18:29:37] 10Traffic, 10SRE: Actions to restore lvs2009/lvs2010 network configuration - https://phabricator.wikimedia.org/T286921 (10Vgutierrez) [18:29:55] Now enabled on switch, but it is showing physically down. [18:31:04] yup... I'm seeing it down as well [18:31:27] it's up now :) [18:31:29] hmm. [18:31:30] ah ok [18:31:42] yeah -1.55dBm light on the switch now. [18:33:33] Learning MAC on both Vlans now also. [18:38:52] 10Traffic, 10SRE: Actions to restore lvs2009/lvs2010 network configuration - https://phabricator.wikimedia.org/T286921 (10Vgutierrez) [18:42:51] topranks: lvs2010 looking good as well, can we enabled xe-2/0/43 lvs2009?, thanks :D [18:42:57] (VarnishTrafficDrop) resolved: 64% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [18:44:57] (VarnishTrafficDrop) firing: 68% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [18:45:53] vgutierrez: yep no problem [18:48:52] 10Traffic, 10SRE: Actions to restore lvs2009/lvs2010 network configuration - https://phabricator.wikimedia.org/T286921 (10Vgutierrez) [18:49:26] Ok should be coming up now momentarily. [18:49:57] (VarnishTrafficDrop) resolved: 66% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [18:50:02] Link up :) [18:51:08] And MAC's on both vlans :) [18:51:25] yep, everything looking good on lvs2009 [18:55:35] 10Traffic, 10SRE, 10WikimediaDebug, 10Performance-Team (Radar): Allow ATS to route traffic to mwdebug deployment on kubernetes - https://phabricator.wikimedia.org/T286482 (10dpifke) The debug extension now fetches the list of backends from noc.wikimedia.org, so this hopefully shouldn't require any changes... [18:56:47] 10Traffic, 10SRE: Actions to restore lvs2009/lvs2010 network configuration - https://phabricator.wikimedia.org/T286921 (10Vgutierrez) [18:59:15] topranks: thanks for your help, lvs2007, lvs2009 and lvs2010 are now happy :D [19:03:40] np, excellent news :) [19:04:23] topranks: considering what XioNoX mentioned this morning about eqiad<->codfw links.. there is no harm in repooling text@codfw, right? [19:10:48] vgutierrez: better to repool it even [19:10:59] I didn't read scrollback yet [20:30:08] 10Traffic, 10SRE, 10Patch-For-Review: Actions to restore lvs2009/lvs2010 network configuration - https://phabricator.wikimedia.org/T286921 (10Vgutierrez) [20:30:41] 10Traffic, 10SRE, 10Patch-For-Review: Actions to restore lvs2009/lvs2010 network configuration - https://phabricator.wikimedia.org/T286921 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [20:31:10] vgutierrez: great job today! [22:03:00] 10netops, 10DC-Ops, 10SRE, 10ops-codfw, 10Wikimedia-Incident: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10Papaul) Case Number:2021-0719-0629 create with Juniper [23:28:42] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Bstorm) [23:52:25] 10netops, 10DC-Ops, 10SRE, 10ops-codfw, 10Wikimedia-Incident: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10Papaul) Dear Juniper Networks Customer, A Return to Factory (RTF) RMA has been created. Details of which are provided below. ***** RMA DETAILS ***** RMA Number: R200361...