[04:35:49] 10Traffic, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Reimaging lvs2012 fails as the host is unreachable from cumin2002 - https://phabricator.wikimedia.org/T336428 (10Papaul) I took a quick look at lvs2012, the server can ping 10.192.16.1 and 10.192.32.1 but the server can not ping 10.192.0.1 and 10... [05:37:02] <_joe_> hi, who should I bother with a trafficserver lua review [05:37:06] <_joe_> well reviewS [09:35:58] 10Traffic, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Reimaging lvs2012 fails as the host is unreachable from cumin2002 - https://phabricator.wikimedia.org/T336428 (10cmooney) >>! In T336428#8843550, @Papaul wrote: > I took a quick look at lvs2012, the server can ping 10.192.16.1 and 10.192.32.1 but... [10:15:17] 10Traffic, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Reimaging lvs2012 fails as the host is unreachable from cumin2002 - https://phabricator.wikimedia.org/T336428 (10cmooney) Ok I think I see what the issue is. Looking at the [[ https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt | k... [11:11:57] 10Traffic, 10Infrastructure-Foundations, 10SRE, 10ops-codfw, 10Patch-For-Review: Reimaging lvs2012 fails as the host is unreachable from cumin2002 - https://phabricator.wikimedia.org/T336428 (10jbond) > Ok I think I see what the issue is Nice work on the investigation > I'm also not sure if this config... [11:21:16] 10netops, 10Infrastructure-Foundations, 10SRE-tools: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) p:05Triage→03Medium [11:24:40] 10Traffic, 10Infrastructure-Foundations, 10SRE, 10ops-codfw, 10Patch-For-Review: Reimaging lvs2012 fails as the host is unreachable from cumin2002 - https://phabricator.wikimedia.org/T336428 (10cmooney) Ah cool John thanks for the explanation. > Seems like it would be an improvement to what we currently... [12:44:15] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-tools: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10cmooney) Thanks for filing this one! I'm happy with the script in the private repo, but I think it would help if @ayounsi also had a quick look... [12:52:46] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-tools: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) [12:54:31] 10Traffic, 10Infrastructure-Foundations, 10SRE, 10ops-codfw, 10Patch-For-Review: Reimaging lvs2012 fails as the host is unreachable from cumin2002 - https://phabricator.wikimedia.org/T336428 (10Papaul) >>! In T336428#8844009, @cmooney wrote: >>>! In T336428#8843550, @Papaul wrote: >> I took a quick look... [12:56:07] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-tools, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) [13:01:54] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-tools, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) [13:40:33] 10Traffic, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Reimaging lvs2012 fails as the host is unreachable from cumin2002 - https://phabricator.wikimedia.org/T336428 (10ssingh) Thanks @cmooney and @jbond for the extensive debugging! Looking at the above discussion, I think I should have mentioned tha... [14:17:44] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-tools, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) [14:18:13] 10Traffic, 10Infrastructure-Foundations, 10SRE: LVS servers using autoconf SLAAC IPv6 addresses - https://phabricator.wikimedia.org/T336505 (10cmooney) p:05Triage→03Low [15:02:05] 10netops, 10DC-Ops, 10Infrastructure-Foundations: Access port speed <= 100Mbps False posatives - https://phabricator.wikimedia.org/T336511 (10jbond) p:05Triage→03Medium [16:15:19] <_joe_> anyone around? sukhe brett bblack [16:15:45] <_joe_> did we recently change anything re: how we manage cookies and logged in users in the caching layer? [16:16:43] _joe_: not that I'm aware of, no [16:16:46] <_joe_> I would double check given https://phabricator.wikimedia.org/T336504 [16:16:58] <_joe_> I do think it's probably mediawiki-related, but one never knows [16:19:57] yeah I double-checked recent VCL commit history, nothing really strange or dangerous or related in at least a few months. [16:20:44] no changes I am aware of either, except that we are restarting varnish-frontend for the shared memory log bump [16:21:00] currently in codfw today, though I don't think it should matter [16:21:31] which will wipe caches as it goes, but they mostly expire in <24h anyways, so even if it's uncovering a problem that old cache entries were hiding, that shouldn't have a huge effect. [16:23:22] <_joe_> So the issue seems codfw-only :) [16:23:38] <_joe_> meaning, only people served by mw in codfw read-only can reproduce [16:24:41] Thu, May 11, 10:15 AM when it was filed [16:25:00] 10:14:58 < sukhe> !log sudo cumin -b1 -s1200 'A:cp and A:codfw' 'varnish-frontend-restart': T253093 [16:25:01] T253093: varnish-frontend-fetcherr: Assert error in vslc_vtx_next, 100% CPU usage - https://phabricator.wikimedia.org/T253093 [16:26:22] note the referenced VPT thread was created a few hours before the task [16:27:05] _joe_: has anyone failed to repro from other codfw-side DCs? (eqsin, ulsfo) [16:27:13] that would be telling [16:27:53] <_joe_> sorry, juggling between slack and IRC isn't easy [16:28:07] <_joe_> but the person who was able to repro is probably served by ulsfo :) [16:28:16] the restart + ticket referenced above, the TL;DR is that our shmlog based metrics/analytics stuff was sometimes having failures because it was overrunning the shm output buffers, and we had to restart to double the size. Shouldn't directly relate to any content/cookie/etc issues. [16:28:25] If only there were some sort of IRC bridge with Sla---- oh........ [16:28:33] :P [16:28:39] <_joe_> I think the problem might be at the applayer, code or otherwise [16:29:53] bblack: additionally the timing doesn't match up so that's a good confirmation [16:30:06] but I could see this somehow being codfw-side-only (meaning codfw+ulsfo+eqsin) if it's related to differing storage/applayer-caching in codfw-vs-eqiad [16:30:30] or something about sessions and the ro/rw split [16:30:36] hmmm [16:32:36] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-tools, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Papaul) @Volans i have some switches ready for testing. 2 leaves in different rows and the 2 spines lsw1-a8 lsw1-b8 ssw1-... [16:32:40] <_joe_> yeah but that would need a major replication issue in cassandra or mysql [16:32:57] <_joe_> I'm still trying to gather information [16:56:48] 10Traffic, 10Data-Engineering-Planning: varnishkafka / ATSkafka should support setting the kafka message timestamp - https://phabricator.wikimedia.org/T277553 (10BCornwall) [18:28:47] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-tools, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) [19:52:47] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Codfw:row A/B: rack/cable new switches - https://phabricator.wikimedia.org/T332180 (10Papaul) [19:55:31] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Codfw:row A/B: rack/cable new switches - https://phabricator.wikimedia.org/T332180 (10Papaul) [20:55:05] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Papaul) @Jclark-ctr i check again those servers from the switch side see below. Those are using NON-JNPR compatible cables. that is m... [21:30:26] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Jclark-ctr) Replaced both cables. they where newer wave2wave dac cables [21:46:46] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Papaul) ` Xcvr 31 REV 01 740-030077 H70824500300 SFP+-10G-CU3M Xcvr 5 REV 01 740-030077 G1807123036-1... [22:25:22] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Papaul) [23:19:40] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Papaul) [23:42:11] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Papaul) @Andrew yes we can still do the os install part and resolve this task when we will will be ready to do network changes we can...