[08:16:09] Hi, there seems to be a little issue with prometheus file gen on cp hosts : https://alerts.wikimedia.org/?q=%40state%3Dactive&q=alertname%3DNodeTextfileStale [08:40:08] 10netops, 10Infrastructure-Foundations, 10SRE: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10ayounsi) JTAC case 2022-1207-600204 opened asking for an RMA as it's the 2nd time the issue happens. [09:02:18] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ayounsi) FYI, there are outstanding Homer diffs for asw1-eqsin: `lang=diff [edit interfaces] - ge-0/0/16 { - description DISABLED; - disable; - } [edit interfaces xe-0... [11:42:50] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ssingh) >>! In T322048#8449909, @ayounsi wrote: > FYI, there are outstanding Homer diffs for asw1-eqsin: > `lang=diff > [edit interfaces] > - ge-0/0/16 { > - description DISAB... [11:44:00] claime: thanks [11:44:01] brett: ^ [12:51:56] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ayounsi) >>! In T322048#8450256, @ssingh wrote: >>>! In T322048#8449909, @ayounsi wrote: >> FYI, there are outstanding Homer diffs for asw1-eqsin: >> `lang=diff... [13:54:56] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin: ganeti500[567] implementation tracking for serviceops - https://phabricator.wikimedia.org/T324610 (10MoritzMuehlenhoff) Ack, decomming these by mid January sounds doable! [14:28:18] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `dns5002.wikimedia.org` - dns5002.wikimedia.... [14:28:58] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ssingh) [14:33:06] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns5003.wikimedia.org with OS buster [14:33:40] Just a head's up, I was checking the puppetmaster cert renew alert, and it's the varnishkafka certificate, notAfter=Dec 13 15:55:06 2022 GMT [14:42:22] claime: thanks, we should be good on that but appreciate the reminder [14:44:57] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs5005.eqsin.wmnet with OS buster [14:45:01] np ;) [14:49:07] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi) [15:36:31] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns5003.wikimedia.org with OS buster completed: - dns5003 (**PASS**)... [15:41:40] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs5005.eqsin.wmnet with OS buster completed: - lvs5005 (**PASS**)... [15:57:03] Hello claime We are working on the varnishkafka certs here https://phabricator.wikimedia.org/T323771 kindly let us know if there's any question or issues with the plan mentioned [15:59:20] steve_munene: Thanks for the heads up, I don't see issues with the plan b.tullis laid out, but I don't know enough about this part to give a valuable opinion ;) [15:59:28] Thanks for keeping me up to date though :) [16:10:46] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ssingh) [16:33:23] np claime I was just about to notify the channel [17:17:26] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `lvs5002.eqsin.wmnet` - lvs5002.eqsin.wmnet... [17:19:36] Hm, the confd-reload-vcl.prom file isn't my doing! I think there is genuinely a problem here! [17:20:46] brett: ^? [17:24:24] brett: oh sorry! I thought you and valentin were working on that [17:24:33] for some reason, I saw some backlog [17:25:06] nope, that's godog [17:25:12] lol [17:25:16] hi [17:25:34] vgutierrez: bye :P [17:25:44] * vgutierrez back to vacation mode [17:30:25] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ssingh) [17:31:53] godog: Do you need any help with the confd-reload-vcl.prom staleness? [17:43:20] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ssingh) [17:52:06] Hello, about to update varnishkafka certificates which will entail, [17:52:06] Disabling puppet on all cp servers [17:52:06] Merging the changes made [17:52:06] verifying the keypair is updated [17:52:07] verifying restarting of the varnishkafka instance, if not perfornimg a restart [17:52:07] re enabling and running puppet on all varnishkafka instances [17:52:07] T323771 [17:52:07] T323771: Update varnishkafka client certificate for authenticating to kafka-jumbo - https://phabricator.wikimedia.org/T323771 [17:52:49] steve_munene: will this disrupt analytics data enough for someone to care? [17:53:05] (restarting all the varnishkafkas while traffic continues) [17:55:57] for that matter, it's been so long since I've looked at vk restart, I'm not even sure if it triggers cache daemon restart. [17:56:59] bblack: I don't think that it will disrupt analytics data, but I would still do it in batches. Varnish itself still runs and logs, so the messages will all still get picked up and put into kafka. We shouldn't lose anything, right? [17:58:00] regarding my last point: I checked, and the dependency is the other way around: varnishkafka.service requires/bindsto varnish.service, not the other way around [17:58:05] so no problem [17:58:14] great [17:58:35] on the data disruption though: varnish is logging to a small ring buffer in memory, which VK picks up data from [17:59:23] Right, thanks for that. steve_munene are you going to test the restart on a single or a subset of cp hosts? [17:59:27] I don't know how many seconds are typically in the ring buffer, but it's entirely possible that a restart causes at least some data loss. when it happens on 1/N nodes it's probably not that great a disturbance in the force, but if they all do it quickly, I donno [18:00:06] btullis: single host possibly 1075 [18:02:56] Is tomorrow a better target for it? The cert expires on Tuesday, so we need to do it before then, whatever happens. Maybe seek input from e.lukey tonmorrow morning for his view on the action plan? [18:04:33] I suspect even if there's a blip of data loss, they can account for it / flag it / whatever (they do that for organic issues as well). It may just be a matter of making sure they're aware. [18:05:17] anyways, strictly from the pov of what traffic directly cares about, I don't think we have any reasons to be worried about functional impact. The rest is up to you and/or analytics I guess :) [18:06:30] technically the data loss could be avoided, by fulling depooling each node around the vk restart, but that will take a longer to execute (caches have to be given time to refill, etc) and is probably not worth it. [18:06:36] s/fulling/fully/ [18:08:02] elukey: ping ^ in case you have quick answers/opinions [18:11:08] sure, let's set this for tomorrow [19:53:18] bblack: sorry just seen it, yes I agree, small batches with some slow pace between them is good enough for the use case (maybe we could depool a cp node when testing the first vk restart just in case, but that's it) [19:53:50] (can't follow up tomorrow, bank holiday in Italy) [19:56:22] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) 05In progress→03Open a:05RobH→03None [19:57:02] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) a:03ssingh @ssingh, Once the final OS installations are completed please resolve this task. Thanks! [20:43:27] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs5006.eqsin.wmnet with OS buster [21:32:18] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs5006.eqsin.wmnet with OS buster completed: - lvs5006 (**PASS**)... [21:44:39] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `lvs5003.eqsin.wmnet` - lvs5003.eqsin.wmnet... [21:44:42] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ssingh) [21:48:34] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ssingh) [21:50:57] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ssingh) 05Open→03Resolved Thanks to @RobH, @Papaul, @Bblack, @cmooney, @MoritzMuehlenhoff, @Volans for all their help in the eqsin refresh.