[01:42:42] 10Traffic, 10DNS, 10SRE, 10Patch-For-Review: Additional DNS entry for WikiLearn - https://phabricator.wikimedia.org/T338280 (10ssingh) ` $ dig app.dev.learn.wiki +short 52.44.207.59 ` Thanks to @Dzahn for the patch! [01:42:49] 10Traffic, 10DNS, 10SRE, 10Patch-For-Review: Additional DNS entry for WikiLearn - https://phabricator.wikimedia.org/T338280 (10ssingh) 05Open→03Resolved a:03ssingh [03:06:16] 10Traffic, 10Infrastructure-Foundations, 10Patch-For-Review: Set cookie in Varnish to start a probe - https://phabricator.wikimedia.org/T335637 (10JameelKaisar) Increase NetworkProbeLimit from 0.0001 (0.01%) to 0.001 (0.1%). [08:42:13] 10Traffic, 10Patch-For-Review: Let HAProxy handle port 80 - https://phabricator.wikimedia.org/T323557 (10Fabfur) [08:48:44] 10Traffic, 10Patch-For-Review: Let HAProxy handle port 80 - https://phabricator.wikimedia.org/T323557 (10Fabfur) [08:48:59] 10Traffic, 10Patch-For-Review: Let HAProxy handle port 80 - https://phabricator.wikimedia.org/T323557 (10Fabfur) [08:49:22] 10Traffic, 10Patch-For-Review: Let HAProxy handle port 80 - https://phabricator.wikimedia.org/T323557 (10Fabfur) [08:49:30] sorry for the spam :) [09:12:17] fabfur: o/ time for breaking varnishkafka? [09:12:50] elukey: yes (but really don't know if I can help you :) ) [09:13:36] fabfur: nono it is just to show you all the beatiful things that we ran on caching nodes :D [09:14:16] let's see how deep is the rabbit hole [09:14:18] so varnishkafka is a simple daemon that reads from the varnish frontend's shared memory log, formats every HTTP request into a json message and sends it to some Kafka topics [09:14:19] :D [09:14:22] :D [09:14:41] all the code in https://github.com/wikimedia/operations-software-varnish-varnishkafka [09:15:33] we tried in the past to replace it with a thing called "atskafka", written in go and more modern, but so far we have stopped the migration since it requires a ton of efforts from multiple teams (Data Engineering, Traffic, etc..) [09:15:58] on cache text we run 3 instances of varnishkafka, that corresponds to different streams [09:16:02] 1) statsv [09:16:06] 2) eventlogging [09:16:08] 3) webrequest [09:16:23] the first two are low volume, mostly for misc analytics purposes [09:16:34] the third is the biggest, basically all http requests hitting us [09:16:46] k [09:16:55] metrics in https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1 if you want to check [09:17:17] so what I am trying to do is to move varnishkafka's client TLS cert from cergen to our PKI infra [09:17:45] and the first step is to add a varnishkafka-all unit, that puppet can use to restart all the other ones when a new cert is issued [09:18:01] https://gerrit.wikimedia.org/r/c/operations/puppet/+/924506 [09:18:12] so the idea that I have in mind is to disable puppet on cp nodes [09:18:30] enable on one in ulsfo, run puppet, check that varnishkafka instances are good and can be restarted etc.. [09:18:33] and re-enable puppet [09:18:36] does it sound good? [09:19:52] reading the CR [09:21:48] ok make sense [09:22:30] super, my idea is to disable puppet on [09:22:31] cumin 'C:varnishkafka' 'disable-puppet "elukey - precaution for https://gerrit.wikimedia.org/r/c/operations/puppet/+/924506"' [09:22:51] 96 nodes, should be all cp nodes [09:23:17] {{done}} [09:24:06] now I merge the change and deploy on cp4037 [09:24:50] err [09:25:00] that's gonna interfere with the cookbook that fabfur is running on eqiad [09:25:08] (potentially) [09:25:15] sorry I was thinking that was just on ulsfo [09:25:19] basically the other way around [09:25:19] ah ok I missed this part [09:25:30] we got a cookbook running against A:eqiad [09:25:35] s/A:eqiad/A:cp-eqiad [09:25:48] so puppet has already been disabled there on some nodes (those pending in our cookbook) [09:25:52] yeah, at the moment all cp nodes on eqiad are puppet-disabled and under rolling restart [09:26:04] and it will potentially get enabled before your change has been safely tested [09:26:23] so can we hold this till fabfur cookbook is done? [09:27:24] vgutierrez: in theory my disable will not get removed until a correspondent enable is issued, so probably no chance to get puppet enabled beforehand (unless you do it with something like puppet agent --enable but I don't think os) [09:27:42] I can revert if you want and remove my disable action [09:28:15] could also be impacted someway by this? https://phabricator.wikimedia.org/T284555 [09:29:07] elukey: we got puppet disabled with another message, so your message got ignored on those instances AFAIK [09:30:35] vgutierrez: on 1075 I see [09:30:36] The last Puppet run was at Wed Jun 7 09:04:16 UTC 2023 (26 minutes ago). Puppet is disabled. elukey - precaution for https://gerrit.wikimedia.org/r/c/operations/puppet/+/924506 [09:31:05] ah but not on all sigh [09:31:42] created the revert [09:33:18] sorry about that :_) [09:33:33] nono I didn't notice the cookbook, will check better next time [09:33:49] we also logged that puppet was disabled on A:cp-eqiad IIRC [09:34:25] ok removed my disable as well [09:34:27] all good :) [09:34:50] elukey: my fault, I thought that you were applying that on ulsfo too, I'll read better next time! [09:35:28] fabfur: same from me, didn't read the SAL, should have checked :) [09:36:12] luckily we have this nice tool called cumin that saves us from making mistakes [09:41:12] :D [11:01:50] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Users management on SONiC - https://phabricator.wikimedia.org/T338028 (10ayounsi) Yes, that would be possible even though there is no documented way on how to do this and what is supported or not. The two main options I see is either via a... [12:27:17] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Users management on SONiC - https://phabricator.wikimedia.org/T338028 (10ayounsi) [14:10:20] 10Traffic, 10netops, 10Commons, 10Infrastructure-Foundations, 10WMF-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) Today, using First World-grade Internet connections, I could still very simply reproduce the bug.... [14:43:05] 10Traffic: Let HAProxy handle port 80 - https://phabricator.wikimedia.org/T323557 (10Fabfur) [14:43:49] Good News Everyone! HAProxy listening on port 80 on all DCs! [14:44:02] nice job fabfur and vgutierrez! [14:46:50] 10Traffic: Let HAProxy handle port 80 - https://phabricator.wikimedia.org/T323557 (10Vgutierrez) `vgutierrez@cumin1001:~$ sudo -i cumin A:cp "ss --listen -t -p '( sport = :http )' |grep haproxy |wc -l" 96 hosts will be targeted: cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1075-1090].eqiad.wmnet,cp[50... [14:47:10] 10Traffic: Let HAProxy handle port 80 - https://phabricator.wikimedia.org/T323557 (10Fabfur) 05In progress→03Resolved All cp* hosts are now updated with HAProxy listening on port 80 [14:49:51] 10Traffic, 10SRE, 10ops-codfw, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `lvs2010.codfw.wmnet` - lvs2010.codfw.wmnet (**WARN**) - Downtimed ho... [14:53:56] hi traffic i just made a dns change and wihout noticing also merged changes to remove lvs2010 which is being decomissioned by sukhe, however when things pushed to the dns severes we got the following error https://phabricator.wikimedia.org/P49113 [14:54:47] oh hmm [14:55:52] jbond: thanks, finishing the decomm and will look more closely [14:55:53] the change from sukhe also caused the removal of sone zones https://phabricator.wikimedia.org/P49113#198702 which probably need updating manually but would like a set of traffic eyes tomake sure things dont get worse [14:58:57] yes that's "normal" when you remove the last IP for a zone the integration will actually delete that file and the related include needs to be removed from the ops/dns repo too [14:59:22] when that's expected to avoid errors you can follow https://wikitech.wikimedia.org/wiki/DNS/Netbox#Atomically_deploy_auto-generated_records_and_a_manual_change [15:00:15] I am patching it [15:00:24] oh jbond already did it [15:00:33] sukhe: `yes was just about to say https://gerrit.wikimedia.org/r/c/operations/dns/+/928072 [15:00:52] thanks volans [15:01:13] and sorry for the interuption sukhe [15:01:21] jbond: not at all! I was done [15:06:21] all fixed thanks [15:06:48] thanks everyone! [15:08:29] fabfur, vgutierrez - ok to work on varnishkafka now? [15:08:52] also to all traffic team - I'd need to stop puppet on all caching nodes for 10/15 mins [15:08:55] to apply a patch [15:09:12] elukey: indeed, all clear [15:09:33] ack perfect, doing it :) [15:10:07] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) [15:13:47] elukey: you shall be known as "master of reverts" from now on [15:14:17] we have a longer record there I think [15:14:24] but we can declare elukey master regardless [15:14:41] * elukey gets the crown [15:15:02] now I hope to not have to do it again because the patch fails [15:15:44] ok so testing on cp4037, so far all good [15:16:41] all the vk instances restarted as expected, going to wait a bit and I'll restart again varnishkafka-all to make sure that all works [15:16:45] then I'll re-enable puppet [15:17:13] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) @cmooney all those connections are no longer on the old switch we can delete those. thanks [15:17:55] elukey: ack [15:20:43] all good afaics from my tests [15:20:57] btullis: o/ I rolled out the vk patch for the extra unit on cp4037 [15:21:35] tried a restart, also checked with kafkacat and traffic is flowing [15:21:45] elukey: Nice. Thanks for the ping. [15:21:54] going to re-enable puppet to complete the rollout, but vk instaces will be roll restarted [15:23:20] {{done}} [15:24:48] 10Traffic, 10Data-Engineering, 10Data-Platform-SRE, 10SRE, 10Patch-For-Review: Move varnishkafka to PKI - https://phabricator.wikimedia.org/T337825 (10elukey) The new `varnishkafka-all` unit is being rolled out across all cp nodes. Next steps: * Merge https://gerrit.wikimedia.org/r/924507 (no-op, just... [15:25:53] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-tools: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Papaul) @Volans on lsw1-a1 which is a new switch, after running the cookbook it did PASS . However no configuration was done on the switch itsel... [15:27:58] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) @papaul thanks I'll remove them from netbox cheers. [15:29:29] 10Traffic, 10SRE, 10ops-codfw, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10Jhancock.wm) Cable IDs em1 - 11995 em2 - 11997 nic2 port 1 - 11996 nic2 port 2 - 11998 [15:32:00] 10Traffic, 10Infrastructure-Foundations, 10SRE: Reimaging cookbook not forcing a Puppet agent run on lvs2011, lvs2012 - https://phabricator.wikimedia.org/T336593 (10ssingh) 05Open→03Resolved a:03ssingh I am going to mark this as resolved as lvs2013 didn't have this issue. Thanks again for the help and... [15:37:37] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-tools: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) It didn't complete successfully, it failed to check the uptime of the switch and asked the operator what to do, and when it was answered... [15:37:42] (SystemdUnitFailed) firing: user@0.service Failed on cp5026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:38:33] ^^ elukey is that you? [15:39:20] vgutierrez: checking [15:40:09] the change was applied in there but all works [15:41:28] seems exited with user@0.service: Main process exited, code=exited, status=219/CGROUP [15:42:41] vgutierrez: it is now gone, have you done anything? [15:43:06] nope just checking journalctl and friends [15:43:10] vgutierrez@cp5026:~$ journalctl -u user@0.service --grep 219 [15:43:10] -- Journal begins at Thu 2023-05-04 02:16:32 UTC, ends at Wed 2023-06-07 15:42:44 UTC. -- [15:43:10] Jun 07 15:35:11 cp5026 systemd[1]: user@0.service: Main process exited, code=exited, status=219/CGROUP [15:43:29] first time in the last month though [15:43:52] never seen it before, vk runs as root though so it may be related, can't think about a why [15:44:12] and cumin says that only happened in cp5026 so far [15:44:58] ack [15:45:48] error happened at 15:35:11 but... [15:45:59] Jun 07 15:38:35 cp5026 systemd[1]: Starting Varnishkafka - All Instances... [15:45:59] Jun 07 15:38:35 cp5026 systemd[1]: Finished Varnishkafka - All Instances. [15:47:38] puppet run applying your Revert^2 started at 15:37:20 so I guess it's totally unrelated [15:47:42] (SystemdUnitFailed) resolved: user@0.service Failed on cp5026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:50:32] vgutierrez: thanks for checking [16:11:16] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10Papaul) [17:08:01] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host lvs2014.codfw.wmnet with OS bullseye [18:04:20] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host lvs2014.codfw.wmnet with OS bullseye executed...