[08:31:02] hi topranks XioNoX , we'd like to repool cp1113 that yesterday gave us some headache, in case are you available to assist with some troubleshooting on the interface side? [08:31:49] we should capture some traffic and try to understand why that host receive that amount of SYN traffic [08:37:29] hmm the amount of SYN traffic should be the failed connection attempts + retries [08:37:46] we need to understand why 3way handshakes are failing [08:37:53] (if that's the case) [08:39:06] on the lvs we were seeing an extra 100k rps of SYN packets [08:39:35] per https://librenms.wikimedia.org/graphs/to=1700642100/id=27171/type=port_upkts/from=1700555700/ those weren't arriving to cp1113 [08:42:22] fabfur: could you remind us the time frame where the host was pooled please? [08:43:51] 14:43 to 15.17 [08:46:45] fabfur: I'm around [08:47:24] and indeed a packet capture here would help [08:47:42] yep [08:48:15] XioNoX: https://grafana.wikimedia.org/goto/RUJEXCSSk?orgId=1 [08:48:29] it looks like packets weren't hitting cp1113 at all [08:49:24] healthchecks were ok though, so >L3 is fine between lvs1018 and cp1113 [08:50:25] https://grafana.wikimedia.org/d/000000366/network-performances-global?orgId=1&from=1700575438876&to=1700582724498&viewPanel=21 [08:50:29] what's the other upload node in row D at the moment in eqiad? [08:51:07] XioNoX: what's that measuring exactly? [08:51:20] netstat IpInHdrErrors [08:51:25] wrong L3 headers? [08:51:54] yeah, wrong or malformed [08:52:00] fabfur: what's the racking task? do we have the row information there? [08:52:11] XioNoX: measured by..? [08:52:13] I'm looking [08:52:30] it's the lvs node complaining? [08:52:54] T342159 [08:52:54] T342159: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 [08:52:59] https://phabricator.wikimedia.org/T342159 racking task [08:53:32] cp1112 and 1114 [08:53:41] fabfur: upload, not text [08:53:46] vgutierrez: it's the LVS cluster yeah [08:53:50] cp1115 - D 7. U 20 [08:54:02] sorry then just cp1115 [08:54:06] whatever creates those clusters [08:54:11] that isn't pooled yet [08:54:24] https://netbox.wikimedia.org/dcim/devices/?q=cp11&location_id=8 [08:55:06] XioNoX: that's weird.. I'd expect the SYN packets coming from the internet to be OK [08:55:24] XioNoX: cause that's a spike on inbound packets with "wrong" headers [08:55:48] still digging [08:57:37] https://grafana.wikimedia.org/d/000000365/network-performances?orgId=1&var-server=lvs1018&var-datasource=eqiad%20prometheus%2Fops&from=1700572391185&to=1700585191185 [08:58:53] hmm [08:59:01] routing configuration for cp1113 looks weird for me [08:59:03] 10.64.53.0/24 dev eno12399np0 proto kernel scope link src 10.64.53.17 [08:59:19] per netbox, 10.64.53.0/24 is an analytics vlan [08:59:55] indeed [09:00:02] should be 10.64.32 [09:00:18] 10.64.48.0/22 [09:00:20] 10.64.48.0/22 [09:00:27] that's private-1d [09:00:33] ok [09:00:56] indeed cp1115 has the correct one [09:01:16] I guess the provision script in netbox has been run with the wrong param [09:01:18] (cp1115 has been reimaged yesterday) [09:01:23] https://netbox.wikimedia.org/extras/changelog/?request_id=58ccdbbf-a0d3-4bb1-aeef-d7a082c2c5fb [09:06:13] yep, confirmed, vlan_type': 'analytics', [09:09:22] let's fix the vlan and that should fix the issue [09:09:58] you need to renumber the host [09:10:17] so basically decom and reimage [09:10:56] decom + provision + reimage, like it was a rename, or some manual hack (undocumented, untested) [09:11:03] so cp1113 will be lost forever? [09:11:08] no [09:11:19] pretty much https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging [09:11:24] without the rename [09:11:36] sorry connection problem [09:11:48] or we can do a slim version of it on the fly [09:11:50] volans: ack, I misunderstood the "renumber the host" part [09:12:00] I meant changing IP [09:12:04] not changing hostname :D [09:12:12] sorry for the confusion [09:12:17] I'm OK with a complete reimage, but up to fabfur [09:12:38] back [09:12:54] ok for me for a complete reimage, I can take care of it [09:14:41] volans: what do you mean with "slim version" ? [09:15:14] stripping some part of it, avoiding the decom, manually editing netbox and then reimaging [09:17:09] basically that procedure is the "I need to to change everything" one, and we could potentially generate 3 different docs, one for IP renumbering (with or without physical relocation), one for in-place renaming of the host, and keep this one for the change all [09:17:45] but it needs some toughts to make sure we do the right thing and don't skip required steps [09:18:12] jbond: o/ I may be still asleep but I am wondering if we need https://gerrit.wikimedia.org/r/c/operations/puppet/+/976659 [09:18:19] after the new wmf-certificates package [09:20:33] volans: I assume that manually changing the ip address(es) on netbox with the correct ones and then running the provision cookbook isn't enough.. ? [09:20:48] not at all [09:34:20] elukey: looking [09:36:17] elukey: lgtm thanks [09:36:45] jbond: ack perfect, IIUC this will trigger the creation of new bundles fleetwide, but it shouldn't really affect anything [09:38:05] elukey: yep [09:38:07] should we check that the file is everywhere to be sure? :D [09:38:16] volans: it is [09:43:52] volans: anything that you want to check/do before I merge? [09:44:18] elukey: nah, just 2 hosts without it, I guess with puppet disabled [09:45:11] volans: which tow host, was deployed with debdeploy so puppet been disabled shldn;t matter [09:45:33] volans: for safety I can disable puppet fleetwide (never done so I'd need some help) and run on some nodes first [09:45:57] probably not needed elukey [09:46:13] ok proceeding then :) [09:46:57] jbond: (2) aqs1016.eqiad.wmnet,kubernetes2041.codfw.wmnet [09:47:11] ls: cannot access '/usr/share/ca-certificates/wikimedia/Puppet5_Internal_CA.crt': No such file or directory [09:47:19] jbond: kubernetes2041 was down yesterday and was fixed over night, I just updated it [09:47:45] and aqs hosts are being reimaged by Eric, it was probably being reimaged yesterday when you rolled out the package [09:47:50] moritzm: ahh ok cheers, and i have just updated aqs1016 [09:47:50] yeah [09:47:58] thanks [09:48:01] super, change merged [09:48:26] no bundle re-creation afaics, no-ip [09:48:27] what about k8s images? [09:48:28] *op [09:48:44] this is a very good point [09:48:54] https://debmonitor.wikimedia.org/packages/wmf-certificates [09:48:54] the new builds will have a different path in theory [09:49:02] elukey: yes i think thoise files are used by the jks bundle creation. however the content is the sam so it should be fine [09:49:14] jbond: ack I figured, thanks for confirming [09:49:22] the bundles just copy the content so no path issues [09:49:45] ah right we use the bundle file, not the specific ones [09:50:00] checking in deployment-charts to be sure [09:50:43] the only reference of /usr/share/ca-certificates/wikimedia/Puppet_Internal_CA.crt is in some fixtures for kserve-inference, nothing to worry about [09:51:57] thanks elukey [09:54:28] the other reference in puppet is related to the ml-cache cassandra cluster, I filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/976658 to fix it [09:56:31] codesearch agrees [10:36:38] <_joe_> elukey: yes I'm a bit worried about the docker images using old layers [10:37:04] <_joe_> in general if we depend on swiftly updating wmf-certificates across all environments, it's not something that will actually work [10:38:58] _joe_ but do we use Puppet_Internal_CA.crt directly? IIRC we should be using the bundle file nowadays, so no config change required [10:39:38] <_joe_> elukey: yes typically it should work [13:48:10] jbond: we have a puppet-merge clash. Feel free to merge mine if you wish. [13:48:27] btullis: merged [15:04:08] Emperor: shall I merge your puppet change? [15:11:17] jhathaway: feel free to merge mine too, if it asks [15:11:39] will do, as soon as Emperor gives the go ahead [15:16:05] jhathaway: do you want to wait a bit longer or should we revert Emperor's change for now? [15:16:43] my change is not urgent, so I happy to wait it out a bit [15:16:51] *I am happy [15:17:48] we're trying to deploy mine in a call so it'd be nice to not have to wait out that much longer :/ [15:19:02] nod [15:20:07] I poked him on slack... [15:24:17] (apparently I joined the puppet-merge party, my patch can be merged if it comes to it) [15:24:28] woohoo [15:24:49] I have one I can add to the queue :-) [15:25:25] <_joe_> in such situations please go on and merge the change, then revert it and merge the revert [15:25:37] if things get too blocked ^ whay joe says [15:26:03] here's a revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/976775 [15:26:18] looking [15:26:19] jhathaway still has the lock [15:26:21] better safe than sorry [15:26:40] oh, sorry, my bad :( [15:27:03] I haven't reverted yet, still good to go Emperor? [15:27:10] jhathaway: oh, yes, please do [15:27:21] done [15:27:33] thanks, and sorry for causing bonus hassle [15:27:49] no worries! [15:28:01] ...now I can re-do those reimages. Doh. [15:32:46] * Emperor goes to rectify the evident brain^Wcaffeine defect [15:34:09] 🫖🍵☕🧉, enjoy! [16:26:48] my reimage failed because run-puppet-agent --quiet failed on cumin2002...? [16:27:28] [am trying said rune in a standalone tmux] [16:28:12] Emperor: yu you have the option to retry [16:28:20] https://puppetboard.wikimedia.org/report/cumin2002.codfw.wmnet/c36ac1e7f599b7bf90890a47009b539b0a1d7ffd [16:28:23] jbond: not obviously, the cookbook has bombed out [16:28:53] volans: yes that look slike the puppet-merge issue [16:29:08] what do you mean? [16:29:19] I opened puppettbord and there were ~130 hosts with filed puppet run [16:29:20] I just ran "sudo run-puppet-agent" in a tmux on cumin2002 and it worked OK... [16:29:33] there re now 120 [16:29:39] volans: T350809 [16:29:40] T350809: Sporadic puppet failures - https://phabricator.wikimedia.org/T350809 [16:29:56] i have done a lot of puppet-merges in the last hour [16:30:04] I'll retry the cookbook with --no-pxe [16:30:11] so you think the puppet-merge rce condition? [16:31:04] volans: yes [16:32:01] volans: i just checked and there was a merge going on at 16:20 [16:35:03] ack, but we have to find a better solution :D [16:36:02] volans: agreeded, jhath.away is looking into it [16:36:18] I know I know [16:38:19] indeed, happy to hear ideas on the approach to take [16:39:40] jhathaway: i think it deenpd if we want to add more selotape to the current solution or just bit the bullet and convert puppet-merge to a cookbook https://phabricator.wikimedia.org/T254249 [16:40:36] I'm happy to do either, but the latter will probably take more time, as my cookbook writing knowledge is nascent [16:40:39] but i think updating to use some symlinks we swap around sounds like it would be pretty much atomic to me [16:41:12] yes agree the later wold take a bit of time [16:41:17] nod, let look at that first, to at least bandaid over the current issue [16:41:35] sgtm [16:43:36] if needed in bash we do swap symlinks here https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/python_deploy/files/python_deploy_venv.sh#54 [16:45:04] re-running the reimage cookbook with --no-pxe worked :) [16:45:27] :) [16:45:34] volans: thanks [16:45:46] (also, all this reimaging of ms frontends resulted in my making the kitten test slightly more efficient) [17:41:52] godog, btullis: you have both ~7GB in your /home on build2001, are all those needed? (we're bit short on space, I'm looking at large dirs) [17:50:03] volans: I'll have a look now. [17:50:25] thx [17:55:23] volans: I've deleted most of it. [17:55:30] <3 [18:27:52] mutante: I'm around if you would like realtime assistance on the rsync issue [18:29:58] jhathaway: thanks, i think I get why it broke and your request as well. let me make a patch [18:30:15] thanks, sorry for not spotting that use case [18:30:37] no worries [18:30:55] the good part.. automatic ticket told us:) [18:31:03] very nice indeed [18:36:16] jhathaway: fwiw, I grepped the repo for "etc/rsync.d" but seems like this is indeed the only special case using the directory to put a secrets file like that [18:36:47] modules/profile/manifests/aptrepo/staging.pp seems to do the same [18:36:55] jhathaway: patch at https://gerrit.wikimedia.org/r/c/operations/puppet/+/976830 [18:37:02] eyes [18:37:42] +1, mutante [18:38:16] thanks! and I did not see staging.pp because my local repo is at an older commit [18:38:19] merging [18:38:28] np, I'll cut a patch for staging [18:44:26] jhathaway: puppet works again, looks good. I noticed the "secrets file = " file gets added to rsyncd.conf on both hosts, but the actual secrets file is only created on the "active" host. but it does not bother rsyncd it doenst exist. I restarted it kindof expecting it might break.. but did not [18:45:12] I meant "the "secrets file =" line" gets added to config [18:45:28] got it, I assume that was how it was before? [18:45:36] I think so, yea [18:45:56] I am almost surprised it doesnt cry over that.. but must have been like it before [18:46:08] so a failover would require a repuppet, to create the secrets? [18:47:16] a failover would require a change in Hiera where it's defined in one central place which host is the rsync source. then puppet has logic like "$is_active = $::fqdn == $active_host" [18:47:40] we have the same pattern for other things where stuff is rsynced from one to the ohter [18:47:52] got it [18:47:57] have to say somewhere in Hiera which is "active" or the source [18:55:58] jhathaway: can you review https://gerrit.wikimedia.org/r/c/operations/puppet/+/976835/ too? [19:40:34] taavi: yup [20:45:12] sudo cookbook sre.ganeti.makevm --vcpus 1 --memory 1.5 ... [20:45:17] error: argument --memory: Memory must be at least 1.5G [20:45:43] sudo cookbook sre.ganeti.makevm --vcpus 1 --memory 1.5G ... [20:45:50] error: argument --memory: invalid validate_memory value: '1.5G' [20:46:13] didn't we have VMs with as low as 256MB before.. btw [20:47:08] it works with 1.6. so seems like there is an " > 1.5" rather than " >= 1.5" [21:22:15] would love a sanity +1 on this small puppet change, https://gerrit.wikimedia.org/r/c/operations/puppet/+/976863 [21:29:48] jhathaway: +1ed [21:29:56] thanks! [22:01:11] mutante: we never had VMs with less than 1G, but starting with bookworm d-i fails with just 1G, hence the bumped requirement, see https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1035854 [22:01:50] moritzm: aha! thanks for the detail, gotcha [22:02:12] so yea, the only minor thing left then is you cant pick exactly 1.5 [22:02:17] I did 1.6 [22:03:03] I'll have a closer look at the cookbook, might by some kilobyte/kibibyte shenanigans [22:37:47] !log my latest commit, may have broken puppet-merge, I'm investigating [22:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:17] I am meanwhile using makevm and reimage cookbook :) [22:42:35] and having some minor issues because I forgot to do the puppet7 thing in the right order [22:42:42] but I _think_ i got it [22:47:44] I did also use puppet-merge fwiw [22:47:53] ok thanks [22:49:26] having puppet cert issue on new VM but probably because I should have merged earlier...unless puppet-merge isnt working [22:57:12] mutante: I think I have it fixed, testing... [23:03:39] mutante: should be fixed now [23:22:47] nope, still broken, looking :( [23:28:41] I have failed puppet runs on both new VMs so far [23:28:47] nod