[06:31:29] Does anyone know how to see from varnish why a cache hit dropped so low at a particular point in time? (Not WMF related, brain just fried) [06:41:48] <_joe_> RhinosF1: I can't parse your question :) [06:42:53] _joe_: I can see from https://grafana.miraheze.org/d/LaSguOMZz/varnish?orgId=1 that cache hit rates dropped (thanks to majavah, didn't know that existed) but I want to know why it dropped so low [06:44:14] <_joe_> RhinosF1: you can start by analyzing the varnish logs (using varnishncsa or varnishlog) to check which requests are not being cached [06:45:44] _joe_: I saw varnishtop -i BereqURL and the varnishlog command but how do I filter it so it's only showing during the outage [06:46:42] <_joe_> oh you want to go retroactively backm to look at that point in time when the cache hit ratio dropped sudeenly [06:46:52] <_joe_> yeah you can't do that [06:46:59] <_joe_> unless you're saving the logs to disk [06:47:38] _joe_: I assume it's manual reading then? [08:31:48] I have ack'ed some netbox alerts that have been fired for quite some weeks and created a task instead [08:36:25] I checked those earlier but they seem to be generated by an unexpected exception, so I wanted Xio or Vol to check them later [08:38:14] jynus: I created https://phabricator.wikimedia.org/T283483 and added willy for now, feel free to add others if you think they can help [08:38:32] jynus: what's the exception? [08:38:36] I will add the info I got [08:38:50] thx [08:39:52] it was a TLS verification error [08:49:52] FYI I've updated https://wikitech.wikimedia.org/wiki/PartMan with a brief description of the standard recipes, I'm not super familiar with the reuse bits but would be good to have a minimal intro for that too [08:54:13] if I have to guess the TLS issue might be relatd to joh.n's addition of a discovery record for puppetdb last week, I'll check with him [09:03:10] hi all internet is down and mobile signal is a bit bad, had storms last night so gussing something got hit. as such i may be boundng a bit today [09:04:04] time to subscribe to Starlink ;) [09:04:09] good luck! [09:14:05] i have been monitoring progress closley ;) [09:24:33] <_joe_> jbond: is there a way I can get the "discovery" CA public pem onto a random server? [09:25:16] <_joe_> context: I want to add that and the puppet CA to our base images going forward (and yes, default to installing ca-certificates too) [09:26:54] we should also probably think how the renewal would work in that case [09:27:04] AFAIK is triggered by puppet right now [09:27:40] _joe_: you can get the discovery cert along with the roo from http://pki.discovery.wmnet/bundles/discovery.pem [09:28:54] <_joe_> volans: we would need to rebuild the images, all the chain, and redeploy them. So yeah maybe I should instead just let it happen as it happens today (we inject the certs in kubernetes via configmaps [09:29:50] _joe_: in theory you should just need the you should just need the root though http://pki.discovery.wmnet/bundles/Wikimedia_Internal_Root_CA.pem as services using the dicovery intermidiate should be configuered to send that as part of the chain in the handshake [09:30:44] <_joe_> jbond: so we need to add that to anything using envoy on kubernetes, as we transition services [09:32:55] yes, fyi the ats servers use this file in puppet puppet:///modules/profile/trafficserver/ats_trusted_ca.pem [09:32:56] <_joe_> I'm not sure how we tell envoy to use two CAs :P [09:33:15] jobo: its the same directive as with one file [09:33:22] _joe_: even [09:33:27] sorry for the ping jobo [09:33:40] _joe_: however the order of the certificates is important [09:33:56] <_joe_> oh? [09:34:02] let me see how i did it in puppet as it tripped me up first time round [09:34:53] <_joe_> looks like you just need to set multiple filter chains [09:35:11] yes in puppet we do ` cat client_cert int_ca > chained.pem` and that worked [09:35:37] <_joe_> sorry, can you show me the patch where you added the new cert to envoy? [09:36:00] i dont think i changed anything on the envoy side [09:36:11] <_joe_> oh you changed the contents of ca.crt [09:36:46] no i chaned the content of the client certificate [09:37:04] <_joe_> oh yeah we have trusted_ca: {filename: /etc/ssl/certs/ca-certificates.crt} [09:37:08] <_joe_> in puppet [09:37:08] 'certificate_chain' [09:37:11] <_joe_> ofc it works [09:37:21] <_joe_> oh you mean for serving [09:37:37] yes sorry have we been talking cross wires :) [09:37:42] <_joe_> serving is less of a worry tbh, I'm worried about envoy making requests [09:38:24] yes for that part just need the http://pki.discovery.wmnet/bundles/Wikimedia_Internal_Root_CA.pem in the ca-bundle [09:38:25] <_joe_> did you transition any production cluster already? because if they don't serve a cert signed by the puppet ca, nothing on k8s can communicate with it [09:39:06] debmonitor is using the new pki for both client submissions and ats -> debmonitor [09:39:25] <_joe_> oh ok [09:39:45] <_joe_> did you check that the docker images are still correctly reported? [09:40:36] erm yes they do although now yu mention it im not sure why as we dont inject the wmf_root cert [09:41:05] * jbond will take a look later [09:41:24] last update of /images is from 3 hours, 42 minutes ago [09:41:56] its possible they are still using a puppet client cert i think debmonitor is submissions will still accept both for client auth [09:42:14] the submission works I think because done outside of k8s [09:42:25] <_joe_> yeah I think they do, anyways you can check docker-report's repository to be sure [09:42:33] <_joe_> volans: yes it's running docker on deneb [09:42:43] so the client runs on deneb [09:42:48] <_joe_> yes [09:42:52] not on the docker world [09:43:02] <_joe_> well no, it runs in a docker container [09:43:08] the reporter? [09:43:16] <_joe_> I don't remember tbh [09:43:40] <_joe_> yeah I think we do submit to debmonitor from the host itself [09:43:51] it runs /usr/bin/docker-report [09:44:20] btw it's failing for docker-registry.wikimedia.org/python3-build-jessie, that should be removed [10:01:39] <_joe_> it's kind-of a problem for now, we can add it to the exclude filter [10:02:50] is still needed? [10:02:57] and/or deployed [10:03:39] <_joe_> no, and not sure what you mean, respectively [10:05:51] I meant if we do we have any jessie-based image still deployed (I guess not) and so if we could just delete them all from the registry, or are they still used somehow [10:22:05] <_joe_> the issue is... the registry software doesn't really allow to completely remove an image reference [10:22:26] <_joe_> you can remove all the actual images, but the reference will still persist when you ask for manifests to the registry [10:24:45] ah... and can't be marked as deprecated/disabled or similar? [10:24:53] <_joe_> yes [10:25:01] <_joe_> _joe_> it's kind-of a problem for now, we can add it to the exclude filter [10:25:25] <_joe_> :) [10:42:28] I got an error while doing a reimage- could it be due to these puppet changes? [10:42:40] (the ones about the cert?) [10:43:11] I understand "Unable to run wmf-auto-reimage-host: Unable to find certificate fingerprint in:" but that makes no sense on a --new host [10:43:34] It also says "sh: puppet: not found" [10:44:59] mmh, something weird happened- I think it trued to run puppet on the debian installer- investigating [10:46:48] jynus: which hosts and from which cumin host? [10:46:57] it may be on a loop [10:47:12] it was backup2004 from cumin2001 [10:47:59] yeah, it is going in a loop, so I will research what is failing on the installer [10:48:15] (or boot process) [10:51:53] I have a theory- that has happened to me before [10:52:05] the boot disk wasn't set as bootable on the bios [10:54:22] ack, I'm gonna step out for lunch, if you still have problems after I can have a look [10:54:43] thanks, volans but instaler is working as expected, so likely to be a hw config issue [10:54:54] k [10:55:26] I may have some "nice to have" suggestions for the script, but not super-important [11:35:03] _joe_: volans: fyi in relation to debmonitor-client. /usr/bin/docker-report installas debmonitor-client in the container, then runs `debmonitor-client -n > /tmp/report`. then later in the process `debmonitor-client-unpriv -f /tmp/report` is run from the host machine (i.e. deneb) as such the submosion dose not happen in the container [11:35:34] amd to answer your original question joe i dont think we do have anything running in a container that also connectes to a pki TLs services [13:33:56] arturo dcaro I have fixed an alert about: cluster=cloudelastic file=nic_firmware.prom instance=cloudelastic1006 job=node site=eqiad , so far so good, if it comes back again, I will file a task for WMCS to take a further look [14:12:07] thanks marostegui, confusingly that server is under the search team umbrella [14:23:00] can someone invite me in team channel says i need invite [14:28:31] (should it? it seems odd it does when i should be allolwed to join?) [14:28:56] _joe_: ^ if you are about, were you handling this? I added my name to that etherpad last week but seems it didnt stick? [14:29:18] (says i need invite to join) [14:31:29] robh: I tried but seems like you need a chanop to be able to invite [14:31:42] can you try /msg chanserv invite #wikimedia-sre-private? [14:31:53] well, if a chan op is in there ping them for me otherwise meh [14:31:59] i just am not in any private channels [14:32:43] <_joe_> robh: done [14:32:52] <_joe_> robh: wait, in *none* [14:33:00] ? [14:33:01] <_joe_> I'm sure I added you to the other one [14:33:21] hrm,m, [14:33:25] i got invite, but it didnt join [14:33:26] wtf... [14:33:33] <_joe_> no you were kicked [14:33:35] im still not in -private, i dunno why [14:33:37] yeah it kicked you out saying not authorized [14:33:51] <_joe_> robh: are you identified with nickserv? [14:33:53] yep [14:34:06] with enforce on, so it would kick me and rename if i didnt do it right [14:38:51] .... [14:38:52] wtf [14:38:55] and out again?!? [14:39:07] <_joe_> yes it's chanserv removing +I from your nick [14:39:10] <_joe_> no idea why [14:39:19] i wasnt in flag list last week [14:39:28] when i checked, perhaps it never stuck? [14:39:47] <_joe_> can you join _security though? [14:39:50] <_joe_> you should be able to [14:39:53] i dunno i never tried, ill try [14:39:57] i dont use that channel cuz its not NDA [14:40:03] ie: i cannot put anything actually private in it [14:40:31] <_joe_> ok, can't really figure out what's wrong with -private, sorry [14:41:33] Ok, who is heading up standardizing so they can fix this evnetually? [14:41:41] cuz not being able to join the private team channel is... bad. [14:42:04] i am happyto file a task detailing the issue. [14:43:05] (I get it may not get solved today, but long term I need to be able to join our team channels ; ) [14:44:16] [14:44:26] sigh... frustrating way to start the week. [14:44:56] * robh goes looking in phab [14:45:09] this is why i want in team channel, public channel is pointless echo chamber ; D [14:48:46] sigh [14:48:50] someone edited the irc channel list [14:48:53] and dcops is no longer on it! [15:01:18] _joe_: if chanserv is configured in a certain way it might require them to have +i chanserv flag to have the invex [15:01:57] <_joe_> yeah I think we did something wrong with the SET modes by copying them over [15:03:25] <_joe_> robh: are you connecting with TLS? [15:03:52] irc.libera.chat via ssl 6697? [15:03:59] 150342 -- | [robh] is using a secure connection [15:04:01] yes [15:04:51] <_joe_> ok try again please [16:19:26] volans, in the end I had a minor hickup FYI: , I ran "wmf-auto-reimage-host --no-pxe --new backup2004.codfw.wmnet -p T277323" [16:19:27] T277323: (Need By: 2021-04-30) rack/setup/install backup200[4-7] - https://phabricator.wikimedia.org/T277323 [16:20:09] and cumin in one of the icinga downtimes said: "Unknown 'backup2004'. must exist in /etc/icinga/objects/puppet_hosts.cfg or /etc/nagios/nagios_host.cfg. (Did you accidentally use FQDN instead of short hostname?)" [16:20:18] maybe because of last cumin update? [16:23:25] (nothing fatal, though) [16:30:12] but it is curious to see one bot says it was successful and the other failed :-) [16:34:06] jynus: so the reimage script runs in a subprocess the downtime cookbook, and the failure you got was that one [16:34:21] ah, that makes sense [16:34:48] but because the downtime is best effort it doesn't make the reimage fails just for that [16:34:54] that's cool [16:35:27] I will check why the recipe doesn't use the fqdn [16:35:42] not sure if it used to work and not it got more strict [16:35:42] ? [16:35:56] it tried to run "cumin hostname" [16:35:56] icinga uses hostnames, not FQDN [16:36:30] did the first puppet run start normally? [16:36:45] let me check the logs [16:36:59] interesting, there is more errors [16:37:02] jynus: ah wait [16:37:06] [failure] An exception occurred: `SSLError: HTTPSConnectionPool(host='puppetdb-api.discovery.wmnet', port=8090) [16:37:07] you used cumin2001? [16:37:15] yes, is that "bad?" [16:37:59] Moritz was making that a DBA-only host, leaving the cumin/spicerack/reimage and such available only on cumin1001/2002, but I'm not sure how far he got on Friday on that [16:38:02] and he's off today [16:38:17] I don't think that was an issue here [16:38:24] I see 2 errors: [16:39:17] Unknown 'backup2004' it is on cumin, but could be icinga being icinga (race condition) [16:39:46] and a tls error re: netbox [16:39:54] yeah I'm seeing the SSL one on the logs [16:40:00] that seems more important [16:40:11] and related to what I mentioned about cert changes [16:40:19] cc jbond [16:40:47] I don't want to create tickets but I am happy to do jbond prefers it [16:40:57] *if [16:41:58] volans: jynus: was about to finsih for the day can this wait untill tomorrow? [16:42:03] of course [16:42:06] yes it can [16:42:07] same here [16:42:16] I was about to disconnect too [16:42:19] ok cool [16:42:48] basically is this [16:42:50] https://netbox.wikimedia.org/extras/scripts/results/811441/ [16:43:02] let me see if I can fix it quickly [16:43:14] I won't install more things today [16:43:17] it can wait [16:43:52] I did a manual change on netbox, which was set it as active BTW [17:00:36] https://gerrit.wikimedia.org/r/c/operations/puppet/+/693921 plus https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/693654 shoudl do the trick, testing it [17:09:39] jynus: it should be fixed now, I've run the puppetdb import script in netbox as the reimage would have done if it had worked, all good: https://netbox.wikimedia.org/extras/scripts/results/811553/ [17:10:58] * volans out too for now [18:27:32] !issync [18:27:37] !issync #wikimedia-sre [18:27:37] Syncing #wikimedia-sre [18:27:42] Error: Unable to get opped in #wikimedia-sre [18:27:44] 👀 [18:28:10] !issync #wikimedia-sre [18:28:10] Syncing #wikimedia-sre [18:28:12] Set /cs flags #wikimedia-sre wmopbot +o [18:28:14] Set /cs flags #wikimedia-sre *!*@libera/staff/* +o [18:28:16] Set /mode #wikimedia-sre +b $j:#wikimedia-bans [18:29:21] the config lives at https://gerrit.wikimedia.org/r/plugins/gitiles/wikimedia/irc/ircservserv-config/+/refs/heads/master/channels/wikimedia-sre.toml [18:29:32] I will document this shortly [18:40:35] ohoho [18:40:52] legoktm: do you plan to sync channel accesslists to -tech, -operations with this? [18:41:22] anyone else using cookbooks + an ssh key that requires a passphrase? I'm getting prompted for the first passphrase (for the proxy) but then cumin says connection denied before it can reach the actual destination [18:43:42] urbanecm: yep. and it'll be open for anyone to use once we get a few more things set up, see https://phabricator.wikimedia.org/T283491 for tracking [18:44:00] also one neat thing is we can mirror ACLs for channels by just symlinking the config :) [18:44:45] cool legoktm :) [18:58:11] andrewbogott: what do you mean? running cookbooks from where? [18:58:20] volans: from my latop [18:58:24] (actually, a VM on my laptop) [18:59:47] that's a setup used only for WMCS, so surely noone else has got that in SRE. I don't know what setup do you have, but it should use the same ssh config that you use to connect to the WMF infra, and you can pass that to cumin's configuration [19:01:17] volans: ok [19:01:32] I'm using a setup that works fine for direct ssh but not in cumin. It's ok, though, I will work around it for now [19:01:37] I didn't realize we were such an edge case. [19:03:32] do you have a local config file for cumin? [19:04:45] you should have something like [19:04:46] clustershell: [19:04:46] ssh_options: [19:04:46] - '-F ~/.ssh/config' [19:04:47] inside it [19:04:52] among the other options andrewbogott [19:05:15] assuming that your same user ssh config is the correct one ofc [19:07:26] yeah, the issue isn't that it's not picking up config, it's something with tty juggling during the proxy jump [19:09:43] you can add -vvv to the options to get more info from ssh if you think is related to that [19:09:57] and to check that the correct username and keys are offered [19:10:41] thanks, I've already worked around it for now. I may dig in later on. [19:13:01] ok [20:56:50] I'm trying to follow https://wikitech.wikimedia.org/wiki/DNS/Netbox#How_to_manually_allocate_a_special_purpose_IP_address_in_Netbox (via https://wikitech.wikimedia.org/wiki/LVS#etcd_data_for_DNS_Discovery) and am a bit lost [20:57:37] I'm at step #2 "Search for the correct VLAN based on datacenter, type, row (if applicable), etc." - how do I know which is the correct VLAN? I assume I want a private IP in codfw and eqiad, but there's multiple that meet that [21:12:15] legoktm: I think the VLANs page is actually kind of a red herring here, despite the instructions? [21:12:27] for LVS service IPs specifically, there's a given prefix for each datacenter [21:14:33] so then I just need the "Netbox Prefixes" link? [21:14:58] the cookbook for creating VMs tells you the list you can pick from [21:15:28] and whether you pick row A or C can be just a matter of preference [21:15:44] https://netbox.wikimedia.org/ipam/prefixes/?expand=on&page=1 still has multiple private1-[abcd]-eqiad listed [21:16:24] when creating one machine in each DC I would normally pick the same row in both, but still "randomly" pick A or C or D, just because D was newer and less ful, for example [21:17:06] mutante: I'm trying to add shellbox to LVS, which apparently requires a "special purpose IP address", which is where I'm stuck [21:17:31] mutante: this is for a LVS VIP though, not for a specific machine [21:18:24] legoktm: I am not certain but think the advice about vlans mostly applies to when you need a special-purpose IP address for a specific machine in a specific row, not an LVS IP [21:19:05] so am I just supposed to pick any of them? should I look at what other k8s services do? [21:19:35] they seem to just pick next-available in both https://netbox.wikimedia.org/ipam/prefixes/92/ip-addresses/ and https://netbox.wikimedia.org/ipam/prefixes/93/ip-addresses/ [21:19:41] legoktm: if it's a LVS IP there is a special subnet, https://netbox.wikimedia.org/ipam/prefixes/93/ [21:19:46] and as the instructions say, to also use the same final octet in both eqiad and codfw [21:20:03] aha [21:20:08] so you could grab for instance 10.2.1.58 and also 10.2.2.58 [21:20:08] thank you both :) [21:20:26] you can click "prefixes" in netbox and ctrl+f for "LVS" [21:25:27] ok, I think I created 10.2.1.51 and 10.2.2.51 correctly [21:25:38] https://netbox.wikimedia.org/ipam/ip-addresses/8581/ and https://netbox.wikimedia.org/ipam/ip-addresses/8582/ [21:29:36] that looks right to me [21:30:18] legoktm: looks reasonable, like other services, same last octet in both DCs [21:30:33] was .51 used by something else before and just free? [21:30:43] since it's not the highest [21:31:03] I think so [21:31:14] there's a gap between .50 and .52 in all the configs I just edited: https://gerrit.wikimedia.org/r/c/operations/dns/+/693957/ [21:31:21] (if someone could also +1 that) [21:33:18] +1ed [21:34:49] I haven't done this since we have to reserve in netbox, but I knew I would need the same thing soon, so glad you are going through the wiki docs [21:35:52] ty [21:36:06] once I'm done and this works, I'll edit the DNS page with a pointer to the two LVS prefixes [21:36:24] when adding a new LVS service there was a part where Icinga pybal checks start alerting and then I wasn't comfortable just restarting all of pybal, but afair others fixed it [21:43:34] since we are at it, I will follow your lead and get my service IPs reserved, won't hurt to have them early [21:45:24] :D [21:45:28] here's the docs I added https://wikitech.wikimedia.org/w/index.php?title=DNS%2FNetbox&type=revision&diff=1913268&oldid=1892411 [21:45:32] thanks cdanis and mutante [21:46:37] thanks! [21:46:44] thanks legoktm! since in netbox you don't actually pick the IP and it's that green button, I guess you could be unlucy to get a different last octet in a different DC, but only if something else was already unbalanced [21:46:58] let me try the same thing now and take .58 [21:47:20] I guess if two people are trying to do it at the same time :p [21:48:02] "If this is a VIP, make sure you get the same last octect in both eqiad and codfw datacentres" heh, ok :) [21:51:31] ah, yea, also this is the part where eqiad is 2 and codfw is 1 and it messes with me a bit because eqiad is usually always older [21:53:33] I had to double check that like 3 times [21:54:49] ok, can you review me as well in netbox? [21:54:59] .58 octet in eqiad and codfw and should be just like yours [21:55:43] I changed mask to /32 , picked VIP type, no tenant.. [21:56:33] * legoktm looks [21:57:12] mutante: lgtm [21:57:26] thanks! making gerrit change