[07:52:53] relocating [09:42:30] errand + lunch [12:29:48] lunch [12:57:54] gehel,ryankemper: any objection to merge https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/723153 ? [12:58:26] volans: looking [12:58:41] nope, no objection [12:59:12] thanks [13:16:06] open hangout is now opened! [13:16:06] https://meet.google.com/ugw-nsih-qyw [14:03:05] while reviewing WDQS board, this might be a good first task: T201354 [14:03:05] T201354: Migrate WDQS tools from JewelCli to actively supported command-line parsing library - https://phabricator.wikimedia.org/T201354 [14:18:47] seems well isolated indeed [15:46:47] ryankemper: should we try and ship lvs today, need any help with the patches? [15:48:31] ebernhardson: yup gearing up to ship lvs part 1 in a few minutes, I think from the research I've done part 1 will be very safe to ship since we're just putting it in the `service_setup` state, and then in a followup patch we'll switch to `lvs_setup` which is when we either want traffic to be overseeing it or at least for us to have a very good idea of what we're doing (since it involves pybal restarts) [15:48:59] ebernhardson: there's some systemd failures on icinga so i'm seeing if those are something quick to resolve before deploying lvs, i'm hoping it's just about that `gc-log-cleanup` job that I/we broke [15:49:44] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=wcqs1001&service=Check+systemd+state `CRITICAL - degraded: The following units failed: query-service-gc-log-cleanup.service` okay yeah it's just that [15:50:09] ebernhardson: remind me again on https://gerrit.wikimedia.org/r/c/operations/puppet/+/721646/1/modules/query_service/manifests/common.pp, is the issue just that it needs a trailing `/`? [15:51:22] looking [15:51:48] ryankemper: a trailing slash should make it all work, yes [15:52:25] i always forget that find doesn't follow symlinks, but probably for the best i can imagine all kinds of oddities if it traversed to unexpected places [15:53:12] ryankemper: how about the rest of LVS? I didn't write those patches because the docs made it seem like the patches are a couple lines and all the work is in how they are deployed [15:53:44] ebernhardson: yeah I imagine the symlinks would make an infinite loop fairly easy to accidentally do [15:54:02] ebernhardson: okay I'll add the trailing `/` and deploy that and make sure all the wcqs hosts look happy, then proceed to lvs pt 1 [15:54:06] ok [15:55:24] ebernhardson: as for the other lvs work, your understanding seems to match up with mine, that it should be mostly in the deploy work itself. legoktm linked me to https://gerrit.wikimedia.org/r/q/topic:shellbox2-lvs which is his/their recent work for shellbox and it did look like the subsequent state transitions (`service_setup`->`lvs_setup`->`monitoring_setup`) etc were all very small puppet changes [15:56:15] one thing that's interesting is I haven't found anywhere that puppet or some other code actually branches based off that `state` value, so I get the impression that the actual value of the state doesn't impact stuff and that it's more of a placeholder...I could be wrong though [15:56:47] sounds good. I'm going to try and figure out the blazegraph nginx config again...i realized recently it still doesn't capture the difference between the three. But if we can get things far enough that a request sent to the .svc. endpoints queries blazegraph or directs to the microsite would finally feel like getting somewhere :) [15:56:48] for ex in https://codesearch.wmcloud.org/search/?q=lvs_setup&i=nope&files=&excludeFiles=&repos= I see no actual branching [15:57:22] hmm, interesting [15:57:27] i hadn't actually looked :) [15:57:42] ebernhardson: that (my last comment re state) does bring back a question I asked traffic yesterday that I didn't hear back about, briefly: [15:57:53] https://wikitech.wikimedia.org/wiki/LVS#Create_an_entry_in_the_service::catalog says that [15:58:00] > Here we've defined state to be "service_setup"; this means that this service will not be included in monitoring, LVS configuration, or DNS Discovery at the moment. Until you perform the next step in the procedure, adding this stanza will be a no-op. [15:58:35] that's mostly straightforward except that "next step" is a little vague, the next immediate section in the docs is https://wikitech.wikimedia.org/wiki/LVS#Add_the_IPs_on_the_backend_servers which is a step we've already done [15:58:39] ryankemper: they did say the first patch is basically a no-op on the lvs hosts, so seems plausible lvs_setup is a do-nothing sigil [15:59:17] whereas https://wikitech.wikimedia.org/wiki/LVS#Configure_the_load_balancers is not the immediate next section - it's the section after the immediate next section - but it's the part that has the very clear "be careful with this step" stuff [15:59:42] reading [15:59:56] so I'm wondering if the fact that we've done the "add IPs on backend servers" part means that the stanza actually isn't a no-op for us [16:01:25] ryankemper: add'ing ips says `Once puppet runs on the backends, the LVS ip will be configured on their loopback device` so i guess i would check the lvs hosts to see if they are bound to the ip we used (.67?) [16:01:39] zpapierski: I saw you sent the patch for spicerack, lmk if you need any help to get it pass CI and/or any early feedback. Alternatively I can totally wait when it's ready from your side :) [16:01:40] if it's already assigned, then we did things out of order [16:02:11] ryankemper: actually, hmm i wonder which backends it means there. It's not clear if thats lvs backends or svc backends. [16:02:11] volans: thanks! I'm going to look at it tomorrow [16:02:28] ebernhardson: yeah I wasn't super clear on that, but it sounds like that's going to modify the wcqs hosts (the backends) to be able to listen to the LVS ip? [16:02:56] ryankemper: checking an elastic server, seems sensible. Those have an `lo:LVS` interface with the .svc. ip assigned, and wcqs do not [16:03:26] I guess based off the `lvs_setup` is just a semantic thing theory, we should remove that step from the lvs part 1 patch and make it part of the `lvs_setup` patch [16:03:31] zpapierski: ack, feel free to ping me when ready (tomorrow I'm OOO fwiw) [16:03:41] I suspect it'd probably be fine either way since we're not restarting pybal etc, but that's probably the safest option for now [16:04:36] ryankemper: it doesn't seem to say here, but i think we also have to put the `realserver` in role::wcqs::public as well for that to happen [16:04:52] volans: that's ok, from what I see it complains about style, should be easy to fix (famous last words) [16:05:13] oh, it does say to do that :) [16:05:15] ebernhardson: doesn't wcqs already include realserver in the profile or something? [16:05:20] if you read it carefully it tells you which invocation of tox to run to fix it ;) autoformatting [16:05:27] in our patch that is [16:05:38] ahh, maybe. checking [16:05:39] but there are also other things missing here and there [16:05:46] * ebernhardson has too many different patches to remember... [16:06:10] Sorry not in the profile, in the class [16:06:18] ryankemper: i guess following the instructions, we should split the lvs step 1 patch? [16:06:35] ryankemper: i guess it's not super clear to me how this is supposed to be ordered either :) [16:07:10] but it seems like this is asking to deploy the service::catalog changes first, then another patch to include realserver and have things assigned [16:08:17] ebernhardson: yes that's what I'm thinking too. basically lvs part 1 is what we have now minus the inclusion of realserver and minus the`conftool` stuff as well [16:08:57] and then that stuff we stick back in in the immediate followup patch to switch to `lvs_setup` [16:09:09] which we'll want to aggressively ping traffic about before actually deploying [16:10:07] sounds right [16:10:54] okay will do that. first things first though, trailing / patch :P [16:12:07] :) [16:13:52] ebernhardson: sorry last question, what is the problem with the symlinks again? [16:14:04] ebernhardson: meaning, if the problem is find won't follow symlinks, why does trailing slash resolve it [16:14:22] ryankemper: `find /path/to/symlink` doesn't follow symlinks, so it stops before it does anyhting. `find /path/to/symlink/` forces find to follow the symlink and start from inside the directory [16:14:45] maybe force to follow is wrong description, it basically resolves the symlink before find does any magic [16:16:41] got it [16:18:21] afk a couple [16:23:29] cool all the `wcqs` hosts are happy now (after doing a `sudo systemctl restart query-service-gc-log-cleanup.service` post-merge) [16:25:05] back [16:34:35] ebernhardson: okay made the changes to https://gerrit.wikimedia.org/r/c/operations/puppet/+/713959 [16:35:01] Just pulled the realserver stuff out of `modules/role/manifests/wcqs/public.pp` and removed the `profile::lvs::realserver::pools` stuff from `hieradata/role/common/wcqs/public.yaml` [16:35:55] will ship it once the test build succeeds [16:37:06] lgtm [16:47:19] ebernhardson: not sure what all I'm supposed to check but the conftool stuff looks right: https://phabricator.wikimedia.org/T280001#7374787 [16:47:47] ebernhardson: here's the followup patch for the `lvs_setup` work: https://gerrit.wikimedia.org/r/c/operations/puppet/+/723254 I'll post in #wikimedia-traffic [16:51:40] seems sane [16:52:02] I wonder if there's code in pybal that looks for `lvs_setup`. not sure where pybal code lives but it's possible that the `codesearch.wmcloud.org` doesn't search pybal's code [16:52:11] because the way the docs are written really makes it sound like `lvs_setup` actually does something [16:52:28] > To add the configuration to PyBal and add the LVS endpoint on the load-balancers, you just need to change the state of your service to lvs_setup: [16:53:59] ryankemper: pybal should be https://gerrit.wikimedia.org/r/admin/repos/operations/debs/pybal [16:54:17] mildly surprised to find it under debs, i would have that that was just packaging, but works :) [16:54:27] no mention of lvs_setup though [16:56:33] finally found service_setpu, used in wmflib::service::get_services_for_lvs to exclude things [17:02:10] huh, i wonder what we changed in github. When i did a codesearch against our org it asked me to re-login and provide a 2-factor code [17:02:17] seems to suggest we have not-free code in our github? [17:06:07] best i can tell, service_setup says 'dont install to lvs, but make config available for the final service group', lvs_setup says 'go ahead and install to lvs too, but no monitoring', so i guess it makes sense that lvs_setup doesn't seem to do anyhting [17:06:24] since service_setup was the one that implemented 'dont install to lvs hosts' [17:10:01] dinner [18:15:07] errm, i wonder if that breaks anything [18:15:26] i started wcqs-data-reload.sh manully on wcqs-beta-01 a few days ago, it's still running but cron started up a second one [18:16:05] I would have to assume so, unless one of the commands the data reload script is running gets blocked on some sort of semaphore [18:16:39] they are both running, but one is importing to wcqs20210921 and the other to wcqs20210920, so maybe can just kill the newer one? [18:31:37] ebernhardson: just need to glance at what the reload script is doing and make sure it hasn't mutated anything [18:31:45] IIRC it does munging first so we might be clear just killing the newer [18:34:27] ryankemper: since it seems to be importing to a separate namespace, i seems safe to kill. Doing so [18:34:37] ebernhardson: ack [18:34:46] i'm not sure if that namespace needs to be deleted somehow, possibly. But i guess that doesn't free up disk space so not sure if important either :) [18:34:51] oh right cause we import to a namespace and then remap the alias after [18:45:06] ryankemper: by chance did you get added to puppet-diffs in cloud? The docs say PCC can now target cloud VMs, but you have to run an export from the puppetmaster (and we aren't using the default cloud puppetmaster) [18:45:15] * ebernhardson was hoping to compile nginx changes for wcqs-beta-01 [18:46:52] https://wikitech.wikimedia.org/wiki/Nova_Resource:Puppet-diffs#How_to_update_the_facts_for_cloud_VMs?_(e.g._INFO:_Unable_to_find_facts_for_host_util-abogott-stretch.testlabs.eqiad.wmflabs,_skipping) [18:53:20] asked in -sre as well, maybe someone will be around [19:02:35] ebernhardson: ack, working with gehel on resolving the confd errors that popped up for wcqs, so can take a look in 10-15 mins if no-one else looks [19:03:16] kk [19:14:01] ebernhardson: Looks like I'm not in https://openstack-browser.toolforge.org/project/puppet-diffs so I'll ask to be added as a user [19:14:20] ahh, ok. Wasn't sure if brian added you the other day when he offere :) [19:14:45] (and it doesn't look like i can see user lists for projects i'm not a member of) [19:14:55] not surprised re not being able to see it [19:15:23] ebernhardson: yeah he didn't, but I never followed up on that membership / should have [19:18:24] ebernhardson: looks like there is a readonly version that you might be able to use: https://openstack-browser.toolforge.org/project/puppet-diffs [19:19:09] * gehel just learned about that one [19:20:48] gehel: cool! hadn't seen that [19:42:33] ebernhardson: Not sure if you got local version working, but updating the facts now [19:44:31] ebernhardson: done [19:44:36] excellent, lemme see if that works [19:52:03] ebernhardson: good catch on the pybal::web stuff (gehel and i looked at this but missed where it's actually creating the files), I wonder if we actually do we need https://gerrit.wikimedia.org/r/c/operations/puppet/+/723254/2/hieradata/role/common/wcqs/public.yaml#40 or if the pools stuff is irrelevant [19:52:07] hmm, it "works" but fails the prod compile. Looks like i need to put fake secret keys for oauth somewhere [19:52:17] but, seems the PCC part was a success :) [19:52:25] > # This file is generated from the etcd directories: [19:52:26] > # ["/pools/eqiad/wdqs/wdqs/"] [19:52:46] is what it says in `/srv/config-master/pybal/eqiad/wdqs`, but not actually sure where those etcd directories are [19:53:34] looking, didn't actually trace where the variables came from in hiera [19:56:09] ryankemper: the impl that provides services is modules/wmflib/functions/service/fetch.pp, invoked with lvs_only=True. This should simply be looping service::catalog in hiera which we defined in the lvs step 1 patch [19:56:21] i'm not clear on what the etcd stuff does though :( [19:57:13] right, I guess we can actually ignore the etcd stuff with respect to the problem of "it's not rendering the damn file at all" :P [19:57:26] Yeah to your point it's just grabbing [19:57:28] https://www.irccloud.com/pastebin/MQTfGqOd/ [19:57:37] And that's exactly what we've got here: [19:57:41] https://www.irccloud.com/pastebin/2hmRqcTu/ [20:01:19] * ebernhardson really isn't sure whats going on :P [20:03:10] same [20:05:15] Unless `define pybal::conf_file` is actually what creates the check / watch, not the file itself that's missing [20:05:51] https://github.com/wikimedia/puppet/blob/24325c572a9b7f6d3158a86ba525699b38d89951/modules/pybal/manifests/conf_file.pp#L22-L27 [20:06:22] It's a little confusing cause the comment at the top says `Writes the actual config file for pybal`, which makes it sound like it's writing the file we care about...but maybe it's just writing the thing that tells confd to check for that file [20:07:01] It's unlikely but maybe this isn't supposed to work until the `lvs_setup` stuff is done [20:07:04] ryankemper: hmm, well confd can essentially be thought of as a thing we feed templates into, and then it renders those templates with etcd data? [20:07:29] ebernhardson: sounds reasonable [20:07:53] So I guess I need to see what `pybal/host-pool.tmpl.erb` contains [20:08:05] (https://github.com/wikimedia/puppet/blob/24325c572a9b7f6d3158a86ba525699b38d89951/modules/pybal/manifests/conf_file.pp#L25) [20:08:06] i would guess then the "actual config file" for pybal services has to come from the confd template, probably at some point in the past the real config was written there, but then later abstracted away [20:08:16] seems plausible at least :P [20:08:53] https://github.com/wikimedia/puppet/blob/production/modules/pybal/templates/host-pool.tmpl.erb [20:09:53] So yeah I'm thinking we need the lvs realserver pool stuff...or if not that, then this step: https://wikitech.wikimedia.org/wiki/LVS#For_active/active_services [20:10:24] In any case I'm going to proceed with the `lvs_state` stuff since it feels likely to fix the problem, and if not we're at least making forward progress [20:10:37] also because we either have to merge it now or not till monday :P [20:11:43] seems plausible. Only other thing i can think of is to see if confd is complaining anywhere else with more info, doesn't look like anything from puppetmaster1001 ends up in logstash [20:27:09] ebernhardson: Gearing up to roll forward with the `lvs_state` stuff. The final step in https://wikitech.wikimedia.org/wiki/LVS#Configure_the_load_balancers is a sanity check by curl'ing the new svc [20:27:23] For us will that just be `http://wcqs.svc.eqiad.wmnet`, i.e. port 80? (and same for codfw) [20:27:26] i've no goats to sacrifice, but good luck :) [20:27:43] Don't worry I already sacced one last midnight, that should carry us through EOD today :) [21:06:23] ebernhardson: any thoughts on a curl command to check the service health following pybal restarts? [21:06:37] I guess I might need to find the internal port envoy is expecting? [21:07:30] ryankemper: hmm, should be the same readiness-probe lvs normally uses? [21:07:41] oh we don't have envoy set up yet do we [21:08:27] hmm, envoy is in the wcqs puppet catalog at least [21:08:57] I was looking in `hieradata/common/profile/services_proxy/envoy.yaml`, may have been the wrong place [21:09:29] I guess discovery might be a later step though [21:09:41] ryankemper: on the wcqs instances, curl localhost/readiness-probe returns the 200 OK [21:10:43] ebernhardson: got it, I think we want to exercise `wcqs.svc.[eqiad,codfw].wmnet` to show that routing works now [21:11:16] But curling `wcqs.svc.codfw.wmnet` just gives `curl: (7) Failed to connect to wcqs.svc.codfw.wmnet port 80: Connection refused`, so maybe I shouldn't be using port 80 internally [21:11:38] hmm, nginx has `TCP *:http (LISTEN)` which should listen to all interfaces [21:12:08] ryankemper: looks like we have an ip mixup somewhere [21:12:24] ryankemper: wcqs1001.eqiad.wmnet is assigned 10.2.1.67 to lo:LVS. ping wcqs.svc.eqiad.wmnet hits 10.2.2.67 [21:12:42] oof [21:12:48] okay at least that's pretty fixable [21:13:57] Yeah 10.2.2 should be codfw and 10.2.1 should be eqiad per https://wikitech.wikimedia.org/wiki/LVS#DNS_changes_(svc_zone_only) [21:14:34] the patch would be https://gerrit.wikimedia.org/r/c/operations/dns/+/713929 but not seeing problem [21:15:12] Neither do I see the problem in https://gerrit.wikimedia.org/r/c/operations/puppet/+/713959/10/hieradata/common/service.yaml#2844 [21:16:45] I guess if the problem is the actual routing of the `.svc` it'd be just the dns and not the puppet anyway, but good to know they line up [21:16:55] That is confusing though [21:17:11] Ah let me check netbox itself, that feels like the most likely culprit [21:17:17] The step w/ manually adding the new IPs [21:17:24] oh, yea thats another good place to check [21:18:48] ebernhardson: Yup that's it, 10.2.2.67/32 for wcqs.svc.eqiad.wmnet [21:19:08] okay so I need to fix those and then re-run the netbox cookbook, and possibly the `sudo auth-dns` step too [21:19:33] ryankemper: checking the rest, the only other part i'm suspicious of is that `wcqs.discovery.wmnet` doesn't resolve. [21:20:03] but envoy reports :443 as being wcqs.discovery.wmnet, so might be fine. I have no clue where that come sfrom :) [21:20:11] (i.e. this works: https://gerrit.wikimedia.org/r/c/operations/puppet/+/713959/10/hieradata/common/service.yaml#2844) [21:20:13] meh wrong paste [21:20:21] curl --resolve 'wcqs.discovery.wmnet:443:10.2.1.67' https://wcqs.discovery.wmnet/readiness-probe [21:20:55] ebernhardson: btw not sure if you saw but ~50 mins ago in #wikimedia-operations: [21:20:57] > ebernhardson: icinga is complaining about there being no hostgroup matching wcqs_codfw. I think it's missing an entry in `monitoring::groups` [21:21:06] from shdubsh [21:21:18] That's presumably unrelated to the discovery stuff but just mentioning [21:21:58] ahh, i didn't notice. Will check [21:32:07] bblack: (pinging here to not cross wires with the ongoing discussion in #wikimedia-traffic) went through the pybal restart process, and still seeing the `ipvs diff check` alerts hanging around. in the post-checks I discovered that I'd mixed up the IPAM addresses for codfw/eqiad in netbox. would that be why those alerts are hanging around? [21:32:35] (note: I'm running the netbox cookbook right now after fixing the issue in netbox) [21:33:05] to elaborate on the IPAM stuff, `wcqs.svc.eqiad.wmnet` was going to where `wcqs.svc.codfw.wmnet` should have been, and vice versa [21:38:57] netbox cookbook run is done, but still seeing ping routing as it was before (i.e. to the wrong one). is this a typical TTL type issue and I just need to wait, or is there some other button I need to push I wonder? [21:39:22] * ryankemper just realized bblack isn't in this channel, duh [22:23:04] looks like it works :) [22:28:41] ebernhardson: yup, turns out the IPAM was totally fine :P [22:28:55] also the confd stuff resolved when we pooled the `wcqs*` hosts [22:29:06] excellent [22:29:42] The docs don't actually explicitly say to pool the service before/during the https://wikitech.wikimedia.org/wiki/LVS#Configure_the_load_balancers, so we should probably add a blurb about that somewhere [22:29:54] * ryankemper is tired though so it's not happening now :P [22:29:58] fair :) [22:35:18] Okay I added a blurb at the bottom https://wikitech.wikimedia.org/w/index.php?title=LVS&type=revision&diff=1926400&oldid=1907196 I think it (pooling) should actually be done before the whole pybal restart, since otherwise we get those confd errors, but that's good enough for now