[01:06:09] _joe_: updated https://wikitech.wikimedia.org/w/index.php?title=HTTP_timeouts&type=revision&diff=1931153&oldid=1871715&diffonly=0#Notes [01:06:19] It's not pretty, but I guess it captures what we know today. [01:06:38] It does seem a bit contradictory that we both want to avoid triggering it but then also intentionally lower it for the videoscaler use case. [01:07:47] anyway, I guess there isn't a great solution if we find it important to stop videoscaler jobs that occupy non-syscall php code for >1h since we have no other means for that currently given surrounding layers using walltime [04:56:52] In about 1h we'll switchover s1 (enwiki) master [05:46:08] In about 15 minutes we'll switchover s1 (enwiki) master [12:26:39] how are mac addrs for dhcps for new machines handled these days? it seems there's some automagic voodoo, but I don't seem to be able to discover much about it easily :) [12:27:11] (and the server lifecycle wikitech page still says to patch puppet for macaddrs the old way, which I think isn't true for physical machines anymore?) [12:28:44] <_joe_> bblack: exported from netbox IIRC [12:29:04] right, but... I don't see the data filled in for recently-installed hosts that worked fine [12:29:25] bblack: no more macs, but uses netbox and dhcp opiton 82 [12:29:28] <_joe_> IIRC it gets inserted and removed by the reimage cookbook [12:29:36] ah! [12:29:40] so there is no permanent record [12:29:43] got it [12:30:15] https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage#DHCP_Automation [12:36:05] this is the ref I stumbled on mentioned earlier (in a different part of the lifecycle docs) - I edited it out already: [12:36:08] https://wikitech.wikimedia.org/w/index.php?title=Server_Lifecycle&type=revision&diff=1931212&oldid=1930649 [13:33:26] bblack: sorry I was AFK, reading backlog [13:33:52] how can I help? [13:34:08] volans: I just had dumb questions, already answered :) [13:34:23] thanks for the edit on the page, I missed that bit [13:37:17] np! [14:15:11] pie-in-the-sky puppet question: is there any way to set a heira value based on a combination of role+other heira value(s)? [14:15:47] <_joe_> kormat: role-based you obtain based on the hierarcy [14:16:01] yes [14:16:27] <_joe_> based on other hiera values, it depends a lot on the backend, but in theory %{lookup("other::hiera::key")} might work [14:16:48] <_joe_> jbond: ^^ do we support nested hiera lookups in our backends? [14:16:54] "might" is certainly the level of confidence i was looking for [14:16:56] <_joe_> we used to, but not so sure anymore [14:17:20] you can surely use %{alias('foo')} [14:17:22] <_joe_> kormat: it should work, and you can easily test it [14:17:53] _joe_: kormat: should be supported ping me if not [14:18:11] <_joe_> volans: that works only if you just need the other value directly, not if you want to reuse the value IIRC [14:18:37] it should work if you do bar::baz: "bar%{alias('prometheus_nodes')}baz" [14:18:38] <_joe_> kormat: but I smell an XY problem here - I need to work on my stuff first though [14:18:44] indeed alias preserveres the type of the lookuped value. lookup always returns a string [14:19:00] volans: that wont work with alias you have to uses lookup [14:19:26] ok [14:19:44] ^ I just got in a very similar pickle xd [14:19:53] what I meant is also that if we allow alias it should work with lookup too [14:20:36] <_joe_> in general if you need to do nested hiera lookups, it's a code smell [14:20:41] <_joe_> a strong one [14:20:58] <_joe_> but there are rare cases where it's legit and not a workaround to a refactor you got to do [14:21:48] * kormat grins innocently [14:22:51] what language is `%{lookup(...)}`? [14:29:29] "The percent-and-braces %{variable} syntax is a Hiera interpolation token." - ok, that's a start [14:31:59] ah. well crap. doesn't look like i can use this. [14:33:36] <_joe_> kormat: you still need to get to the zen of puppet, I think. It's not you that hack puppet, it's puppet that hacks you. [14:35:10] the zen of puppet is resignation, i have decided. [14:39:39] puppet has no zen :P [14:40:32] <_joe_> obligatory link https://bash.toolforge.org/quip/AVfTAUmefIH_7EDsriqu [14:41:43] but more-seriously, on a related meta-philosophical note that's of no practical use to help us in the here and now: [14:42:27] I see sort of "stage 1" of lifting the art of what we do is: automate / CM all the things. Basically, the path the industry has been on for many years now to replace arcane memorization and manually-typed administrative duties with scripts and puppet repos. [14:43:04] but in the same sense that automation is a meta-step up from doing things manually... in both cases they're still basically lots of duct tape binding bits of software together for a purpose. [14:43:42] some duct tape is inevitable, but where there are large amounts of duct tape (sometimes more tape than the things it holds together), it screams for the next level up, where you improve the software itself so that it needs less tape. [14:44:08] investing in larger and larger globs of duct tape is not a long-term winning vision, in other words [14:45:01] sometimes duct tape is our hammer though, so we build lots of it by default [14:45:58] but there's probably a lot of room for saying "this stack we built out of software components A+B+C is important to us, but we need 9000 lines of duct tape to sew it together into a solution. Maybe we need upstream improvements/replacements to A+B+C so that they can be put together in this shape with only 100 lines of duct tape" [14:48:12] <_joe_> so you prefer the duct tape to be written in C or php by a developer, I see. [14:48:18] <_joe_> or worse, nodejs [14:50:16] :P [14:50:51] no, I think there's a real difference between duct tape and application/daemon code [14:52:02] duct tape is often just making up for deficiencies (at least, for the production purpose at hand) in those components and how they interact with each other and/or how they get configured/run. [14:53:05] basically I'm arguing that we treat the application layers (mostly I mean the open source ones we use around here) as products, and the duct tape we assemble them with as a measure of our pain-of-use of these products, and get the products fixed to be usable for our purposes with less pain. [14:53:34] (even if that means patching them ourselves, in an upstreamable-friendly fashion) [14:54:18] I agree with that, to a large extent [14:55:42] I agree in principle, but duct tape is almost always the expedient option [14:55:48] yeah it is [14:56:07] the main problem as I see it being that the time investment required to change those upstream projects becomes too much of a drawback compared to just duct taping [14:56:17] but imagine, say, if we added up all the duct tape time/effort/hours spent hacking around some woes related to Debian, and applied it to making Debian better upstream in ways that would've obviated the duct tape. [14:56:58] (though, I also agree that duct taping creates long lasting debt that you'll have to pay multiplied in the mid-long term) [14:57:05] (and specifically I'm thinking of installer/partman woes there, for Debian heh) [14:57:14] partman 😬 [14:57:43] someone already did that! https://nick-black.com/dankwiki/images/b/b9/Parting_ways_with_partman.pdf [14:59:10] nice! [14:59:12] and if/when that has matured to the point that it'll be available in d-i for testing, I'll make sure our workflows are covered/help with testing/making it solid [15:08:58] <_joe_> bblack: sure stuff like partman is indefensible [15:09:25] <_joe_> but take the way we add/remove backends from varnish, which is indeed layers of duct tape [15:09:41] <_joe_> I am not unhappy I don't have to deal with a solution from varnish itself [15:10:19] yeah [15:10:28] <_joe_> but I'm also ok with envoy's approach which is "we give you all the infra and the apis to change dynamically configuration, just write a service or some yaml files into a specific dir" [15:10:38] <_joe_> but that's a lot of c++ code just there [15:27:47] legoktm[m]: does this sound good to you? https://gerrit.wikimedia.org/r/c/operations/puppet/+/731286/ [15:28:56] yeah, it's just a straight copy-paste right? [15:29:09] yup and PCC is noop basically [15:30:34] lgtm :) [16:15:57] hnowlan: just to make sure, you're all done with the wikidiff2 rollout + restarts? if so, I'll start on the PHP 7.2 upgrade [16:25:34] hnowlan, legoktm: it seems the job runners are not yet updated: https://debmonitor.wikimedia.org/packages/php-wikidiff2 [16:26:07] and snapshot, labweb, parsoid [16:26:08] legoktm: still restarting the codfw api servers but nearly done [16:26:19] moritzm: ack, didn't realise they needed it [16:27:09] I think it's mostly just installed on those for consistency [16:27:29] yeah that. not sure whether any of the code paths actually make use of it, but when rolling out updates let's upgrade all installed packages, otherwise we'll only create cornercases [16:28:10] yeah for sure [16:28:31] mhm [16:28:37] hnowlan: please ping me whenever you're done :) [16:29:01] legoktm: I'll let you know when the restarts are finished and I can do the other machine groups after you're done [16:29:25] oh, no, you should finish the wikidiff2 rollout everywhere first I think [16:29:49] ah ok. Shouldn't take too long [16:30:01] just in case there are issues with either upgrade we have distinct times for when each happened/finished [16:31:00] semi-related, I filed https://phabricator.wikimedia.org/T294802 a few days ago to have A:mw* match parsoid servers too, since they're running MediaWiki these days [16:34:00] good idea. and maybe also one of the two labwebs? since they are also a full mw workload, but with a little icing on the cake [16:59:14] legoktm: all done, thanks for waiting [17:08:29] awesome, ty! [21:04:01] `wcqs` has been having issues due to underlying problems on wcqs* hosts that we're still troubleshooting (https://phabricator.wikimedia.org/T294865). Because the backend hosts are not healthy it's lead to noise on the pybal / ipvs side. Am I correct in thinking that moving the service state back to `monitoring_setup` will suppress the alert noise while we work on fixing the underlying problem? [21:04:05] (See https://gerrit.wikimedia.org/r/c/operations/puppet/+/736564) [21:09:09] ryankemper: if I read correctly, monitoring_setup means that alerts will exist but they won't page -- is that your intent? [21:09:23] vs. rolling back to lvs_setup, which would fully remove monitoring [21:09:48] rzl: based on your understanding, is it that it wouldn't page, but would still post in #wikimedia-operations? Or would it not post entirely [21:09:56] It'd be nice to have the checks running but without any associated noise [21:10:03] I believe it would still go to IRC, yeah [21:10:18] I guess for further context we have the wcqs* hosts downtimed but that doesn't handle non-directly-host-related alerts [21:10:22] modules/service/manifests/monitor.pp is where that's defined [21:10:32] rzl: okay thanks, we likely want `lvs_setup` then [21:10:32] (cc mutante who was interested in this) [21:11:23] yes, thanks for cleaning up the alerts. downtiming the affected hosts does not affect them because these are alerts on lvs* directly [21:12:02] rzl: last question if you know, is it possible to go straight from `production` to `lvs_setup` or do we need to go to `monitoring_setup` first as an intermediate step? https://wikitech.wikimedia.org/wiki/LVS#/media/File:Lvs_state.png seems to imply we need to do one step at a time but that might be a limitation of the diagram [21:12:06] and we dont want to downtime those checks because then we wouldnt see alerts for other service [21:12:19] mutante: yup, definitely agreed on both those points [21:12:54] ryankemper: from https://wikitech.wikimedia.org/wiki/LVS#Remove_monitoring_in_icinga I infer that it's fine to go direct [21:13:15] but note the point there about the DNS record, if it's relevant to your setup [21:13:28] haven't checked on your current state :) [21:13:36] a bit earlier it says "The procedure for removal of a service should more or less follow the inverse order of what gets done adding it. It is important to perform the following actions in order. Specifically: " [21:14:16] yeah for sure, just wrt that state in particular, the instructions never have you traverse `monitoring_setup` on the way back [21:16:40] Okay I think I'm comfortable going straight to `lvs_setup` given that understanding. The point that the instructions don't specify to traverse the `service.yaml` state change specifically seems to imply that nothing will explode (famous last words ofc) [21:16:58] I will need to heed this: `The DNS record must have been removed in the previous step, otherwise it will trigger an alert.` so I will make sure to run that DNS step [21:17:33] We don't understand the problem exactly but it can be summarized as "the nodes are borked" so it won't hamper our ability to investigate if we remove the DNS entry [21:18:01] note this would have the next step: Check in with #wikimedia-traffic connect that your change looks good and that now is a good time for a PyBal restart. [21:18:48] Hmm [21:21:41] mutante: is that true? I think if we're only backing out to `lvs_setup` then he only needs to update icinga and authdns, he shouldn't need to restart pybal [21:22:52] if we were fully removing the service then yeah we'd need to restart pybal and that definitely requires coordination with traffic, but I think the plan is only to back out the monitoring [21:22:58] I don't know, the Wikitech page talks about the pybal restarts right after " you just need to change the state of your service to lvs_setup:" though [21:23:16] rzl / mutante: yes I was about to say the same, looking at https://wikitech.wikimedia.org/wiki/LVS#Remove_the_service_from_the_load-balancers_and_the_backend_servers that step is for going back one step before `lvs_setup` [21:23:35] rzl's summary is correct the plan is only to back out monitoring [21:23:50] I'm hoping it's just the following: [21:23:53] remove dns discovery (https://wikitech.wikimedia.org/wiki/LVS#Remove_the_discovery_DNS_record), state change to `lvs_setup`, run puppet [21:23:59] yeah, that's under "Remove the service from the load-balancers and the backend servers" which is the next step in fully removing a service -- if we're stopping before that, I think we're fine [21:24:42] that summary lgtm, where "run puppet" means specifically on hosts `'A:icinga or A:dns-auth'` [21:25:05] if you want to get a tie-breaker, I'd ask in -traffic to be sure ;) but based on the docs I feel good about that plan [21:27:34] Great [21:27:37] I'm going to tie-break in reverse alphabetical order of first name, which coincidentally means we go with that plan :P [21:27:43] ahaha [21:27:52] Updated https://gerrit.wikimedia.org/r/c/operations/puppet/+/736564 accordingly [21:28:42] (I think it's the correct answer but I'd still like to get mutante's okay before we go ahead, or else dig further until we get consensus on this) [21:29:41] No argument there, consensus never hurts when we're touching LVS [21:30:48] Ideally we still have the hosts in pybal but just not alerting, thus avoiding the need to restart, which should be the plan we've outlined above, but there's definitely plenty of moving parts for sure [21:31:39] oh, you don't need a tie break, do it if you are confident with it. I'm ok with it [21:31:50] 👍 [21:31:56] I want to do the same thing tomorrow :p [21:33:43] https://gerrit.wikimedia.org/r/c/operations/dns/+/736585 is the DNS side of things [21:37:37] rzl: mutante: either of you two mind +1ing https://gerrit.wikimedia.org/r/c/operations/dns/+/736585 if it looks as expected? [21:38:34] looking [21:39:13] The only uncertainty I have (and this uncertainty doesn't affect the DNS patch itself) is whether I need to muck around with removing the `confctl` discovery entries. I think I won't have to though [21:39:31] (ie reversing step 2 of https://wikitech.wikimedia.org/wiki/LVS#For_active/active_services) [21:40:10] you shouldn't need to [21:40:23] you're backing up to an earlier stage, but that stage already has those entries present [21:40:41] or, hm, maybe not [21:41:14] Yeah on second thought if pybal is using the presence of those entries as a way to know what to do it might affect things [21:41:25] I guess it depends how the monitoring is implemented, will glance at the puppet code [21:41:40] all I really know is discovery DNS has to be the last thing merged..after LVS config [21:42:34] ryankemper: DNS patch looks good to me, +1ing -- don't forget to authdns-update after merging [21:42:37] since monitoring is already alerting I think you can just try it though [21:43:17] Good point [21:43:32] I do think https://github.com/wikimedia/puppet/blob/production/modules/service/manifests/monitor.pp#L6 and https://github.com/wikimedia/puppet/blob/production/modules/service/manifests/monitor.pp#L48 imply that we're okay not removing the entry...but we'll see shortly [21:47:26] DNS changes rolled out, proceeding to the `lvs_setup` step: https://gerrit.wikimedia.org/r/c/operations/puppet/+/736564 [21:56:47] Hmm, maybe we do need to remove the conftool entries. Not seeing the alerts go away after the dns and puppet changes were rolled out [21:58:23] fwiw, I tried it with both pooled=no (in config) and pooled=inactive (not in config) [21:58:39] and originally thought setting it to inactive will fix this right away [21:58:55] but it did not.. it added 2 new alerts [21:59:22] first it was just the "marked as down but pooled" [22:00:04] Yeah in my experience if the underlying hosts are down the marked as down but pooled is unavoidable since pybal can't depool all the hosts without crossing the depool threshold [22:00:07] then it was also the "known to pybal but not to ipvs" [22:00:35] so then I changed that one host back to "no" [22:00:49] Inactive should just take the corresponding host out of consideration entirely I think [22:00:54] also they always just talk about wcqs2003 and not the other backeds [22:01:21] yea, so if the alert is "marked as down but pooled" and you .. depool them.. wouldnt you expect a recovery? [22:01:30] I did.. but did not get one [22:01:58] but also these alerts are often an issue when new services are added [22:02:06] I forget how pybal behaves when the operator manually tries to depool [22:02:09] and I remember wondering about those not going away [22:02:17] until eventually someone from traffic restarted pybal [22:02:21] and then it was fixed [22:02:29] But at least as far as the automation is concerned, pybal "wants" to depool but can't because the threshold ties its hands [22:02:36] I think I remember the docs saying manual might even work the same way...sec [22:02:52] it was a similar "weird" case and then it was like "yea, pybal needed a restart" for some reason [22:03:08] > the presence of a safety measure like the depool threshold in pybal means that setting "pooled=no" doesn't mean just changing the value in etcd will guarantee the server is not serving traffic anymore [22:03:19] from https://wikitech.wikimedia.org/wiki/Load_Balanced_Services_And_Conftool [22:03:50] still not entirely clear to me what exactly that means though :P like if we manually depool does pybal end up repooling it because it detects it under the threshold, or does the host never get depooled in the first place, or neither [22:05:00] Anyway...so what I currently need to figure out is whether the alerts are hanging around because of those conftool entries, or if this is a case where we need to actually restart pybal [22:05:05] I'm hoping/thinking the former but not quite sure [22:07:30] if the alert says "... but pooled" even though none are pooled that is already strange to me [22:07:57] ryankemper: what about wcqs2001 and 2002.. they are still pooled, I only never depooled 2003 [22:08:10] because the alerts only talk about that one [22:08:23] but let's depool the other 2 as well and refresh the checks again? [22:09:05] mutante: seems worth a try, I just dropped a question in #traffic incase one of the traffic team is around. I'll see what happens when I depool `wdqs200[1,2]` [22:09:35] I would say let's set all 3 to "no" so they are False here: [22:09:37] https://config-master.wikimedia.org/pybal/codfw/wcqs [22:09:50] and then refresh the "...but pooled" alerts [22:10:06] if they still don't change.. then my bet is on pybal restart to clear them [22:10:53] mutante: all of the wcqs2* hosts are difficult or impossible to reach via ssh, so I'd need to manually depool with conftool [22:11:04] Which is fine but does make me wonder if I should just nuke the conftool entries entirely and see what happens :P [22:11:43] ryankemper: should I do that? I already have that confctl open [22:11:55] nuking entirely if this does not help? [22:12:05] mutante: yeah I guess depool for now, then we'll see about nuking [22:12:08] no -> inactive -> remove [22:12:12] <_joe_> ryankemper: what state is the service in? [22:12:28] _joe_: `lvs_setup` now (from `production` previously) [22:12:31] <_joe_> if it's in service_setup, you can safely nuke them [22:12:51] It's one step past service_setup, so presumably not [22:12:59] <_joe_> ok so you need first to roll to service_setup (IIRC, don't trust my memory past 11pm) [22:13:17] And actually looking at the docs again I was thinking lvs_setup was where we actually add the conftool entries, but it looks like those entries are already there but not set to pooled until that step [22:13:19] <_joe_> that will remove the services from pybal after a restart [22:13:20] he is trying to roll backwards because the hosts turned out to be broken [22:13:37] <_joe_> so you need [22:13:42] _joe_: okay, we don't want to go all the way back to `service_setup`, but I think what you said fixed my understanding of when the conftool entries are first added [22:14:09] taking a step back, what we're trying to figure out is why removing the dns discovery entry and merging the puppet change was not sufficient to clear the alerts [22:14:20] <_joe_> which alerts? [22:14:34] <_joe_> sorry I didn't read all the backlog [22:14:35] `PyBal IPVS diff check` and `PyBal backends health check` [22:14:38] <_joe_> oh yes [22:14:42] No worries there's plenty of backlog, ask away :) [22:14:51] <_joe_> those will only clear once you move to service_setup [22:14:55] <_joe_> and restart the pybals [22:15:03] <_joe_> after running puppet on the lvs nodes [22:15:28] there's that restart needed, right after moving to service_setup, as per comment earlier [22:15:30] <_joe_> but I can take care of that tomorrow morning if you drop me an email if you don't feel confident restarting pybal [22:15:36] <_joe_> mutante: yes [22:15:38] Hmm, that's unfortunate [22:15:58] <_joe_> ryankemper: lvs_setup is the state where you've added the services to lvs [22:16:11] <_joe_> so pybal considers the service set up [22:16:13] _joe_: I've restarted pybal before so can handle that (altho will presumably need traffic's confirmation that there's nothing going on that should prevent that) [22:16:14] <_joe_> and alerts [22:16:28] can we do 2 things at a time add merge my LVS config tomorrow together while also fixing this?:) [22:16:33] Interesting, so I guess the monitoring referred to in `monitoring_setup` is other monitoring? [22:16:35] will need the same thing [22:16:45] and I just did not want to do it while it has existing alerts, heh [22:16:50] <_joe_> ryankemper: yes it's the service monitoring [22:16:57] <_joe_> the one that pages us if it breaks [22:17:28] <_joe_> ryankemper: to explain why this makes sense [22:17:35] <_joe_> imagine you're setting up the service [22:17:44] <_joe_> you get to lvs_setup, but your backends are broken [22:17:53] <_joe_> pybal will tell you (and in fact, it is) [22:18:05] <_joe_> so you don't go on enabling monitoring and paging everyone :) [22:20:32] Ah, that makes sense :) [22:20:51] Okay rolling the pybal restarts tomorrow sounds good to me [22:23:46] <_joe_> ack