[00:50:04] jhathaway: any west-coasters around who have used etcd/conftool/dbctl? [00:50:15] oops, sorry for the ping, that was supposed to be a general call for help :) [00:50:22] So, once again... any west-coasters around who have used etcd/conftool/dbctl? [00:51:26] Very basically [01:00:04] andrewbogott: how can I help? [01:00:59] brett: I am slightly further along now than when I asked that question... but possibly now I'm to the hard part. [01:01:29] This is deployment-prep. I'm rebuilding the etcd setup but the dump I generated seems to not contain whatever it was that it needed to contain :( [01:02:04] So until a minute ago my question was 'how do I get dbctl working?' but now my question is the much-broaded 'how do I configure this?' [01:02:45] I believe the data I want is approximately the dump in https://phabricator.wikimedia.org/T276462#6882968 [01:03:00] -- would you try to inject that with dbctl or just via curling etcd directly? [01:03:25] sure [01:04:12] Oh, sorry, I was asking for advice rather than asking you to do it, but if you want to actually do it better yet :) [01:04:22] lawl [01:04:26] dbctl is on deployment-cumin-3.deployment-prep.eqiad1.wikimedia.cloud [01:04:35] Well, I'm afraid I am not the right person for advice but I'm happy to be a duck [01:06:16] I'm confused by the action/get bits in {"action":"get","node":{"key":"/conftool","dir":true,"nodes":[{"key":"/conftool/v1","dir":true,"modifiedIndex":5,"createdIndex":5}],"modifiedIndex":5,"createdIndex":5}} [01:07:07] why's that? [01:07:25] just trying to figure out where the key/value begins and where the etcd overhead ends [01:08:05] I guess all of that just boils down to 'there is a dir named conftool' [08:30:43] kamila_: I'll proceed to bump partitions for mw accesslog topic https://phabricator.wikimedia.org/T369256 [08:32:10] ack, thanks godog [15:43:47] herron: bblack: shortly after 17:00 UTC we'll be depooling both DCs from the appservers-r[ow] and api-r[ow] discovery services as a "soft" turn-down, before removing the LVS services tomorrow. [15:43:47] no action required, just flagging it for visibility in case something goes wrong. [15:43:47] sukhe: ^ I guess we'll find out the answer to the servfail vs. nxdomain question empirically :) [15:44:51] swfrench-wmf: nice! [15:45:26] swfrench-wmf: 👍 [15:52:47] nice :D [16:01:22] fyi. We switched the firewall provider for production gerrit from iptables to nftables. And gitlab as well. So this is becoming slightly more common. We have like 50 hosts out of 2000 that are on it now. You could consider that for other things. [17:22:58] FYI, holding for the moment on the depooling, as it seems there's some analytics workload still hitting the api appservers in eqiad [17:27:02] gmodena: btullis ^ [17:28:25] I see 16:30 btullis@deploy1002: Finished deploy [analytics/refinery@a203f30] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@a203f30c] (duration: 03m 41s) [17:28:25] in https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:37] but I think we need a puppet patch to bump the version used? [17:28:50] (16:14 btullis@deploy1002: Finished deploy [analytics/refinery@a203f30]: Regular analytics weekly train [analytics/refinery@a203f30c] (duration: 09m 23s) [17:28:50] ) [17:29:28] ottomata: thank you for digging that up [17:32:42] I'm going to go ahead and depool eqiad on the -rw services (i.e., the one pooled DC), as it's highly unlikely that's involved here [17:33:18] it should also be safe to depool one DC (e.g., codfw) on the -ro services [17:33:38] yes we don't use -rw [17:33:49] there are puppet patches needed to apply this change, looking now... [17:34:51] ottomata: ack, thank you [17:41:57] uhh, that SAL is I mentioned is from weeks ago. grep fail! [17:41:58] i need to deploy [17:48:17] herron: bblack: as an update, all DCs are depooled on api-rw and appservers-rw, so both are now resolving to failoid. api-ro and appservers-ro are pooled only in eqiad, and I'm holding off for now on depooling there until while the work above ^ is in progress [17:51:14] ottomata: many thanks again for your help here. I see there's one or more puppet patches involved - do you need me to merge those? [17:53:41] swfrench-wmf: i can merge, need the refinery scap deployment to finish first. [17:53:45] test analytics cluster looks good [17:56:31] unrelatedly, it seems something has gone awry with mw-on-k8s envoy metrics: https://grafana.wikimedia.org/d/35WSHOjVk/application-servers-red-k8s - going to investigate that shortly [18:04:44] ah, correction - these are not envoy metrics, these are `mediawiki_http_requests_duration_count` which IIRC comes from benthos ... [18:36:37] swfrench-wmf: deployments all done. looking okay from our side [18:36:42] let us know if you still see requests to the old endpoints [18:40:06] ottomata: thank you very much for your help. alas, I'm still seeing requests from an-launcher1002 =/ [18:40:32] is the expectation that the workload that launches around :35 after the hour should no longer be running? [18:41:31] hm [18:41:37] i'm not sure which workload that is... [18:41:39] alas, I can still see that in https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red (e.g.) [18:41:56] let me hop over to that host and see if I can catch one in the act [18:42:07] k [18:43:12] actually, any job that uses an older refinery-job jar version will have this problem, even if the job itself isn't configured to use the service. [18:43:43] The stupid EventStreamConfig WikimediaDefaults are statically instantiated (my fault) which (IIUC) means that any usage of the codebase will end up calling that API [18:44:16] there are a lot of airflow usages.... [18:44:18] erg. [18:47:42] ottomata: ah, got it. that makes sense, yeah (i.e., anything with an older jar could be doing this). [18:48:28] FWIW, just going by the systemd timers, this looks like either gobblin-webrequest or gobblin-netflow [18:48:36] oh gobblin...hmm [18:48:36] right [18:49:23] the one I definitively caught in the act yesterday was gobblin-webrequest_frontend_rc0.timer [18:50:57] okay... thank you. swfrench-wmf are there really only those few? if that is so then maybe my concern about the older jars is not founded? [18:54:38] ottomata: it's a bit hard to tell, to be honest. the best I've been able to do for the moment is a shell one-liner on an-launcher1002 that periodically polls for connections to the eqiad api LVS address, then correlate with a pid [18:54:55] swfrench-wmf: gobblin def is a problem. i can see it in the code now. [18:54:59] so we need to at least fix that [18:55:05] I think that is not going to get done today...i'm sorry [18:55:18] I'm sorry we didn't catch this when you all emailed. Gabriele and I did a codesearch and thought we were fine. [18:56:01] ottomata: no worries, and thanks so much for your help! yeah, we similarly didn't see anything in codesearch that we had not otherwise covered [18:56:33] IMO, it's fine to pause here while we get this sorted [18:57:38] if it would be _helpful_ as a brute-force way of identifying what's affected, I can go ahead and temporarily depool the DCs on the api-ro service [18:57:48] and we could see what breaks :) [19:17:39] bblack: last update on changes for today - I'm going to move ahead and depool the one pooled DC on appservers-ro, since the straggler analytics workloads all should be on api-ro (which I'll leave pooled in eqiad) [19:25:03] ok :)