[07:28:16] brett: mmhh probably the easiest is to try and re-enroll the client i.e. modules/pontoon/files/enroll.py --stack yourstack --force [07:28:35] brett: that will nuke the client's puppet keypair from the client and the server [10:09:04] Would somebody be willing to add me to #Trusted Contributors in Phabricator please? https://phabricator.wikimedia.org/project/members/3104/ [10:14:13] btullis: I'm not 100% convinced you're trustedâ„¢ enough :-P [10:14:24] I didn't even know about that group, but I'm part of it... added you :D [10:14:35] do you know for what is used? [10:14:41] should be part of the onboarding? [10:16:00] volans: Many thanks. I think it gives me rights to create havoc :-) (Actually, I think it gives me rights to create/edit workboards in Phab) [10:19:43] Quite possibly should be part of onboarding, but I've managed without it for nearly a year so maybe not. Cheers anyway. [10:20:41] "The purpose: To serve as a minimal policy control for access to certain features in phabricator which might be prone to abuse. Things like creating a paste, editing tasks, uploading files, or anything else that we identify to be an easy target for spammers or other kinds of abuse from outside the community." [10:22:00] Unexpected benefit - it allows me to see burndown charts. I've been looking for those for ages. :-) [10:23:14] ultimate source of truth [10:24:03] IDK they were restricted [10:26:24] volans: it's probably something worth adding on onboarding [10:28:08] I agree [14:09:35] For https://phabricator.wikimedia.org/T307641 we need some IP addresses for aqs20[01-12]-{a,b}.codfw.wmnet (like e.g. aqs1015{,-a,-b}.eqiad.wmnet - they can then be added in the placeholder places in https://gerrit.wikimedia.org/r/c/operations/puppet/+/802604/5/hieradata/role/common/aqs_next.yaml . How does one go about allocating these / getting them allocated? [14:13:13] Emperor: o/ the Ip addresses for cassandra are allocated by DC-OPS when prepping the nodes in theory [14:13:31] at provisioning time [14:13:43] you can add them later on manually on netbox (basically adding a new address on the main interface) but it may be tedious for 12 nodes [14:13:53] and two instances each [14:15:08] this is what the automation does: [14:15:09] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/netbox-extras/+/refs/heads/master/customscripts/interface_automation.py#1260 [14:15:36] the UI just asks for how many cassandra instances to allocate [14:16:56] So is it just that the new nodes (which appear in netbox) haven't been fully allocated yet, and we can just wait for DCops to finish this work? [14:17:26] no [14:17:34] the provision script has been already run, like for example https://netbox.wikimedia.org/dcim/devices/4157/interfaces/ [14:17:43] but was run with no cassandra instances [14:17:54] wither was forgotten or it was not specified in the provision task [14:20:01] So is the only option now to manually add 24 IPs by hand? :( [14:21:00] depends how much you're willing to pay ;) [14:22:32] * Emperor considers the merit of buck-passing instead ;-p [14:23:43] looks like https://phabricator.wikimedia.org/T305568 didn't explicitly say "please provision as cassandra hosts" [14:25:08] alas, looks like the equivalent new eqiad provisioning ticket likewise. https://phabricator.wikimedia.org/T305570 [14:25:27] tbh is not the first time, I've had to fix this multiple times... [14:26:12] :( [14:26:53] should we have something like if name starts with "aqs" then don't accept 0 cassandra instances? [14:27:05] or require the user to input 0 explicitely [14:27:51] I think that the problem is that is missing from the provision template [14:28:11] and/or is not asked at provision time [14:28:44] Mmm [14:28:55] also this whole thing IMHO should disappear, I think they could perfectly work with the same hostname and different ports [14:29:29] anyway, I can programmatically provision those via netbox internal APIs if needed [14:29:41] * Emperor flutters eyelashes at volans [14:29:51] pretty please? :) [14:31:19] dropping the DNS records for the instances was discussed before, but eventually discarded: https://phabricator.wikimedia.org/T269328 [14:34:21] ok, so not a technical limitation, but tooling/convenience [14:35:41] and we have this unicorn and exception in the provision script for just ~40 hosts in eqiad [14:36:52] I know, I also would like to standardize it [14:37:00] at least they're on private IPs :) [14:37:12] yeah. I might be worth reopening, if this is mostly about convenience we could also extend the cassandra tool to use a programmatic local name which identifies the instance [14:37:49] cool, yeah I can re-open it for further discussion [14:38:28] I think urandom (who isn't in this channel) is in general thinking about how to do Cassandra better, but I think that's a post-deploying-these-hosts sort of thing [14:51:01] Emperor: could you confirm all of these? [14:51:01] ['aqs2001', 'aqs2002', 'aqs2003', 'aqs2004', 'aqs2005', 'aqs2006', 'aqs2007', 'aqs2008', 'aqs2009', 'aqs2010', 'aqs2011', 'aqs2012'] [14:51:09] -a and -b (not -c) [14:56:54] volans: Yes, that's correct, thanks [14:58:02] ack proceeding [15:02:16] running the dns cookbook now [15:09:13] thanks, you're a star :) [15:09:50] weird, the cookbook is failing to pull on the authdns hosts, checking [15:14:26] XioNoX: this is weird, I'm wondering if it's a problem of the new infra [15:14:45] ah? what's the error? [15:14:46] Auth DNS is Opinionated, doesn't like recipes [15:14:53] git fetch is not fetching the new commit [15:15:11] chrisj just run the cookbook couple of minutes before me and worked [15:15:27] the remote is [15:15:27] origin https://netbox-exports.wikimedia.org/dns.git (fetch) [15:15:35] is that passing via the CDN? [15:15:42] yep [15:15:52] let me try modifying it to the discovery address [15:16:14] is there a discovery for that one? [15:16:16] I guess not [15:16:33] let me check, I think the CDN uses the discovery [15:17:04] yeah netbox-exports.discovery.wmnet is an alias for netbox.discovery.wmnet. [15:17:04] I mean I can't put netbox.discovery as remote, it would not hit the right virtualhost in nginx or apache [15:17:09] ah there is [15:17:35] nah [15:17:35] SSL: certificate subject name (netbox.wikimedia.org) does not match target host name 'netbox-exports.discovery.wmnet' [15:17:44] is probably not in the vhost alias [15:17:56] I see [15:18:59] git clone works on those hosts (eg. in my home dir) [15:19:06] could it be a transcient issue? [15:19:36] what's your latest SHA1? [15:19:49] or you mean clone with netbox-exports.discovery.wmnet [15:20:12] ith https://netbox-exports.wikimedia.org/dns.git [15:20:22] fetch works, says there is nothing to fetch [15:20:33] yeah, lastest commit is from chris f013f207adf1b75d44eb8254a712360991bac940 [15:20:41] it should be fc5c5b3607087d4f67337012d135f9d9c8d9229b [15:21:06] could it be cache related? [15:21:21] that's what I'm thinking, but checking also other things to exclude [15:22:28] but why now? [15:22:35] why working for various runs in the last few days [15:22:46] is it just that they were sparse enough for the cache to get evicted? [15:22:54] could be yeah [15:26:12] but I got the update locally... [15:26:27] what do you mean? [15:27:01] that my local copy got updated to the latest SHA1 [15:27:10] so I cloned it from eqsin and I have your change there [15:27:34] probably the cache there didn't have it in memory [15:27:49] all authdns hosts failed [15:27:59] also the ones in eqsin... [15:28:02] I can retry [15:28:16] I tried from bast5002 [15:28:53] * volans trying on dns5001 [15:29:56] that stays stuck at Cloning into 'dns'... for now [15:30:06] it took ages yes [15:31:45] XioNoX: so the clone on dns5001 didn't get the last sha1 [15:31:58] that's a .wikimedia.org host [15:32:20] so is bast5002 [15:32:58] yeah [15:34:04] I made https://gerrit.wikimedia.org/r/c/operations/puppet/+/804345 just in case [15:34:11] but would be nice to be sure it's the issue [15:35:54] thx [15:37:25] XioNoX: can we easily wipe the caches for that domain? [15:37:29] to test if it's that [15:37:51] the 'pass' patch seems something we should use anyway, if that's the correct value to tell ats to not cache [15:38:48] volans: there is a tool to clear per URL, but dunno if there is for a full domain [15:39:20] I don't know the exact urls git needs [15:39:32] did you debug something similar the other day with jo.hn? [15:41:05] yeah might be the same issue [15:41:10] jo.hn did found https://netbox-exports.wikimedia.org/dns.git/info/refs?service=git-upload-pack [15:41:13] let me try [15:41:38] https://wikitech.wikimedia.org/wiki/Kafka_HTTP_purging#One-off_purge for the one off command if you need it [15:41:47] that I'm doing that [15:41:52] from https://wikitech.wikimedia.org/wiki/Multicast_HTCP_purging#One-off_purge [15:42:48] and now magically works... [15:42:51] so yes it's chache [15:42:58] and needs to be tweaked to not cache that domain [15:43:44] I'd guess my patch does the right thing, but that's only based on similar records [15:44:30] yes, please deploy it got 2 +1s :D [15:44:55] alright :) [15:44:57] Emperor: your records are live [15:45:07] need probably to add them to hiera... [15:45:36] details in https://phabricator.wikimedia.org/T305568#7992643 [15:48:55] volans: I ran puppet manually on cp2041.codfw.wmnet and nothing is on fire [15:49:47] volans: thank you! [15:49:47] running it on all the cache::text_haproxy [15:50:30] ack, thx [15:58:38] !log ganeti3002 rebooting into firmware update then reimage via T308238 [15:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:41] T308238: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 [16:00:16] godog: Regarding your pontoon reply (thanks for that), are you talking about enrolling a client *to* pontoon? The issue seems to be in pontoon itself rather than a client under pontoon's stewardship [16:03:38] brett: handling a page, will reply shortly [16:03:58] apologies :( [16:17:26] brett: hehe no worries! ok my reading of the message is that a pontoon client can't talk to the pontoon puppet server (?) if that's the case then re-enrolling might fix the issue, of course that's not really addressing the root cause, anyways I have to go shortly but happy to followup say on phab with more info [16:22:08] * godog off [16:35:04] has anyone else had trouble reimaging Dell chassis to bullseye? Supposedly a firmware update fixed it here ( https://phabricator.wikimedia.org/T309343#7971329 ) , but before I ask to have all the elastic hosts' fw updated, curious if anyone's seen this too [16:37:47] inflatador: not sure about the install but there have been other issues that were fixed with firmware updates too. like "Icinga says mgmt host is flapping" [16:38:22] papaul would know about the install i'm sure [16:39:08] yeah, my next question would be if there's an automated way to do those FW upgrades, probably a question for papaul too [16:39:38] that's on the way, not before EOQ though [16:40:23] nice [17:51:02] inflatador: how does it fail? does it initiate the PXE sequence, but eventually fall through to loading the existing OS? [17:51:38] that's what happened on several hosts and was only fixable via a firmware update [17:54:20] moritzm more context in dcops IRC, but yes, that's exactly what's happening. I'm going to eat lunch and then upgrade FW and see what happens [17:57:53] I'm afraid firmware updates are the only option :-( [17:58:36] I ran into it with the ganeti reimages stretch->buster and given how early it fails, there's also no way to workaround this [18:00:05] I suppose that between 4.9 and 4.19 (as this also happens with buster) the Linux kernel enforces something stricter and only the FW update rectifies the underlying issue