[07:28:16] <godog>	 brett: mmhh probably the easiest is to try and re-enroll the client i.e. modules/pontoon/files/enroll.py --stack yourstack --force <client fqdn>
[07:28:35] <godog>	 brett: that will nuke the client's puppet keypair from the client and the server
[10:09:04] <btullis>	 Would somebody be willing to add me to #Trusted Contributors in Phabricator please? https://phabricator.wikimedia.org/project/members/3104/ 
[10:14:13] <volans>	 btullis: I'm not 100% convinced you're trusted™ enough :-P
[10:14:24] <volans>	 I didn't even know about that group, but I'm part of it... added you :D
[10:14:35] <volans>	 do you know for what is used?
[10:14:41] <volans>	 should be part of the onboarding?
[10:16:00] <btullis>	 volans: Many thanks. I think it gives me rights to create havoc :-) (Actually, I think it gives me rights to create/edit workboards in Phab)
[10:19:43] <btullis>	 Quite possibly should be part of onboarding, but I've managed without it for nearly a year so maybe not. Cheers anyway.
[10:20:41] <hauskatze>	 "The purpose: To serve as a minimal policy control for access to certain features in phabricator which might be prone to abuse. Things like creating a paste, editing tasks, uploading files, or anything else that we identify to be an easy target for spammers or other kinds of abuse from outside the community."
[10:22:00] <btullis>	 Unexpected benefit - it allows me to see burndown charts. I've been looking for those for ages. :-)
[10:23:14] <hauskatze>	 ultimate source of truth
[10:24:03] <hauskatze>	 IDK they were restricted
[10:26:24] <RhinosF1>	 volans: it's probably something worth adding on onboarding
[10:28:08] <volans>	 I agree
[14:09:35] <Emperor>	 For https://phabricator.wikimedia.org/T307641 we need some IP addresses for aqs20[01-12]-{a,b}.codfw.wmnet (like e.g. aqs1015{,-a,-b}.eqiad.wmnet - they can then be added in the placeholder places in https://gerrit.wikimedia.org/r/c/operations/puppet/+/802604/5/hieradata/role/common/aqs_next.yaml . How does one go about allocating these / getting them allocated?
[14:13:13] <elukey>	 Emperor: o/ the Ip addresses for cassandra are allocated by DC-OPS when prepping the nodes in theory
[14:13:31] <volans>	 at provisioning time
[14:13:43] <elukey>	 you can add them later on manually on netbox (basically adding a new address on the main interface) but it may be tedious for 12 nodes
[14:13:53] <elukey>	 and two instances each
[14:15:08] <volans>	 this is what the automation does:
[14:15:09] <volans>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/netbox-extras/+/refs/heads/master/customscripts/interface_automation.py#1260
[14:15:36] <volans>	 the UI just asks for how many cassandra instances to allocate
[14:16:56] <Emperor>	 So is it just that the new nodes (which appear in netbox) haven't been fully allocated yet, and we can just wait for DCops to finish this work?
[14:17:26] <volans>	 no
[14:17:34] <volans>	 the provision script has been already run, like for example https://netbox.wikimedia.org/dcim/devices/4157/interfaces/
[14:17:43] <volans>	 but was run with no cassandra instances
[14:17:54] <volans>	 wither was forgotten or it was not specified in the provision task
[14:20:01] <Emperor>	 So is the only option now to manually add 24 IPs by hand? :(
[14:21:00] <volans>	 depends how much you're willing to pay ;)
[14:22:32] * Emperor considers the merit of buck-passing instead ;-p
[14:23:43] <Emperor>	 looks like https://phabricator.wikimedia.org/T305568 didn't explicitly say "please provision as cassandra hosts"
[14:25:08] <Emperor>	 alas, looks like the equivalent new eqiad provisioning ticket likewise. https://phabricator.wikimedia.org/T305570
[14:25:27] <volans>	 tbh is not the first time, I've had to fix this multiple times...
[14:26:12] <Emperor>	 :(
[14:26:53] <XioNoX>	 should we have something like if name starts with "aqs" then don't accept 0 cassandra instances?
[14:27:05] <XioNoX>	 or require the user to input 0 explicitely
[14:27:51] <volans>	 I think that the problem is that is missing from the provision template
[14:28:11] <volans>	 and/or is not asked at provision time
[14:28:44] <Emperor>	 Mmm
[14:28:55] <volans>	 also this whole thing IMHO should disappear, I think they could perfectly work with the same hostname and different ports
[14:29:29] <volans>	 anyway, I can programmatically provision those via netbox internal APIs if needed
[14:29:41] * Emperor flutters eyelashes at volans
[14:29:51] <Emperor>	 pretty please? :)
[14:31:19] <moritzm>	 dropping the DNS records for the instances was discussed before, but eventually discarded: https://phabricator.wikimedia.org/T269328
[14:34:21] <XioNoX>	 ok, so not a technical limitation, but tooling/convenience
[14:35:41] <volans>	 and we have this unicorn and exception in the provision script for just ~40 hosts in eqiad
[14:36:52] <XioNoX>	 I know, I also would like to standardize it
[14:37:00] <XioNoX>	 at least they're on private IPs :)
[14:37:12] <moritzm>	 yeah. I might be worth reopening, if this is mostly about convenience we could also extend the cassandra tool to use a programmatic local name which identifies the instance
[14:37:49] <XioNoX>	 cool, yeah I can re-open it for further discussion
[14:38:28] <Emperor>	 I think urandom (who isn't in this channel) is in general thinking about how to do Cassandra better, but I think that's a post-deploying-these-hosts sort of thing
[14:51:01] <volans>	 Emperor: could you confirm all of these?
[14:51:01] <volans>	 ['aqs2001', 'aqs2002', 'aqs2003', 'aqs2004', 'aqs2005', 'aqs2006', 'aqs2007', 'aqs2008', 'aqs2009', 'aqs2010', 'aqs2011', 'aqs2012']
[14:51:09] <volans>	 -a and -b (not -c)
[14:56:54] <Emperor>	 volans: Yes, that's correct, thanks
[14:58:02] <volans>	 ack proceeding
[15:02:16] <volans>	 running the dns cookbook now
[15:09:13] <Emperor>	 thanks, you're a star :)
[15:09:50] <volans>	 weird, the cookbook is failing to pull on the authdns hosts, checking
[15:14:26] <volans>	 XioNoX: this is weird, I'm wondering if it's a problem of the new infra
[15:14:45] <XioNoX>	 ah? what's the error?
[15:14:46] <question_mark>	 Auth DNS is Opinionated, doesn't like recipes
[15:14:53] <volans>	 git fetch is not fetching the new commit
[15:15:11] <volans>	 chrisj just run the cookbook couple of minutes before me and worked
[15:15:27] <volans>	 the remote is
[15:15:27] <volans>	 origin    https://netbox-exports.wikimedia.org/dns.git (fetch)
[15:15:35] <volans>	 is that passing via the CDN?
[15:15:42] <XioNoX>	 yep
[15:15:52] <volans>	 let me try modifying it to the discovery address
[15:16:14] <volans>	 is there a discovery for that one?
[15:16:16] <volans>	 I guess not
[15:16:33] <XioNoX>	 let me check, I think the CDN uses the discovery
[15:17:04] <XioNoX>	 yeah netbox-exports.discovery.wmnet is an alias for netbox.discovery.wmnet.
[15:17:04] <volans>	 I mean I can't put netbox.discovery as remote, it would not hit the right virtualhost in nginx or apache
[15:17:09] <volans>	 ah there is
[15:17:35] <volans>	 nah
[15:17:35] <volans>	 SSL: certificate subject name (netbox.wikimedia.org) does not match target host name 'netbox-exports.discovery.wmnet'
[15:17:44] <volans>	 is probably not in the vhost alias
[15:17:56] <XioNoX>	 I see
[15:18:59] <XioNoX>	 git clone works on those hosts (eg. in my home dir)
[15:19:06] <XioNoX>	 could it be a transcient issue?
[15:19:36] <volans>	 what's your latest SHA1?
[15:19:49] <volans>	 or you mean clone with netbox-exports.discovery.wmnet
[15:20:12] <XioNoX>	 ith https://netbox-exports.wikimedia.org/dns.git
[15:20:22] <volans>	 fetch works, says there is nothing to fetch
[15:20:33] <XioNoX>	 yeah, lastest commit is from chris f013f207adf1b75d44eb8254a712360991bac940
[15:20:41] <volans>	 it should be fc5c5b3607087d4f67337012d135f9d9c8d9229b
[15:21:06] <XioNoX>	 could it be cache related?
[15:21:21] <volans>	 that's what I'm thinking, but checking also other things to exclude
[15:22:28] <volans>	 but why now?
[15:22:35] <volans>	 why working for various runs in the last few days
[15:22:46] <volans>	 is it just that they were sparse enough for the cache to get evicted?
[15:22:54] <XioNoX>	 could be yeah
[15:26:12] <volans>	 but I got the update locally...
[15:26:27] <XioNoX>	 what do you mean?
[15:27:01] <volans>	 that my local copy got updated to the latest SHA1
[15:27:10] <XioNoX>	 so I cloned it from eqsin and I have your change there
[15:27:34] <XioNoX>	 probably the cache there didn't have it in memory
[15:27:49] <volans>	 all authdns hosts failed
[15:27:59] <volans>	 also the ones in eqsin...
[15:28:02] <volans>	 I can retry
[15:28:16] <XioNoX>	 I tried from bast5002
[15:28:53] * volans trying on dns5001
[15:29:56] <volans>	 that stays stuck at Cloning into 'dns'... for now
[15:30:06] <XioNoX>	 it took ages yes
[15:31:45] <volans>	 XioNoX: so the clone on dns5001 didn't get the last sha1
[15:31:58] <volans>	 that's a .wikimedia.org host
[15:32:20] <XioNoX>	 so is bast5002
[15:32:58] <volans>	 yeah
[15:34:04] <XioNoX>	 I made https://gerrit.wikimedia.org/r/c/operations/puppet/+/804345 just in case
[15:34:11] <XioNoX>	 but would be nice to be sure it's the issue
[15:35:54] <volans>	 thx
[15:37:25] <volans>	 XioNoX: can we easily wipe the caches for that domain?
[15:37:29] <volans>	 to test if it's that
[15:37:51] <volans>	 the 'pass' patch seems something we should use anyway, if that's the correct value to tell ats to not cache
[15:38:48] <XioNoX>	 volans: there is a tool to clear per URL, but dunno if there is for a full domain
[15:39:20] <volans>	 I don't know the  exact urls git needs
[15:39:32] <volans>	 did you debug something similar the other day with jo.hn?
[15:41:05] <volans>	 yeah might be the same issue
[15:41:10] <volans>	 jo.hn did found https://netbox-exports.wikimedia.org/dns.git/info/refs?service=git-upload-pack
[15:41:13] <volans>	 let me try
[15:41:38] <XioNoX>	 https://wikitech.wikimedia.org/wiki/Kafka_HTTP_purging#One-off_purge for the one off command if you need it
[15:41:47] <volans>	 that I'm doing that
[15:41:52] <volans>	 from https://wikitech.wikimedia.org/wiki/Multicast_HTCP_purging#One-off_purge
[15:42:48] <volans>	 and now magically works...
[15:42:51] <volans>	 so yes it's chache
[15:42:58] <volans>	 and needs to be tweaked to not cache that domain
[15:43:44] <XioNoX>	 I'd guess my patch does the right thing, but that's only based on similar records
[15:44:30] <volans>	 yes, please deploy it got 2 +1s :D
[15:44:55] <XioNoX>	 alright :)
[15:44:57] <volans>	 Emperor: your records are live
[15:45:07] <volans>	 need probably to add them to hiera...
[15:45:36] <volans>	 details in https://phabricator.wikimedia.org/T305568#7992643
[15:48:55] <XioNoX>	 volans: I ran puppet manually on cp2041.codfw.wmnet and nothing is on fire
[15:49:47] <Emperor>	 volans: thank you!
[15:49:47] <XioNoX>	 running it on all the cache::text_haproxy
[15:50:30] <volans>	 ack, thx
[15:58:38] <robh>	 !log ganeti3002 rebooting into firmware update then reimage via T308238
[15:58:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:58:41] <stashbot>	 T308238: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238
[16:00:16] <brett>	 godog: Regarding your pontoon reply (thanks for that), are you talking about enrolling a client *to* pontoon? The issue seems to be in pontoon itself rather than a client under pontoon's stewardship
[16:03:38] <godog>	 brett: handling a page, will reply shortly
[16:03:58] <brett>	 apologies :(
[16:17:26] <godog>	 brett: hehe no worries! ok my reading of the message is that a pontoon client can't talk to the pontoon puppet server (?) if that's the case then re-enrolling might fix the issue, of course that's not really addressing the root cause, anyways I have to go shortly but happy to followup say on phab with more info
[16:22:08] * godog off
[16:35:04] <inflatador>	 has anyone else had trouble reimaging Dell chassis to bullseye? Supposedly a firmware update fixed it here ( https://phabricator.wikimedia.org/T309343#7971329 ) , but before I ask to have all the elastic hosts' fw updated, curious if anyone's seen this too
[16:37:47] <mutante>	 inflatador: not sure about the install but there have been other issues that were fixed with firmware updates too. like "Icinga says mgmt host is flapping"
[16:38:22] <mutante>	 papaul would know about the install i'm sure
[16:39:08] <inflatador>	 yeah, my next question would be if there's an automated way to do those FW upgrades, probably a question for papaul too
[16:39:38] <volans>	 that's on the way, not before EOQ though
[16:40:23] <inflatador>	 nice
[17:51:02] <moritzm>	 inflatador: how does it fail? does it initiate the PXE sequence, but eventually fall through to loading the existing OS?
[17:51:38] <moritzm>	 that's what happened on several hosts and was only fixable via a firmware update
[17:54:20] <inflatador>	 moritzm more context in dcops IRC, but yes, that's exactly what's happening. I'm going to eat lunch and then upgrade FW and see what happens
[17:57:53] <moritzm>	 I'm afraid firmware updates are the only option :-(
[17:58:36] <moritzm>	 I ran into it with the ganeti reimages stretch->buster and given how early it fails, there's also no way to workaround this
[18:00:05] <moritzm>	 I suppose that between 4.9 and 4.19 (as this also happens with buster) the Linux kernel enforces something stricter and only the FW update rectifies the underlying issue