[01:48:49] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host clouddumps1001.wikimedia.org with OS... [11:05:39] Bringing you the very best and most urgent code fixes: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/803482 ... [11:05:48] already +2ed :D [11:05:58] keep them coming ;) [11:06:22] thanks Emperor [11:06:32] I definitely need to get out more, I've just seen that prompt too many times over the last few weeks, and had to fix it :) [11:06:45] ahahah [11:16:34] :D [12:53:36] fyi all for the intrested me and XioNoX are currently at step 7 in the following rough plan to upgrade netbox https://etherpad.wikimedia.org/p/nTbAb5yPvBFATUylV7Vj [12:54:14] thanks! so I don't know the old one, but there are couple of things in the browser console that we might want to look at later, nothing urgent/blocking, so go ahead with the plan [12:56:03] 10Puppet, 10puppet-compiler, 10Infrastructure-Foundations, 10SRE, 10User-jbond: puppet master command will be removed in puppet 6 - https://phabricator.wikimedia.org/T236373 (10jbond) [12:56:23] ack thanks volans [13:30:35] volans: in relation to netbox-extra we have a bit of a chicken egg see https://integration.wikimedia.org/ci/job/operations-dns-lint-docker/3920/console [13:30:44] checking [13:30:48] i think the best thing is to overr5ide CI and merge this Cr but intrested in your thoughts [13:31:52] jbond: if the dns repo becomes private CI will fail for the operations/dns repo [13:31:59] so it must stay public IMHO [13:32:39] why can't we keep it public with the dyna? [13:32:40] good point [13:33:05] volans: we can, but we need to put it behind the caches too [13:33:08] volans: we definetly can ill abandon that and create a change for the cache [13:34:53] btw we need also a patch to https://github.com/cdanis/tunnelencabulator/blob/master/tunnelencabulator.py ;) [13:35:12] I think we can contact the upstream maintainer :-P [13:35:31] volans: he will tell you to send a PR :) [13:38:01] [unrelated] for spicerack to get CI pass on your last CR we need https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/803316 [13:38:11] I can self-merge if you're too busy, it's just pylint stuff [13:38:36] volans: XioNoX: https://gerrit.wikimedia.org/r/c/operations/dns/+/803514 https://gerrit.wikimedia.org/r/c/operations/puppet/+/803515 [13:38:49] volans: please self merge anything thats is allready looking good [13:41:26] jbond: wrong name [13:42:02] er, right, netbox-exports [13:42:22] ahh one sec [13:42:40] ah, netbox going behind text-lb? should be a very simple patch to tunnelencabulator indeed [13:43:31] cdanis: correct, +CDN, -LVS, -public IP :D [13:44:08] volans: I invite you to make the PR against tunnelencabulator :D [13:44:23] volans: XioNoX: have updated the name can yuo take another passd [13:44:29] yep [13:44:44] [nit] commit message is still the old one [13:46:12] jbond: left a comment [13:49:06] I edited in place the dns repo patch [13:49:13] I can do the other too if needed [13:49:22] (the puppet cache one) [13:49:59] XioNoX: good catch update [13:50:15] volans: the commit message is the same (as we just made the same change in a different zone) [13:50:30] jbond: commit message have a "extra" then we're good [13:50:31] jbond: I meant with the typo [13:50:40] ahh doh! [13:50:53] I fixed the one in the dns repo and +1ed [13:50:53] done [13:51:16] ack thanks [13:51:22] +1e [13:51:31] overriding CI and merging the dns one now [13:52:27] ok, I'm wondering if authdns-update will work [13:52:45] yeah it should IIRc [13:52:52] seems to be [13:52:58] because it's using the cached exports data locally [13:53:03] yes it worked [13:53:05] ack [13:57:02] ok mergeing cache change [13:58:22] jbond: lmk when netbox-exports should be reachable again so I can test both locally ( I have a prior checkout) and from CI with a recheck [13:58:33] ack will do [14:03:55] volans: fyi the change has gone through but its not current working, im checking [14:04:18] did puppet run already across the caches? [14:04:23] yes [14:04:41] and i cleared cache for both netbox-exports.wikimedia.org and netbox-exports.discovery.wmnet [14:05:22] I'm getting a 200 with curl a Forbidden with my browser :/ [14:05:27] same [14:05:44] but the 200 is soem sort of landing page [14:05:59] or the homepage of wikimedia.org [14:06:04] i get a 302 for /dns.git but a 403 for /dns.git/ [14:06:28] yeah I was getting the same 200 before the cache change with my browser [14:06:30] and it is comming from the netbox server [14:06:55] jbond: is the vhost in the new server there? [14:06:58] is the data there? [14:07:34] $ sudo git log -1 [14:07:34] fatal: your current branch 'master' does not have any commits yet [14:07:36] vhost is there and is responding correctly checking data now [14:07:51] /srv/netbox-exports/dns.git is empty [14:08:08] let me try to fetch from the other netbox [14:08:17] ack [14:08:37] fyi the hiera has data checking that [14:10:10] the git config is missing the remote, adding it manually [14:10:51] ahh it only has the two new serveres not the old ones [14:11:03] run [14:11:03] # runuser -u netbox -- git -C "/srv/netbox-exports/dns.git" fetch netbox1001.wikimedia.org master:master [14:11:20] removed manually added remote [14:12:14] manually run git update-server-info [14:12:18] not sure why the hook didn't run it [14:12:27] now a git fetch from my local repo works [14:12:37] nice [14:12:38] testing CI on https://gerrit.wikimedia.org/r/c/operations/dns/+/803460 [14:12:46] ack thanks [14:13:00] FYI CI for puppet (which uses the hiera repo) is working https://puppet-compiler.wmflabs.org/pcc-worker1002/35773/ [14:13:24] not sure why: curl "https://netbox-exports.wikimedia.org/" gives me the wikimedia.org homepage [14:14:07] volans: not a problem for now [14:14:18] CI seems to workm on the dns repo [14:14:37] although there is no 'netbox' reference in the logs anywhere, and I thought it should be there [14:14:56] hmm im getting the following [14:14:56] git clone https://netbox-exports.wikimedia.org/dns.git [14:14:56] Cloning into 'dns'... [14:14:57] fatal: repository 'https://netbox-exports.wikimedia.org/dns.git/' not found [14:15:51] works for me [14:15:52] $ git clone https://netbox-exports.wikimedia.org/dns.git foo-bar [14:15:55] Cloning into 'foo-bar'... [14:15:56] from my laptop [14:17:38] I'm starting to make the spicerack release, pinf me if you need anything else [14:17:47] ack thanks will do [14:19:42] oh i suspect whatever cache im hitting has the 404 cached [14:21:25] jbond: https://netbox-exports.wikimedia.org/dns.git/ still returns a "forbidden" on my browser [14:21:39] jbond: you can try this https://wikitech.wikimedia.org/wiki/Multicast_HTCP_purging#One-off_purge [14:21:56] and git clone with no local directory also fails for me [14:22:54] XioNoX: i think the 403 in the browse is valid (i also get that with curl) [14:23:01] ok [14:23:01] i get a not found when i do a git clone [14:23:15] cdanis: thanks althugh im not 100% sure what i need to purg [14:23:26] im gussing whatever git clone fetches first to see if a repo exists [14:26:52] cdanis: thanks i think that got it after tracking down the url (https://netbox-exports.wikimedia.org/dns.git/info/refs?service=git-upload-pack) [14:27:11] XioNoX: can you try cloning again [14:27:29] jbond: all good now! [14:27:33] cool [14:32:42] XioNoX: volans: in relation to the following change https://gerrit.wikimedia.org/r/c/operations/puppet/+/803468/2/hieradata/common.yaml#1893 this would prevent useres froim running homer and spicertack from there laptop i think as such im thinking we shuld ditch it? [14:33:25] jbond: looking [14:33:56] jbond: I think it only sets it in their respective config files [14:34:36] jbond: why a laptop run should read netbox_api_url? [14:34:49] XioNoX: ack that makes senses thanks :) [14:35:30] so I think that would be safe to merge once spicerack is fixed [14:35:37] ack [14:35:44] release in progress... :) [14:35:50] thanks :) [14:35:51] thanks [14:36:52] * jbond going to grab coffee [14:51:28] release done, I'm upgrading cumin2002 and testing few things, then running the dns.netbox cookbook [14:52:08] ack [14:52:09] cool [14:57:30] * volans waiting for jenkins :D [14:58:56] always [15:03:12] running cookbook -d sre.dns.netbox "test" first to avoid too much noise [15:03:23] should also be a noop in theory :D [15:03:38] it's running on netbox1002.eqiad.wmnet, so that's something :) [15:04:20] :D [15:04:23] volans: the icinga state file says "message": "Netbox has uncommitted DNS changes, but last edit in Netbox is within 30 minutes"," [15:04:35] so there might be some changes [15:04:46] no the file is stale [15:04:51] because the cookbook has never run there [15:04:59] but there are also changes :D [15:05:15] weird though [15:05:23] it's not getting some last code changes [15:06:05] there is a local diff in reports/accounting.py [15:06:11] - http = Http(proxy_info=ProxyInfo(PROXY_TYPE_HTTP, proxy_host, proxy_port)) [15:06:14] + http = Http(proxy_info=ProxyInfo(PROXY_TYPE_HTTP, proxy_host, int(proxy_port))) [15:06:22] I'm reverting it and allowing the repo to get updated [15:06:44] volans: the cookbook never ran, but the "check_netbox_uncommitted_dns_changes.service" did run, so it should compare what has been pulled from the git repo and the current status of the DB, no? [15:06:57] hmm that relates to ... (goes to find cr) [15:07:04] yeah there are diffs but is an artifact [15:07:21] https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/802127 [15:07:44] coul dyou please check also netbox2002 if it has local modifications in any repo? [15:07:52] yes [15:09:53] jbond: the only diff was the line above in the local modification [15:10:04] I'll leave it to you if that needs to be committed or not [15:10:09] it's just the int() AFAICT [15:10:37] volans: yes the Cr above dose need to be commited its a slight reworking of what i manully tested [15:10:37] ok [15:10:44] could use a quick review but i dont think its critical right now [15:10:59] (will just break the accounting report) [15:11:37] there where no diffs on netbox2002 apart from /srv/deployment/netbox/deploy which looks like something XioNoX may have dopne for the deploy, ot possibly something thescap did [15:11:41] ? [15:12:45] have the same thing on netbox1001 [15:12:51] ok then :D [15:12:53] might be scap [15:13:20] yes could be [15:14:12] I'm stupid [15:14:56] it was also on a branch, not only with local modifications :/ [15:17:28] ok [15:17:28] 2022-06-07 15:17:22,223 [INFO] Nothing to commit! [15:17:34] now we're good on that one [15:17:41] alright [15:18:05] so all outstanding issues are solved, right? [15:18:41] the dns cookbook took ~2 minutes, compared to previous ~4 minutes [15:18:46] that's already a nice improvement! [15:19:20] uh [15:19:43] I'm wondering why [15:20:36] no idea, maybe i was lucky :D [15:20:42] posiibly because is going via the cache [15:20:51] it shouldn't [15:21:04] it's localhost [15:21:25] not sure then i dont think we really changed anything that could affect that [15:21:37] Executing commands [cumin.transports.Command('cd /tmp && runuser -u netbox -- python3 /srv/deployment/netbox-extras/dns/generate_dns_snippets.py commit --batch "volans@cumin2002: test"', ok_codes=[0, 99])] on 1 hosts: netbox1002.eqiad.wmnet [15:21:50] 3.7 vs 3.9? [15:22:15] oh i gusse could be that and some of the artifacts probably got updated as well [15:22:53] the script generate_dns_snippets.py, is configured to pull data from api = https://netbox.wikimedia.org/ [15:23:15] so it does go through the caches, but not sure if anything is actually being cached, etc [15:23:36] I was about to say the same, just checked [15:23:45] that should be changed to the discovery address too I guess [15:23:51] next step is to use the discovery address anyway, or jbond's suggestion to point directly yo localhost [15:23:58] to* [15:24:18] re: /etc/hosts - https://gerrit.wikimedia.org/r/c/operations/puppet/+/803508 [15:24:32] same for ganeti-sync.cfg and scripts.cfg [15:24:49] not sure about report_check.yaml [15:25:38] jbond: I think I'd prefer to use the discovery address instead of pinning to localhost, so the behavior is coherent between all the hosts. But happy to be convinced otherwise [15:26:03] XioNoX: im happy with that [15:26:41] just running pcc on 803468 [15:28:06] volans: https://puppet-compiler.wmflabs.org/pcc-worker1001/35776/netbox1002.eqiad.wmnet/index.html i think thats all of them [15:28:12] XioNoX: ^^ [15:28:22] checking [15:38:55] ok i have merged the config above to have everything use the discovery address [15:39:10] we are now upto step 9 from https://etherpad.wikimedia.org/p/nTbAb5yPvBFATUylV7Vj [15:39:19] i.e. give a grace period while we are still on netbox 2.10.4 to allow any issues to surface [15:40:07] XioNoX: if you agree suggest we halt now before doing the actual netbox upgrades? not sure how long we want the grace period to be [15:40:23] jbond: yeah of course [15:40:40] cool [15:41:17] jbond: a day or two minimum I'd say [15:41:17] XioNoX: volans: i think we can let dc-ops know that they can use netbox now right? [15:41:40] ack agree fyi its a public holiday here thursday but feel free to upgrade without me [15:41:48] jbond: :) [15:41:50] I think so and I agree let's wait few days to decouple any issue [15:42:03] but also happy to pick up monday [15:42:15] monday is fine, yeah [15:43:02] what do we want to do with the ganeti group support? [15:43:11] do it before or after the upgrade? [15:43:24] volans: does it work with < 3.2 ? [15:43:39] ack for monday [15:44:03] XioNoX: not 100% sure, we don't have a test host on 2.10 anymore :) [15:44:27] volans: so yeah let's do it after the upgrade, there is no urgency afaik [15:44:33] ok [15:55:52] jbond, XioNoX : all the ganety systemd timers are failing with [15:55:55] requests.exceptions.SSLError: HTTPSConnectionPool(host='netbox.discovery.wmnet', port=443): Max retries exceeded with [15:56:15] ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer c ... [15:57:04] volans: ahh i think this is the ca bundle issue we have spoken about [15:57:24] in my tests on netbox-dev2002 I was running it with [15:57:24] REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt /srv/deployment/netbox/venv/bin/python3 /home/volans/ganeti-netbox-sync.py -c /home/volans/ganeti-sync.cfg codfw_test [15:57:34] if that helps [15:58:50] volans: ack thanks that dose i should be able to inject that environment variable via systemd. however it would be better to fix the venv (which will take a biut more research) [15:59:32] ack [16:05:48] why only the ganeti sync though and not everything that uses pynetbox on the discovery address? [16:07:14] cookbooks and spicerack don't use venvs, so they read the system CAs and all works fine [16:07:32] only the things that run using the netbox venv and do API calls would fail [16:08:22] interesting, didn't know that venv used different CAs (or didn't have access to the system ones) [16:08:59] request in the venv ship its own one (something like python-certifi) [16:09:10] and is not easily extendable [16:09:40] there is also this nice bug: https://github.com/psf/requests/issues/3829 [16:10:17] that if you set the env variable, and then use some session that wants to turn off verify it still verifies it [16:10:33] secure! [16:38:19] XioNoX: volans: any chance you are still around for a reviw on https://gerrit.wikimedia.org/r/c/operations/puppet/+/803579 [16:38:32] I am [16:38:37] thanks <3 [16:39:45] yep [16:41:19] * volans fixed couple of typos [16:41:41] * jbond sned <3 to volans [16:41:42] jbond: that's god the reports, but not for the ganeti_sync_run [16:42:18] XioNoX: ahh where is that in puppet? git grep ganeti_sync_run comes up empty [16:42:19] see line 338 [16:42:49] ack updating [16:42:52] jbond: pick the latest PS [16:42:57] I've fixed the typos ;) [16:43:08] redownloading thanks [16:45:53] updated [16:48:44] jbond: lgtm! [16:48:54] great thanks will merge once CI is done [17:09:48] XioNoX: volans: FYi systemd --failed is all good now [17:10:22] im going to sign of now and stop toching things but please ping if anyone sees issues [17:10:44] jbond: have a good evening! [17:11:03] great [17:11:09] thanks :)