[07:45:43] I'm trying to update grafana in reprepro and jenkins gpg key is expired, I guess it needs updating as per https://www.jenkins.io/blog/2023/03/27/repository-signing-keys-changing/ [07:47:16] godog: I think that was discussed yesterday around 11am my time in this channel [07:47:35] at least something bout jenkins and gpg :) [07:48:06] hah! checking the scrollback [07:49:41] indeed, looks like there was no followup [07:49:47] https://gerrit.wikimedia.org/r/c/operations/puppet/+/905542 [07:54:41] in unrelated news, I wanted to share https://i.redd.it/19wqjyzd35n61.jpg [08:28:02] someone around for a patch sanity check? https://puppet-compiler.wmflabs.org/output/904514/40517/backup1004.eqiad.wmnet/index.html [08:30:04] jynus: any particular reason to bind on ipv4 only? [08:31:43] not really, but I don't want to bind to '' [08:32:08] (e.g. I don't want to bind to localhost) [08:32:34] jynus: looks good, left one comment inline [08:32:45] thanks, moritzm, I didn't notice that [08:48:12] I fixed the docs but will merge later as I have something else to do first [08:54:35] jynus: ack, +1d in the mean time [08:58:45] I have to remember to update de wikitek docs after deploy: https://wikitech.wikimedia.org/wiki/Media_storage/Backups#How_to_access_the_web_UI_of_minio [12:02:54] head's up eqiad row C upgrade in 1h - the WMCS table is empty (cc balloons) and the ncredir doesn't have any action neither (cc kwakuofori) - https://phabricator.wikimedia.org/T331882 [12:03:09] please ignore if no actions are needed [12:04:51] ack :) [12:05:30] me and Steve are working on the DE front, everything should be ready to go in an hour but I'll ping you otherwise [12:06:59] awesome [12:07:24] XioNoX: looking [12:07:30] Thanks [12:07:33] thanks! [12:11:24] XioNoX: all good from Traffic for T331882 [12:11:25] T331882: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 [12:15:55] thanks! [12:25:24] akosiaris: are you taking care of depooling eqiad like last week? [12:28:40] sukhe: https://gerrit.wikimedia.org/r/c/operations/dns/+/905603 for the traffic side [12:39:17] XioNoX: ms-fe node depooled [12:39:38] godog: you OK to handle the thanos frontend, or should I? it's thanos-fe1003 [12:40:04] Emperor: would you mind taking care of it? thank you [12:41:58] ack [12:43:20] XioNoX: likewise thanos-fe done [12:44:38] XioNoX: ping me when you want me to do the puppet disable [12:48:10] XioNoX: depooling started [12:49:06] at step 13/53 right now [12:49:18] we sure got a lot of services these days [12:50:27] we sure do [12:52:12] XioNoX: we are ready on WMCS side, ticket should be up-to-date [12:52:16] thanks! [12:53:29] volans :( https://www.irccloud.com/pastebin/RUx15bLU/ [12:53:40] or godog ^ ? [12:54:50] heh yeah good old "alert hosts are not in icinga" or sth similar, did the host get skipped or the whole thing failed XioNoX ? [12:55:05] i.e. silences for the rest are in place ? [12:55:15] godog: the whole thing failed I think, it stopped immediately [12:55:56] I'm thinking of a solution/bandaid we could use right now [12:56:14] hnowlan: forgot to ping you, there are some pending hosts for "core platform" on https://phabricator.wikimedia.org/T331882 [12:56:26] godog: what's the status of alert1001 ? [12:56:39] XioNoX: just add ' and not P{alert1001*}' [12:56:48] ah right it's not in icinga because of old bug [12:57:26] volans: we're failed over to alert2001 re: alert1001 status [12:57:37] yes yes took me 30s to recall sorry [12:57:50] volans: that worked, thanks! [12:57:55] there is no bandaid, just skip the host [12:58:08] yeah that's a nice fix, thanks volans [12:58:09] until we will monitor aall alerts hosts from the active icinga [12:58:32] fyi, I depooled the two ores hosts with the indicated command [12:59:56] I am at step 34/53 of the cookbook [13:00:05] gimme a couple of minutes more please [13:00:13] no rush [13:00:20] XioNoX: thanks, I was about to do it [13:00:23] I'll go make some tea [13:00:47] Hello, done with the Data Engineering severs they are ready for the upgrade. [13:03:08] thanks! [13:05:48] XioNoX: done [13:06:06] jbond: can you disable puppet? [13:06:11] sure thing one sec [13:06:46] * volans getting some super late lunch ping me directly if needed in the next few minutes [13:09:51] XioNoX: done [13:10:00] alright! [13:10:12] everyone, anything left to do? [13:10:28] all ok from me [13:10:53] let's do it! [13:11:16] System going down in 1 minute [13:12:16] it's rebooting [13:12:37] I'm watching console on the master switch [13:12:49] Uptime: 1874d12h24m34s [13:13:16] lol [13:14:34] notbad.flv [13:20:28] 3/7 up [13:21:11] godog: flv!? :P [13:21:40] sukhe: you are welcome for that free trip down memory lane [13:22:06] we should start seeing recoveries [13:23:05] all nodes up [13:23:34] yay [13:23:55] * jbond re-enableing puppet [13:25:48] * jbond done [13:26:08] thx [13:26:59] everything looks fine on the network side [13:27:07] feel free to repool all your services [13:27:27] nice! [13:27:58] nice work! [13:28:15] thank you everybody! [13:28:19] so smooth! [13:28:39] very nice indeed [13:28:49] XioNoX: OK to revert DNS change? [13:28:55] guess we will wait for a bit [13:28:55] sukhe: yep [13:29:22] Yay \o/ [13:31:03] elukey: are you taking care of the ores repool or should I? [13:34:36] effie will repool eqiad in about 30m [13:34:44] er ok [13:34:51] sorry, services, misread [13:35:01] I am pooling back DNS [13:35:24] seems like puppetdb1002.eqiad.wmnet is offline still -- that's on purpose? [13:36:19] andrewbogott: yeah I am getting a failed run too, probably because a lot of hosts had puppet enabled again andit might be taking time to catch up [13:36:24] happened last time as well IIRC [13:36:40] ok, I'll just ignore for now :) [13:48:05] jbond: puppetdb1002 still on fire? [13:53:01] XioNoX: looking [13:56:16] andrewbogott: sukhe: what made you think there was an issue with puppetdb [13:56:32] https://www.irccloud.com/pastebin/IFuk9u8L/ [13:57:24] similar error for me [13:57:27] jbond: ^ [13:57:38] + puppet failure reported on 63 servers [13:57:44] the 63 is me [13:58:01] i stoped any current runs to investigate [13:58:23] anyway i think i enabled things before puppetdb fully started which ment some serveres started there puppet run too early [13:58:25] the host that produced that paste is now working [13:58:29] still checking but things shuld be ok [13:58:37] XioNoX: sorry didn't mention, already done [13:59:20] cool [13:59:36] jbond: I had failures with cumin commands not able to talk to puppetdb but it's good now [13:59:49] jbond: resolved for me as well [14:00:10] yes im running on failed nbow so everything elses should catch up, hopefully [14:00:16] thank you [14:00:19] working for us too, thanks! [14:00:21] all ok jbond [14:04:42] having some timeouts when fetching files though (puppet:///modules/profile/puppet/ca.production.pem: Net::OpenTimeout) [14:05:00] dcaro: yes im still looking :( [14:16:00] is eqiad repooled? I am seing some weird stuff and want to make sure I don't break stuff [14:16:43] pooled on the DNS side, yes [15:19:30] jynus: I am aving trouble with a couple of servcies, but most things should be pooled back [15:19:55] yeah, no worries, I am also having issues [15:20:12] so knowing it can be accessed I am depooling it [16:46:23] does any of you use the DNS names "people.eqiad.wmnet" and "people.codfw.wmnet" in local configs? I kind of want to remove them because we also have peopleweb.discovery.wmnet that always points to the current one (right now people2002 but will change again). but of course that is just one, and not one for each DC. I don't see them used in puppet, so I think this is only for human usage? [16:52:28] or debugging, maybe? [16:53:17] we have some other "pointless" names like that, but sometimes they're just easier alternatives than IPs when you e.g. want to curl-test against a specific endpoint. [16:57:42] fair. yea, can't decide what is more work, just keep updating it when updating discovery record as well, with potential to miss it.. or tracking this down by surveying all of SRE.. or just delete and wait for complaints :) [17:01:41] guess I will just keep it "just in case" (maybe an example how we end up with tech debt) [17:05:18] we can also define them in terms of the discovery records, although I don't think we've done so before [17:05:23] (so there's no existing pattern for it) [17:06:26] weused to have that for the public debugging hostnames, but I think they got moved to netbox [17:06:40] (which maybe is a mistake, in light of that, but whatever) [17:06:55] I never use 'em FWIW [17:06:57] yea, so if people want a way to test eqiad vs codfw - then a single discovery record doesn't cut it [17:08:02] so in the templates/wmnet zonefile, where the discovery records are. Using this example: [17:08:05] appservers-ro 300/10 IN DYNA geoip!disc-appservers-ro [17:08:08] after what you said, bblack,I think maybe we should have test.eqiad.wmnet and test.codfw.wmnet to point to the ping offloaders or something [17:08:30] there is supported syntax, which we don't currently take advantage of, to use a specific side for this kind of per-dc record, by doing: [17:08:50] appservers-ro-eqiad 300 IN DYNA geoip!disc-appservers-ro/eqiad [17:09:07] The "/eqiad" on the end means "don't actually do the geoip IP, just pull out the eqiad IP from this config" [17:10:21] so we could in theory make lines like that under whatever hostnames we want [17:10:27] ah, interesting, TIL. though right now people is still one of those that have discovery DNS but not geodns [17:10:38] discovery yes, but one host is just commented out [17:10:40] it's the same for all of them IIRC [17:10:46] ah [17:10:56] yeah, it only works for the true discovery ones (geoip or metafo) [17:11:23] but for the "manual" names that happen to be in the discovery subdomain, you could also use CNAMEs to avoid the duplication [17:11:49] (given the real two IPs to foo.eqiad.wmnet and codfw.eqiad.wmnet, and then just have "foo.discovery.wmnet CNAME foo.eqiad.wmnet" [17:11:49] ACK, so I do have a pending change to add metafo and "add to service catalog" for it [17:12:37] I think I like this option (for right now) :) thanks [17:12:40] but since we haven't used the /datacenter syntax here before, we don't really have "standards" about how and where we'd structure it to avoid issues on future updates, etc [17:13:18] hmmm. ok [17:13:23] probably they should be close-by in the zonefile so they're updated/deleted/whatever together in the future, but then they're in different subdomains (.discovery vs .eqiad), so it's tricky. [17:13:41] that kind of issue, I mean. [17:14:00] otherwise later someone will decom the discovery entry and forget the other one, and DNS CI will break or something. [17:14:18] (for the /datacenter case, I mean) [17:15:46] yea, let me not be the first user of that. I am also ok just updating this one line every .. hmm.. 2 years.. since it's just needed when we replace VMs, not for each DC switch [21:28:59] unexpected DNS changes during sync again [21:29:16] frdat1001 is being removed but wasnt synced [21:30:37] now stuck at the prompt, if I cancel my decom fails, but removing things you dont know about feels bad. appears to be a bit common issue [21:32:45] found this is already the entire thread https://phabricator.wikimedia.org/T333971#8755678 [21:37:20] merging that as well.. but sigh, see comments about frack and access for fr-tech in there [21:37:31] I am only merging it because it's limited to mgmt [21:43:02] mutante: yeah. sorry. we don't have the rights so sadly these things crop up. it's rare but would love to get a resolution. [21:45:45] dwisehaupt: yea! It seems to me the 2 options are basically.. either you get access to run that cookbook.. or netbox stops handling mgmt for frack [21:45:58] you should have control of both frack and frack.mgmt [21:46:02] yeah. [21:46:23] I got that from the comments so if I sounded annoyed, sorry, it wasn't a complaint to you [21:46:35] just that we need to fix the process [21:46:44] i'd hope that after 3 years i could say i promise not to break anything. :) [21:47:16] oh yeah. i get the general complaint. it's not fun to be in the flow of doing something and get jolted out due to something irregular. [21:47:22] either root for you or rootless cookbooks... [21:47:42] unsure how close rootless cookbooks are [21:48:02] as long as it was just mgmt.ok [21:48:12] would have been more scary to remove from DNS an actual frack machine [21:48:42] <+icinga-wm_> RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes [21:48:53] we are good for now [21:48:58] cool.