[10:20:54] do we have alerts for node exporter being unreachable? [10:23:32] that's not the issue, forget it [10:28:17] ok.. found the issue, wikikube-worker2198 (10.192.43.8) doesn't have proper DNS configuration in place and it's triggering some issues on pybal monitoring [10:28:52] jelto: ^^ [10:30:23] jelto: could you provide a PTR for wikikube-worker2198? [10:31:20] I renamed and reimaged the host last Friday, let me check. PTR should have been created by the cookbooks [10:31:39] it went wrong at some point [10:36:14] where are the PTR records missing? [10:39:38] jelto: ns servers don't have it [10:40:26] https://www.irccloud.com/pastebin/qXobUMB7/ [10:42:30] I've discovered it while debugging T383661 [10:42:31] T383661: check_pybal_ipvs_diff crashes if a pooled realserver is missing its PTR record - https://phabricator.wikimedia.org/T383661 [11:00:19] I looked at the cookbook run from last Friday and could not find anything obvious wrong. [11:00:19] So should the PTR records be added manually or should I try another reimage/provision? I'm not sure how to add the records manually and have never done that [11:00:33] Also if this is causing pyball issues we can depool the host [11:01:33] both PTR records appear to be on /srv/git/netbox_dns_snippets [11:01:39] vgutierrez@dns1004:/srv/git/netbox_dns_snippets$ fgrep wikikube-worker2198 * [11:01:39] 2.193.10.in-addr.arpa:47 1H IN PTR wikikube-worker2198.mgmt.codfw.wmnet. [11:01:39] 2.2.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa:8.0.0.0.3.4.0.0.2.9.1.0.0.1.0.0 1H IN PTR wikikube-worker2198.codfw.wmnet. [11:01:39] 43.192.10.in-addr.arpa:8 1H IN PTR wikikube-worker2198.codfw.wmnet. [11:02:13] so they are just not synced to the dns servers? [11:02:51] sre.dns.netbox cookbook isn't reporting anything missing [11:03:05] https://www.irccloud.com/pastebin/JQpD8sls/ [11:04:15] as an unrelated comment it looks like the cookbook is slower (140s) than the requested lock (60s) [11:05:21] and the cookbook seems to be right [11:05:27] at least for dns1004 [11:05:34] vgutierrez@dns1004:/etc/gdnsd/zones/netbox$ fgrep wikikube-worker2198 * [11:05:34] 2.193.10.in-addr.arpa:47 1H IN PTR wikikube-worker2198.mgmt.codfw.wmnet. [11:05:34] 2.2.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa:8.0.0.0.3.4.0.0.2.9.1.0.0.1.0.0 1H IN PTR wikikube-worker2198.codfw.wmnet. [11:05:34] 43.192.10.in-addr.arpa:8 1H IN PTR wikikube-worker2198.codfw.wmnet. [11:07:33] and that seems to be the case for the 16 dns servers :] [11:13:47] hmmm [11:14:10] In /etc/gdnsd/zones/netbox the PTR is also present for the host [11:14:18] maybe I'm missing something about gdnsd <-> netbox integration [11:14:45] but /etc/gdnsd/zones/netbox/43.192.10.in-addr.arpa is missing the SOA record, NS records and so on [11:16:40] and per gdnsd journal output that zone file isn't being loaded at all [11:17:03] When running the following command I get 3 PTRs, 2 A and 1 AAAA record [11:17:03] dns1004:/etc/gdnsd/zones/netbox$ fgrep wikikube-worker2198 * [11:17:14] yes, that's true [11:18:41] vgutierrez@dns1004:/etc/gdnsd/zones/netbox$ fgrep SOA ../10.in-addr.arpa [11:18:41] @ 1H IN SOA ns0.wikimedia.org. hostmaster.wikimedia.org. 2024120418 12H 2H 2W 1H [11:18:41] vgutierrez@dns1004:/etc/gdnsd/zones/netbox$ fgrep SOA 43.192.10.in-addr.arpa ||echo "no SOA record" [11:18:41] no SOA record [11:19:19] AFAIK it looks like gdnsd is ignoring /etc/gdnsd/zones/netbox [11:19:38] they need to be included explicitly [11:20:07] volans: oh.. so those are included in zone files living in /etc/gdnsd/zones? [11:20:08] * volans reading backlog [11:20:11] yes [11:20:37] https://wikitech.wikimedia.org/wiki/DNS/Netbox#Infrastructure [11:20:42] ok, the include for 43.192.10.in-addr.arpa zone is missing [11:21:06] I'm updating my local copy and checking [11:21:15] at least on dns1004 that's it :) [11:22:04] I'll send a CR soon [11:22:19] missing 43.192.10.in-addr.arpa [11:22:21] confirmed [11:22:27] from https://wikitech.wikimedia.org/wiki/DNS/Netbox#Check_if_any_automatically_generated_zone_file_is_not_included [11:26:05] Ah interesting, I didn't know the zones have to be included explicitly [11:34:14] https://gerrit.wikimedia.org/r/c/operations/dns/+/1111201 [11:36:15] could I get a review? :) [11:39:40] sure [11:39:57] topranks: ^^^ any reason wy it was left out? [11:43:35] thx volans <3 [11:44:54] topranks: any objections? [11:51:42] I'll merge it given it's impacting production [11:55:15] problem solved, thanks jelto & volans [11:55:38] vgutierrez: thanks for sorting out! [11:55:46] apologies this must have just been an omission on my part when I added them all :( [11:56:16] thank you too for figuring that out so quickly :) [11:57:23] when we get some time we'll try to work on T362985, which will remove the headache of having to modify the zone files for every new IP range [11:57:24] T362985: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985 [12:00:23] fwiw I'll do a quick sweep of all those new ranges in codfw added last quarter just to make sure there aren't others [12:01:33] vgutierrez: to answer your question on SOA records, NS records etc. [12:01:51] "43.192.10.in-addr.arpa" is not a zone [12:01:58] the only zone is "10.in-addr.arpa" [12:02:49] we simply put some of the records for that zone in a separate file, called 43.192.10.in-addr.arpa, but that's not a zone file just a text file snippet with some records. We don't delegate a sub-zone with NS entries or have SOA etc. in that file. [12:03:24] just a simple INCLUDE [12:15:35] topranks: no need to sweep I already did with https://wikitech.wikimedia.org/wiki/DNS/Netbox#Check_if_any_automatically_generated_zone_file_is_not_included [12:16:16] heh that is very neat [12:16:41] I will abandon the wonky netbox-api / dns query script I'm half way through making :P [12:16:43] thanks! [14:11:33] thanks cdanis mortiz commented. I'm updating the README to remove references to groups ensure [14:12:30] cool, thanks ottomata [14:12:43] glad my guess wasn't too far off :) [14:21:03] seeking opinions from anyone who is feeling opinionated: I've been tinkering with the idea of making wikitech-static an actual static snapshot rather than having it run live mediawiki (partly to simplify maintenance and partly to avoid rude surprises when we really need it). I have a very rough prototype here: https://wts.wmcloud.org/wiki/Main_Page.html [14:21:48] search (with lunr) is only implemented on Main_Page at the moment, and I haven't spent any time making it non-ugly but it does at least find things. [14:22:13] please, folks, let me know if you think this is worth pursuing or if it's a dead end! https://phabricator.wikimedia.org/T376400 [14:24:57] how “expensive” is it to grab HTML copies of all the pages on Wikitech? [14:28:45] It's not terrible; at the moment I'm using httrack which has easily tunable throttles for bandwidth. [14:29:20] Also it's not 100% of the pages, I'm excluding some things that aren't useful for recovery documentation. [14:31:52] The expensive part is generating the search index, which lunr does in RAM all in one go. So I'm worried that the VM that hosts this might be a bit pricey unless I can prune down the index a bit more. [14:47:48] andrewbogott: maybe the artifact generated should be a docker image (with all the content and the index prebuilt) [14:49:37] cdanis: could be! I'm a bit stuck on the idea of having the remote host pull the content rather than prebuilding it because it involves fewer moving parts. But pre-generating would allow for more and simpler hosting options [14:51:17] yeah I think the extra parts are worth it in this case. it's a lot easier to move around static files (or a tarball, or a deb, or a docker image) than it is a VM image/specification. and there's value in having multiple snapshots of the data available too [14:52:11] and that also means it's something that people could launch on their own laptops on-demand, and (since static stuff is easy to mirror off-infra) even while a total outage is ongoing [14:52:42] oh yeah! the laptop-hosted image is a good idea [14:53:47] of course we could also build a docker container that runs mediawiki rather than just having static files. [14:53:59] that would be harder for me but probably there are folks in the channel who regard it as trivial :) [15:10:43] andrewbogott definitely recommend leaving Rackspace (my former employer). There are like 5-10 people supporting its entire cloud. It's been on deathwatch for awhile. If you can get it completely static, maybe Cloudflare R2 is an option? We already have a CF account and they don't charge for egress traffic [15:46:48] is it safe for me to edit netbox with my semi-broken account (T373702) or would creating activity with my current account just make fixing it harder? [15:46:49] T373702: Unable to log in to Netbox - https://phabricator.wikimedia.org/T373702 [15:48:33] slyngs: ^^^ this is for you :) [16:00:43] oncall: nothing to report for EU today [18:00:06] update - shellbox syntaxhighlight has been migrated to PHP 8.1 in eqiad. no issues encountered thus far, and I'll be keeping an eye out throughout the day. [18:00:06] https://phabricator.wikimedia.org/T377038#10459391 has a quick summary of the current state and example `confctl` command to mitigate if issues arise. [18:10:36] congrats! [18:25:35] godog: I see it's already merged. (thanks denisse!) I can confirm puppet is unbroken on contint* hosts. thanks for the quick response:) [19:17:10] FYI, we're going to be releasing a new version of conftool (4.1.0 -> 4.2.0) shortly. this if a fairly low-risk release, but FYI oncallers - cdanis, bblack [19:17:53] I'll be deploying first to a cumin host and puppetserver to smoke test read-only ops [19:42:39] no issues encountered when testing read-only ops for each of confctl, dbctl, and requestctl [19:43:55] I'll move the rest of the puppetservers and cluster management hosts forward [19:44:10] thanks! [19:45:33] cdanis: do you have any recommendations for the remaining bulk of hosts with confctl only installed? I can come up with a mix of queries covering them, but wanted to check if you had a favorite query expression :) [19:45:41] `all` [19:45:43] :D [19:49:17] heh, that works [19:49:22] swfrench-wmf: `P:conftool::client` [19:49:25] I think is better :) [19:50:18] that seems like a solid option, yeah :) [19:53:32] although now I'm not sure if debdeploy wants just an alias, or allows a whole expression [19:53:49] I believe with `-Q` it accepts an arbitrary expression [19:54:00] (testing now with `query_version` first) [19:54:29] yup, that works [19:57:48] -s allows a Cumin alias, -Q full cumin syntax [19:57:54] ah neat [19:58:28] thanks, mortizm! yeah, using `-Q` to target `P:conftool::client` totally did the trick [19:59:54] alright, I'm going to pause here for a bit and see what stragglers shake out in debmonitor [20:10:27] with the exception of mwdebug1001.eqiad.wmnet, where the tool seems to think the host is up to date when deploying (but it isn't), this should be everything [20:11:25] for reference, the prior version was 4.1.0-1, and the spec file used was `/home/oblivian/2025-01-14-conftool.yaml` [20:11:36] swfrench-wmf: mwdebug1001 has a broken deb package state, that's why conftool wasn't updated on it [20:11:49] ah, that'll do it. thank you! [20:11:50] some mismatch with php7.4-dev it looks like [20:14:04] I fixed the package state and updated python3-conftool on mwdebug1001 [20:14:24] oh, awesome! thank you very much for doing so :) [20:14:50] thanks Scott :) [20:14:53] and thanks Moritz! [20:15:21] thank you as well, cdanis! [21:04:40] heads-up data persistence: I just decomm'd a few hosts and there was a DNS diff for wmf5337 and wmf5338 that I agreed to: https://phabricator.wikimedia.org/P72048 [21:49:10] oops ... I just realized I never hit enter on my SAL log for the conftool update =/ [21:49:40] did do just now retroactively (done as of ~ 20:00 UTC)