[02:06:22] 我是阿里云国际站和亚马逊云的代理商，可以给您提供比官网上更便宜的价格和技术支持，有意者可以联系我。 [02:06:23] I am an agent for Alibaba Cloud International Station and Amazon Cloud. I can provide you with cheaper prices and technical support than on the official website. Interested parties can contact me. [02:07:00] @telegram moderators ^ [02:29:49] the spam is improving [02:34:30] I may have taken down codesearch. [02:34:54] I've added server-side rendering and there's probably too much bot crawling going on. [02:35:16] I'll revert for now, but ssh is becoming unresponsive. [02:39:44] was just about to ask if anyone else was having issues Krinkle [02:46:52] !log codesearch Reboot host https://grafana.wmcloud.org/d/0g9N-7pVz/cloud-vps-project-board?orgId=1&var-project=codesearch&var-instance=All&from=now-3h&to=now [02:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Codesearch/SAL [02:48:30] Managed to ssh quickly after reboot before it overloaded again, to deploy the revert [02:50:13] thanks [04:04:50] In case someone here has some iptables+puppet skills, I could use help figuring out how to let codesearch-frontend talk to hound_proxy. Both are docker containers on the same host. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1016480 [09:17:51] !log admin manually delete prometheus-node-textfile-wmcs-dnsleaks.service and related files from cloudservices1005/6, leftovers of the designate api to cloudcontrol migration [09:17:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [09:45:28] !log tools rebuilding prebuild images for T361457 [09:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:45:31] T361457: Install php-yaml in Toolforge images - https://phabricator.wikimedia.org/T361457 [10:57:44] !log taavi@tools-bastion-12 tools.wikibugs toolforge jobs restart irc [10:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikibugs/SAL [13:54:24] !log taavi@tools-bastion-12 tools.sal toolforge webservice restart [13:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.sal/SAL [14:31:21] !log anticomposite@tools-sgebastion-10 tools.stewardbots SULWatcher/manage.sh restart # SULWatchers disconnected [14:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL [14:32:41] !log anticomposite@tools-sgebastion-10 tools.stewardbots ./stewardbots/StewardBot/manage.sh restart # RC reader not reading RC [14:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL [17:42:21] !log bd808@tools-sgebastion-10 tools.ftl Restarted webservice to pick up new service.template defined 2G RAM limit. (T361652) [17:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.ftl/SAL [17:58:35] !log bd808@tools-bastion-12 tools.wikibugs Built new image and restarted all tasks to pick up Python 3.12 runtime bump [17:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikibugs/SAL [18:50:27] Created a new instance in the ldap-dev project, dcl-dev1, but I am unable to ssh in, from the log there is a failure with starting cloud-final.service, is that a normal failure? [18:51:17] https://horizon.wikimedia.org/project/instances/80bac244-e231-4390-9de9-79dc71ac0827/console [19:20:10] jhathaway: can you see in the console if the puppet run has finished yet? [19:20:26] oh, nevermind [19:21:57] jhathaway: do you have a local puppetmaster in that project? maybe https://phabricator.wikimedia.org/T349937 [19:23:03] I do not, didn't realize that was required, or forgot [19:24:05] that seems unrelated, the first run should complete on the central puppetserver just fine [19:24:07] let's see [19:24:40] nod [19:25:19] thanks for looking taavi [19:25:23] Apr 03 18:43:41 dcl-dev1 cloud-init[1111]: Notice: /Stage[main]/Base::Standard_packages/Package[fzf]/ensure: created [19:25:23] Apr 03 18:43:43 dcl-dev1 systemd[1]: cloud-final.service: start operation timed out. Terminating. [19:26:44] andrewbogott: I think our bookworm base image needs a rebuild, the package updates and puppet diff is growing too large to be done without cloud-init timing out [19:27:23] yep, agreed -- I was hoping to postpone that until after the switch to puppet 7 but I can build one now. [19:27:39] jhathaway: try logging in now? [19:27:56] woohoo, thanks taavi [19:28:15] I assume that wasn't something I could do myself through some other route? [19:28:22] i.e. kick off a puppet run? [19:29:58] if you really wanted you could've logged in via the vm console by sshing to the cloudvirt, or add a root key to all the instances.. but at least I think we've failed if our provisioning instances is unreliable enough that you need to regularly do that [19:30:52] taavi: mostly just curious, thanks again for the help [19:31:49] taavi: building [19:35:12] bd808 just sanity-checking the Opensearch cluster stuff. Is `toolhub_tools` the only index you own? [19:40:54] Not sure the new cluster can handle that 720 bytes of data ;P [19:43:12] inflatador: toolhub_tools sounds like the right name, yes. And yeah it should be tiny. The whole index is 3247 very small json documents. I told you I'd run it from a Pod if I had a PVC for storage. :) [19:44:22] ACK, just working through the 🥳 that is capacity planning [19:46:47] alwasy fun times! [19:47:08] * bd808 is glad not to do hardware budgets anymore [20:46:52] taavi: the culprit seems to be a weird upstream change: https://phabricator.wikimedia.org/T361749 [21:02:48] ouch. Makes you wonder why exactly that change was made. (Probably to hack around something else ;) ) [21:04:14] FWIW, I get that same output for `systemctl show cloud-init.service | grep Timeout` on my local system, where cloud-init.service does not exist [21:04:19] so I think that’s just the systemd defaults for any unit [21:04:47] (same on toolforge-dev too – there, `systemctl status cloud-init.service` says that the service is active (exited) but loaded: not-found and systemctl cat can’t show any files for it o_O) [21:06:19] oh man, you're right, if I do 'systemctl show thisisnotarealservice.service' I get a whole bunch of settings [21:06:23] that seems broken! [21:06:38] does `systemctl cat cloud-init.service` show any unit files on that system? [21:07:34] is there an issue with puppet not running? 'The last Puppet run was at Wed Apr 3 14:27:15 UTC 2024 (394 minutes ago).' for all instances in the copypatrol project [21:08:05] 'cat' doesn't show it although 'systemctl | grep cloud-init' does [21:08:32] JJMC89: can you tell me a specific vm to check? I think we may have an expired cert someplace. [21:08:59] andrewbogott: copypatrolbackenddeploytest04.copypatrol.eqiad1.wikimedia.cloud [21:10:17] JJMC89: yeah, that's the same breakage I'm seeing elsewhere. I'll look at it but it's not the /next/ think I need to look at. [21:11:09] on the regular toolforge bastion, `systemctl show cloud-init.service` has all-infinity timeouts, and `systemctl cat cloud-init` shows an existing unit file (`/lib/systemd/system/cloud-init.service`) [21:11:17] so I’m guessing it’s related that the unit file is somehow gone on some systems [21:11:27] but I have no idea why, or how systemd can still somehow know a bit about the service [21:11:40] since `systemctl status cloud-init.service` does show something [21:12:36] actually... I wonder if that's just the ridiculous behavior when puppet7 clients can't reach the server [21:13:26] lucaswerkmeister, any reason to think it's not just bullseye vs bookworm? [21:13:40] I mean, 'some systems' == 'bookworm' [21:13:45] it probably is, yeah [21:14:21] `systemctl list-units | grep cloud-init` lists all three as “not-found” btw [21:14:47] three more if you grep for `not-found`, actually [21:18:58] on the dev (bookworm) bastion, `dpkg -L cloud-init | grep service` doesn’t list any unit files anymore o_O [21:19:12] (only an override for `sshd-keygen@.service.d`) [21:19:32] wait WHAT? they’re in `/etc/init.d`? [21:19:56] did they go *back* from systemd units to sysvinit scripts in bookworm?? [21:20:10] (and in init scripts you can’t declare timeouts, surprise surprise…) [21:21:38] (`systemd-sysv-generator` is already deprecated btw) [21:23:09] sorry, something terrible is happening with DNS so now I'm looking at that :/ [21:23:21] np, I’ll put what I found in the phab task [21:23:30] thanks! [21:23:36] dns thing seems to be specific to one host/ip somehow [21:28:23] left a comment and I think that’s all the investigation I can offer [21:28:36] I don’t even know where cloud-init comes from :D [21:28:49] hopefully it was still useful and someone else can pick up from there :) [21:33:37] this is supposed to be a list of all files in the bookworm package for cloud-init. it does contain unit files under /lib/systemd/system/ there https://packages.debian.org/bookworm/all/cloud-init/filelist [21:34:42] the unit files are there, but the cloud-init package gets uninstalled at some point and `dpkg -L` only shows files from installed packages??? [21:34:54] `/lib/udev/rules.d/66-azure-ephemeral.rules` also seems to be missing [21:35:00] I wonder if it’s a usr-merge thing 🤔 [21:35:39] nah, maybe not [21:35:51] but `dpkg -L cloud-init` actually *only* shows things in `/etc`, all the `/usr` parts are gone too o_O [21:37:20] that last one sounds like what happens if you apt remove it without using purge [21:38:11] hmm [21:38:18] check /var/log/apt/history.log if you are on the machine [21:38:25] ah, taavi noticed the same on phabricator already [21:38:52] So the puppet issue is pretty clearly related to this: [21:38:57] root@cloudinfra-cloudvps-puppetserver-1:~# puppetserver ca list --all [21:38:57] Error: [21:38:57] code: 500 [21:38:57] body: Internal Server Error: java.lang.IndexOutOfBoundsException: Index 0 out of bounds for length 0 [21:38:57] Error while getting certificates [21:39:14] The CA has stopped working entirely [21:44:49] or... well that's bad but maybe unrelated to the immediate problem [21:49:17] the DNS issue that just affects one host.. doesnt happen to be a .mil host, does it [21:55:34] mutante: it's not, and I think it wasn't a real issue, I was just looking for the wrong thing [21:56:14] andrewbogott: ACK, that question may have sounded weird but I remembered an actual ticket [21:59:01] right now I am really questioning the wisdom of requiring pki auth for the tool that manages the pki [22:28:00] bd808: can you confirm that to use --network=host, I have to modify the Dockerfile itself to already have the Apache listening on 3002 instead of 80, and thus require close coordination to deploy the puppet/codesearch.git changes around the same time? [22:28:23] (if yes, who could I ask for CR in puppet.git around this?) [22:31:10] I'm not sure the docker image we're using even has a way to change the port. They assume port mapping pretty much [22:32:00] that is, port mapping is mutually exclusive with host networking [22:32:57] Krinkle: I think you are correct that you lose the ability to remap exposed ports when attached to the host network directly. [22:33:32] logically this makes sense as there would not be a SDN layer to do the remapping [22:34:07] I'm still confused as to why we can't talk to the host form inside the container. Isn't that iptables blocking stuff in a way we could fix? Or is that Hard? [22:35:03] Afaik there isn't any (intentional) network limitations placed on the container in terms of who it can reach, only what and how it exposes stuff. So if that is iptables, that's presumabl coming from "us" accidentally and not docker itself [22:35:19] it should be a matter of iptables/nftables rules yes. As to how "easy" this is to fix :shrug: [22:36:55] I can see what appear to be rules that deny it in /etc/, but I have no idea where to start with puppetising a change to it. [22:37:04] I'm not sure it is rules we add but rather the rules docker adds lacking something, but I think there is also an issue of the Docker overlay network's IP range shadowing some of the Cloud VPS network. [22:37:38] i've run into a weird thing when rebuilding a cloudvps host that i don't think is related to the chatter here but need some help with. [22:38:15] Krinkle: a 90 degree turn you could look into is switching from Docker to Podman and setting up a pod for the two containers to live in. [22:38:19] on a freshly build host, when i try to ssh in it accepts my ssh key and then immediately kicks me off. [22:39:07] dwisehaupt: can you see anything about the auth failure on the Horizon console for that instance? [22:39:34] alternately, what instance? [22:39:40] dwisehaupt: the puppetserver is broken right now so you won't be able to create new VMs. I'm working on it but it will probably not be fixed until tomorrow. [22:39:49] bd808: it's N x a Hound container + 1 app.py proxy in front of those + 1 php frontend. the app.py and php frontend are public to the world via codesearch. and codesearch-backend. I'm trying to have the php frontend talk to the app.py directly going forward. [22:39:54] andrewbogott: ah. cool i figured that might be it. [22:39:57] i can wait. [22:40:07] (see above mention of requiring pki to work in order to fix pki) [22:40:19] heh. yeah. [22:40:27] pki-ception [22:40:56] bd808: thanks. i'm pretty sure it's what andrew is working on. [22:41:08] I'm fairly conservative in what I'm changing given I have no place to test this at the moment. [22:41:17] dwisehaupt: sounds likely, yes [22:41:28] I guess we could give codesearch a custom puppetmaster to make that part easier. [22:41:42] cool. then i'll just shift to a different project for today. good luck! [22:42:58] Krinkle: I guess I applaud that someone (K.unal I presume) setup Puppet for the project, but that also adds a lot of mess when you don't have an active SRE attached to the project as well. [22:43:17] yeah.. [22:43:19] and yes a project local puppetmaster is the most direct hack to test puppet things [22:43:49] <3 to get clojure errors in this jruby application [22:44:25] I personally stopped recommending that anyone using custom Puppet to manage their instances quite a while ago. It is just too hard for mortals. [22:44:56] so what I'll do is explore these two options: 1) change the frontend/Dockerfile in codesearch.git to expose apache-php on a differnet port, and time that change with an SRE merging a variant of my puppet patch that uses --network=host and drops the port map. or option 2) find someone who can help me fix the iptables deny rule so that host IP is actually reachable, and t hen I can continue as-is without any major changes. [22:45:09] I'm pretty sure I can get #1 to work. [22:46:53] Krinkle: you can probably ask here nicely for help with the puppet merge when you are ready. there are nice SRE folks here who can do the needful if they aren't stuck in a mess at that time [22:47:19] or of course you have other places to find SREs [22:49:04] I have had good luck using podman instead of Docker for simple deployments -- https://wikitech.wikimedia.org/wiki/Developer.wikimedia.org#Demo_server [22:49:58] The slightly more complicated toolhub dev server uses docker-compose today, but I want to try https://mohitgoyal.co/2021/04/23/spinning-up-and-managing-pods-with-multiple-containers-with-podman/ when I rebuild it next