[07:25:11] dcaro: I wonder if T373293 is related to T373243 [07:25:12] T373293: [builds-api] quota command failing on functional tests on tools - https://phabricator.wikimedia.org/T373293 [07:25:12] T373243: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243 [07:26:04] yep, I have a similar suspicion, I'm trying to find a smoking gun on that regard [07:39:20] found it [07:39:26] ```{"error":["Error getting quota from Harbor: Head \"https://tools-harbor.wmcloud.org/api/v2.0/projects?project_name=tool-automated-toolforge-tests\": dial tcp: lookup tools-harbor.wmcloud.org: i/o timeout"]}``` [08:18:45] !log tools scale up cordens deployment to 4 replicas [08:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:38:54] hello i have a cloud vps project calle etytree [09:39:00] hello i have a cloud vps project called etytree [09:39:26] I would like to update it as it was running on Buster, which is deprecated [09:39:55] I want to create a new instance, but I don't remember how to do it [09:40:15] can anyone help? [09:40:59] managing instances happens on https://horizon.wikimedia.org/ (log in with your wikitech username + password) [09:41:24] ok great thanks a lot! [09:41:47] and then select the right project in the upper left corner (it might default to “bastion” which is not what you need) [09:41:51] hi lucas, good to hear from you! [09:43:56] hi 👋 [11:06:56] !log tools manually deleted the coredns pods that had been around for 4d [11:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:17:35] that did not help xd [11:46:27] hi, FYI prometheus eqiad can't scrape the openstack metrics, not sure if known already [11:46:30] Get "http://cloudcontrol1007:12345/metrics": EOF [11:46:37] I'm looking at this https://prometheus-eqiad.wikimedia.org/cloud/targets?search=#pool-openstack [12:09:21] godog: that's interesting, I've been seeing issues with the tools-prometheus specific ones, but not openstack, will look, though there's some fires going on [12:09:45] !status DNS issues inside k8s T373243 [12:09:46] T373243: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243 [12:09:52] dcaro: ack, thanks for taking a look [12:14:32] it seems to be broken :/ [12:14:32] ERRO[0018] enabling exporter for service identity failed: Missing input for argument [auth_url] source="main.go:169" [12:14:45] andrewbogot.t: ^ maybe you know why? last upgrade? [12:17:31] I think that the error is actually ERRO[0003] enabling exporter for service network failed: cloud novaadmin does not exist in clouds.yaml source="main.go:169" [12:23:46] I'm only here for 10 minutes but I need more context. What's the 'it' in 'it seems to be broken'? [12:25:48] oh, yep, it's the authentication from the openstack prometheus exporter to openstack itself [12:26:39] I'm not sure why/when clouds.yaml would've changed but does it really need admin creds to monitor? [12:27:05] I don't think so, but that's what it's trying to use [12:27:20] what host is that? [12:27:24] it loads novaenv and uses /etc/prometheus-openstack-exporter.yaml [12:27:30] cloudcontrol1007 [12:27:35] the wrapper script has your name xd [12:28:21] wait no, it's artur.o [12:28:36] so yep, probably got broken when we moved the cloud.yaml stuff not long ago [12:28:50] yeah, seems like [12:56:00] Good morning, can someone reboot the Flickrreview bot? it gets stuck every now and then and reboot fixes it [12:56:01] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.yifeibot/SAL [13:05:45] I feel like it’s probably affected by T373243 so I don’t know how much restarting it right now is going to help [13:39:26] Oh ok, the DNS issue .. got it. [13:57:51] I can't open phabricator, I get error *Too many requests 429* [14:09:24] If you are having DNS issues, can you retry now? I just cordoned a few of the worker nodes that were misbehaving, and did some tests, would be good to have more data before declaring the issue 'workedaround' [14:10:15] !status DNS issues tentatively solved, please report any incidents T373243 [14:10:36] I guess I can try restarting Flickrreview then [14:10:39] dcaro: where did you see 'cloud novaadmin does not exist in clouds.yaml'? [14:10:52] sikander: are you still around and can check if it works better? (once I restart it ^^) [14:11:07] andrewbogott: yep, sorry, lots of stuff at the same time, I was running the command manually like the service would (using the wrapper and such) [14:11:12] there was nothing in journal [14:11:28] what command? I'm way behind [14:11:45] (take your time) [14:11:46] !log lucaswerkmeister@tools-bastion-13 tools.yifeibot kubectl rollout restart deployment flr # after reports of FlickreviewR 2 not working on IRC [14:12:01] andrewbogott: /usr/bin/prometheus-openstack-exporter --web.listen-address=":12345" --os-client-config=/etc/prometheus-openstack-exporter.yaml --disable-slow-metrics eqiad1 [14:12:05] thx! [14:12:30] you might need to stop the current service, as it takes the port (and it just fails and restarts when curled) [14:13:35] this is going to seem like a weird question, but... what was your cwd when you ran the command? [14:14:08] ok looks like https://commons.wikimedia.org/wiki/Special:Contributions/FlickreviewR_2 is doing things again (cc sikander) [14:14:52] stashbot: ping [14:14:55] oops [14:15:15] Nice, that's great to see.. thank you! [14:16:46] !log lucaswerkmeister@tools-bastion-13 tools.stashbot ./bin/stashbot.sh restart # left #wikimedia-cloud, unknown when [14:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL [14:17:19] !log lucaswerkmeister@tools-bastion-13 tools.yifeibot kubectl rollout restart deployment flr # after reports of FlickreviewR 2 not working on IRC [originally logged 14:11 UTC but stashbot was gone] [14:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.yifeibot/SAL [14:17:27] there, that’s better [14:18:01] dcaro: idk if yifeibot and/or stashbot were affected by DNS or something else but they both seem to be behaving better now fwiw [14:18:28] lucaswerkmeister: thanks! that's good to know, let me know if it goes awry at any point [14:35:10] https://geohack.toolforge.org/ is timing out, https://recoin.toolforge.org/ is now 404ing and https://k8s-status.toolforge.org/namespaces/tool-recoin/ is a 500 [15:02:02] lookiung [15:03:07] !log dcaro@tools-bastion-13 tools.geohack restarting webservice due to stuck process [15:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.geohack/SAL [15:03:50] I think we are getting overloaded [15:03:56] (as I took the workers out) [15:03:57] looking [16:51:32] Raymond_Ndibe: https://wikicontrib.toolforge.org/ seems to be sad. It looks like contraband is throwing HTTP 500 errors, but there doesn't seem to be anything useful in the uwsgi.log file there. [22:01:23] !log melos@tools-bastion-13 tools.stewardbots ./stewardbots/StewardBot/manage.sh restart # Disconnected [22:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL