[06:51:33] greetings [07:52:22] * dhinus paged [07:52:41] is toolsdb out of disk space again? [07:53:07] looking [07:54:08] no I think that is yesterday's page that was never resolved [07:54:28] dhinus: ^ [07:54:34] I'll resolve it [07:55:08] unresolved incidents re-trigger after 24h [07:55:16] ah phew :) [07:57:32] disk space is stable at 80% used [08:02:57] sorry about the page! [08:11:46] no problem! [10:58:32] quick review: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/100 [10:59:04] perche' uno 4k e l'altro 5k? [10:59:20] 5k è stato modificato ieri durante l'incident [10:59:29] ma senza aggiornare tofu [10:59:38] ma quello nuovo e' 4 [10:59:43] non lo vuoi da 5? [10:59:45] per la nuova istanza terrei 4k, che per ora basta e avanza, e poi mettiamo l'alert per il disk space [10:59:48] ah ok [11:00:00] sorry about the language mix-up :) [11:00:12] err_too_many_italians [11:00:20] :D [11:00:51] lolz [11:05:23] * dhinus errand + lunch [11:25:34] * godog lunch [14:53:42] andrewbogott: which mysterious alert are you seeing? [14:54:18] 'Reduced availability for job pdns_rec in cloud@codfw' -- I found an issue that was definitely causing that, fixed it, and the alert persists. [14:56:07] andrewbogott: https://prometheus-codfw.wikimedia.org/cloud/targets?search=&scrapePool=pdns_rec [14:56:12] it's failing to scrape [14:56:34] using ip6 (not sure that's an issue) [14:57:02] ah, it's failing on both? Because of 'reduced' I was thinking it was working on one and not the other... [14:57:20] taavi@cloudservices2004-dev ~ $ ss -tulpn | grep 8082 [14:57:20] tcp LISTEN 0 10 127.0.0.1:8082 0.0.0.0:* [14:57:38] pdns is working though https://prometheus-codfw.wikimedia.org/cloud/targets?search=&scrapePool=pdns [14:57:48] that's missing the ip6 listen no? [14:57:52] the listener on that port is bound on localhost [14:57:59] the actual 'get' listen there is running on a prometheus host, or running right there on the cloudservices host? [14:58:19] prometheus calls the http on the host [14:58:44] so you should be able to test it with `curl https://....` from the prometheus host for connectivity [14:59:01] yeah, there's not even a firewall rule for that port. So I don't know how this ever worked :) [14:59:34] (that's where I stopped yesterday, noticing that there was no ferm rule and thinking, well, I must be wrong about how this is supposed to work) [14:59:35] andrewbogott: the previous config format set `webserver-address`, the equivalent seems to be missing from the new yaml file [14:59:53] there is a firewall rule to allow all traffic from the prometheus hosts [15:00:12] oh, all traffic, so grepping for the port # doesn't work, d'oh [15:00:19] ok, great, thanks taavi, that's a straightforward fix. [15:00:23] oh but now it's meeting time [18:44:54] * dcaro off [18:57:49] I kicked off the data rsync for T409287 [18:57:49] T409287: [toolsdb] Destroy tools-db-4 and create new host - https://phabricator.wikimedia.org/T409287 [18:58:38] it should complete in a few hours, tomorrow I will continue with the setup [18:59:20] no impact or alerts expected for now, ping me if you see anything weird :) [18:59:23] sgtm [18:59:33] * dhinus off