[00:00:24] !log wikilabels shutting down Buster VMs due to lack of response on T367562 [00:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikilabels/SAL [00:00:26] T367562: Cloud VPS "wikilabels" project Buster deprecation - https://phabricator.wikimedia.org/T367562 [00:04:53] !log schematreerecommender shutting down Buster VMs due to lack of response on T367552 [00:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Schematreerecommender/SAL [00:04:57] T367552: Cloud VPS "schematreerecommender" project Buster deprecation - https://phabricator.wikimedia.org/T367552 [12:31:26] !log admin upgrade spicerack from 8.5 to 8.8 on cloudcumin* [12:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:31:53] Can one request to take over tools of deceased contributors in Toolforge? Asking in context of the passing of JamesR, whose bots are now blocked because of various reasons https://en.wikipedia.org/wiki/Wikipedia:Bots/Noticeboard#HBC_AIV_helperbot5_and_AdminStatsBot [15:31:54] If it is possible to takeover, it will smoothen the paths to reactivate the processes on Wikipedia (albeit under a different bot/username). [15:45:32] @robertsky there is https://wikitech.wikimedia.org/wiki/Help:Toolforge/Abandoned_tool_policy; I imagine rule 3 (maintainer must not object within 14 days) might be waived in this case [15:47:32] thanks! I will give it a shot. [15:47:40] if only the committee responsible for that did their duties [16:03:51] JJMC89: blame me for inventing a committee but then just hoping that they would fill in the processes to keep things working over time. It would be most excellent if someone showed up with a passion for rebooting that system. [16:05:45] in the meantime, current toolforge roots will fill in when needed [16:07:45] IMO, WMCS should remove all the members and appoint a new committee. Then have the membership changes work like the CoCC (self-managed yearly). [16:08:42] $1$.*+ [16:10:34] whelp. sounds like i should expect a slow response to the request. [16:17:24] Nemo_bis, can I get a response on https://phabricator.wikimedia.org/T367528 [16:37:34] JJMC89 and pintoch: can I put y'all down as core members for the next iteration? :) [16:38:01] yes [16:39:40] awesome! I do think it can all work; it just leaned too hard on a handful of folks who have mostly wandered away as their personal lives changed. [16:50:31] lucaswerkmeister, do you have any knowledge about the wikidata-dev project? I'm wondering who to follow up with about https://phabricator.wikimedia.org/T360713 [16:52:04] Amir1: same question ^ that project has 34 admins which is almost like it having none :) [16:52:25] * bd808 taps the "no team based projects" sign [16:56:42] gehel, bearloga, is shiny-r defunct? Can I delete it? (context: no response on https://phabricator.wikimedia.org/T367553 or on https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2024_Purge) [16:57:09] so many ghosts in this channel, I guess I need to start working berlin hours :( [16:58:51] andrewbogott: replied on task: not used by me or my teams [16:59:42] for Trove, how do I check how much space a database is actually taking up of the allocated quota? [17:00:12] gehel: thanks! Can you give me any advice about future communication? It feels like all WMF staff has cloud-announce emails directed straight to trash, which is very frustrating. [17:01:18] JJMC89: The quota limits how much space you can allocate to DB volumes, and the size of the DB volume is displayed on horizon. If you're wondering how full (or unfull) a given volume is... I'm not sure :) [17:01:53] yea, how full the volume is [17:02:23] andrewbogott: seems like an endemic problem with any kind of communication :/ I'm struggling with the same kind of issues and haven't found a solution yet [17:04:02] JJMC89: it's a good question. That seems like something the trove API ought to report but as far as I know it doesn't. [17:04:16] If you want, I can log into your host and check for you /right now/ but that's not a good general solution [17:05:04] could you just let me know Monday when you do the maintenance ? [17:05:33] I won't know any more then than I know now :/ [17:05:42] what's the db name? I'll look [17:07:01] instance: copypatrol-prod-db-01; host: qbyvyo2fbjk.svc.trove.eqiad1.wikimedia.cloud [17:07:57] /dev/vdb 2.0G 978M 869M 53% /var/lib/mysql [17:08:13] thanks [17:08:29] lots of room to grow [17:14:10] it seems T353010 was fixed sometime since I filed it - not sure if any actions were intentionally taken or just part of some update - anyway, thanks to whoever may have done it [17:14:11] T353010: Cannot view database instance logs in Horizon - https://phabricator.wikimedia.org/T353010 [17:17:07] MacFan4000: can I delete the shutdown VMs in wm-bot? context: T367567 [17:17:07] T367567: Cloud VPS "wm-bot" project Buster deprecation - https://phabricator.wikimedia.org/T367567 [17:21:17] !log wm-bot deleted unused buster instances [17:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wm-bot/SAL [17:21:32] andrewbogott: ^^ [17:21:41] thank you! [18:06:24] andrewbogott: given the Buster deprecation timeline, OK if I delete the Buster syslog client in auditlogging? [18:22:11] Southparkfan: Am I remembering correctly that we need it to audit logins on other buster VMs? [18:22:42] hi. can i somehow find out how an instance in one of my projects was deleted? [18:22:46] nope, it's merely to verify Buster compatibility [18:22:59] the actual syslog servers are in cloudinfra, managed by you [18:23:43] (patchdemo4.catalyst.eqiad1.wikimedia.cloud) [18:24:12] you could argue the servers are rather important now, because Buster is EOL (and at higher risk) and therefore we have to make sure Buster clients still forward their logs [18:25:12] but I don't really mind if we shut it down and risk breaking compatibility, if Buster VMs are going to be removed soon :-) [18:25:35] Southparkfan: let's just leave them until the other Buster VMs are eliminated. Most of them at least. [18:26:15] MatmaRex: I might be able to dig up some history, will look [18:26:31] andrewbogott: sure [18:27:11] so many deployment-prep VMs left... [18:29:35] yeah [18:30:32] you could say "whatever" and do in-place upgrade with apt-get upgrade, apt-get dist-upgrade. could be lucky and much easier than creating new ones. but the downsides are it would still always nag that these are based on old images even when they are upgraded.. and it goes against the whole "cattle not pets" thing.. and puppet run could still break because missing packages or whatever [18:31:54] arguably it is somewhat pragmatic though [18:32:24] not to mention you don't test whether an initial bootstrap through puppet works on first/second attempt (second if the solely cause is a race condition) [18:33:27] yea, though "puppet on first attempt" in my world is just a rare cherry on top [18:33:41] on second run is already pretty good:) [18:34:32] but yea, no contest. it's nicest to just apply the role on something new and run puppet and see what happens [18:34:37] but dist-upgrade is an acceptable last resort action [18:36:16] is there an open membership policy for -prep? I cannot finish the whole process on my own, but could help migrating part of the leftovers [18:37:43] Southparkfan: thanks for checking in about the audit loggers [18:38:08] no problem :-) [18:38:08] Southparkfan: I'd want you to coordinate with folks in #wikimedia-releng but I'm sure they would welcome your help. [18:38:28] (to the degree that anyone owns the project they're your best bet) [18:39:24] okay, makes sense to me [18:39:46] thanks! [18:57:08] !log commons-corruption-checker shutting down main.commons-corruption-checker.eqiad1.wikimedia.cloud due to no response on T367525 or direct emails [18:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Commons-corruption-checker/SAL [18:57:12] T367525: Cloud VPS "commons-corruption-checker" project Buster deprecation - https://phabricator.wikimedia.org/T367525 [19:59:52] andrewbogott: I think you might like https://gerrit.wikimedia.org/r/c/operations/puppet/+/1026193 . I added that so that we can avoid some puppet errors for cloud VPS projects that have a local deployment_server and/or local puppetserver. Not sure if you ran into it.. but it's about fixing that "Server Error: Not authorized to call search on /file_metadata/volatile/GeoIP" [20:00:12] that should be possible now with just Hiera [20:01:25] bd808: ah yes that sounds fun, thanks for proposing [20:01:54] good that I wrote random garbage in this chan at the right moment :-D [20:05:39] fixing puppet run on deploy1006.devtools .. yay. finally [20:06:05] that was broken for quite a while and it worked now after setting "profile::mediawiki::common::load_geoip_data_from_puppetserver" to False in hiera [20:12:58] I got an error in horizon and disconnected from my session on cloud VPS just now [20:13:32] restricted bastion doesnt let me in right now [20:14:04] mutante: there's a lot of alerts going off [20:15:01] mutante: try now? [20:15:18] RhinosF1: it works again:) [20:15:48] mutante: they were a lot of recoveries now [20:16:34] ack, thanks for pointing out the feed channel [20:22:44] I have no real theory about what went wrong. Whatever it was was quite widespread [20:23:16] it feels like global networking issue [20:23:18] but only cloud [20:23:25] I mean, just blip though [20:23:51] or maybe DNS was down for a short time [20:24:54] yeah, could have been dns... was anyone logged into a bastion or VM during that blip? Did you get kicked out? [20:25:52] mutante: ^ [20:26:21] no, but I saw this on a VPS project: Error connecting to database: SQLSTATE[HY000] [2002] php_network_getaddresses: getaddrinfo failed: Temporary failure in name resolution [20:26:53] (its now working fine again) [20:27:05] andrewbogott: yes, I got disconnected from restricted bastion [20:27:16] ok! So not just dns [20:27:18] andrewbogott: and then for a short time it did talk to me but was like "pubkey denied" [20:27:45] per tradition, I will wait and see if that happens again :) [20:28:45] andrewbogott: I am actually not sure if it was disconnected from bastion or the VPS behind it.. tbh.. or timeout. but I know when I tried to get back on my key got denied and then it was all back to normal [20:28:56] yea [20:33:10] For anyone following along with the Toolforge standards committee discussion earlier, I have opened T370474 and will be following up to start a new nomination and vetting process for a membership refresh. [20:33:11] T370474: Refresh membership of Toolforge standards committee - https://phabricator.wikimedia.org/T370474 [20:43:57] I need to step away for a while but please ping me if there's another storm like the last one [22:05:02] !log deployment-prep add deployment-ircd03 (bullseye) with floating IP and irc-next.beta.wmcloud.org - T369919 [22:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL [22:05:07] T369919: Replace deployment-ircd02 with a Bullseye or Bookworm host - https://phabricator.wikimedia.org/T369919 [22:14:20] how long should it take before i'm able to ssh into a newly created instance? i created one 28 minutes ago that's still not accessible. i'm wondering if that's normal or if i did something wrong [22:14:58] (patchdemo4-production.catalyst.eqiad1.wikimedia.cloud) [22:16:00] MatmaRex: that sounds like a long time. Have you checked the console logs via Horizon to see if Puppet ran? [22:16:04] MatmaRex: do you happen to get an NXDOMAIN response for that FQDN? [22:16:44] the FQDN of my recently created deployment-ircd03 instance doesn't resolve yet, it seems [22:16:58] https://horizon.wikimedia.org/project/instances/d1818008-69c9-4a7b-b642-baeb313df663/ says "The last Puppet run was at Thu Jul 18 21:46:35 UTC 2024 (0 minutes ago). " [22:17:21] and i created it at 21:43 [22:17:58] Southparkfan: not sure how to answer that. i can tell you what error i get, one sec [22:18:22] Host patchdemo4-production.catalyst.eqiad1.wikimedia.cloud not found: 3(NXDOMAIN) [22:18:26] dns is borked? [22:18:30] https://phabricator.wikimedia.org/P66837 [22:18:44] (i have some ssh config that may be complicating things) [22:18:45] bd808: must be it, same for deployment-ircd03 [22:19:07] the addition of an A record to beta.wmcloud.org. is also stuck at Pending [22:19:08] andrewbogott: if you are about, we may be having some designate problems with new instances [22:19:24] I am! [22:19:32] I wonder if that's due to the flap earlier, let me kick things.... [22:23:43] MatmaRex: try now? [22:24:58] I fear this is a case where having three designate servers means 3x as many failures :( I restarted things and they seem to be working for now [22:27:28] andrewbogott: yep, works. thank you [22:28:20] beta.wmcloud.org RR updates also going through now [22:29:10] andrewbogott: could you check why deployment-irc03 still gives an NXDOMAIN? :/ [23:33:04] Southparkfan: It's working for me, maybe a bad cache entry? [23:33:06] https://www.irccloud.com/pastebin/j2LPYBn9/ [23:36:35] that's in the /etc/hosts on the host itself :) [23:36:47] southparkfan@bastion-eqiad1-03:~$ ping deployment-irc03.deployment-prep.eqiad1.wikimedia.cloud [23:36:47] ping: deployment-irc03.deployment-prep.eqiad1.wikimedia.cloud: Name or service not known [23:38:33] southparkfan@bastion-eqiad1-03:~$ dig @172.20.255.1 +short deployment-irc03.deployment-prep.eqiad1.wikimedia.cloud fails too [23:39:16] whereas southparkfan@bastion-eqiad1-03:~$ dig @172.20.255.1 +short deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud gives an IP address, as expected, status is NXDOMAIN for -irc03 - is there an RRset for -irc03? [23:40:12] thinking about it, I have to go afk rather soon, debugging is welcome