[04:08:29] !log wikisp mars: Updating dependences [04:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikisp/SAL [04:14:16] !log wikisp apollo: Updating zammad to 6.1 [04:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikisp/SAL [04:14:27] Not sure if we are aware, but i'm getting no route to host on one of my WMCS instances [04:14:57] 172.16.6.23 [04:24:35] https://phabricator.wikimedia.org/T347661 [05:36:44] My instance is also down and I don't know why: https://grafana.wmcloud.org/d/0g9N-7pVz/cloud-vps-project-board?orgId=1&var-project=mwoffliner&var-instance=mwcurator [06:43:41] hello, some of my tools are down apparently because they cannot connect to the tools-db MySQL server anymore: [06:43:48] "Can't connect to MySQL server on 'tools.db.svc.eqiad.wmflabs' (115)" [06:44:33] connecting manually (with the "mysql" command run as the tool account) gives me "ERROR 2002 (HY000): Can't connect to MySQL server on 'tools-db' (115)" [06:55:45] I see lots of is down in -feed [06:55:48] Including tools-db [06:55:50] !help [06:55:51] If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-kanban [06:55:57] * RhinosF1 makes UBN task anyway [06:57:15] https://phabricator.wikimedia.org/T347665 [07:06:39] hi! looking [07:07:29] dcaro: thanks [07:32:17] !status network issues with bullseye based instances [07:39:51] i can't connect to my instance suddenly, any known issues atm? [07:42:38] Danny_B: yes, working on it T347665 [07:42:38] T347665: Multiple CloudVPS instances down - https://phabricator.wikimedia.org/T347665 [07:44:35] thanks. any eta? [08:07:13] we are slowly getting things up, I'd say 1h tops (vms that lost network need to be manually fixed through console) [08:36:19] !log admin start script to fix networking on broken bullseye instances T347665 [08:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [08:36:24] T347665: Multiple CloudVPS instances lost their ips (unreachable) - https://phabricator.wikimedia.org/T347665 [08:39:01] tools-db is already up again, yay! thank you all! that seems to be a tricky situation to get out of. [08:40:47] pintoch: cool, thanks I just fixed the VM [09:14:50] !status most projects already back online, working on the few left [09:57:28] Something is odd on sgebastin-11 NFS-Problem? [09:58:16] yes, we're working on it [09:58:23] okay [09:59:21] Just in case … I was saving a file in vi when this happens … is there some backup? [10:00:23] vim usually writes a temp file that you can use to recover what you were editing, but it might depend on the config [10:01:29] "usually" … if something goes wrong, it usually goes terrible wrong :-( [10:03:21] Wurgl: sadly everything has gone terribly wrong today [10:03:29] All of WMCS bullseye instances lost network [10:03:35] They are working super hard [10:06:52] the bastions should be back up at this point [10:28:53] !log testlabs shutdown testabscookbook-nfs-[1,2] [10:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Testlabs/SAL [10:33:01] Lucky me. File is not corrupted [10:38:24] yay! [10:49:55] !log cloudvirt-canary force-reboot all servers [10:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cloudvirt-canary/SAL [11:40:43] !status Ok, hopefully [11:42:10] `toolforge-jobs run` is giving me "502 Server Error: Bad Gateway for url: https://api.svc.tools.eqiad1.wikimedia.cloud:30003/jobs/api/v1/run/". Run from login.toolforge.org. [11:42:44] yep, just saw that [11:43:40] try now? [11:45:54] Worked now, thanks [11:46:12] !status Ok, hopefully [12:18:09] @lucaswerkmeister: is it intentional that notwikilambda still has a restart cron job? I thought that tool was retired already [12:18:18] uh [12:18:20] nope [12:18:24] let me remove that then [12:19:03] thanks [12:19:12] !log tools.notwikilambda kubectl delete cronjob restart (T314880) [12:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.notwikilambda/SAL [12:19:44] now `kubectl get all` only shows pod, service, deployment and replicaset (1 each), all from the webservice I assume [12:20:15] yeah lexeme-forms has the same resources (except more replicasets because of the rolling restarts I do there) [12:39:46] thanks for spotting it taavi! [12:40:14] !log admin taavi@cloudcontrol1005 ~ $ os subnet set a69bdfad-d7d2-4cfa-8231-3d6d3e0074c9 --no-dns-nameservers --dns-nameserver 172.20.255.1 [12:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:28:55] hi folks! One quick question - I have some difficulties to ssh to vms like deployment-kafka-jumbo-9.deployment-prep.eqiad1.wikimedia.cloud, could it be related to today's outage or something different? I noticed that we are recovered but before starting to dig deep into it I wanted to double check :) [13:29:35] let me look [13:30:02] <3 [13:30:22] the machine has network and an ip [13:33:23] elukey: can you try now, I ifdown+ifup the interface [13:33:31] also what error do you get? [13:36:45] hmm, I can't ssh either xd let me look [13:37:06] dcaro: thanks! It basically hangs, afaict when doing [13:37:07] Executing proxy command: exec ssh -a -W deployment-kafka-jumbo-9.deployment-prep.eqiad1.wikimedia.cloud:22 restricted.bastion.wmcloud.org [13:39:34] can you ssh to restricted.bastion.wmcloud.org? [13:40:06] let me guess: that instance is failing to run puppet, and so has an outdated firewall configuration which blocks access from the new restricted bastion [13:40:25] ha, that's possible [13:40:45] feels like it yes [13:40:53] ssh packets arrive, but get dropped [13:41:18] the old bastion is still running, but I already removed the config in operations/puppet so I don't think we can use it [13:41:23] seems to be applying puppet [13:41:38] nope [13:41:42] ha. not the first time I've seen that on deployment-prep :P [13:41:43] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, Class[Profile::Kafka::Broker]: parameter 'statsd' expects a String value, got Undef (file: /etc/puppet/modules/role/manifests/kafka/jumbo/broker.pp, line: 10, column: 5) on node [13:41:43] deployment-kafka-jumbo-9.deployment-prep.eqiad1.wikimedia.cloud [13:41:49] I saw that error before too :) [13:41:55] elukey: ^ [13:42:53] lol thanks [13:43:07] to note that I was working on a different vm, not doing any change to kafka atm [13:43:16] but lemme see if I can fix :D [13:43:30] I think it has been broken for a while (that's why id did not update the bastion firewall rule) [13:44:56] elukey: you should be able to ssh to it to debug, added the ferm rule manually, will get properly fixed on the first successful puppet run [13:45:13] ah wow nice! [13:46:20] otherwise fixing puppet is going to be a bit harder xd [13:47:49] but how can you SSH if it doesn't accept connections from the bastion? [13:48:37] I can ssh now to jumbo-9 thanks [13:51:16] dhinus: I added the ferm rule to accept connections [13:51:48] did you ssh through the non-restricted bastions? [13:59:26] I did not, I used the console [13:59:51] can you ssh from the non-restricted bastions directly? let me try [14:01:13] hmm, does not seem to work [14:01:20] dcaro@urcuchillay$ ssh -vv -a -W deployment-kafka-jumbo-9.deployment-prep.eqiad1.wikimedia.cloud:22 bast_eqiad [14:07:09] I think you can do "ssh -J fnegri@bastion.wmcloud.org root@deployment-kafka-jumbo-9.deployment-prep.eqiad1.wikimedia.cloud" [14:07:20] (s/fnegri/yourusername/) [14:07:54] jumping through bastion.wmcloud.org instead of restricted.bastion.wmcloud.org [14:09:25] let me try [14:09:35] seems to work for me [14:09:57] oh yes, that works like a charm :), I'll save it in my history for the future [14:10:15] but I only thought about it because you wrote "I'll use the prod one" in the other channel, speaking of something completely unrelated :D [14:10:44] but in my mind you made me think "oh yeah we have another bastion!" [14:11:12] that's the one the cloudcumin uses right? [14:13:21] nope, cloudcumin uses the restricted one [14:14:00] oh, what's that one for then? [14:14:20] https://wikitech.wikimedia.org/wiki/Help:Accessing_Cloud_VPS_instances#Setup [14:14:34] according to that page "restricted" is for SREs [14:14:44] and non-restricted for everyone else [14:15:02] not sure if we really need that separation [14:15:44] oh, hmm, yep, seems prone to errors of "this is up for me" when checking connectivity for users [14:18:01] we actually have two non-restricted ones if you look at the instances in the "bastion" project, I'm not sure if they are load-balanced or not [14:21:16] they're not, there's just some docs telling that you can use secondary.bastion.wmcloud.org instead if you want [14:23:01] totally unblocked folks, thanks for the help! [14:23:52] 👍 [14:24:14] taavi: thanks! [14:59:37] is there a way to find out whether / when a tool was deleted? I found a tool that’s linked from a Commons template and has some years-old SAL entries but it no longer has a home dir and doesn’t appear in toolsadmin [14:59:55] (not sure if I should say which tool it is – maybe it was vanished for a reason, idk 🤐) [15:00:26] aha, found in in the general tools sal. nevermind :) [15:01:32] should be there yes [15:25:23] @lucaswerkmeister: Another thing that can be checked is the /srv/tools/archivedtools directory on tools-nfs.svc.tools.eqiad1.wikimedia.cloud. When tools are removed by the "Disable tool" process the final archive ends up there. [15:26:23] got it, thanks! (it’s indeed there too) [16:36:31] Hi, I'm trying to understand the OAUTH procedure following the (outdated) https://wikitech.wikimedia.org/wiki/Help:Toolforge/My_first_Django_OAuth_tool and it recommends testing your OAUTH application on the beta cluster here: [16:36:32] https://meta.wikimedia.beta.wmflabs.org/wiki/Special:OAuthConsumerRegistration/propose [16:36:34] But to get access to it, it prompts me to confirm my e-mail address, with out ever sending a confirmation code... [16:36:35] Is there another way? [18:06:08] @kristabaum: A proposed consumer at meta will work for the user that proposes it before it is approved. That would be the method I personally recommend. [18:53:11] Good morning. I'm trying to add a user to our Cloud VPS server. They're setting up their Wikitech account, but I can't find any docs on how to add them to the project. I've looked at https://wikitech.wikimedia.org/wiki/Help:Cloud_VPS_user_roles_and_rights and https://wikitech.wikimedia.org/wiki/Help:Accessing_Cloud_VPS_instances. The project is `mwoffliner`: https://openstack-browser.toolforge.org/project/mwoffliner [18:53:36] slightly related: I see a list of project "members" on that page, and I'm not one of them, even though I can ssh to one of the machines and deploy code there [19:01:30] yes, you’re under “Viewers” (collapsed by default), which means you don’t have permission to add more members IIUC [19:01:44] (more viewers or members, that is. confusing terminology)