[10:07:44] <taavi>	 !log spacemedia created project T329462
[10:07:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Spacemedia/SAL
[10:07:47] <stashbot>	 T329462: Request creation of spacemedia VPS project - https://phabricator.wikimedia.org/T329462
[11:00:10] <Guest96>	 Hello 👋
[12:20:42] <Rook>	 !log paws pywikibot to version 8 16f07db23b6e380a15297605cbbb532f312a3246 T326512
[12:20:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL
[12:20:46] <stashbot>	 T326512: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T326512
[13:15:52] <arturo>	 !log toolsbeta cordoned & drained k8s workers 4 to 7 to force workload to relocate to 8 (T329378)
[13:15:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL
[13:15:54] <stashbot>	 T329378: toolforge: latest k8s worker node have networking issues - https://phabricator.wikimedia.org/T329378
[13:32:43] <taavi>	 !log admin re-enable puppet on labstore1004 T329377
[13:32:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL
[13:32:46] <stashbot>	 T329377: [bug]  Server does not start - https://phabricator.wikimedia.org/T329377
[14:05:40] <andrewbogott>	 taavi: thank you for the nfs fix
[14:20:01] <Lucas_WMDE>	 did codesearch just go down?
[14:20:21] <Lucas_WMDE>	 getting 502 bad gateway now
[14:35:26] <Lucas_WMDE>	 (codesearch seems to be back now)
[14:48:50] <dcaro>	 We are doing some re-shuffling of the underlying ceph hosts, it seems that it has affected (and is still affecting) some of the VMs
[15:05:05] <taavi>	 !log tools deploy jobs-api updates, improving some status messages
[15:05:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[15:49:08] <dbb>	 hi all - visiting after a long hiatus.. my login is darkblueb and it appears to be OK
[15:49:45] <dbb>	 my mission this week is building content for FOSS4G 2023 in Prizren,  Balkans
[15:53:46] <taavi>	 hi! are you looking for help for something?
[16:03:39] <taavi>	 !log tools update maintain-kubeusers deployment to use helm
[16:03:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[16:06:10] <dbb>	 hi @taavi thx for replying .. 
[16:06:24] <dbb>	 I am starting out today .. so no questions yet.. 
[16:06:39] <dbb>	 I am an editor for a linux distro on maps and mapping
[19:09:59] <bd808>	 arturo: interestingly dev.toolforge.org never went down.
[19:10:09] <MacFan4000>	 I got wm-bot running, though somebody somebody will need to make sure that xmlrcs is running
[19:10:31] <MacFan4000>	 as wm-bot had an issue connecting to it
[19:10:31] <bd808>	 MacFan4000: find petan ;)
[19:10:46] <bd808>	 xmlrcs does not start on instance boot
[19:10:55] <bd808>	 it's kind of crappy like that
[19:12:33] <bd808>	 Toolforge k8s cluster still seems sad. `Error from server (Forbidden): pods is forbidden: User "stashbot" cannot list resource "pods" in API group "" in the namespace "tool-stashbot": RBAC: clusterrole.rbac.authorization.k8s.io "tools-user" not found`
[19:12:52] <arturo>	 haproxy still not fully back online
[19:13:30] <taavi>	 bd808: hm, that's not an error I'd expect. I was touching maintain-kubeusers earlier today, so might be fallout from that. looking.
[19:13:49] <ma>	 I can reproduce with stewardbots too
[19:14:07] <arturo>	 ok, blaming maintain-kubeusers then :-P
[19:17:39] <bd808>	 MacFan4000: xmlrcs is up and running now
[19:17:52] <MacFan4000>	 ty
[19:18:50] <arturo>	 taavi: I still think the k8s API is not fully back online
[19:19:11] <taavi>	 arturo: wdym?
[19:19:17] <MacFan4000>	 looks like at this point we're just missing bridgebot and stashbot
[19:19:41] <bd808>	 stashbot still cannot read it's own namespace
[19:19:50] <ma>	 `kubectl get pods` returns an error too
[19:19:54] <taavi>	 try now?
[19:20:15] <bd808>	 works! You put the clusterrole in place I take it?
[19:21:18] <taavi>	 I enabled provisioning that in the maintain-kubeusers configuration.. https://gerrit.wikimedia.org/r/c/labs/tools/maintain-kubeusers/+/888765
[19:23:11] <arturo>	 force-rebooting tools-sgebastion-10
[19:25:23] <bd808>	 taavi: I can't get anything to delete or start in the stashbot namespace. Things are just hanging out in Terminating and Pending states.
[19:25:50] <taavi>	 bd808: that's mostly expected for now
[19:26:10] <taavi>	 there are a lot of pods that are in states that k8s does not expect, so it's trying to fix all of those
[19:26:17] <taavi>	 which in turn slows everything down
[19:26:40] <taavi>	 it'll eventually get better, plus we're rebooting some frozen nodes atm which should help
[19:26:48] <bd808>	 ok. I'll go make a sandwich and see what the world is like after I eat it :)
[19:27:35] <taavi>	 !log tools force reboot tools-k8s-control-1
[19:28:05] <dcaro>	 !log tools hard rebooting k8s workers that are still down (78,76,65,60,54,49,33)
[19:29:01] <dcaro>	 !log tools hard rebooting k8s workers that are still down (continued... 70)
[19:29:46] <wm-bot>	 !log tools.stewardbots <maurelio> Reboot of StewardBot/SULWatcher due to WMCS outage.
[19:33:43] <Deus>	 After every reboot, the lamp puppet crashes. Ex: https://survey.wikisp.org (with lamp)
[19:33:44] <Deus>	 https://apollo.wikisp.org/#login (without lamp)
[19:34:07] <Deus>	 Both in cloudvps
[19:35:48] <Deus>	 I think the best solution is not using lamp puppet
[19:38:49] <RhinosF1>	 Deus: any error?
[19:38:58] <RhinosF1>	 The lamp module is fairly basic
[19:41:14] <Deus>	 Pretty explained here: https://phabricator.wikimedia.org/T321763
[19:41:32] <Deus>	 Same error and apparently same solution (rebuild)
[19:42:18] <RhinosF1>	 Lamp should not cause that
[19:43:08] <mutante>	 yea, that would be pretty unrelated
[19:43:16] <mutante>	 all puppet does is install mariadb 
[19:43:31] <mutante>	 "Table 'mysql.servers' doesn't exist" is another problem
[19:44:07] <arturo>	 all the systems should be back to normal now
[19:44:35] <arturo>	 we got some lazy bootups and some lazy NFS server connections
[19:44:52] <arturo>	 but we managed to avoid data corruption, apparently
[19:45:01] <mutante>	 arturo: I can confirm jenkins voted again, CI up. thank you
[19:45:24] <arturo>	 thanks mutante 
[19:45:49] <Lucas_WMDE>	 arturo: great news, thanks <3
[19:46:02] <ma>	 arturo: care to reboot the webservice for the k8s-status tool? Loading very slowly and not all assets
[19:46:37] <arturo>	 ma: I don't have time for that now, sorry :-(
[19:46:49] <ma>	 no prob
[19:47:34] <Deus>	 15:43 <mutante> yea, that would be pretty unrelated — Yup, it would be strange to explain the fact that two vm's that use simplelamp2 have errors with the database and the other one that does not use it, works correctly after each restart. 
[19:49:22] <wm-bot>	 !log tools.bridgebot <lucaswerkmeister> restart after general Cloud VPS outage (T.329535), deployments/pods seem up but are apparently not working
[19:49:53] <taavi>	 hm, is tools-static down?
[19:51:41] <mutante>	 Deus: what that thing does is tell the system to install mariadb server and if it's already installed it does nothing. so it would indeed be strange to correlate that instead of the storage outage when the error is not finding files under /srv/sqldata/
[19:53:28] <wm-bb>	 <lucaswerkmeister> ok, bridgebot is back up *waves at IRC*
[19:54:14] <MacFan4000>	 still no stashbot but i think bd808 just needs to give it another kick
[19:54:35] <dcaro>	 !log project-proxy reboot proxy-03 due to dns issues
[19:55:33] <taavi>	 ma: k8s-status seems to work better now that we fixed tools-static
[19:55:58] <ma>	 taavi: indeed, thank you
[19:56:15] <taavi>	 anyone have any other tools that are still down?
[19:56:16] <wm-bot>	 !log tools.stashbot <lucaswerkmeister> bin/stashbot.sh start # start back up after T.329535
[19:56:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL
[19:56:21] <taavi>	 and there's stashbot
[19:56:29] <MacFan4000>	 !log wm-bot full restart following outage
[19:56:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wm-bot/SAL
[19:57:46] <wm-bot>	 !log tools.bridgebot <lucaswerkmeister> restart after general Cloud VPS outage (T.329535) [originally logged 19:49 UTC, relogging now that stashbot is back]
[19:57:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL
[19:58:53] <wm-bb>	 <MaartenDammers> I hope that rebooting the Wikimedia Cloud is like (finally) rebooting your laptop after a while: Everything seems to be much faster 😉
[20:00:07] <mutante>	 Deus: I think a question is if /srv/sqldata has any data in it at all
[20:03:16] <Titore>	 taavi: thanks! 502 for https://sigma.toolforge.org, is this related with the current outage?
[20:03:27] <taavi>	 let's see
[20:05:39] <taavi>	 try now?
[20:05:55] <taavi>	 looks like grid engine backed webservices might need a restart (or a k8s migration :P)
[20:06:04] <Titore>	 yeah, it works. Thanks again
[20:10:36] <herzog>	 taavi: not before logrotate is available on k8s (or a different logging system is in place) please :)
[20:10:49] <darkblueb>	 ok "exciting" day to return after long vacation (!).. hope things are settling down 
[20:10:52] <mutante>	 Deus: we can't help debugging this way, I will assume you solved your issue in some unrelated way
[20:12:07] <darkblueb>	 .. if anyone cares about #osgeo things or maps tech.. our linux distro is in alpha for the 2023 setup ..  https://download.osgeo.org/livedvd/
[20:13:25] <Deus>	 mutante: It would take me some time to check what the problem is, but I will let you know in the ticket in case there is any future solution beyond rebuilding it.
[20:13:25] <Deus>	 As far as I knew, I assumed it was something with puppet because I installed it independently and it didn't error between restarts to test.
[20:13:34] <Deus>	 Uhoh
[20:14:13] <mutante>	 Deus: sounds good. so I was wondering if you used a different path from /srv/sqldata if the setup is manual
[20:14:44] <mutante>	 because that could be it. that the working dir is simply in another place
[20:15:37] <mutante>	 either way, glad to hear your data is not gone 
[20:15:49] <mutante>	 if rebuilding is possible
[20:16:43] <Deus>	 I don't remember doing a separate configuration, I know that everything was under the puppet route
[20:17:31] <mutante>	 do you see any files in /srv/sqldata now?
[20:17:46] <mutante>	 well, don't worry if you already reimaged
[20:18:19] <mutante>	 Deus: oh, look at this in another channel: 20:17 < zabe> it's now failing since mariadb is dead on deployment-db10, which seems to be the cause because the volume is not correctly mounted
[20:18:24] <mutante>	 that is NOT using the same puppet
[20:18:34] <jgleeson>	 hi folks. I see Status:OK in the subject. Is that up-to-date? 
[20:18:35] <mutante>	 and your errors match that too
[20:18:44] <taavi>	 jgleeson: yes. are you still seeing issues?
[20:19:02] * taavi pokes at deployment-db10
[20:19:03] <mutante>	 Deus: I think it's more like the volume is not mounted as well
[20:19:17] <mutante>	 Deus: same thing taavi does on deployment-db10 is a good guess
[20:19:19] <jgleeson>	 taavi: I just tested one of our proxy links, https://paymentstest2.wmcloud.org/
[20:19:23] <Guest40>	 hi https://hub.paws.wmcloud.org/ is not working
[20:19:33] <jgleeson>	 I'm gonna see if I can dig out the detials to check the console
[20:19:37] <wm-bb>	 <Egon> We had data in https://wikipathways-data.wmcloud.org/ but this is empty now... Related issue?
[20:19:50] <jgleeson>	 coming back with a 502 atm
[20:20:11] <Rook>	 Guest40: thanks for the notice. Most of cloud VPS when down, paws is being stubborn in returning :)
[20:20:28] <mutante>	 seems like a couple users all have the same problem in common
[20:20:32] <taavi>	 andrewbogott: ^ mind having a look at the volume issues? deployment-db10.deployment-prep for example
[20:20:40] <mutante>	 that is after reboot their data in /srv/sqldata is not mounted
[20:20:43] <taavi>	 jgleeson: which project is that in?
[20:20:52] <andrewbogott>	 sure, looking
[20:21:43] <jgleeson>	 taavi: fr-tech-dev I think
[20:22:02] <andrewbogott>	 taavi: there's no entry for that in fstab, is there reason to think that it wasn't just created/mounted by hand and so lost in the reboot?
[20:22:18] <Deus>	 mutante: I mounted a volume on a vm, but not for a database.
[20:22:21] <Deus>	 Let me check what is there and I will update the ticket. It may look similar, but I'm pretty sure it's not the same :)
[20:22:30] <taavi>	 jgleeson: yeah, found it. I don't see anything listening on payments.fr-tech-dev.eqiad1.wikimedia.cloud:8052, so a 502 seems expected
[20:22:41] <taavi>	 andrewbogott: not sure at all, sorry :/
[20:22:58] <mutante>	 Deus: all the other chat around us is also about missing volume
[20:23:23] <Deus>	 #wikimedia-tech ?
[20:23:35] <mutante>	 Deus: no, this channel, what other people are talking about
[20:23:37] <jgleeson>	 taavi: maybe nginx needs restarting. thanks I'll hop on and check
[20:24:13] <Deus>	 Oh, good. My net is very lazy today :(
[20:25:07] <andrewbogott>	 hm, something interesting is happening, going to reboot this host a few more times
[20:25:58] <jgleeson>	 back in business! thanks taavi 
[20:26:52] <andrewbogott>	 hmmmm what's another example taavi?
[20:27:03] <andrewbogott>	 Might just need reboots due to some kind of race with cinder
[20:27:05] <taavi>	 ask Deus or mutante
[20:27:14] <andrewbogott>	 mutante: other examples of missing /srv 
[20:27:15] <andrewbogott>	 ?
[20:28:12] <andrewbogott>	 and/or Deus ?
[20:28:20] <mutante>	 andrewbogott: mars.wikisp.eqiad1.wikimedia.cloud and in /srv/sqldata/ .. is that empty ?
[20:28:44] <mutante>	 the other was ceres-01.wikisp.eqiad1.wikimedia.cloud  but it may have been reimaged
[20:28:46] <taavi>	 mutante: hold on, is that using the wmf mariadb puppetization?
[20:28:51] <mutante>	 both /srv/sqldata seemed empty
[20:30:05] <mutante>	 taavi: yes, the datadir is /srv/sqldata when that is used
[20:30:11] <Guest40>	 Rook thanks for the info
[20:30:18] <andrewbogott>	 mutante: that host has a cinder volume mounted under /nc-data which is working
[20:30:35] <andrewbogott>	 and I don't see any other volumes attached.
[20:30:43] <andrewbogott>	 (this is mars.wikisp.eqiad1.wikimedia.cloud)
[20:30:59] <mutante>	 Deus: is nc-data your database working dir? did you already reimage mars?
[20:31:07] <andrewbogott>	 also there's quite a bit of stuff in /srv/sqldata on mars
[20:31:14] <andrewbogott>	 So I don't know what I'm looking for :/
[20:31:25] <taavi>	 mutante: well that's expected unless you manually ran mariadb_install (or whatever the command is). that puppet module is designed for the DBAs who know what they're doing and what the puppet module does, and don't want any chances that puppet would accidentally disrupt anything
[20:31:57] <Deus>	 mutante: No, I did not rebuild any of them at the moment. That task is from October 27th. I solved it on November 1 by destroying the vm and rebuilding both.
[20:31:58] <RhinosF1>	 Lamp should not be aimed at DBAs
[20:32:06] <mutante>	 taavi: what is the expected part? certainly not that files in the data_dir disppear on reboot?
[20:32:21] <RhinosF1>	 The DBAs are not using LAMP
[20:32:31] <andrewbogott>	 Yeah, if we have lots of users using puppet lamp things we should use upstream packages rather than the wmf dba packages
[20:32:38] <andrewbogott>	 I'm not sure which it defaults to at the moemnt
[20:33:04] <andrewbogott>	 But I still don't know what actual problem we're trying to understand here
[20:33:06] <taavi>	 mutante: no, they're in /var/lib/mysql/ because that's the default. puppet changed that in the configuration, but it wasn't applied until now because no-one restarted mariadb.service manually
[20:33:07] <mutante>	 I don't follow how the way that the mariadb package is installed leads to "file not found" in the data dir after a reboot
[20:34:17] <mutante>	 taavi: aha, now that makes sense. well then the bug is that it's missing the restart. at least we can fix that
[20:34:31] <mutante>	 or we could change the default path
[20:35:04] <taavi>	 you can't do either of those with the current puppetization easily without breaking DBA workflows
[20:35:21] <mutante>	 andrewbogott: you are not looking for anything. it's just that "file not found" seemed like missing mount
[20:35:29] <andrewbogott>	 oh I see
[20:35:34] <mutante>	 DBA is using a different module afaict
[20:35:41] <RhinosF1>	 taavi: why are DBAs going near LAMP?
[20:35:45] <andrewbogott>	 ok, then I'm going to ignore the 'missing volume' conversation until I'm re-pinged.
[20:35:48] <RhinosF1>	 They shouldn’t share
[20:35:52] <RhinosF1>	 At all
[20:36:13] <RhinosF1>	 They use different packages and are for different purposes
[20:36:14] <mutante>	 Deus: we can fix your issue by changing the data dir in the config
[20:36:15] <andrewbogott>	 RhinosF1: it's literally the debian database package that we're talking about
[20:36:20] <taavi>	 RhinosF1: O:simplelamp2 uses P::mariadb::generic_server which uses the mariadb module
[20:36:58] <RhinosF1>	 taavi: then the fix is probably don’t use the same module for DBAs and generic people who just want it to work
[20:37:43] <mutante>	 Rhinos is right, DBA use role(mariadb::core)
[20:37:51] <mutante>	 which uses profile::mariadb::core
[20:38:15] <mutante>	 we do set the path to /srv/ though because that was considered a good thing at some point
[20:39:27] <mutante>	 we can just change the default to /var/lib/mysql or add the service restart
[20:41:11] <mutante>	 one thing uses "mariadb::packages", the other uses "profile::mariadb::packages_wmf"
[20:41:30] <Deus>	 mutante: good. At least I won't have to destroy and re-create the vm
[20:41:47] <taavi>	 I'm applying a temporary fix to mars
[20:43:19] <mutante>	 taavi: a fix would be to set "profile::mariadb::generic_server::datadir" in Hiera to the right data_dir
[20:44:30] <taavi>	 Deus: mariadb is back up on mars
[20:45:26] <Deus>	 Thank you!
[20:46:01] <taavi>	 mutante: no, I just moved it to the expected location to avoid messing with hiera. if we're moving the directory or fixing the puppet code to restart after config changes we should probably do it on all simplelamp instances - could you open a task for that?
[20:46:53] <mutante>	 taavi: you moved the data to /srv? ok
[20:48:41] <mutante>	 I do have different definition which option is more "messing with" (copying data round vs changing the config), but I do agree that we need a ticket to apply something to all of them
[20:54:11] <mutante>	 taavi: yea, done https://phabricator.wikimedia.org/T329571
[20:57:36] <mutante>	 so this class is used in production too, by phabricator, VRTS and parsoid/testreduce.. it's separate from "DBA databases" though
[20:58:00] <mutante>	 profile::mariadb::generic_server  that is.. as opposed to the one with the wmf packages
[21:05:09] <taavi>	 mutante: thanks! and just to clarify: the code in modules/mariadb/ is shared with production databases, the profile is not. so editing the profile is fine, but if you need to modify the mariadb::* classes you need to be careful
[21:05:21] <taavi>	 sorry if I wasn't clear enough earlier
[21:06:27] <mutante>	 taavi: one step beyond that, profile::mariadb::generic_server is also used by a few things, which are not prod databases but also not cloud VPS. but my first suggestion is to change it only in the role
[21:07:17] <mutante>	 like https://gerrit.wikimedia.org/r/c/operations/puppet/+/888800/1/modules/role/manifests/simplelamp2.pp
[21:08:24] <mutante>	 of course we could also just restart the service once.. but only if /srv/ is any better on a cloud VPS
[21:15:11] <audiodude>	 hello. I'm trying to reboot our Docker graph to get our service back up after the outage, but I can't seem to SSH to our machine
[21:15:50] <taavi>	 have you tried (hard) rebooting it from horizon already?
[21:16:00] <audiodude>	 no, I'll try that
[21:24:28] <audiodude>	 so assume for a moment that I've barely ever used Horizon before
[21:24:52] <audiodude>	 I see a bunch of names like bastion-eqiad1-04 but they don't correspond to the hostname I'm used to referring to my machine as
[21:25:30] <taavi>	 you need to switch to the correct project from the top left corner 
[21:25:51] <audiodude>	 great thanks :thumbsu
[21:31:14] <audiodude>	 okay sorry I looked, but I can't find it: where do I go to reboot the instance?
[21:31:34] <audiodude>	 I also search on wikitech wiki
[21:31:45] <audiodude>	 https://wikitech.wikimedia.org/w/index.php?go=Go&search=reboot+cloud+vps&title=Special:Search&ns0=1&ns12=1&ns116=1&ns498=1
[21:32:59] <taavi>	 audiodude: select compute -> instances from the sidebar. and then on the instance, from the "actions" dropdown on the right, select "hard reboot"
[21:36:22] <bd808>	 !log petscan Started a screen as magnus and then ~magnus/petscan/run.sh inside it
[21:36:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Petscan/SAL
[21:37:03] <audiodude>	 actions is not a drop down for me, it's just a button that says "view log"....though while clicking around I also found out that I'm not an admin on this machine, is that the problem?
[21:37:34] <taavi>	 hmm yes, you would need to be a projectadmin to do that
[21:37:37] <Eihel40>	 hello
[21:37:41] <audiodude>	 or rather not an admin on the project, just a user
[21:37:44] <audiodude>	 yes, okay thanks!
[21:37:57] <taavi>	 I can do it for you too, if you tell me the name of the project and instance
[21:38:02] <taavi>	 Eihel40: hello, do you need any help?
[21:39:06] <Eihel40>	 andrewbogott I read the last post on lists.wikimedia.org
[21:39:40] <Eihel40>	 I wanted to access https://guc.toolforge.org/?user=80.125.56.20, but I received the message "503 Service Temporarily Unavailable"
[21:39:51] <andrewbogott>	 Eihel40: do you mean the cloud-announce post?
[21:40:09] <Eihel40>	 yes
[21:40:46] <andrewbogott>	 Is that your tool?  If so I'd suggest a 'webservice restart' as some things on the grid didn't handle the downtime well.
[21:41:11] <bd808>	 that's one of Krinkle's tools. I can take a quick look.
[21:42:04] <bd808>	 !log tools.guc Container in CrashLoopBackOff, investigating.
[21:42:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.guc/SAL
[21:44:00] <wm-bot>	 !log tools.guc <root> Hard restart to resolve LDAP connection issue.
[21:44:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.guc/SAL
[21:44:22] <bd808>	 Eihel40: Give it a try now. I think I got it back to a working state.
[21:44:23] <audiodude>	 taavi: oh yes, that would be very helpful. The project is "mwoffliner" and the instance is "mwcurator"
[21:44:27] <taavi>	 bd808: hm, which node was the pod with the issues running now?
[21:45:12] <Eihel40>	 Whaou nice ! txs
[21:45:36] <bd808>	 taavi: I didn't look at the node. The error in the app logs was "KeyError: 'getpwuid(): uid not found: 51333'".
[21:46:10] <taavi>	 !log mwoffliner reboot mwcurator to fix ldap issues
[21:46:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Mwoffliner/SAL
[21:46:45] <bd808>	 The `kubectl logs` output for guc doesn't have timestamps, so I'm not sure when it gave up on restarting
[21:47:29] <taavi>	 audiodude: try now?
[21:47:47] <audiodude>	 taavi:  ssh working, thank you!
[21:53:00] <audiodude>	 looks like the server processes are indeed setup to start on reboot, it just didn't come up cleanly
[21:53:02] <audiodude>	 thanks again!
[21:58:45] <mutante>	 !log devtools rebooting instance gerrit-prod-1001 which can't be reached T329444
[21:58:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Devtools/SAL
[21:58:48] <stashbot>	 T329444: Issues with Gerrit test instance - https://phabricator.wikimedia.org/T329444
[22:03:24] <mutante>	 !log devtools - re-activating disabled puppet on gerrit-prod-1001 (reason given was 'gerrit deploy' but it was about 17 days ago)
[22:03:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Devtools/SAL
[22:13:27] <Fuzheado>	 PAWS still seems to be down?
[22:16:28] <Rook>	 Yeah it is still down 
[22:18:19] <mutante>	 !log devtools install package python3-certbot-apache on gerrit-prod-1001 - T329444
[22:18:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Devtools/SAL
[22:18:24] <stashbot>	 T329444: Issues with Gerrit test instance - https://phabricator.wikimedia.org/T329444
[22:22:10] <mutante>	 !log devtools certbot renew --apache fixed cert issue - https://ldapauth-gitldap.wmflabs.org/ does not exist unrelatedly - T329444
[22:22:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Devtools/SAL
[22:22:31] <mutante>	 Rook: did something else replace this?  https://ldapauth-gitldap.wmflabs.org/
[22:40:59] <Rook>	 mutante: maybe? I don't know that I've seen that link before
[22:42:08] <mutante>	 Rook: hmm, ok, me neither. maybe it was local to this project but I dont know
[22:54:33] <bd808>	 mutante: it used to be a part of the git project that Paladox took care of based on T227729
[22:54:33] <stashbot>	 T227729: Can't log into gerrit.git.wmflabs.org with account from ldapauth-gitldap.wmflabs.org - https://phabricator.wikimedia.org/T227729
[22:56:07] <bd808>	 mutante: The general issue is that it is against the Cloud VPS TOU to use the real LDAP server for auth in a service like Gerrit or Phabricator. We do this to try and keep folks from accidentally capturing login info that also works in prod.
[22:57:08] <mutante>	 bd808: aha, the old Paladox project. that explains even older docs
[22:57:12] <bd808>	 When Paladox first setup a test gerrit instance in Cloud VPS it was using the real LDAP server. We got him to fix that by setting up his own project local LDAP directory.
[22:57:37] <mutante>	 ACK, this probably means it's not really fixable then
[22:59:09] <bd808>	 It is possible to run a MediaWiki pointed at a private LDAP directory. This might be what the former service was. There is something like that in the Striker project to act as a fake wikitech.
[22:59:42] <mutante>	 thanks! not sure though if it's realistic that we do that again and setup an LDAP server. we should probably ask _why_ they want a login first
[23:00:05] <mutante>	 also would mean yet another quota request
[23:01:15] <mutante>	 bd808: in that case.. the current setup is against TOU again :(
[23:01:21] <mutante>	 because it "just works"
[23:01:28] <bd808>	 I was wondering about that
[23:01:42] <mutante>	 well, now that I brought the instance back up that is
[23:02:43] <mutante>	 the user is like "cool, it works again" .. so now I would have to go back to shutting it down
[23:09:25] <mutante>	 !log devtools - shutting down gerrit-prod-1001
[23:09:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Devtools/SAL
[23:26:27] <JJMC89>	 FYI, getting 503s for grafana.wmcloud.org
[23:28:04] <bd808>	 JJMC89: I'll see if the why is obvious...
[23:32:46] <bd808>	 !log metricsinfra grafana.wmcloud.org offline with db connection error. Investigating.
[23:32:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Metricsinfra/SAL
[23:35:27] <bd808>	 !log metricsinfra metricsinfra-db-1.trove.eqiad1.wikimedia.cloud not responsive to ssh
[23:35:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Metricsinfra/SAL
[23:37:17] <bd808>	 !log metricsinfra metricsinfra-db-1.trove.eqiad1.wikimedia.cloud restarted via Horizon
[23:37:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Metricsinfra/SAL
[23:40:54] <zabe>	 the gerrit copy of cloud/instance-puppet is not getting my changes, is that known?
[23:42:54] <bd808>	 zabe: it looks like the last commit there was 8 hours ago. I wonder if there is a post-outage thing that needs to be restarted somewhere?
[23:43:46] * bd808 can't remember if that is wired into the horizon server or handled on puppetmaster where the enc lives
[23:43:48] <zabe>	 that would be my guess aswell, but I don't know horizon, so I have no idea what that would be
[23:46:20] <bd808>	 It looks like this was recently moved from horizon to the end service on the puppetmaster per T318504
[23:46:20] <stashbot>	 T318504: ENC API should update cloud/instance-puppet.git instead of requiring the caller to do so - https://phabricator.wikimedia.org/T318504
[23:48:40] <bd808>	 zabe: would you mind making a phab task and linking it to T329535?
[23:48:40] <stashbot>	 T329535: Cloud Ceph outage 2023-02-13 - https://phabricator.wikimedia.org/T329535
[23:48:51] <zabe>	 sure
[23:48:59] <zabe>	 can do
[23:59:48] <bd808>	 !log cloudinfra enc-1.cloudinfra.eqiad1.wikimedia.cloud: `service uwsgi-puppet-enc restart` (T329589)
[23:59:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cloudinfra/SAL
[23:59:51] <stashbot>	 T329589: gerrit copy of cloud/instance-puppet stopped replicating - https://phabricator.wikimedia.org/T329589