[12:28:37] !log admin [codfw1dev] add DB grants for cloudbackup2002.codfw.wmnet IP address to the cinder DB (T292546) [12:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:28:41] T292546: cloud NFS: figure out backups for cinder volumes - https://phabricator.wikimedia.org/T292546 [15:00:29] !log toolsbeta Joining grid node toolsbeta-sgewebgen-09-1.toolsbeta.eqiad1.wikimedia.cloud to the toolsbeta cluster - cookbook ran by dcaro@vulcanus [15:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [15:01:46] !log toolsbeta Joining grid node toolsbeta-sgewebgen-09-1.toolsbeta.eqiad1.wikimedia.cloud to the toolsbeta cluster - cookbook ran by dcaro@vulcanus [15:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [15:02:20] !log toolsbeta Joining grid node toolsbeta-sgewebgen-09-1.toolsbeta.eqiad1.wikimedia.cloud to the toolsbeta cluster - cookbook ran by dcaro@vulcanus [15:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [15:10:32] I'm getting a lot of 500s on my toolforge webservice; I think the issues started in earnest yesterday after I stopped and started my container (following the wikitech-l announcement) [15:10:47] tool name? [15:10:52] bullseye [15:11:06] they're not showing up in my uwsgi.log, so I don't think they're ever reaching the service itself [15:12:51] GenNotability: 500s or some other 5xx code? [15:13:09] majavah: straight up 500 [15:13:35] it's intermittent, I am occasionally reaching the service successfully [15:14:20] did you restart it yourself about 7 minutes ago? [15:14:59] or stop/start [15:16:14] I did [15:18:12] hmm, I can reproduce too but the error doesn't match what usually comes out if our proxy layer is acting up [15:18:42] and now I am seeing 500s in my logs [15:18:44] sigh [15:18:55] okay, I'll look at my end a bit more, thanks for the help [15:19:24] they're showing in https://grafana-labs.wikimedia.org/d/toolforge-k8s-namespace-resources/kubernetes-namespace-resources?orgId=1&var-namespace=tool-bullseye&refresh=5m&from=now-12h&to=now too which iirc the proxy layer ones don't [15:20:29] neat, didn't know about that tool [15:21:22] there's also https://k8s-status.toolforge.org/namespaces/tool-bullseye/ [15:22:00] the problem is that there's 3 layers of proxies between the public internet and a webservice running on k8s, one of them speaks only tcp and the two remaining layers which understand http requests don't currently provide very useful logs/metrics for most troubleshooting [15:23:20] that graph comes from the layer that directly interacts with your tools own kubernetes pod, so if it thinks you're serving 500s it probably comes somewhere around your tool [15:24:35] aha! "User s54856 already has more than 'max_user_connections' active connections" [15:24:41] guess I have dead connections somewhere [15:24:42] that'll do it [15:25:32] any idea what max_active_connections is right now? [15:26:31] I think it's 10 per user by default [15:26:33] hmm [15:26:46] I guess I could hit that if multiple people are running queries... [15:28:13] also note that you should be creating new connections for each request instead of doing any connection pooling for the replicas and toolsdb, otherwise we would have a gazillion open connections not doing anything but still consuming some resources [15:31:01] as far as I know I'm not pooling [15:32:01] afaik I have two cases where I hit the DB - either checking user logins or getting cached API hits [15:48:16] well, I learned I was being really dumb with how I did caching, so I redid that and now things are good [20:27:37] *finds docs on making VM disk larger* [20:45:33] How do I make use of this magical 40GB ephemeral disk?