[06:45:53] greetings [07:19:23] ok toolsdb is out of space, investigating in T409244 [07:19:24] T409244: Could not connect to database: User s51080 already has more than 'max_user_connections' active connections - https://phabricator.wikimedia.org/T409244 [07:23:49] not sure if/how to grow a cinder volume at least temporarily [07:30:18] actually flipping to tools-db-6 might be easier? it has a bunch more space [07:33:55] I'm not confident enough with toolsdb yet to choose which to do tbh [07:41:37] ok I'll start paging folks [07:44:38] heh if I remember how to trigger manual paging for splunk oncall that is [07:49:25] doh of course, I created an incident [07:50:53] dcaro: apologies in advance for waking you up [07:50:55] Hi! [07:51:27] I'll be on a laptop in a sec [07:51:52] ok, tl;dr tools-db-4 is out of disk space, tracking in T409244 [07:51:53] T409244: Could not connect to database: User s51080 already has more than 'max_user_connections' active connections - https://phabricator.wikimedia.org/T409244 [07:53:47] ack [07:54:09] I see the trend does kinda continuous [07:55:08] you mention that there's tools-db-6 that has more space? [07:55:50] yes it is the replica afaict [07:56:43] yep [07:56:47] can't say I'm confident enough to do a failover though [07:56:51] https://www.irccloud.com/pastebin/UaoPGduL/ [07:57:53] another option would be to turn off mysql on tools-db-4, extend the volume and start again [07:58:42] indeed, seems easier to me than https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/ToolsDB#Changing_a_Replica_to_Become_the_Primary [07:59:09] okok, let's do that then, resize the volume [08:00:03] ok, would you mind driving dcaro ? I'm happy to assist [08:00:05] https://wikitech.wikimedia.org/wiki/Help:Adding_disk_space_to_Cloud_VPS_instances#Extend_a_volume [08:00:30] okou [08:01:18] godog: can you create the incident document? I'll stop mysql [08:01:19] https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Incident_response_process [08:01:27] yes will do dcaro [08:01:37] thanks! [08:02:55] https://docs.google.com/document/d/1GebJarU39x_7qnaAuYtZVNtzpNBNON09Abrm5AhOIN8/edit?tab=t.0 filling in details [08:07:04] thanks! mysql is taking some time to stop, I was looking at the sizes in the meantime [08:08:23] the idbata1 file is significantly bigger on the main node, I would guess that's expected though, but not sure [08:08:55] indeed [08:11:20] hmm.... there was something growing considerably in the last couple of days though [08:11:42] (from the graph in the task https://phabricator.wikimedia.org/T409244) [08:12:10] well, scale is just 1% free xd [08:12:38] yes and also over 30d that rate of space used is not unusual [08:12:49] yep [08:13:01] do we have it split per db? [08:13:05] (just curious at this point) [08:13:13] not from mysqld-exporter afaics [08:13:25] mysql is still writing to disk before stopping [08:13:31] though my money is on ibdata1 growing [08:14:23] https://www.irccloud.com/pastebin/YlyodQ1i/ [08:15:07] feels like it was just growing "slowly" and the idbata probably grows and shrinks, until the base data grew enough for the next ibdata growth spur to exhaust the space [08:17:36] could be, though on db-6 free space actually goes up [08:18:59] shutdown is long... [08:21:32] I could have tried to delete some binary logs before trying to stop mysql [08:21:37] (now I can't xd) [08:23:43] hmm... so the growth has been ~10% in the last month [08:23:56] that means ~400G/month [08:24:33] it feels like a huge growth :/ [08:26:21] indeed, and I bet that's only ibdata growth since db-6 does not show the same trend [08:28:28] mysql is still writting and such, 18 min now :/, not sure how much to wait [08:29:16] actually, 28 min since I ran `systemctl stop mariadb` [08:29:52] not sure either how long to wait tbh [08:29:56] last log is the one I passed before [08:30:44] it might get hung by the lack of disk space, would be nice if it was a bit more verbose (or retained the socket open for me to inspect) [08:31:49] dhinus: let us know when you are around, you might have a better idea of the time it takes [08:33:01] I'll be around shortly [08:33:34] also note that free disk space did jump a little bit since the stop, with 24G available now [08:33:54] if that goes down again sth else we could do is not reserve blocks for root on the filesystem [08:35:03] unfortunately mariadb sometimes just refuses to stop and you have to kill -9 [08:36:08] I'll give the kill a shot [08:37:19] stopped now, umounting to extend the volume [08:37:30] Disk space should be similar in the replica, if it's smaller there is something odd [08:37:53] I'll add 1T more, but we should look into what grew [08:38:09] If you can extend the primary that would be my first call, so we can then compare the replica to make sure it's in sync [08:38:24] ack [08:38:36] There was definitely some fast growth, we had more than 1TB free until a few months ago [08:40:45] resizing... `sudo resize2fs /dev/sdb` [08:41:02] dhinus: should I reboot the VM instead of manually starting mariadb? [08:41:25] (so it starts "afresh") [08:41:51] Probably not required [08:42:28] https://www.irccloud.com/pastebin/o8HVgZcc/ [08:42:41] okok, I'll start mariadb by hand then [08:42:41] I would just systemctl start mariadb and see how it goes [08:42:58] Check the logs with journald -u mariadb [08:43:14] 👍 [08:43:15] starting [08:43:25] it says statred [08:43:26] Nov 05 08:43:13 tools-db-4 systemd[1]: Started mariadb.service - mariadb database server. [08:43:46] Looks good [08:44:43] slave connected too [08:44:48] *replica [08:46:00] https://www.irccloud.com/pastebin/oevc803K/ [08:46:09] that looks ok [08:46:33] https://usercontent.irccloud-cdn.com/file/tLKbme9G/image.png [08:46:42] https://sample-complex-app.toolforge.org/ look ok too [08:46:55] that's trove id :facepalm: [08:51:24] taking a break, brb [08:51:54] 👍 [09:03:43] small improvement for the cookbook checking the status gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1202054 [09:03:47] (adding readonly property) [09:07:05] we should add an alert of disk almost full [09:07:13] (to give us some leeway next time) [09:07:44] dcaro: yes that alert is super important and I should've really thought about that before [09:08:13] I was not expecting such rapid growth, but there is nothing preventing it so it can happen [09:08:28] another option is also move the ibdata to it's own volume [09:08:47] https://tool-db-usage.toolforge.org/ gives you stats by-db but it's not working atm [09:08:48] as that's the most dynamic thing around, not sure how sound is that though (is it the same risk?) [09:08:59] * dcaro looking [09:11:35] there you go [09:11:36] https://tool-db-usage.toolforge.org/ [09:11:53] it's also very weird that ibdata is growing this much [09:12:19] that is used for transactions? [09:13:06] https://www.percona.com/blog/why-is-the-ibdata1-file-continuously-growing-in-mysql/ [09:13:28] we could also claim some spaced by removing some binlogs [09:13:29] we do have "innodb_file_per_table", so innodb1 should be metadata (small), change buffer, doublewrite buffer, undo logs [09:13:42] s/innodb1/ibdata1/ [09:17:52] is there a way to see which db is using that space? [09:18:23] you can check the db sizes with 'du -hs /srv/labsdb/data/* |sort -h' [09:18:37] but the extra disk space is in ibdata1, and for that it's not as easy [09:18:46] according to the percona page above, you can use innochecksum [09:19:02] but that gives me "Error: Unable to lock file" so we probably need to shut down the db [09:20:33] anyway I think we need to fail over to the replica because that ibdata1 will never shrink, apparently [09:20:53] I'm confident the replica is in sync, and the transaction that caused this issue never completed [09:21:07] wait, are you saying we don't have innodb_file_per_table set ? [09:21:11] we do [09:21:16] see the percona post above [09:21:29] undo_log and other things are still left in the centralized ibdata1 [09:22:01] dhinus: we do from the start or we added it later? [09:22:03] per-db sizes are normal, the only thing that grew is innodb1, and on the primary only [09:22:14] volans: we do since I work here :) but probably not from the very start [09:23:01] lol, I mean if the data has been imported with a db with that already set or not, because otherwise you still get the old data in there and yes it can't shrink as any ibdata file [09:23:07] unless you reimport [09:23:12] the fact that ibdata1 is 115G even on the replica indicates that either initially we didn't have it, or (more likely) we had another similar issue with the undo log growing too much [09:23:52] I'll shut down the replica for a second and check what innochecksum gives me on the replica [09:24:51] innochecksum on the replica returns "Fail: page::5 invalid" :/ [09:27:25] before we flip to the replica I think we should be extend the replica volume too to be the same size [09:27:36] yep [09:27:53] not really because ibdata1 there is not that big, so we have space [09:28:05] of course if they try the same query, we'll have the same problem again... [09:28:15] so maybe we should prepare some extra space [09:28:37] but if we have the alert, we should hopefully catch it before it uses the full disk [09:28:49] yes replica/primary with two different sizes is not a good situation to be in imho [09:29:39] my plan was to stick with the current size, and nuke tools-db-4 which is the one with a different size [09:30:23] sure that works too, as long as they have the same size that works for me [09:31:09] we can also decide that we're using 5TB everywhere, that's an easy change and will give us more time in case the problem reoccurs [09:31:10] the replica is not getting the requests the primary gets though no? [09:31:31] I'm ok with nuking, as long as we keep a close eye [09:31:34] it's getting different requests, but that should not affect disk size [09:31:44] because only write requests affect disk size [09:31:47] and the replica is read-only [09:32:07] (just to make sure that the issue is not some queries that are still being run, and will just move to the replica once we swap them) [09:32:34] yes I also expect those queries will likely happen again, or maybe are already happening now, let's see if I can find them :) [09:34:31] there are very long queries on mixnmatch happening right now [09:34:39] not sure if they are the ones that caused the issue [09:45:52] I'm checking the journald logs (they are on a different partition so unaffected by the disk being full) [09:46:21] I see some disk full errors last night at 23:27 UTC, but I don't see any disk full errors after that, how is that possible? [09:47:50] when I stopped it, mysql was not complaining about disk full either [09:48:08] (it was full, just not complaining about it) [09:58:55] as a data point, I found the size of ibdata1 in 2022 (102G) and Feb2025 (103G) https://phabricator.wikimedia.org/T291782#10545551 [09:59:21] in the replica, it's now 115G [09:59:46] in the primary, 985G [10:00:04] 10% increase since feb, 1% increase in the previous 3 years [10:00:29] that's 300% increase in the increase for this year xd [10:31:14] we do have an alert that checks if it can write to toolsdb. why did it not fire? [10:32:16] maybe it was able to write? I was thinking the same, no pages were triggered until godog paged me and I stopped the db [10:33:00] I guess we could check the binary logs and see if there were any transactions [10:33:16] _if_ mariadb could write the binlogs [10:33:30] indeed I didn't see any toolsdb alert while investigating, I'll add a followup to the incident to investigate said alert [10:34:47] * godog errand and lunch [10:39:18] ok the toolsdb read/write check was removed in T313030 [10:39:18] T313030: [toolforge.infra] Replace Toolschecker alerts with Prometheus based ones - https://phabricator.wikimedia.org/T313030 [10:39:28] because we thought alertmanager alerts were enough [10:39:59] but they only check the read/write status, they don't try writing [10:47:16] I'll go have a shower and some lunch [10:49:57] I checked the binlogs and there are some transactions that were committed overnight but much fewer than usual [10:55:54] so it seems mysql was able to still make do without space, just slowly? [11:05:23] kinda, buy I'm not fully understanding it [11:23:00] I'm having lunch, then I'll fail over the primary to tools-db-6 [13:17:17] after some thinking, failing over immediately might not be a good idea: if the same thing happens, we would have both hosts with a big "ibdata1" [13:17:40] I will create another replica first (tools-db-7), from a snapshot of tools-db-6 [13:21:15] *nod* makes sense [13:25:58] I summarized my plan in T409287 [13:25:59] T409287: [toolsdb] Destroy tools-db-4 and create new host - https://phabricator.wikimedia.org/T409287 [13:29:00] 👍 [13:33:32] alert-wise, there's a bump in the toolforge general board for toolsdb (added yesterday), but I think it's just when I stopped it [13:33:34] https://usercontent.irccloud-cdn.com/file/igD7IG1a/image.png [13:39:40] yes I saw it and I think it's the restart. it was probably reporting down as soon as you tried to stop it [13:39:49] even if it did not stop cleanly [13:46:48] yep, it shut down the mysql socket right away, so I was not able to check either anything [14:01:51] guys I'll join in 2 min [14:02:49] topranks: ok! [15:00:58] godog: the other MTU thing I was talking about: T409294 [15:00:58] T409294: Fix MTU on single-NIC Ceph nodes - https://phabricator.wikimedia.org/T409294 [15:02:26] taavi: neat, thank you, the cluster interface should have 9000 - 4 mtu, did I get that right ? [15:02:41] dhinus: Before I go eat breakfast, is there anything I can do to help with the toolsdb issues? (I can't tell from backscroll if things are currently down) [15:02:41] dhinus: there's only a few grants with ips hardcoded [15:02:58] MariaDB [mysql]> select * from information_schema.user_privileges; [15:04:14] for toolsdb I mean [15:04:22] godog: yeah, I believe we need to either lower the cluster interface mtu by the .1Q size (4 bytes) or raise the public interface MTU by the same amount. the switch side is 9192 so there's room for both options [15:05:08] andrewbogott: everything is up :), dhinus is working on getting a new full replica to replace the current primary in case it got corrupted, but there's no issues [15:05:26] great, then I will eat :) [15:05:46] bon app! [15:08:15] dcaro: good point about grants, those would need to be updated for ipv6 I guess [15:08:37] I think I'll stick with the legacy lan for now, and we can enable IPv6 later when things are more stable [15:09:05] taavi: ok! I think I slightly prefer lowering the cluster interface mtu in the single nic case, purely from a deployment simplicity POV, meaning we wouldn't be affecting double-nic hosts [15:09:52] dhinus: at least go for the vxlan/v4only network instead of the legacy one [15:10:09] godog: good point! [15:11:03] sweet, I'll update the task [15:11:09] taavi: ack, will do. can I change it after the VM is created? [15:11:13] dhinus: no [15:11:24] ok :) [15:13:19] ah the cookbook "create_instance_with_prefix" did output "overriding requested network VLAN/legacy to VXLAN/IPv6-dualstack" [15:13:28] so it used the ipv6 one [15:14:25] harsh but fair [15:16:01] if we don't add DNS for IPv6 addresses, all connections to toolsdb will use IPv4, right? and maybe it's fine if the VM is on VXLAN/IPv6? [15:16:51] indeed, if the service record never has AAAA then all should work as expected [15:18:20] i guess so, although we currently use a CNAME where you can't selectively only alias v4 queries [15:18:28] ha, right [15:19:12] tbh three quarters of the ip-specific grants use source addresses that can't even talk to the hosts and so are 100% obsolete, and the remaining grants seem easy enough to duplicate for v6 [15:19:44] topranks: I created T409309 based on what we just discussed [15:19:45] T409309: Create new VRF and networks for Toolforge-on-Metal - https://phabricator.wikimedia.org/T409309 [15:20:23] taavi: yeah i was also thinking about it, although it's adding more work, but worst case we'll get someone complaining they can't access toolsdb and we can track down which grant is missing [15:20:44] we need to do that work at some point anyway [15:20:54] * dhinus is not sure he wants to go down the rabbit hole of looking at which grants we have :D [15:21:04] select user, host from mysql.user where host != '%'; [15:21:45] ah yeah they're very few [15:21:56] all of the 10/8 grants are obsolete, as are the 208.80.154/23 ones [15:22:25] so that leaves you with the replication account, account creation and the quarry/superset readonly accounts [15:22:54] which one is used for account creation? [15:23:29] labsdbadmin [15:23:46] does it need the 3 different ips? [15:24:12] it is supposed to have a rule per cloudcontrol. don't ask me why the third one has not been updated for cloud-private :D [15:24:23] ack :) [15:31:10] taavi: thanks for that, I tried to braindump a bunch of things in the parent task that we discussed too [15:31:53] I can imagine getting some pushback on the new vrf idea, so we should detail why it is better than trying to isolate things in other ways (for instance with calico rules on the hosts themselves) [15:32:11] yeah, I just figured that there's enough work about this subject that needs doing that it was best to start splitting it to smaller things [15:32:18] I've no problem with in my mind it's the cleanest approach [15:32:25] yep agreed [15:37:48] bd808: something that didn't cross my mind earlier is that tool-iw isn't the only thing doing custom ingress things, there's about 30 other tools doing the same thing: https://phabricator.wikimedia.org/P84923. I haven't checked that the new validation rules break all of them but I have to assume at least some of them would be affected :/ [16:32:56] taavi: the 6 cluebot entries are doing nothing special, they come from https://github.com/cluebotng/component-configs/blob/main/fabfile.py#L25, it's what was set by `webservice` at the time and is a placeholder until jobs/components support things natively [17:11:19] topranks: want me to merge 'Cathal Mooney: gnmic config: merge juniper and nokia bgp collection again' ? [17:11:48] andrewbogott: yep thanks was just trying to do so but you have the lock [17:11:49] thanks! [17:11:59] done [17:12:04] cheers [17:30:48] PSA: I was in back-to-back meetings so I haven't started copying data to the new toolsdb host, I'll do that tomorrow morning [17:32:20] disk usage on tools-db-4 is at 80%, so I don't expect it to fill up overnight (there's almost 1TB free) [17:33:41] * dhinus off [19:16:13] taavi: yeah, I'm not completely sure which ones will break but I would be surprised if none of them did. The newer version of the nginx-ingress will not support https://wikitech.wikimedia.org/wiki/User:BryanDavis/Kubernetes#Make_a_tool_redirect_to_another_tool_WITHOUT_running_a_webservice for example because of the overly restrictive validation that is default now. [19:17:16] The quick check would be for any annotation with a label that starts "nginx" and a "$" in the value. [19:24:20] * dcaro off