[03:04:03] FIRING: PuppetFailure: Puppet has failed on ms-be1088:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:04:03] FIRING: PuppetFailure: Puppet has failed on ms-be1088:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:39:23] Apropos yesterday's discussion, I've put in https://gitlab.wikimedia.org/repos/sre/wmfmariadbpy/-/merge_requests/16 to make the use of versioning in wmfmariadbpy more Debian-standard (and thus hopefully less generally confusing) [09:56:53] jynus: to test the new transfer.py can I run the cloning cookbook from cumin1002.eqiad.wmnet now? [10:30:30] <_joe_> Emperor: we usually use native debian package numbers for homegrown software [10:30:43] <_joe_> but I personally do not care one way or the other [10:32:35] federico3: yes, the unreleased version of transfer py is on cumin1002 [10:37:47] _joe_: I think I would tend likewise, but I sensed reluctance when I suggested just using 0.12.2 yesterday, and it wasn't a hill I wanted to die on [11:04:03] FIRING: PuppetFailure: Puppet has failed on ms-be1088:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:14:04] I'm facing performance degradation with a complex root cause [11:14:04] leaky pipe -> plumber called -> water distribution closed -> no coffee -> tired federico3 [11:14:19] XDDD [11:14:47] clearly systems that critical need additional redundancy [11:16:06] at least a backup wat- coffee tank [11:18:25] no coffee> 😱 [11:45:26] lol [11:45:41] marostegui: I'm stopping db1176 as a test, then cleaning up the data as per https://phabricator.wikimedia.org/T409159#11339395 [11:46:15] No need to stop it, you can just drop database [11:47:06] I just wanted to see phabricator running with db1176 down :D [11:47:14] haha sure :) [11:47:15] good idea [11:47:23] (and it did) [12:01:27] federico3: did the cloning work? [12:02:00] jynus: not started yet, cleaning up disk space in advance [12:02:07] ok, thanks [14:35:03] _joe_: do you think there should be conftool in trixie-wikimedia on apt-staging now? Looking at https://apt-staging.wikimedia.org/wikimedia-staging/dists/trixie-wikimedia/main/binary-amd64/Packages it doesn't seem to be (but also that has a timestamp of 28 Oct, which seems odd). I also see in logs [14:35:10] Nov 04 10:29:37 apt-staging2001 gitlab-package-puller[2331423]: Skipping conftool_6.0.1+deb13u1_amd64.changes because all packages are skipped! [14:35:58] Hm, also earlier Nov 04 10:10:38 apt-staging2001 gitlab-package-puller[2330852]: Warning: database 'trixie-wikimedia|main|amd64' was modified but no index file was exported. (maybe because of the error before that?) [14:37:35] Maybe wmfmariadbpy_0.12.1~wmf6_amd64.changes needs explicitly ditching from incoming? [14:57:50] jynus: the transfer ran successfully, the cookbook is hanging and it's stuck on starting replication but I don't know if it's a symptom of an issue in the file transfer, probably not [15:01:08] is transfer.py stuck on starting replication, or is your cookbook? [15:01:23] can I see the logs? [15:01:59] it says "transfer succesful" at the end of a succesful transfer [15:14:08] I see the error and doesn't seem related to transfer.py [15:14:22] [ERROR] Exception raised while initializing the Cookbook sre.mysql.clone: [15:14:26] rack_name = nd["rack"]["name"] [15:14:29] KeyError: 'rack' [15:15:03] although unsure if it is that execution [15:15:12] 2025-11-04 14:04:07,558 [15:46:48] that's unrelated, it was a past run [15:47:39] I ran the cookbook again, I'm seeing mysql not running at the moment on db2030 with the same sequence: [15:47:54] https://www.irccloud.com/pastebin/gWqARDCB/ [15:48:37] if the STOP and the CHANGE commands were run successfully then mariadb is crashing after CHANGE before START SLAVE. @marostegui [15:49:04] do you think that is related to the transfer, or do you suspect it is something else? [15:49:27] Nov 04 13:50:05 db2230 mysqld[474753]: 2025-11-04 13:50:05 6 [ERROR] InnoDB: Error number 17 means 'File exists' [15:49:50] I have no context on what's being done though [15:49:56] But I am seeing lots of errors [15:50:02] 4 13:52:45 db2230 mysqld[475343]: 2025-11-04 13:52:45 0 [ERROR] InnoDB: Error number 2 means 'No such file or directory' [15:50:02] Nov 04 13:52:45 db2230 mysqld[475343]: 2025-11-04 13:52:45 0 [Note] InnoDB: Some operating system error numbers are described at https://mariadb.com/kb/en/library/operating-system-error-codes/ [15:50:02] Nov 04 13:52:45 db2230 mysqld[475343]: 2025-11-04 13:52:45 0 [ERROR] InnoDB: Cannot open datafile for read-only: './heartbeat/heartbeat.ibd' OS error: 71 [15:50:02] Nov 04 13:52:45 db2230 mysqld[475343]: 2025-11-04 13:52:45 0 [ERROR] InnoDB: Could not find a valid tablespace file for heartbeat/heartbeat. Please refer to https://mariadb.com/kb/en/innodb-data-dictionary-troubleshooting/ for how to resolve the issue. [15:50:03] etc etc [15:52:26] I'm running the clone cookbook sourcing from db2230, let me try to reproduce the sequence [15:52:46] someone created heartbeat.old [15:52:53] that doesn't seem right to me [15:53:00] what's that? [15:53:25] Anyway, I am off alrady but I will read the backlog tomorrow and/or the task if there's any! [15:53:30] that was just a backup of the heartbeat dir [15:53:38] ok thanks [15:53:52] a backup, how? [15:54:04] federico3: But that won't work on a mariadb level. Remember that any directory on the datadir mariadb considers it a database, so it can potentially mess up with the tablespace [15:54:12] yeah [15:54:26] there should be nothing under datadir [15:54:30] even if it's not configured as a databae? [15:54:35] federico3: yep [15:54:36] that is not on the data dictionary [15:54:39] ok, deleting it [15:54:43] thanks [15:54:45] it may be a bit late [15:54:52] it might had messed up stuff [15:55:01] federico3: the problem is that it could've already messed up the tablspace [15:55:11] But it is not really a problem, you can erase everything and clone again and it should work [15:55:12] it is possible to do backups like that [15:55:32] but please talk to me, I am the "backup expert" can tell you 20 ways to do that properly [15:55:36] federico3: Everything as in: leave the datadir complete empty [15:55:40] yeah [15:55:43] and stop mariadb of course [15:55:50] I am calling it a day! byeee [15:56:18] yeah, it is a test host, no worries [15:56:59] I have a sql file with a mysqldump to recreate the desired dbs/tables only [15:57:12] yeah, but you need to wipe the datadir first [15:57:16] yep [15:57:26] and run mysql_secure_installation [15:58:19] I can even show you how to copy a db without stopping it [15:58:39] transfer.py actually can do that with special options [15:58:40] it's a test host so stopping it is ok [15:59:00] yeah, what I mean is I can show you how to do more stuff if interested in tweaking [15:59:08] /opt/wmf-mariadb1011/bin/mysql_secure_installation with some parameters? [15:59:20] the typical, basedir datadir and socket [15:59:28] depending on what the defaults are [16:01:14] also never put backup.sql as part of datadir [16:01:18] and basedir to /opt/wmf-mariadb1011/ ? [16:01:34] yep, or the right installed version [16:01:55] you can use /srv/tmp for temporary stuff or a new subdir inside /srv [16:02:03] e.g. /srv/backups [16:02:20] without root password and uni_socket authentication I guess? [16:02:34] it needs access to root [16:02:49] whatever is the default installation, normally root:socket [16:02:52] I mean the configuration that mysql_secure_installation [16:02:58] ...creates [16:03:15] just run it as root, preciselly it deletes all authentication except that [16:03:53] yes I'm running it as root, but it's asking if we want unix_socket auth (no), set root password (no), and if we want to remove anonymous users (yes?) [16:03:55] but first you have to create the dir [16:04:07] by running mysqld once [16:04:16] it will start with the default [16:04:35] you mean /srv/sqldata ? It's there and belongs to mysql:mysql already [16:04:51] are you running it on db2230 ? [16:04:56] yep [16:05:03] the datadir is empty [16:05:08] it needs mysql started [16:05:09] yep [16:05:26] ok starting it [16:06:53] ok, let's start again [16:06:56] wipe datadir [16:07:34] run mariadb-install-db from /opt/wmf-mariadb1011 aka ./scripts/mariadb-install-db [16:08:09] (mysql runs it automatically, apparently mariadb does not, it depends on the version) [16:08:42] ah! [16:09:03] every version of flavour does something differently [16:09:12] and we almost never initialize a database from 0 [16:09:27] that's why it is not automated or anything, we almost always recover from backups or an existing db [16:09:48] now start normal (but with insecure credentials) [16:09:52] and do [16:09:56] enable socket [16:10:02] not change root password [16:10:10] and delete test tables and grants [16:10:35] mysql does randome password as default setup [16:10:45] I belive mariadb does socket authentication by default [16:11:33] all good, mysql_secure_installation ran succesfully? [16:11:51] I think not yet, as I can see the test db [16:12:12] no, I just ran a drop for it [16:12:18] noooooo [16:12:21] thats really bad [16:12:39] and delete test tables and grants <--- I'm referring to this [16:12:39] now any user can create databases that start with test [16:12:55] you have to do it with mysql_secure_installation [16:12:59] not directly [16:13:05] ah... [16:13:10] (you may need to start from 0 again) [16:13:39] yes, I'm writing down the steps [16:13:42] stop, wipe, and do the same [16:14:00] but don't run any sql or file handling command [16:15:02] documentation is at: https://dev.mysql.com/doc/refman/5.7/en/mysql-install-db.html [16:15:17] and https://dev.mysql.com/doc/refman/5.7/en/mysql-secure-installation.html [16:15:30] but there are some stuff that mariadb does classically [16:15:54] perhaps there could be changes is the wmf package? [16:15:58] nope [16:16:01] that's on purpose [16:16:10] we never initialize the data [16:16:20] so automation is on purpose removed [16:16:53] you would call me and ask me to restore you a db [16:17:09] so never do this on production :-D [16:17:32] but telling you how to do it correctly so you know [16:17:56] the official package wipes datadir on purge [16:18:01] which we don't want [16:18:29] and reruns secure installation, which also we don't want the package touching our grants [16:18:41] so it is easier on a normal installation [16:18:48] outside production [16:19:31] on a normal procedure, I just do transfer --type=decompress and start mariadb and we are done :-D [16:19:58] I can do that too for any section, if you want, takes 20 minutes [16:24:03] so, ready to run secure_mysql_installation ? [16:26:02] there is mariadb specific info @ https://mariadb.com/docs/server/server-management/install-and-upgrade-mariadb/installing-mariadb/installing-system-tables-mariadb-install-db/installing-system-tables-on-unix [16:26:46] https://mariadb.com/docs/server/clients-and-utilities/deployment-tools/mariadb-secure-installation [16:27:55] federico3 ? [16:28:20] I'm trying to get mysql_secure_installation to start reliably [16:28:30] let me know if I can help [16:29:01] the test VMs have tiny disks and were meant to have no data but we might have to wipe them often so I might as well get this to work reliably [16:29:32] do I have to log in and enable the socket before mysql_secure_installation yes? [16:29:58] /opt/wmf-mariadb1011/bin/mariadb-secure-installation --basedir=/opt/wmf-mariadb1011/ --socket=/run/mysqld/mysqld.sock [16:30:08] ^ it works for me with this [16:30:25] ah, thanks [16:30:31] just needs mariadb running and a socket location [16:30:31] see /root/wipe_mariadb.sh :) [16:30:42] let me check [16:31:28] unsure if "\nn\nn\nY\nY\nY\nY\n" is right [16:31:37] I'm updating it now [16:31:48] Disallow root login remotely? -> n ? [16:31:52] y [16:32:02] we will remove everything except root @ socket [16:32:08] and then apply the production grants [16:32:31] which I am guessing will be what you wanted- initialize fully a host from 0 [16:32:54] not needed if you are going to recover a backup [16:33:08] (the production grants, I mean) [16:33:32] well I have /root/test_data.sql [16:33:34] no need to change root password (because there is nont) [16:33:51] let me check the grants [16:34:00] let me check, most likely it will be not production-ready if you just did "mysqldump" [16:34:52] uhm there's some stray data in the db table [16:35:23] no, that doesn't work [16:35:45] ok, let me propose you something, this is actually my domain (recovery) but it is a bit longer to teach [16:35:58] what about I tell you the proper version tomorrow ? [16:36:05] so you know everything [16:36:10] on a meet? [16:36:39] I know what you are trying to do, but that "stray data" are system tables, you cannot just drop and recereate :-D [16:36:42] yes, that would be useful [16:37:01] it won't be long, but it will take more time than a few minutes here on chat :-D [16:37:03] some are test tables [16:37:09] not really [16:37:22] db is a system table you should never touch with sql [16:37:31] not test as in `test*` but tables from phabricator used for testing [16:37:45] ah, yeah, not worried about the contet [16:37:55] I mean your backup is not useful, as I supposed [16:38:05] but I can tell you how to do it useful :-D [16:38:30] just make sure at the end of the day the instance is stopped, so the alarm about grants doesn't go off [16:38:41] because too generous grants [16:39:35] but running that backup against a production (a real one) would just corrupt the db [16:39:42] not worries here [16:39:55] I will teach you several ways to do that [16:40:28] ok for now I'm stopping it [16:41:44] look at the good side, transfer.py didn't break the db, you did :-D [16:41:53] I can guess some of the grants come from ./modules/profile/templates/mariadb/grants/production.sql.erb and similar files [16:41:59] yep [16:42:07] but there is spetialized tools for that [16:42:12] aka pt-show-grants [16:42:27] that's how you export and import data in the mysql db, you never touch it otherwise [16:42:53] and that's why we don't backup mysql dir/db with backups, they're useless [16:43:07] I sent you an invite to tell you the basics [16:43:11] and then you can experiment [16:43:16] but in a safe way [16:43:48] moved it to wednesday, let me know if the invite works for you [16:44:28] I cannot help you with many things, but I am literally in charge of backups and recoveries [16:44:32] ok [16:44:54] we also don't use mysqldump generally [16:45:04] it needs some tuning to do the right thing [16:45:18] plus it is blocking by default [16:46:09] if you are happy with transfer.py (now we know what it failed, you overwrote system tables), I will do a release tomorrow [16:52:41] so far transfer.py seemed to work well every time [16:53:13] in any case, this will be just a milestone and we can always tune stuff afterwards [16:53:19] the dump/restore was not meant to be used to wipe the datadir entirely but only cleanup/shrink the db files [16:53:33] yeah, no worries [16:53:49] this will be so you know everything so you can be more confident [16:54:00] plus the recommended procedures [16:54:21] there is not a single way to do stuff, but there are not recommended ways to do it [17:19:48] FIRING: PuppetFailure: Puppet has failed on ms-be1088:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [21:20:03] FIRING: PuppetFailure: Puppet has failed on ms-be1088:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure