[01:10:12] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 42 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:10:32] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 21.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:12:02] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:12:22] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [04:36:01] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 6.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [04:37:09] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [07:48:59] one option could be https://jira.mariadb.org/browse/MDEV-29988 [07:49:28] Yeah [07:49:33] That is the only one that could be related [07:49:42] I saw it, but it wasn't clear really [07:49:50] indeed [07:50:02] So I decided not to guess by myself on the bug [07:50:18] I talked to mariadb people and they recommended filing it and they can search for related bugs and close as duplicate if needed [07:50:41] there is also https://jira.mariadb.org/browse/MDEV-29368 but that is even more far fetched [07:59:25] jynus: when could I stop db1117:3315 to clone another host? [07:59:35] any time now [07:59:41] And tomorrow? [07:59:49] In case I don't get to do it today due to DC switch [08:00:34] m2 backups usally take a bit, but no more than tuesday from 0 to 12 UTC [08:00:57] you can check it on icinga or on backupmon if it has finished for that week [08:01:17] Yeah I normally check on zarcillo, but I don't remember when they are scheduled [08:01:25] It is m5, not m2 :) [08:01:36] Which I think it is quite tiny [08:02:17] this link will show you the list time of each backup: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=dump+of misc dumps start on tuesdays at 0h [08:02:25] *last time [08:02:49] Last dump for m5 at codfw (db2160) taken on 2023-02-28 04:38:20 (21 GiB, +0.1 %) [08:03:09] yeah, but my issue is that I don't know when it will be the next time scheduled [08:03:12] or with the command "ssh -L 8000:localhost:8000 backupmon1001.eqiad.wmnet" [08:03:32] it will show you all ongoing backups [08:03:47] I can also show you the schedules [08:04:00] it is on puppet, let me get you the link [08:04:08] yep, that's what I need [08:05:34] this is where it is defined: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/dbbackups/mydumper.pp#104 [08:06:01] for snapshots: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/dbbackups/transfer.pp#50 [08:06:02] would it be hard to add that to pampinus? [08:06:09] what do you mean? [08:06:20] adding the next scheduled backup run [08:06:31] ie: s1 last backup taken XX-XX-XX next backup scheduled for YY-YY-YY [08:07:47] it is with the current model as it is defined on puppet, but we can change the model (the answer is currently it would be difficult) [08:07:55] but if it helps we can work on that [08:08:04] just an idea :) [08:08:17] But it might be useful to have a full picture for this kind of things or maintenances [08:08:31] like: let's not drop this today, let's wait for YY-YY-YY when the next automatic backup will run [08:08:47] the thing is I would solve the issue by looking if a backup is ongoing right now -> wait or ask, if not -> proceed [08:08:48] just add it to the wishlist for now I think [08:09:00] your maintenance having precedence [08:09:21] if backups fail at start it is not a big deal [08:09:24] yeah, but what if the backup finished like 24h ago and I am like: "ok, let's start this!" and then the next backup runs in 2h and I break it? [08:09:48] it is only when they fail after running for 30 hours that is more impacting [08:09:53] Anyways, just an idea, to have an overview on when the next one will start. Nothing to do for now, just add it to the list of ideas [08:09:58] it is ok in that case for things to fail because no harm was done [08:10:17] it should just be retried [08:10:34] ^I think this is important, you maintenance has priority [08:10:51] Not a big deal for now, just keep it mind for future work/okrs/wishlists [08:11:07] but only in the case of long running backups, I ask you to wait if possible (mostly es) [08:11:16] but only if ongoing [08:11:57] if there is maintenance and backup fails it should be retried, that's all [08:13:44] also, again, with the exception of es backups, they should be mostly finished by the time you wake up (that's the intention) [08:15:49] marostegui: ok for me to stop and test recovery at db2184? [08:16:12] jynus: yeah! [08:16:20] doing [08:24:17] for Em., I saw ms-be2067 complaining a few minutes ago, maybe a RAID issue again [08:27:56] my intention for db2184 is to clone its datadir, drop it from a live instance and test the recovery process is well documented and commands "just work" the same as it is said in wikitech [08:28:12] sounds good to me [10:43:38] we all set for this afternoon db-wise? [10:44:29] yes [10:51:11] awesome [10:51:13] <3 [12:37:17] jynus: https://phabricator.wikimedia.org/T330861 can be done tomorrow or do you need an specific window? [12:37:45] in this graph can be seen how the clever rearrangment of load order maximizes the server performance: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-job=All&var-server=db2184&var-port=9104&from=1677660277289&to=1677673103866 this will be particularly useful for things like the revision or text tables, which myloader loads last [12:38:28] db1204 is at the moment under heavy load, updating the media backups [12:38:48] db2183 was idle precisely for the network and this maintenance [12:38:57] maybe I can do db2183 now? [12:39:00] I would prefer if we did codfw at any time [12:39:07] but waited a bit for the eqiad [12:39:09] ok, let me do db2183 now [12:39:17] (I will make sure to not leave it running while away) [12:39:35] sure [12:42:02] jynus: db2183 migrated [12:42:30] that was fast! [12:42:40] Cause stopping it was fast! [12:42:57] so that way I can test it before fully commiting on the other datacenter [12:43:06] all yours know [12:43:11] They are both dowtimed for 2h [12:43:13] as well as we don't interrup the current process [12:43:15] downtimed [12:43:34] this is a bit of a weird case because there was a backlog of 10 million files [12:43:56] while the intention is to make backup have way less latency now [12:44:02] *media backups [12:47:29] ah, the other reason why I didn't start it on codfw is because we were going to start write there, and I hoped to not create any additional load or anything just in case [12:47:41] *swift [13:34:42] taavi: https://phabricator.wikimedia.org/T330502#8648057 [13:35:06] yeah I saw that, still waiting for a final +2 on the extension patch before proceeding with the tables [13:35:18] sounds good [13:35:24] from our side we are fully done [14:13:23] I'm using transferpy (from a python script, so w/o the CLI) to transfer some files around, and it works nicely but is logging a bunch of ERRORS (something about .lock files); should I worry? [14:17:02] ERRORs? [14:17:27] lock files basically prevent to use the same port and checksum among different transfers [14:17:48] they should not happen unless you schedule several jobs at the exact same second [14:18:13] I can take a look later [14:19:51] in the cleaning up stage, AFAICT [14:21:00] jynus: https://phabricator.wikimedia.org/P44914 is the relevant noise [14:21:17] oh, I see [14:21:22] (like I say, the transfer completes OK, I'm just a bit concerned) [14:21:23] that is cumin ERRORs [14:21:29] which sadly I cannot remove [14:21:50] because I do checks where I expect return values <> 0 and cumin considers that an error [14:22:03] you can't do the equivalent of cumin -x ? [14:22:04] you can tell cumin which exit codes are ok for you [14:22:18] volans: see, I remembered for once! :-D [14:22:19] I guess I coold do that :-D [14:22:25] but I didn't :-D [14:22:57] I don't care about the return value, those are commands to "check if a file exists" [14:23:08] https://doc.wikimedia.org/cumin/master/api/cumin.transports.html#cumin.transports.Command.ok_codes [14:23:15] honestly, when I implemented that, cumin didn't have most of those functionalities I think [14:27:16] I will try to make it better, but as of now, yes, I check right now when cumin fails and act appropiatelly (e.g. aborting the transfer if the file about to be written exists) [14:27:38] you should focus on the errors by transfer.py itself [14:29:26] Reminder, no DB maintenance till Monday :) [14:30:40] Emperor: if used for transfering files, it does a checksum of the file on origin and on destination just to be 100% sure transference works well and that is the most important part (which is what decides the final return code) [14:35:35] jynus: I will just ignore these cumin errors for now, but if you could see your way to making transfer.py tell cumin not to be unhappy about them it would make my logs (and thus me) happier :) [14:35:58] yeah, thanks to the improvements that should be possible now [14:36:08] note that transfer.py predates cumin! [14:37:40] so there was some weirdness and features I have been requesting that now are there. but yeah, my fault becuase normally I run in cli mode and in non-verbose mode, so that had very little priority [14:37:58] could you file a ticket so I don't forget? [14:38:09] as I am about to go on vacations [14:38:15] sure; how would you like it tagged? [14:38:38] data-persistence-backups database-backups [14:39:03] as primarily transfer.py was built for xtrabackup streams (db backups) [14:39:37] can I ask you what are you using it for? that will help me serve you better [14:40:21] (you can put that on the ticket) [14:42:42] as in some cases I would discourage its usage (e.g. not a good alternative for rsync-like transfers) [14:43:08] jynus: T330882 [14:43:09] T330882: transferpy should take advantage of cumin's ok_codes to avoid spurious ERRORs - https://phabricator.wikimedia.org/T330882 [14:43:22] that's great- that should be fixed anyway [14:43:55] using it for> I'm making a cookbook to deal with the ghost objects in swift; to do this I need to copy container dbs from a bunch of nodes onto the cumin node so I can do some analysis on them (calculate the set of ghost objects, etc.) [14:44:20] so are those like very large and many files ? [14:44:30] there will typically be 6 files [14:44:39] and in size? [14:44:46] (one per host) O(100M) [14:45:06] yeah, then it should work well [14:45:20] as in, it is a good use case for it [14:45:34] other than the alarming log noise, it does seem to be doing the business, indeed :) [14:45:36] (how it works is another story given the bug :-D) [14:46:06] yeah, as I said, transfer predates cumin and I adapted to use cumin remote execution a bit in a rush [14:46:23] so it can be improved- this seems like an easy fix [14:46:53] excellent :) [14:49:01] oh, forgot one thing- the cli has as default not using encryption [14:49:17] make sure that is enabled if transfering cross-dc [14:49:27] I don't remember what is the default in the library [14:49:37] the default on the cumin nodes is off [14:49:52] per /etc/transferpy/transferpy.conf [14:50:00] that is because for dbs normally we only do trasnfers within a dc [14:50:15] and we gain some cycles on old servers [14:50:28] but should be enabled if transferring cross-dc [14:50:41] (and in newer hosts it has 0 overhead) [15:01:53] jynus: presumably for occasional uses it's OK to just enable it (rather than having my script think about whether the target system is local or not?) [15:02:34] yeah, if your transfers can happen between dcs, just enable it [15:02:40] all the time [15:02:55] ta [15:03:07] it is just on 30 hour, 20TB transfers where it is mostly noticed [15:03:31] 100M should be done quite quickly no matter the config [15:05:24] another thing is, if reliability is very improtant, make sure to do retries on your script- I didn't do that transfer.py itself because I handle that at higher level [15:06:06] *in transfer.py [16:35:53] hey Emperor - would you mind if I did a test pooling of thumbor tomorrow? I imagine the 5xx swift errors will recur but I just want to test some behaviours while it's doing that. [16:36:39] hnowlan: you could do; you might (or might not) want to wait until we've got the new frontend capacity online first, though? [16:39:37] I'm oncall this week, so as long as nothing else breaks we're good :) [16:39:38] oh, I didn't realise - sounds fine. when would that be? [16:42:42] "hopefully soon" (if that's too vague, you might not want to wait); the h/w is I think now basically ready to go. [16:43:11] I want to offer u.random the opportunity to do some of the addition work as cross-training [16:45:15] (and have been somewhat procrastinating making such changes while we were mid-DC-switch) [16:47:59] Ah yeah, understandable [16:48:16] If it's okay with you I might try to do a brief one tomorrow, just to try to trace a request through [16:53:03] OK; I'd prefer before 15:00 UTC if that's OK? [17:02:49] for sure, I'll try to do it before 1300 UTC [17:03:20] If there's anything swift-side you can think of I could/should observe please lemme know [17:05:25] Emperor: spoiler alert: u.random does want the opportunity :) [17:16:50] :) [17:18:53] hnowlan: other than the thing I noted on the phab item ( https://phabricator.wikimedia.org/T328033#8609677 ) and that if we see a spike again it'd be worth checking the spike in 5xx isn't "just" swift reflecting an increase in 5xx from thumbor, I don't think so [17:30:46] Emperor: grand. I'm fairly certain it's not the file handles issue as we don't see the concurrent spike in 5xx on the thumbor pod haproxy status codes, but I'll try to have visibility into that at the time