[01:08:26] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 13.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:10:08] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 14 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:11:40] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:11:48] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [04:15:38] I'm this close to making a patch ignoring m1 replag [07:05:13] no need, I will soon move out bacula, with is the cause of it [07:08:45] marostegui: thanks to the new host, backups are now 4 hours faster, so I am ready when you are [07:10:27] jynus: for m1? [07:11:04] yes, I only need to shutdown bacula and some alerts depending on it when you tell me, in advance [07:12:21] sure, let me prepare the patches [07:13:13] I will start the downtimes, for 2 hours ok? [07:13:39] Amir1: do you think I can proceed with https://phabricator.wikimedia.org/P43355? [07:13:57] yeah, it's beta cluster only [07:14:03] ok, proceeding [07:15:10] marostegui: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/883621/4#message-d8c56af691d3b074efc53f09560bcab2d3e806a9 [07:15:29] thanks :* [07:15:32] jynus: sure [07:18:18] Can I get a review of https://gerrit.wikimedia.org/r/c/operations/puppet/+/883703/ [07:20:12] doing, I am ready on my side [07:20:19] thanks [07:20:24] I am getting on etherpad in case it needs a restart [07:42:02] ok, starting bacula up again [08:31:29] bacula job alert did not recover, having a look [08:32:27] OSError: [Errno 24] Too many open files on the prometheus exporter, restarting it [08:32:33] jesus [08:34:10] it excepts and freezes, but apparently doesn't get killed when there is no bacula to monitor [08:37:38] that worked, we may have a gap on backup metrics, but it is not like if that requires minute resolution [08:39:24] snapshots now only take 7 hours to run, so I am wondering if to delay them until 0 hours [08:40:15] what's the benefit of doing that? [08:40:46] so the idea is to run backups when there are no humans around [08:41:08] both to avoid loads during peak time and also to make sure that when humans are around fresh backups are availalbe [08:41:24] e.g. when you wake up, you have all backups freshly created [08:41:51] yeah [08:41:58] Up to you then :) [08:42:27] as a minor thing, it will make easier to identify "backups from the 26th of Jan" [08:42:37] true yeah [08:42:48] as they will all have a timestamp in the 26th, while now some are taken on the 25 and some on the 26 [08:49:47] jynus: ok to drop racktables db? [08:49:55] ok for me [08:49:58] ok [08:55:18] marostegui: e.g. I think this is simpler to understand: https://gerrit.wikimedia.org/r/c/operations/puppet/+/883834/1/modules/profile/manifests/dbbackups/transfer.pp [08:57:12] I would even put: 00:00:01 [08:57:14] to make it clearer [08:57:39] normally the backups are actually generated with that, it takes a minute for things to start [08:57:47] ah cool [08:58:00] snapshot.s8.2023-01-25--19-00-04 [08:58:04] snapshot.s1.2023-01-25--19-00-02 [09:21:59] I am going to switchover x2 codfw [09:22:02] which is going to be fun [09:25:20] interesting definition of fun [09:25:46] es data check is ongoing, it goes at hewiki, enwiki done [09:26:03] Going to continue to password rotation of wikiuser [09:26:15] on second thought, it can wait [09:26:23] I let switchovers finish [09:26:35] yeah [09:26:38] give me a sec to make sure x2 is ok [09:29:22] topology looks good [09:29:49] Amir1: can you double check if x2 is working fine too? [09:29:53] from a MW point of view [09:30:01] let me check [09:30:10] if edits are flowing, it should be fine [09:30:16] the main usecase is edit stash [09:30:47] do you know if we still use redis for someting? [09:31:00] or session store + x2 removed it completely? [09:31:07] was file lockmanager for upload but I think that is also gone [09:31:35] I actually like redis [09:31:40] I can ask service ops, I knew that was the goal but not sure if completed [09:32:00] I don't think redis is the problem, but maintaining 300 different stores [09:32:37] or more accurately, not maintaining 298 ones 0:-D [09:32:38] Amir1: confirm all good? [09:35:00] yup [09:35:02] no error [09:35:17] sweet thanks [09:35:20] is switchmaster good btw marostegui ? [09:35:57] not for x2 [09:37:51] sad [09:38:18] I should publish it and then try it [09:48:29] I will double check backup config [09:50:09] for what is worth, db2141 is back and all looks fine [09:50:46] yeah, but there must be some double issue- not only backups starting now, but also overloading the servers [10:01:30] what I think it happened is that after reorganization, because codfw backup sources are setup differently than eqiad, I may have started 2 backups at the same time on the same hosts [10:01:45] which is not supposed to happen [10:02:02] leading to s1 and s2 backups at high spead on the same host, saturating network [10:02:04] will fix that [10:02:16] *s1 and s6 [10:02:43] so it was good that backups accidentally run now, because it errored early [10:22:42] marostegui: shall I continue with the rest of rotation work? [10:22:51] yeah [10:22:54] all good from my side [10:24:54] awesome [10:38:04] marostegui: If I'm missing something, let me know, specially since I haven't done wikiadmin before https://wikitech.wikimedia.org/wiki/MariaDB/Changing_user_passwords [10:44:45] sure [10:45:06] thank you! [10:58:13] marostegui: due to a mistake of mine, I configured 2 simultaneous backups in the same host, with the new config, that should not happen again: https://gerrit.wikimedia.org/r/c/operations/puppet/+/883857 [11:37:45] marostegui: to confirm the event update is only needed on s1-s8? [11:37:58] query killer (part of the password rotation) [11:40:37] and esX [11:40:55] just check pc just in case too [11:41:32] OK, it seems that if you have an object that appears in a swift container listing (but does not in fact exist), then when you DELETE it, the server says 404 but does in fact successfully delete it. I suspect that's not strictly spec-compliant (a successful DELETE should return 200/202/204 ) [DELETE is meant to be idempotent, which might mean you should be able to issue DELETE for the same resource multiple times] [12:00:14] okay [12:00:54] * Emperor is reporting this as a bug to upstream [12:11:10] marostegui: before I run it, Does this look okay? https://phabricator.wikimedia.org/P43406 I tested it with echo instead of actually calling the db and it looked okay [12:12:51] sorry I'm asking a lot, I hope this automates it and make it much easier [12:44:37] Amir1: that should work I think yeah, I would try with a LIMIT 1 first, to see if it works fine [13:04:38] sounds good, gonna do it [13:47:33] I'm getting some errors of "not allowed" for the new wikiuser account, but they are quite small number and I checked, the grants and the user all exist https://logstash.wikimedia.org/goto/3f6a7faf4f266d839322b9b473ed128d [13:47:43] Am I missing something? [13:48:53] my guess is that "flush privileges" might cause a small timespan of not allowing but can't say for sure [13:49:44] I can check in a bit, but have you tried a manual connection from that same host and see if it also fails? [13:49:55] like from mw2312 to db2108 for instance [13:50:40] I can try but the sql one is wikiadmin [13:50:54] it's possible to do wikiuser, just more complicated [13:50:55] ok, I can try in a bit [13:51:01] thanks [13:51:13] I do see wikiuser2023 connected to db2108 [13:51:17] so maybe a stalled connection? [13:51:47] possibly [13:52:02] it all started when I started dropping the old wikiuser [13:52:36] but my strongest guess is that since the drop is like this [13:52:37] 2023-01-26 13:52:25.033625 db-mysql db2151:3306 -e "set session sql_log_bin=0; DROP USER \`wikiuser202206\`@\`208.80.155.117\`; FLUSH PRIVILEGES;" [13:54:42] I wait a bit and if it's gone, then it means it's a blip caused by FLUSH PRIVILEGES; [13:55:44] I don't see how that could cause a blip though [13:56:47] I don't know either :/ just correlates well time-wis [13:57:04] the errors seems to be gone now? [13:57:25] it started when I started the script to drop the old user and gone when the script is finished [13:57:37] interesting [13:59:00] https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-mediawiki-1-7.0.0-1-2023.01.26?id=Tada7oUBTudCKUGpSCyw [13:59:05] look at the timestamp [13:59:07] and now this [13:59:15] 2023-01-26 13:52:51.645143 db-mysql db1168:3306 -e "set session sql_log_bin=0; DROP USER \`wikiuser202206\`@\`208.80.155.117\`; FLUSH PRIVILEGES;" [13:59:22] matches up to the second [14:02:15] it's 35 errors in total. Meh [19:35:19] Amir1: I'm seeing a weird error on labtestwikitech. I just ran update.php so it shouldn't be a schema issue... [19:35:21] https://www.irccloud.com/pastebin/vSlrzzAy/ [19:35:30] Any suggestions? [19:37:31] (I also recreated the 2023 users in the db server but I think that was an unrelated issue) [19:38:02] (well, created, not recreated) [19:41:43] andrewbogott: see PM [19:43:33] (for the record, that token isn't live due to the write having crashed) [21:02:42] I don't think yu needed FLUSH PRIVILEGES [21:02:59] you were using DROP USER, not touching mysql.user directly [21:03:44] Emperor: that DELETE behavior doesn't seem that wrong to me