[05:27:49] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 2.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [05:29:11] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [09:11:47] marostegui, Amir1: kill any running schema changes, please. dbctl is broken [09:12:49] done [09:13:47] argh. so i can kill Amir1's schema changes, but i can't know if it's safe to do so :/ [09:14:17] kormat: hi! mariadb104-test1.mariadb104-test.eqiad1.wikimedia.cloud is coming up on my list of last vms without an ldap config change applied. looks like it has had puppet disabled by you back in september, any chance it could be re-enabled? [09:16:43] taavi: i quickly enabled puppet and tried to run it, but it fails. i'll get back to you on that, as there's currently a larger production issue going on [09:17:00] thank you! [09:19:05] marostegui: i'm just going to kill Amir1's changes. auto_schema ignores errors returned by dbctl, so i can't be sure it's going to do something reasonable. [09:19:20] sounds good [09:19:25] done [10:04:46] Amir1: can you paste the list of those 72 hosts reported on the first line of https://phabricator.wikimedia.org/T298590? [10:07:50] marostegui: might be visible on https://people.wikimedia.org/~ladsgroup/omg/ [10:08:46] ah yes, indeed [10:08:50] just got the list [10:08:53] thanks [10:08:56] np :) [10:47:11] kormat, marostegui: can I do a quick test for dbctl on cumin2002? [10:47:26] like running dbctl instance es1029 depool (without committing it ofc) [10:47:32] volans: sure [10:47:35] you could, but, [10:47:42] it's no longer the master, so that won't do anything. ;) [10:48:04] volans: i have a whole bunch of processes running doing dbctl stuff at the moment [10:48:16] so i'd prefer if you didn't do testing right now [10:48:17] ok then, another time [10:57:51] * kormat declares mwmaint1002 to be her mortal enemy [11:45:08] good morning [11:45:11] what a day [11:45:33] kormat: thanks for the link and killing stuff :P [11:45:54] marostegui: let me know if I can help on anything [11:50:33] Amir1: Just a question, how often omg refresh its data? [11:50:51] marostegui: If someone asks, I can run it [11:50:56] Amir1: Please do :) [11:51:09] on it! [11:51:20] thanks [11:51:45] ftr, if you want to do it as well, run omg.py in my home directory and move the omg.json there [11:52:14] I want to put the source code somewhere but not sure if it should be gitlab, gerrit, github [11:52:15] ugh [11:52:39] put the omg.json where? XD [11:53:09] omg.json in people1003 in that directory [11:53:14] ah cool [11:53:17] let me do that then [11:55:02] cool. I go make coffee [12:08:30] So transfer.py doesn't work from cumin1001 to people1003? I guess FW in between? [12:08:34] I will use netcat XD [12:10:02] marostegui: I use scp/sftp [12:10:07] whatever works :P [12:10:54] and the path is /home/ladsgroup/public_html/omg/omg.json [12:11:05] yeah, it is done now :) [12:11:18] transfer.py worked in the end, but it took like 2 minutes [12:11:23] I can puppetize it in a systemd timer in cumin and sync it but that's for later :D [12:12:05] yeah, not worth now [12:47:21] marostegui: for when you have time, now the dumpers should have been started. Do you want to check if T138208 is resolved? [12:47:22] T138208: Connections to all db servers for wikidata as wikiadmin from snapshot, terbium - https://phabricator.wikimedia.org/T138208 [12:47:32] Amir1: i am currently blocked on proceeding by maintenance scripts running against: s6, s7, es2, es3 [12:47:54] kormat: okay, let's fix that [12:48:07] I have nothing to do against es AFAIK [12:48:23] Amir1: sure [12:48:43] Amir1: ah. at least some of the es stuff is from snapshot1008 [12:48:59] I restarted my s6 one [12:49:18] kormat: we don't have dump group on es :/ [12:49:36] so dumps just run against all replicas? :/ [12:49:40] Amir1: for now I only see them connected to s8's vslow/dump host, so that's good [12:49:43] But let's give it some more time [12:51:08] Amir1: extensions/GrowthExperiments/maintenance/refreshLinkRecommendations.php is running against db1168 [12:51:16] s7 is also restarted [12:51:19] Amir1: I just saw snapshot1011.eqiad.wmnet. connecting to db1177 which is not vslow [12:51:31] marostegui: sigh [12:51:45] but 1011 is misc dumps [12:51:49] Amir1: I haven't caught it doing a select, but with two connections open [12:51:50] iirc [12:52:39] marostegui: i have been doing reboots of sanitarium masters, which might be related [12:54:15] It is definitely creating new connections to db1177 [12:54:20] but not sure what they are doing [12:54:27] kormat: wrt refreshLinkRecommendations, I have been asking them to reduce the time of their scripts, so it should be done in less than an hour [12:54:48] But snapshot1011 is running a dump against wikidata [12:55:12] actually 1011 is not misc, 1008 is misc [12:55:15] Amir1: so i should depool, and wait an hour, and then come back? [12:55:24] kormat: yeah :( [12:55:29] ffs [12:55:57] Amir1: As I said, I haven't been able to catch what the connection is actually doing but it is connecting to db1177, I don't see dump SELECTs like I do see on the vslow, so that's a win [12:56:07] kormat: I have some ideas how to handle this for good T305016 [12:56:08] T305016: Think about rdbms reconnection logic - https://phabricator.wikimedia.org/T305016 [12:56:40] Amir1: i'd kill for the ability to say "ok, all maintenance scripts are idempotent, and we can just kill the connection anytime we need to" [12:56:57] marostegui: Yeah, less work [12:57:10] kormat: so the plan is to make maint script avoid reconnecting [12:58:12] currently it has reconnect logic, we are removing that if perf team gets on doing it. I might pick it up and do it with them [12:58:30] Amir1: what's the impact for us? [12:58:57] so auto schema or your depool bash can simply kill all connections after let's say five/ten minutes [12:59:08] ahhh. i love it. [12:59:11] and mw takes notice and won't connect to it again [13:47:23] kormat: can I restart my schema changes or dbctl is still broken? [13:57:23] Amir1: go for it [13:58:11] thanks [14:11:11] Amir1: re: T303927#7868116, i've been rebooting all candidate primaries [14:11:12] T303927: Switchover s8 master (db1109 -> db1104) - https://phabricator.wikimedia.org/T303927 [14:12:08] Awesome [14:12:23] I just wanted to double check, if it's missing, we failover tomorrow and then oops [14:12:35] 👍 [14:37:26] Emperor: do you have an rclone config for s3 or swift handy I could use to start the copy at https://phabricator.wikimedia.org/T306424#7867679 ? [14:40:48] I have a swift one (for ms, details will be similar for thanos) in ~mvernon/.config/rclone/rclone.conf on ms-fe1012; although you might want to go via S3 as that what the user is using? [14:42:20] yeah I'll S3 [14:42:23] ok thanks! [14:42:24] I have an S3 rclone config from $JOB[-1] I could file the serial numbers off if you like, but the docs are pretty good - https://rclone.org/s3/ [14:46:15] *nod* seems straightforward enough [16:27:24] thanks for doing the transfer godog! [16:27:55] hnowlan: sure np! might take a while tho : [16:27:57] :| [16:30:14] not really a way around it afaics though, regenerating the tiles would be more expensive [16:30:32] yeah, makes sense