[07:34:12] I have restored pc1012 back as pc2 master [07:36:03] Thanks! and also thanks for the help in switchover of s4 [07:45:29] Amir1: on db2094:3318: Last_SQL_Error: Column 2 of table 'wikidatawiki.templatelinks' cannot be converted from type 'bigint' to type 'varbinary(255)' [07:46:01] ugh [07:46:04] let me fix it [07:46:19] I haven't checked what is going on, as I am busy with some other stuff [07:47:04] marostegui: don't worry, I got this one [07:47:16] let me do some work from time to time, you're making me lazy [07:49:45] replication flowing now [07:50:23] this is combination of a couple of issues, mostly caused by changes to auto schema to make it work with multidc, I'll make a patch for each one [07:51:39] good, thanks! [07:56:10] Amir1: you running the schema change for https://phabricator.wikimedia.org/T312160 ? [07:58:22] marostegui: si [07:58:43] cool, remember that it cannot be done on the master [07:58:47] it needs a switchover [07:59:01] yup, I leave that part to you mwhaha [07:59:21] note that we now we need to do switchovers on codfw and eqiad [08:00:30] what do you mean? [08:00:39] x1 codfw master isn't used [08:00:42] it can be run there [08:02:10] I think mediawiki.org is already multidc [08:02:28] and if you hit mw.org, you hit x1 (echo, etc.) [08:03:09] if mw.org is not multidc, it'll be soon so we should hurry :D [08:03:19] oh, I thought only x2 would be affected by that [08:03:20] :( [08:03:56] marostegui: sorry to interrupt conversation, m1 has no blocker, I can proceed to shutdown bacula, right? [08:04:09] nope, if you hit a page in basically any wiki, you get db reads on section of that wiki + s8 + s4 + s7 + x1 + x2 + pcs [08:04:14] jynus: yeah, anytime you want, the switchover is scheduled for .30 though [08:04:22] Amir1: :) [08:04:23] ok [08:04:28] that is good to know [08:06:18] it is ok, as it has a lot of different alerts, I can start downtiming early [08:06:48] btw, I'm almost done with migrating replication lag alert to alert manager (off icinga) heads up [08:07:08] probably will be done today [08:07:12] see _security [08:07:14] jesus christ [08:07:21] we were almost done :( [08:10:04] at least I have an excuse to automate this [08:10:11] *now I have [08:17:40] I don't remember much what was the procedure for non-backup services of m1, etherpad required a restart? or killing old connections was enough? [08:17:56] It depends, sometimes it does [08:18:01] Some others it doesn't [08:19:34] ah, it is actually nicely documented: "Normally systemd takes care of it and restarts it instantly. However if the maintenance window takes long enough, systemd will back out and stop trying to restart, in which case a systemctl restart etherpad-lite will be required. " [08:29:49] I am going to start m1 failover [08:31:41] etherpad seems to be working without a restart [08:31:45] same [08:32:13] will wait for other services to confirm they are right before restarting backup ones (I will also restart them too) [08:33:05] everything seems fine [08:33:08] librenms, rt.. [08:33:36] Can I merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/826223 ? [08:33:41] I was about to ask [08:33:45] +1 from me [08:34:16] merged [08:34:39] will run the failed x1 backup when I am done with bacula for testing purposes [08:53:01] I belive m1 performance could have been sligthly affected due to the reboot - nothing serious or blocker, my guess is it will take a few minutes to get to the previous level [08:53:18] yeah, probably [08:54:16] as some bacula monitoring process ran slower than usual [08:56:33] Can I get a quick IPs review https://gerrit.wikimedia.org/r/c/operations/puppet/+/826506 [08:56:34] ? [08:59:56] thanks! [09:01:37] so I have a few pending things regarding m1: re-enable and force rung "Content database backups remote long term backups" (Es-rw) and rerun x1 codfw snapshots [09:02:44] anything I can do to help? [09:03:15] not really, just confirming there is no expected further maintenance on m1? [09:03:35] as es backups take many hours to complete [09:03:51] nope [09:03:52] nothing [09:05:15] jynus: regarding m2 in eqiad, can I stop db1117:3322 to clone another host? [09:05:23] one last question, for any of you, I belive you mentioned ongoing x1 maintenance on codfw? [09:05:41] marostegui: yeah, I will only intend to work with x1 today [09:05:53] jynus: I am not doing anything on x1 today or tomorrow no [09:06:31] I am asking because those are not in a hurry (unlike es), and maintenance would explain why thye failed a couple of times [09:06:53] so I can delay x1 if there is ongoing alters of heavy writing no issue [09:07:01] there are big alters going on [09:07:03] but the hosts won't be down [09:07:08] ah, I see [09:07:32] xtrabackups is a bit sensitive to that kind of load, even if in theory it should work [09:07:40] then yeah, it might take long [09:07:44] but Amir1 is the one doing that one [09:07:48] so maybe he can estimate better [09:07:59] which is a good thing ,because it means there is no programming issue elsewhere [09:08:10] * Amir1 reads up [09:09:00] yeah, it's 93GB, I think it'll take around ten hours for each replica [09:09:12] so is that done serially? [09:09:29] that way another thing I can do is swith to another replica [09:09:45] that has either already completed or not touched yet [09:10:16] or just wait, eqiad backup is ok, and it is only a couple of days old, not a big deal [09:12:10] yeah, all schema changes are done serially [09:12:37] I think it will just be easier to just wait, just ping me -if you can- when it finishes on db2101 (the backup source) [09:14:18] schema changes > backups being perfectly up to date at all times [09:15:02] and if I am complaining because a redundant backup is "1 day, 6 hours old" we are in a great place :-D [09:47:03] I have rebooted all standby proxies + codfw proxies [12:02:16] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (db1194:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [12:02:49] ^ me [12:07:16] (PrometheusMysqldExporterFailed) resolved: Prometheus-mysqld-exporter failed (db1194:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [12:19:25] jynus: The script I'm writing for reboot doesn't really distinguish between a backup source or another replica. I wonder how I can make sure it doesn't try to reboot a backup source while it's actually working. Can you give me an idea on this? [12:20:04] do you want to get the distinction or prevent from restarting while the dump is ongoing? [12:20:20] yup [12:20:24] the latter [12:20:38] if the latter, I already told Stevie Beth how to monitor for ongoing backups- a dump user is connected [12:21:07] ah, thanks [12:21:20] but I would like to handle backups at my own pace, to be fair- I would use your script, but I would prefer to control when to restart them myself [12:22:11] let me think, because that would work for es hosts, but not sure if for snapshots [12:22:30] as I belive a root user is used for xtrabackup [12:22:51] (as it doesn't use mysql except for checking replication status, the copy is of the filesystem) [12:23:11] for that, probably you will want to check for a lock file set by transfer.py [12:25:29] as for how to distinguish, either with puppet role (mariadb::backup_source) or zarcillo [12:27:11] is kormat ok? they haven't been active (anywhere i can see) in a while. [12:28:08] pmed you [12:28:53] yes, please let's not discuss people's private lives in publicly logged channels [12:35:28] Amir1: so there will be a specific lock dir under /tmp for snapshots [12:35:42] I am trying to remember the name [12:35:56] it exists only during transfer [12:36:01] jynus: is that for ES or core? For core I assume the dump user would be enough [12:36:05] yeah [12:36:07] I'm not planning to run it on ES [12:36:20] so the dump user is for dumps [12:36:36] and the /tmp dubdir is for snapshots, only on backup sources [12:37:24] the first should probably be done for other users (webrequest user, admin user, so shouldn't be a huge overhead) [12:37:50] I'm trying to find you the docs or source code for the second [12:38:08] if there is any backups running, I can go and poke around :D [12:38:28] mmm, let me see if x1 finished [12:38:52] I gave it a try expecting to fail [12:40:15] actually it finished, but correctly [12:40:23] do you want me to run one somewhere? [12:41:40] nah, let's wait [12:41:42] this is not urgent [12:41:50] it is ok, I am curious myself [12:41:50] just ping me if something starts [12:41:56] it is just one command [12:42:00] haha sure [12:42:04] actually I want to show you how to run it [12:42:09] go to cumin2002 [12:42:22] start a screen session (or tmux) [12:42:43] and run "remote-backup-mariadb s5" [12:43:14] (it should be super easy to start a backup!) [12:43:34] sure [12:44:01] started [12:45:04] hmm, I don't see replication stopped in backup sources of s5 [12:45:05] that should be running on db2101.codfw.wmnet [12:45:39] oh, it depends on configuration, only stop it based on load- see config at /etc/wmfbackups/remote_backups.cnf [12:45:48] I see [12:45:53] I think I stop it on s1, s8 and s3 [12:46:11] show processlist doesn't bring dump user from what I'm seeing [12:46:14] because otherwise it writes too much and fails more often (plus make it faster) [12:46:26] yeah, that is because it is a snapshot, it copies from the filesystem [12:47:10] but there is /tmp/trnsfr_dbprov2003.codfw.wmnet_4400 on the source [12:47:58] if a /tmp/trnsfr_* dir exists, it is transferring (that way I lock against 2 processes running at the same time on the same port) [12:48:24] the rest is where it is being sent and through which port, but you don't care about that :-D [12:48:52] 4400 is the default, but it will use more 4401, 4402 if the others are busy [12:49:17] so checking the existence of /tmp/trnsfr_* should be enough [12:49:36] and for dumps, the dump user [12:51:57] and note a transfer or dump can take multiple hours, es dumps now take 27+ plus due to its low concurrency [12:53:40] PROBLEM - MariaDB sustained replica lag on s6 on db2114 is CRITICAL: 542 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2114&var-port=9104 [13:05:21] That was me, it should be back now [13:08:44] yeah, if see it, I just skip it [13:12:05] RECOVERY - MariaDB sustained replica lag on s6 on db2114 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2114&var-port=9104 [13:17:41] jynus: SELECT * FROM information_schema.processlist WHERE User like '%dump%' does this look good to you? [13:18:03] we can simulate this too, let me see [13:18:29] note we prefer sys.processlist as it is non blocking [13:18:51] but should be ok if not used under an outage [13:19:45] but yeah, that should be ok --> if it returns rows, wait/abort [13:20:14] try to make the user configurable, you may know why :-D [13:24:06] another potential way to detect snapshots would be "pgrep xtrabackup" [16:00:47] I think I can finally run the delayed es long term backups