[09:25:44] marostegui: fyi, i've claimed T303174, and am doing an audit of the hosts listed, as there's definitely some that are wrongly classified [09:39:28] kormat: did you reboot es2022 yesterday? [09:39:42] jynus: yes [09:39:59] is it special? [09:40:08] do you have a script or did it manually? [09:40:14] i have a script [09:40:43] could the script have something like- if a dump is ongoing- pause and ask or something? [09:40:58] ohh, that's a backup source? [09:41:05] yep [09:41:39] currently the script waits for wikuser/wikiadmin conns to go away [09:41:44] i guess i should add 'dumps' to that, too? [09:42:06] the thing is, it is ok to kill it- as long as we get notified [09:42:16] dumps take 22 hours to run [09:42:52] because they are on purpuse very slow, to not affect production traffic [09:43:22] ok. is the relevant user 'dumps'? [09:43:27] dump without s [09:43:41] ok, thanks. i'll make a change. [09:44:31] ideally, we coordinate- so I can either delay the backup or you delay the reboot [09:45:11] oh.. crap. we _renamed_ the 'wikiuser' account? 😬 [09:45:30] he he [09:45:36] Amir1: ^? [09:45:51] kormat: yup, we renamed it [09:45:54] what's up [09:45:55] well shit [09:46:04] Amir1: the query killer isn't updated, at least in git [09:46:10] so it's still looking at the old/wrong user [09:46:15] haha nice [09:46:26] so there you have your answer to reuven [09:46:30] that would explain why it didn't catch an issue that caused a page last week [09:46:33] technically query killer is not needed as we have that inside mw these days [09:46:39] but that's beside the point [09:47:49] modules/profile/files/mariadb/db_kill.py is also now wrong [09:48:02] kormat: let me make a patch, at least now it's hierazed [09:50:12] please update the incident doc, too, if you can this is quite valuable information [09:50:59] or the ticket, or something, cannot remember where was that handled [09:53:03] Amir1: https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/CommonSettings.php#L116 also looks a bit concerning [09:53:42] kormat: that gets overwritten in private [09:53:58] otherwise, we would have a full outage [09:54:03] 🤦‍♀️ i see [09:54:12] Amir1: should it be updated or on pourpuse kept secret? [09:54:30] jynus: I don't care either way [09:54:35] :-) [09:54:41] better documentation maybe [09:54:49] https://gerrit.wikimedia.org/r/c/operations/puppet/+/777754 [09:55:04] kormat: where is the query killer? [09:55:16] on cumin hosts [09:55:31] Amir1: software/dbtools/events_coredb_master.sql [09:55:35] and same with s/master/slave/ [09:55:50] im not sure it has been ever executed live, to be fair [09:56:30] it was built for non-dbas on their request, but I am not sure they used it :-( [09:57:13] hmm, I have seen that script in switchovers [10:00:17] Amir1: was wikiadmin also renamed? [10:00:25] not yet [10:00:30] planning to at some point [10:00:41] good to know! 😅 [10:01:16] tbh, that procedure went way smoother than I expected even considering this snafu [10:01:51] also we are way healthier than we used to if it took weeks for an outage [10:02:24] in the past, things would have gone way worse without it (e.g. when a host was forgotten to run the script) [10:02:58] jynus: that's because most of the query killing is now happening inside mw for most special pages [10:03:10] yeah, and I appreciate that <3 [10:03:29] but I am guessing the work there will still take time to have everything, right? [10:04:47] I don't know. We will add more places for sure but I can't make it for everything [10:04:57] as some queries are by nature slow, e.g. maint scripts [10:05:07] or the ones updating special page stats, etc. [10:05:40] sure [10:05:58] I think we will query killer as another layer of defense but not at this shape. Needs to be much smarter [10:06:04] yeah [10:06:47] jynus: i'm looking at https://wikitech.wikimedia.org/wiki/Incident_documentation/2022-03-27_wdqs_outage, which i think is the relevant incident. but i'm not sure where in that page this issue would belong [10:06:48] one thing that mw layer doesn't have that db query killer does is having a more agresive mode under stress (even if badly) [10:07:03] kormat: I don't belive that is the one [10:07:14] yup [10:07:15] sorry, https://wikitech.wikimedia.org/wiki/Incident_documentation/2022-03-31_api_errors [10:07:25] again, I think it should use a PID controller [10:07:51] and setting weights should also use PID controllers, on it [10:08:06] kormat: literally just add a comment anywhere (e.g. actionables) to credit your work on this [10:08:18] e.g. linking to the patch [10:08:35] "X was discovered. Fixed on this patch" [10:11:05] I just want to use this opportunity to emphasize how much I hate sql triggers/event/procedures or anything that should be a program instead [10:11:16] :D [10:11:23] yeah [10:11:48] let me tell you the reasoning I was given- which make sense to some extent [10:12:27] "when connections get saturated, query killer cannot log in either" [10:12:53] which doesn't make a lot of sense- in fact the query killer I did uses the extra port for that [10:13:18] plus I belive an interpreted procedure will be much slower than an external app [10:14:02] as sql on mysql is purely interpreted [10:14:26] terrible performance, no debugging tools :-( [10:15:10] https://gerrit.wikimedia.org/r/c/operations/software/+/777760 [10:19:35] so I will relaunch the backup process and hopefully in the afternoon I can actually so some reboots of mine :-) [10:20:55] jynus: i have a fix out to make sure i don't screw up dumps again :) https://gerrit.wikimedia.org/r/c/operations/software/+/777756 [10:21:54] the root will catch I think snapshots- although that process is mostly file-based [10:22:08] root hasn't been an issue so far, IME [10:22:58] my only question is if it is ok to wait for 22 hours to restart an es* host? [10:23:50] I don't need you to wait, ping would be ok, as I can wait for the restart and restart it afterwards (that is why I called for coordination- I don't want to take preference) [10:24:21] (alternatively, you give me permission to make backups faster 0:-)) [10:25:23] basically, I am open for alternative solutions, just let me know! [10:26:10] also, dumps run with thte wikiuser2022, and backups with the dump user, that is in no way confusing! [10:27:21] oh lol [10:27:42] jynus: for the reboots i'm doing, leaving it for another day is totally fine. [10:27:56] cool to me [10:45:53] root@db1110:/srv/sqldata/cebwiki# ls -Ssh | head [10:45:53] total 339G [10:45:53] 281G templatelinks.ibd [10:45:53] 19G externallinks.ibd [10:45:53] 9.2G pagelinks.ibd [10:45:53] 6.7G categorylinks.ibd [10:46:22] that determines which wiki will get the normalization patches first [10:48:49] testwiki to make sure nothing breaks? XD [10:50:35] I run on beta first but it's interesting to see this wiki has basically 80% of it just templatelinks, if we clean that up, it'll be much much much smaller [10:59:53] kormat: cool, I will get the hosts at T305469 ready for 11th april [10:59:53] T305469: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 [11:00:02] sobanski: ^ that works for you? Just saw your reply on the task [11:00:09] I can have them ready by that date, no problem [11:19:01] Sure thing, I was just checking how flexible Papaul is with this request [11:19:28] I just realised the times are in central time [11:19:37] That's confusing, anyways, I can work to get them ready [11:26:14] kormat: added you as secondary for https://phabricator.wikimedia.org/T304933 [12:09:07] marostegui: your kindness is noted, and will be repaid in full [12:10:41] Amir1: are you taking care of deploying the query killer fix? [12:11:57] not yet. I'll try to do it soon [12:12:05] (lunch first) [12:12:18] ok cool. i'd say it's a fairly high priority, but _obviously_ not as high as lunch. :) [12:12:50] marostegui: again, sorry the templatelinks thingy might keep connection open for depool, ping me if you want me to reload it [12:13:12] Amir1: I am not altering templatelinks [12:13:29] Ah, yo mean the scripts [12:13:32] yeah [12:13:33] Will do yes [12:13:37] Are you done with s1 btw? [12:13:56] it'll take days but it's idempotent so let me just kill it [12:14:07] No no [12:14:11] I mean with the schema change [12:14:19] I know, the user fixes [12:14:33] No rush, no worries [12:14:37] I just need to alter a couple of hosts [12:15:09] killed it already :P [12:15:23] XD [12:15:28] ok, let me deploy there then [13:37:17] marostegui: i've just noticed that you're doing a schema change on s2/s3. i'm currently rebooting db hosts in codfw. i'm not touching the primaries yet, which means i _think_ it won't affect your stuff [13:37:42] primaries as masters? [13:38:05] yes [13:38:24] Yeah, codfw ones are done via them so it shouldn't be an issue for now [13:38:57] grand. [13:46:46] kormat: marostegui to double check, in order to deploy the new query killer, I just need to run the script we run in switchovers, correct? [13:47:07] curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_slave.sql?format=TEXT' | base64 -d | db-mysql {}".format( [13:47:22] on replicas and primaries [13:48:59] Amir1: yes, for slaves the ones with slave and for masters the one with master [13:49:11] yeah [13:49:40] marostegui: stupid q. by master it's only eqiad/codfw masters, not sanitarium masters? [13:49:57] yep [13:50:24] Thanks! [14:11:53] running everywhere [14:12:06] eqiad done, [14:14:14] codfw masters done [14:14:24] now I need to make sure codfw replicas get it as well [16:37:18] PROBLEM - MariaDB sustained replica lag on s5 on db2094 is CRITICAL: 35.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2094&var-port=13315 [16:37:53] ^maintenance? [16:38:30] looks like so, probably something on the codfw primary [16:38:56] RECOVERY - MariaDB sustained replica lag on s5 on db2094 is OK: (C)2 ge (W)1 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2094&var-port=13315 [16:39:09] nah, it recovered [16:39:44] sorry, yeah, that was me.