[05:36:48] <ori>	 filed T312829 for the jitter
[05:36:49] <stashbot>	 T312829: Add jitter to BagOStuff TTLs - https://phabricator.wikimedia.org/T312829
[05:53:23] <Krinkle>	 https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2142&var-port=9104&refresh=1m&viewPanel=6&from=now-2d&to=now (courtesy of jynus pointing this one in -sre).
[05:53:28] <Krinkle>	 https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=core&var-shard=x2&var-role=All&from=now-2d&to=now
[06:01:36] <Krinkle>	 Note to self - check the binlogs to see what the write traffic is made up and talk with DBAs to determine whether this is a "problem" or not.
[06:01:56] <Krinkle>	 If we need to cut down write traffic, we could probably get rid of the changeTTL complexity: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/813120
[06:02:20] <Krinkle>	 Also, I filed "Avoid x2-mainstash replica connections (ChronologyProtector)" https://phabricator.wikimedia.org/T312809 
[06:02:24] <Krinkle>	 .. while investigating the above
[06:02:33] <Krinkle>	 cc TimStarling fyi :)
[06:02:49] <Krinkle>	 Tomorrow I'll get on the session debug I was going to do last week.
[06:02:56] <Krinkle>	 signing off for tonight now
[17:32:08] <Amir1>	 Krinkle: https://phabricator.wikimedia.org/T299417#8073150 ^^
[17:38:49] <Krinkle>	 Amir1: nice nice
[17:42:10] <Krinkle>	 I don't have ssh on those hosts, so yeah, might need help there
[17:42:15] <Krinkle>	 unless there's a way for wikiadmin to read them over SQL
[18:09:54] <Krinkle>	 binlog from db2142 (x2 codfw), grouped by sqlbag method; https://phabricator.wikimedia.org/P31020
[18:29:23] <Krinkle>	 Amir1: do we have different replication modes intra vs inter-dc? e.g. they are both statement based right?
[18:29:30] <Krinkle>	 (for x2)
[18:30:06] <Amir1>	 let me check
[18:30:17] <Amir1>	 both are row-based
[18:33:48] <Krinkle>	 and the statements are there for debugging only?
[18:34:23] <Krinkle>	 for mainstash we need statement based replication to ensure eventual consistency.
[18:34:48] <Krinkle>	 I guess we're fine right now since no writes come in from codfw yet
[18:35:24] <Krinkle>	 but in terms of lag, I thought row-based would generally reduce spikes in lag by being more trivial to apply on the receiver end
[18:37:15] <Krinkle>	 in terms of correlation, I do see now that while not directly correlating to the overall write traffic, there is a correlation with the UPDATE query stats (which I assume counts INSERT .. ON DUPE KEY UPDATE as an update if the local outcome was update?)
[18:37:50] <Krinkle>	 although that doesnt' narrow it doesn since all adds, sets, cas, delets are like that basically
[18:43:43] <Krinkle>	 okay, I guess with 15min lag and no cross-row complex queries this should have been obvious..
[18:43:44] <Krinkle>	 but...
[18:43:49] <Krinkle>	 I dont think the query complexity is the issue
[18:43:49] <Krinkle>	 https://grafana-rw.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=db2142&var-datasource=thanos&var-cluster=mysql&from=now-2d&to=now&viewPanel=6
[18:43:55] <Krinkle>	 The disk is saturated, it can't write
[18:53:17] * Krinkle copies to -sre for input
[19:43:55] <Amir1>	 Krinkle: I was afk. I can take a look once I'm back from dinner
[19:47:10] <Krinkle>	 ack, I've reverted the config change for now. bytes_received and query rate drop sharply as expected. Eqiad disk saturation still fairl high at 90+% no longer ~100%. I guess even the eqiad one was backlogged for the binlog? The codfw lag is still increasing slowly, so this might take a while.
[19:47:26] <Krinkle>	 I'm cutting out renew() meanwhile, and making it more safe to backport as its own patch
[21:39:48] <Amir1>	 I'm home now, let me check https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=db2142&var-datasource=thanos&var-cluster=wmcs&from=now-1d&to=now&viewPanel=6
[21:40:04] <Amir1>	 how it went from 90% to 30%
[21:41:24] <Krinkle>	 Amir1: https://gerrit.wikimedia.org/r/813296
[21:41:35] <Krinkle>	 basically, I turned it off. Post-pones the problem, not solves it.
[21:41:43] <Krinkle>	 Back to writing core.module_deps on GET.
[21:41:53] <Amir1>	 I know, I assumed it would be the binlogs but it's not
[21:42:43] <Krinkle>	 Amir1: okay - what ruled it out from your pov?
[21:43:05] <Amir1>	 binlogs stay around
[21:43:20] <Amir1>	 let me login and check
[21:43:27] <Amir1>	 at least in eqiad it's 90%
[21:43:53] <Krinkle>	 right, and we stopped writing binlogs as much so it dropped back to where it was.
[21:44:22] <Krinkle>	 eqiad went from 100 to 40%
[21:44:31] <Krinkle>	 db1151
[21:44:40] <Krinkle>	 I assume I'm wrong, just trying to catch up :)
[21:44:41] <Amir1>	 the thing is that it shouldn't recover this fast
[21:45:12] <Amir1>	 unless the ttl for binlogs is set to something really short 
[21:45:15] <Krinkle>	 in eqiad saturation went from 100% to 90% for about 30min at first, and only then to 40%
[21:45:24] <Krinkle>	 okay, so I think you are mixing up what I thought you were
[21:45:28] <Krinkle>	 saturation isn't utilization
[21:45:33] <Amir1>	 for pc it's one day, for core dbs it's 30 days
[21:46:31] <Krinkle>	 or rather... ugh, I can't get used to these terms
[21:46:40] <Krinkle>	 disk usage != filesystem usage, apparently that's the terms we use
[21:46:51] <Krinkle>	 in other words this is writes we do right now, like CPU usage basically, not space use
[21:47:26] <Krinkle>	 https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=db1151&var-datasource=thanos&var-cluster=mysql&from=now-1d&to=now&viewPanel=28
[21:47:37] <Krinkle>	 space hasn't gone back down
[21:47:42] <Krinkle>	 (as expected)
[21:48:40] <Amir1>	 I'm trying to see what is the capacity of sda 
[21:49:52] <Amir1>	 I can be very wrong but probably something has been mounted in the wrong place
[21:50:09] <Amir1>	 or writes in the place it shouldn't
[21:51:41] <Amir1>	 mounting looks okay
[21:51:44] <Amir1>	 https://www.irccloud.com/pastebin/3q8z5jtj/
[21:52:32] <Krinkle>	 Amir1: btw, the binlog you shared with me, was that from eqiad primary or codfw?
[21:52:41] <Amir1>	 codfw primary
[21:52:46] <Krinkle>	 k
[22:29:20] <Krinkle>	 the server has literally cooled back down https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=db2142&var-datasource=thanos&var-cluster=mysql&from=now-2d&to=now&viewPanel=25