[05:36:48] filed T312829 for the jitter [05:36:49] T312829: Add jitter to BagOStuff TTLs - https://phabricator.wikimedia.org/T312829 [05:53:23] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2142&var-port=9104&refresh=1m&viewPanel=6&from=now-2d&to=now (courtesy of jynus pointing this one in -sre). [05:53:28] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=core&var-shard=x2&var-role=All&from=now-2d&to=now [06:01:36] Note to self - check the binlogs to see what the write traffic is made up and talk with DBAs to determine whether this is a "problem" or not. [06:01:56] If we need to cut down write traffic, we could probably get rid of the changeTTL complexity: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/813120 [06:02:20] Also, I filed "Avoid x2-mainstash replica connections (ChronologyProtector)" https://phabricator.wikimedia.org/T312809 [06:02:24] .. while investigating the above [06:02:33] cc TimStarling fyi :) [06:02:49] Tomorrow I'll get on the session debug I was going to do last week. [06:02:56] signing off for tonight now [17:32:08] Krinkle: https://phabricator.wikimedia.org/T299417#8073150 ^^ [17:38:49] Amir1: nice nice [17:42:10] I don't have ssh on those hosts, so yeah, might need help there [17:42:15] unless there's a way for wikiadmin to read them over SQL [18:09:54] binlog from db2142 (x2 codfw), grouped by sqlbag method; https://phabricator.wikimedia.org/P31020 [18:29:23] Amir1: do we have different replication modes intra vs inter-dc? e.g. they are both statement based right? [18:29:30] (for x2) [18:30:06] let me check [18:30:17] both are row-based [18:33:48] and the statements are there for debugging only? [18:34:23] for mainstash we need statement based replication to ensure eventual consistency. [18:34:48] I guess we're fine right now since no writes come in from codfw yet [18:35:24] but in terms of lag, I thought row-based would generally reduce spikes in lag by being more trivial to apply on the receiver end [18:37:15] in terms of correlation, I do see now that while not directly correlating to the overall write traffic, there is a correlation with the UPDATE query stats (which I assume counts INSERT .. ON DUPE KEY UPDATE as an update if the local outcome was update?) [18:37:50] although that doesnt' narrow it doesn since all adds, sets, cas, delets are like that basically [18:43:43] okay, I guess with 15min lag and no cross-row complex queries this should have been obvious.. [18:43:44] but... [18:43:49] I dont think the query complexity is the issue [18:43:49] https://grafana-rw.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=db2142&var-datasource=thanos&var-cluster=mysql&from=now-2d&to=now&viewPanel=6 [18:43:55] The disk is saturated, it can't write [18:53:17] * Krinkle copies to -sre for input [19:43:55] Krinkle: I was afk. I can take a look once I'm back from dinner [19:47:10] ack, I've reverted the config change for now. bytes_received and query rate drop sharply as expected. Eqiad disk saturation still fairl high at 90+% no longer ~100%. I guess even the eqiad one was backlogged for the binlog? The codfw lag is still increasing slowly, so this might take a while. [19:47:26] I'm cutting out renew() meanwhile, and making it more safe to backport as its own patch [21:39:48] I'm home now, let me check https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=db2142&var-datasource=thanos&var-cluster=wmcs&from=now-1d&to=now&viewPanel=6 [21:40:04] how it went from 90% to 30% [21:41:24] Amir1: https://gerrit.wikimedia.org/r/813296 [21:41:35] basically, I turned it off. Post-pones the problem, not solves it. [21:41:43] Back to writing core.module_deps on GET. [21:41:53] I know, I assumed it would be the binlogs but it's not [21:42:43] Amir1: okay - what ruled it out from your pov? [21:43:05] binlogs stay around [21:43:20] let me login and check [21:43:27] at least in eqiad it's 90% [21:43:53] right, and we stopped writing binlogs as much so it dropped back to where it was. [21:44:22] eqiad went from 100 to 40% [21:44:31] db1151 [21:44:40] I assume I'm wrong, just trying to catch up :) [21:44:41] the thing is that it shouldn't recover this fast [21:45:12] unless the ttl for binlogs is set to something really short [21:45:15] in eqiad saturation went from 100% to 90% for about 30min at first, and only then to 40% [21:45:24] okay, so I think you are mixing up what I thought you were [21:45:28] saturation isn't utilization [21:45:33] for pc it's one day, for core dbs it's 30 days [21:46:31] or rather... ugh, I can't get used to these terms [21:46:40] disk usage != filesystem usage, apparently that's the terms we use [21:46:51] in other words this is writes we do right now, like CPU usage basically, not space use [21:47:26] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=db1151&var-datasource=thanos&var-cluster=mysql&from=now-1d&to=now&viewPanel=28 [21:47:37] space hasn't gone back down [21:47:42] (as expected) [21:48:40] I'm trying to see what is the capacity of sda [21:49:52] I can be very wrong but probably something has been mounted in the wrong place [21:50:09] or writes in the place it shouldn't [21:51:41] mounting looks okay [21:51:44] https://www.irccloud.com/pastebin/3q8z5jtj/ [21:52:32] Amir1: btw, the binlog you shared with me, was that from eqiad primary or codfw? [21:52:41] codfw primary [21:52:46] k [22:29:20] the server has literally cooled back down https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=db2142&var-datasource=thanos&var-cluster=mysql&from=now-2d&to=now&viewPanel=25