[03:47:47] <jinxer-wm>	 (SessionStoreOnNonDedicatedHost) firing: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost
[07:22:47] <jinxer-wm>	 (SessionStoreOnNonDedicatedHost) resolved: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost
[07:27:34] * Emperor wonders if that's going to fire again in 20 minutes
[13:36:02] <urandom>	 oh.
[13:36:05] <urandom>	 hrmm
[13:36:59] <urandom>	 I guess it didn't (happen again in 20 minutes) :)
[13:39:38] <Emperor>	 indeed not
[13:39:43] <Emperor>	 <-- wrong again ;)
[13:41:20] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db2132 is CRITICAL: 13.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104
[13:41:56] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db1217 is CRITICAL: 216 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321
[13:42:32] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db2132 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104
[13:43:20] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 222 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321
[13:49:46] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db1217 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321
[13:55:54] <urandom>	 So the sessionstore/cassandra upgrade in codfw is done, eqiad today...
[13:56:04] <urandom>	 there is a bit of a latency regression
[13:56:29] <Emperor>	 that's a bit sad, you'd like performance to improve with newer version
[13:56:29] <urandom>	 https://grafana.wikimedia.org/d/000001590/sessionstore?from=now-2d&orgId=1&to=now&var-container_name=kask-production&var-dc=thanos&var-prometheus=k8s&var-service=sessionstore&var-site=codfw&refresh=5m&viewPanel=50
[13:56:48] <urandom>	 I made one config change over 3.11.14 that's worth looking at
[13:56:57] <urandom>	 the use of heap allocation instead of direct
[13:57:29] <Emperor>	 would it be easy to change in codfw and see if that impacts the latency increase?
[13:57:45] <urandom>	 the latter has been a problematic part of the code, you're basically end-arounding Java's memory management
[13:58:08] <urandom>	 I'm kind of hesitant to mess with it in this state
[13:58:16] <urandom>	 when the cluster is mixed-version
[13:58:23] <urandom>	 too many variables in  play
[13:58:43] <Emperor>	 fair enough!
[13:59:13] <urandom>	 I don't think the regression would be enough to wave off an upgrade, so I think I'll proceed, see what it looks like afterward
[13:59:43] <Emperor>	 👍
[14:00:06] <urandom>	 honestly, the GC graphs here probably explain it: https://grafana.wikimedia.org/d/000000418/cassandra?from=now-2d&orgId=1&to=now&var-cluster=sessionstore&var-datasource=codfw%20prometheus%2Fservices&var-keyspace=sessions&var-quantile=99p&var-table=values
[14:00:18] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321
[14:01:38] <urandom>	 which again could be mem v. direct
[14:01:46] <urandom>	 but is probably tunable either way
[14:46:50] <urandom>	 Emperor: not sure how much longer you'll be around, but I tagged you on a bunch of related gerrits
[14:47:21] <urandom>	 should just be basic sanity checks, their basically the equiv of the last round, but for eqiad this time
[14:59:04] <Emperor>	 👀
[15:15:24] <Emperor>	 all LGTM
[15:16:00] <jynus>	 I wonder if this looks less 90s? https://phab.wmfusercontent.org/file/data/nkw56245yskqrmcsxjbo/PHID-FILE-7r57626rykv44jvld45l/Screenshot_20230608_171448.png
[15:16:12] <jynus>	 ^Amir1
[15:16:32] <Amir1>	 noice
[15:19:09] <jynus>	 I will extend the downtime until tuesday and CC manuel there just in case
[15:19:55] <jynus>	 remember I won't be around tomorrow or on monday
[15:21:59] <jynus>	 as sadly the notifications enabled, due to a race condition won't work until puppet runs there
[15:23:06] <jynus>	 actually, I will just do it manually to avoid paging
[15:36:48] <jynus>	 I checked the logs and they didn't have anything else other than what you posted btw
[15:37:24] <jynus>	 so I guess getting the memory changed and moving on :-(
[16:13:08] <urandom>	 Umm... so somehow latency is better in codfw when it's shouldering all of the traffic(?)
[16:13:17] <urandom>	 that's...not what I would have expected
[16:14:16] <jynus>	 interesting
[16:15:06] <jynus>	 it could make sense under some conditions- sometimes outliers queries are more common when tehre is no "real" or high traffic
[16:16:00] <urandom>	 I suppose... but you'd expect to see if that if the traffic was very low, no?
[16:16:27] <urandom>	 I mean, even with both DCs pooled, codfw sees ~400 reqs/sec
[16:17:32] <jynus>	 yeah, not saying that explaings it, more like it is a possibility
[16:18:25] <urandom>	 I have another idea
[16:18:43] <urandom>	 both DCs see the same writes, everything is replicated
[16:18:50] <urandom>	 but codfw sees a subset of all the reads
[16:19:09] <urandom>	 so page cache maybe?
[16:19:40] <urandom>	 we must graph this somewhere...
[16:20:50] <urandom>	 I think I do need to change memtable  allocation back, from heap to direct and see what that does