[03:47:47] (SessionStoreOnNonDedicatedHost) firing: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [07:22:47] (SessionStoreOnNonDedicatedHost) resolved: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [07:27:34] * Emperor wonders if that's going to fire again in 20 minutes [13:36:02] oh. [13:36:05] hrmm [13:36:59] I guess it didn't (happen again in 20 minutes) :) [13:39:38] indeed not [13:39:43] <-- wrong again ;) [13:41:20] PROBLEM - MariaDB sustained replica lag on m1 on db2132 is CRITICAL: 13.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104 [13:41:56] PROBLEM - MariaDB sustained replica lag on m1 on db1217 is CRITICAL: 216 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321 [13:42:32] RECOVERY - MariaDB sustained replica lag on m1 on db2132 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104 [13:43:20] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 222 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [13:49:46] RECOVERY - MariaDB sustained replica lag on m1 on db1217 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321 [13:55:54] So the sessionstore/cassandra upgrade in codfw is done, eqiad today... [13:56:04] there is a bit of a latency regression [13:56:29] that's a bit sad, you'd like performance to improve with newer version [13:56:29] https://grafana.wikimedia.org/d/000001590/sessionstore?from=now-2d&orgId=1&to=now&var-container_name=kask-production&var-dc=thanos&var-prometheus=k8s&var-service=sessionstore&var-site=codfw&refresh=5m&viewPanel=50 [13:56:48] I made one config change over 3.11.14 that's worth looking at [13:56:57] the use of heap allocation instead of direct [13:57:29] would it be easy to change in codfw and see if that impacts the latency increase? [13:57:45] the latter has been a problematic part of the code, you're basically end-arounding Java's memory management [13:58:08] I'm kind of hesitant to mess with it in this state [13:58:16] when the cluster is mixed-version [13:58:23] too many variables in play [13:58:43] fair enough! [13:59:13] I don't think the regression would be enough to wave off an upgrade, so I think I'll proceed, see what it looks like afterward [13:59:43] 👍 [14:00:06] honestly, the GC graphs here probably explain it: https://grafana.wikimedia.org/d/000000418/cassandra?from=now-2d&orgId=1&to=now&var-cluster=sessionstore&var-datasource=codfw%20prometheus%2Fservices&var-keyspace=sessions&var-quantile=99p&var-table=values [14:00:18] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [14:01:38] which again could be mem v. direct [14:01:46] but is probably tunable either way [14:46:50] Emperor: not sure how much longer you'll be around, but I tagged you on a bunch of related gerrits [14:47:21] should just be basic sanity checks, their basically the equiv of the last round, but for eqiad this time [14:59:04] 👀 [15:15:24] all LGTM [15:16:00] I wonder if this looks less 90s? https://phab.wmfusercontent.org/file/data/nkw56245yskqrmcsxjbo/PHID-FILE-7r57626rykv44jvld45l/Screenshot_20230608_171448.png [15:16:12] ^Amir1 [15:16:32] noice [15:19:09] I will extend the downtime until tuesday and CC manuel there just in case [15:19:55] remember I won't be around tomorrow or on monday [15:21:59] as sadly the notifications enabled, due to a race condition won't work until puppet runs there [15:23:06] actually, I will just do it manually to avoid paging [15:36:48] I checked the logs and they didn't have anything else other than what you posted btw [15:37:24] so I guess getting the memory changed and moving on :-( [16:13:08] Umm... so somehow latency is better in codfw when it's shouldering all of the traffic(?) [16:13:17] that's...not what I would have expected [16:14:16] interesting [16:15:06] it could make sense under some conditions- sometimes outliers queries are more common when tehre is no "real" or high traffic [16:16:00] I suppose... but you'd expect to see if that if the traffic was very low, no? [16:16:27] I mean, even with both DCs pooled, codfw sees ~400 reqs/sec [16:17:32] yeah, not saying that explaings it, more like it is a possibility [16:18:25] I have another idea [16:18:43] both DCs see the same writes, everything is replicated [16:18:50] but codfw sees a subset of all the reads [16:19:09] so page cache maybe? [16:19:40] we must graph this somewhere... [16:20:50] I think I do need to change memtable allocation back, from heap to direct and see what that does