[01:18:10] 06serviceops, 10Cloud-VPS: OOM livelock stalls - https://phabricator.wikimedia.org/T358634 (10tstarling) [03:26:29] 06serviceops: etcdmirror does not recover from a cleared waitIndex - https://phabricator.wikimedia.org/T358636 (10Scott_French) [03:26:53] 06serviceops: etcdmirror does not recover from a cleared waitIndex - https://phabricator.wikimedia.org/T358636#9582361 (10Scott_French) p:05Triage→03Medium [07:05:41] 06serviceops, 10Cloud-VPS: OOM livelock stalls - https://phabricator.wikimedia.org/T358634#9582468 (10Joe) Focusing on the swap part of the problem, for posterity: I think it's a valid point for backend/async processing systems or systems that have a lot of noisy neighbours and are not latency-critical. I do... [07:05:47] 06serviceops, 10Cloud-VPS: OOM livelock stalls - https://phabricator.wikimedia.org/T358634#9582471 (10Joe) I also want to note that on kubernetes memory is mostly managed by the k8s scheduler on top of the kernel one, so that we never have overflowing use of memory and we OOM containers (which are nothing more... [08:51:35] 06serviceops: etcdmirror does not recover from a cleared waitIndex - https://phabricator.wikimedia.org/T358636#9582669 (10Volans) That etcdmirror is mirroring only the `/conftool` keys it's totally news to me, I assumed it was replicating the whole content of etcd. But indeed it does not: ` $ sudo etcdctl --end... [10:35:22] 06serviceops, 10Cloud-VPS: OOM livelock stalls - https://phabricator.wikimedia.org/T358634#9582894 (10dcaro) @tstarling thanks for the task! :), I was sitting a couple of desks away from Cris in London when he wrote that post xd, it circulated widely among production engineering To clarify, this task is to re... [12:44:18] 06serviceops, 06Content-Transform-Team-WIP, 06Data-Persistence, 10RESTBase Sunsetting: Wikifeeds increase on 500 errors after switchover to core page HTML - https://phabricator.wikimedia.org/T354657#9583356 (10Jgiannelos) Untagging #content-transform-team-wip since it looks more like a data persistence rel... [12:44:41] 06serviceops, 06Data-Persistence, 10RESTBase Sunsetting, 10Wikifeeds: Wikifeeds increase on 500 errors after switchover to core page HTML - https://phabricator.wikimedia.org/T354657#9583358 (10Jgiannelos) [12:50:55] 06serviceops, 06Data-Engineering, 10WMF-JobQueue, 13Patch-For-Review, and 3 others: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745#9583374 (10gmodena) Hey @Clement_Goubert , I was on PTO last week and trying to piece together wh... [15:18:25] 06serviceops, 06Data-Persistence, 10RESTBase Sunsetting, 10Wikifeeds: Wikifeeds increase on 500 errors after switchover to core page HTML - https://phabricator.wikimedia.org/T354657#9584008 (10Joe) I don't think this is a data persistence issue, but rather it's much more probable this is actually a restbas... [15:27:01] 06serviceops, 10CX-cxserver, 10Citoid, 06Content-Transform-Team-WIP, and 11 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118#9584108 (10Jdforrester-WMF) [15:33:06] 06serviceops, 10CX-cxserver, 10Citoid, 06Content-Transform-Team-WIP, and 11 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118#9584162 (10Jdforrester-WMF) [15:33:50] 06serviceops: etcdmirror does not recover from a cleared waitIndex - https://phabricator.wikimedia.org/T358636#9584184 (10Scott_French) [15:44:02] 06serviceops: etcdmirror does not recover from a cleared waitIndex - https://phabricator.wikimedia.org/T358636#9584215 (10Joe) >>! In T358636#9582669, @Volans wrote: > That etcdmirror is mirroring only the `/conftool` keys it's totally news to me, I assumed it was replicating the whole content of etcd. But indee... [15:46:36] 06serviceops: etcdmirror does not recover from a cleared waitIndex - https://phabricator.wikimedia.org/T358636#9584228 (10Scott_French) Thanks, Riccardo. Yes, indeed - this particular issue should generally not happen if the entire keyspace is mirrored (IIRC, there are non-keyspace events that can advance the in... [15:46:40] 06serviceops: etcdmirror does not recover from a cleared waitIndex - https://phabricator.wikimedia.org/T358636#9584229 (10Volans) But a second instance wouldn't prevent the current issue, right? [15:53:35] claime: hey how are we set for rack b6 migration today ? [15:55:03] <_joe_> topranks: claime is out, but jayme is following it [15:55:40] _joe_: ah cool thanks [15:56:38] topranks: sorry, did not mention that explicitely. :) As said, from my POV we're good [15:56:41] 06serviceops, 10CX-cxserver, 10Citoid, 06Content-Transform-Team-WIP, and 11 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118#9584263 (10Jdforrester-WMF) [15:57:25] jayme: absolutely you did, me getting mixed up nevermind :) [16:00:25] 06serviceops, 10CX-cxserver, 10Citoid, 06Content-Transform-Team-WIP, and 11 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118#9584257 (10Jdforrester-WMF) [16:13:01] jayme: all done our side you can repool those hosts [16:13:02] thanks! [16:13:13] wilco, thanks [16:33:56] 06serviceops, 06Data-Persistence, 10RESTBase Sunsetting, 10Wikifeeds: Wikifeeds increase on 500 errors after switchover to core page HTML - https://phabricator.wikimedia.org/T354657#9584394 (10Jgiannelos) Indeed the problem looks like to be in RESTBase and more specifically in `restbase-mod-table-cassandra... [16:35:54] 06serviceops, 10RESTBase, 10RESTBase Sunsetting, 10RESTBase-Cassandra, 10Wikifeeds: Wikifeeds increase on 500 errors after switchover to core page HTML - https://phabricator.wikimedia.org/T354657#9584397 (10Jgiannelos) [16:36:37] 06serviceops, 10RESTBase, 10RESTBase Sunsetting, 10Wikifeeds: Wikifeeds increase on 500 errors after switchover to core page HTML - https://phabricator.wikimedia.org/T354657#9584401 (10Jgiannelos) [17:49:08] 06serviceops, 06DC-Ops, 06SRE, 10ops-codfw: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#9584776 (10JMeybohm) [18:05:45] 06serviceops: etcdmirror does not recover from a cleared waitIndex - https://phabricator.wikimedia.org/T358636#9584855 (10Scott_French) Rather than trying to pull access logs off the conf hosts (as they're rather large, and I'd like to avoid stressing them), I just ended up looking at the etcd grafana dashboard... [18:33:47] 06serviceops, 06Data-Engineering, 06Data-Platform-SRE, 06SRE, 10Event-Platform: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058#9585000 (10BTullis) I believe that this ticket will be invalidated by the approach that that has tested and agreed upon in {T331894}. There...