[10:12:37] this week's rclone sync fixed up the corrupted images from last week, so I've closed T381891 and T381893 [10:12:37] T381891: Interieur - 's-Gravenhage - 20085391 - RCE.jpg inconsistent, needs new upload - https://phabricator.wikimedia.org/T381891 [10:12:38] T381893: Interieur - 's-Gravenhage - 20089866 - RCE.jpg inconsistent, needs new upload - https://phabricator.wikimedia.org/T381893 [16:13:51] Yay [18:00:06] 12:55:12 <+jinxer-wm> FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [18:00:08] is this known? [18:12:47] Hi folks. I found what I would call perplexing behaviour in MariaDB. I'm not sure if it qualifies as a bug, expected behaviour, or both. I pasted debug information inhttps://phabricator.wikimedia.org/P71715, and in the first comment there's a super-simple testcase. Basically, by making a DELETE query more specific with an IN operator, over a certain number of records it goes rogue and locks the supremum record (and by [18:12:47] extension, essentially the whole table). This seems wrong to me, so I'm sure I'm missing something obvious; I don't know what, though. Is this something you could take a look at? Can I file a task and tag with #DBA? [18:16:35] To clarify, I'm not sure what the best venue for this would be, as it's mostly just me not understanding something. I would like to document this in https://wikitech.wikimedia.org/wiki/MediaWiki_Engineering/Guides/Backend_performance_practices#Transactions though, once I figure it out. It doesn't seem intuitive behaviour at all to me. [18:18:37] cdanis: I was just looking at that, and it looks like most of the swift-stats timers have been disabled on ms-fe2009, including `swift-container-stats_mw-media.timer` which I believe is the one that produces the underlying metric [18:18:50] ah [18:19:31] last run for that one was at `2024-12-16 07:10:00 UTC` so, that aligns with when the stats start to diverge [18:20:53] Daimona: is there a ticket? I'm outside. I might be able to look into it a bit later [18:21:30] Amir1: nope, I've just been pasting stuff in that paste for now. I can make a task now. [18:21:40] Thanks [18:26:18] hmmm ... I don't see anything in the puppet agent logs that would explain this, and indeed the state in puppet would suggest these should be running (`profile::swift::stats_reporter_host` is indeed ms-fe2009) [18:59:00] Whenever I begin thinking that the world makes sense, I run "EXPLAIN" for a random SQL query, just to be proven wrong. I think I understood the problem, if by "understood" we mean "realize that the inexplicable behaviour is a perfectly reasonable consequence of an even more inexplicable behaviour". [18:59:55] (that is: https://phabricator.wikimedia.org/P71715#287504) [19:03:46] https://phabricator.wikimedia.org/P71715#287505 I think I need to go touch grass. I might file a task later, if I first understand what's wrong. [19:40:55] got distracted for a bit ... so, actually, I failed to notice that the services triggered by the "stuck" timers are actually all running, and have been since various times in the 6:00 - 7:00 UTC range on 12/16 [19:42:12] stracing one of them (the container-stats one) shows it sitting in recvfrom on an fd that's presumably the connection to swift [19:48:29] yes, confirmed that it's the TCP connection to swift (LVS address for ms-fe.svc.codfw.wmnet) [19:50:17] I'm tempted to bounce one of the services to see if it unsticks on the next trigger, but I'm also a bit concerned there's something subtle going on here [19:56:19] Emperor: on Wednesday, could you please take a look at the above? tl;dr - a number of timer-triggered systemd services that run swift-account-stats, swift-container-stats, and swift-dispersion-report on ms-fe2009 seem to have become "stuck" within a ~ 1h window on 12/16 [19:56:19] I'm kind of out of ideas other than brute-force, which doesn [19:56:30] 't seem like an ideal solution :) [20:46:03] PROBLEM - MariaDB sustained replica lag on s1 on db1232 is CRITICAL: 10.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1232&var-port=9104 [20:48:03] RECOVERY - MariaDB sustained replica lag on s1 on db1232 is OK: (C)10 ge (W)5 ge 2.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1232&var-port=9104 [22:18:27] PROBLEM - MariaDB sustained replica lag on s1 on db1207 is CRITICAL: 10.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1207&var-port=9104 [22:25:27] RECOVERY - MariaDB sustained replica lag on s1 on db1207 is OK: (C)10 ge (W)5 ge 2.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1207&var-port=9104