[05:01:45] topranks: so my preferred order for the switches maintenance would be: row d, c, b a (again, this is what would work best for dbas, as it would give us more time to work on row a and row b replacements) [08:56:31] Any bets on how long altering image table (371G) is going to take?= [08:56:37] Amir1 you are not allowed to bet :p [09:08:34] "a while" [09:08:49] :( [09:09:16] Another bet: "How long it takes to remove that data" [09:34:02] Amir1: you can tell your guess to me, I'll bet that, and then we can split the prize :D [09:34:30] oh has that started now? ouch [09:34:41] majavah: you should have told me in private message, cheating 101 :D [10:46:08] marostegui: _joe_ : Soooo, we found something that should have been deleted years ago, I'm going to make patch, get it deployed and the result would be that lots of wikidata jobs and refreshlinks, parsing, etc. will be dropped, the wbc_entity_usage table in client wikis will get smaller (I assume a big difference in commons) and recentchanges too [10:46:21] The problem is that this change will be very slow to take effect [10:46:28] majavah: hahahaha [10:46:47] Amir1: slow meaning? [10:47:07] requires parsing of each client wiki to update the aspect [10:47:13] <_joe_> Amir1: "get it deployed" you mean next week right? [10:47:27] _joe_: whatever you prefer [10:47:43] the change itself is straightforward [10:47:50] just removing three lines of code [10:48:08] context: https://phabricator.wikimedia.org/T283040#7190050 [15:43:20] Amir1: why would it reduce RC and parse/jobs load? I'm probably missing something, but it looks like "other" is added redundantly right now but only when there is already at least one other kind of subscription right? I assume we dedupe RC and change prop/parse, so apart from smaller table how does it change behaviour? [15:43:51] Maybe it's because the other subscriptions are more specific and won't be triggered for (other) edits to the same entity? [15:44:05] Krinkle: exactly [15:44:12] I was about to write that [15:44:21] Cool :) [15:44:48] basically if you subscribe with C.P123, you shouldn't get a new change for O edits but you get it with this bug [15:45:26] O changes (alias changes) account to 5% of all edits in wikidata but it trickle down really big as 192M pages are subscribed in O mode now [15:46:46] Ah, I thought O/Other is a wildcard subscription for when we don't know which properties were used [15:47:09] But it's actually it's own subset of properties [15:47:29] That makes it even more strange that it was there [15:47:59] the reason being that basically we had one or two aspects and the rest slowly got developed out of O [15:48:06] same for description [15:48:22] the docs still says we subscribe with description usage [16:53:25] kormat: marostegui: fyi - I've replaced the "disk use derivative (avg per second)" with "disk use delta (since -24h)" at https://grafana-rw.wikimedia.org/d/000000106/parser-cache?orgId=1&from=now-7d&to=now&forceLogin=true [16:53:47] so for any given data point, it represents the difference with the same time 24h earlier [16:54:24] Krinkle: Ah, that's useful [16:56:08] total since we finished the spare rotation: https://grafana-rw.wikimedia.org/d/000000106/parser-cache?orgId=1&from=1623931200000&to=1625504100000&forceLogin=true [16:56:41] it seems since we started doing concurrent purging / purging in <1 day, the delta went from +300 or +400 GB to ~ +200 GB [16:56:44] but still growing nonetheless [17:00:35] back in March this year, the daily delta was actually lower, around +180GB, but there were also many days then where it was negative by many hundreds of GBs. [17:00:40] https://grafana-rw.wikimedia.org/d/000000106/parser-cache?viewPanel=9&orgId=1&from=1614759518903&to=1619164888579&forceLogin=true [17:01:16] I'm confused as to why that was... assuming there weren't optimizings going on then.. how would pc server disk use drop by that much so regularly, given mariadb holding on unused space? [17:01:33] and also how come the delta was better then even on a bad day then now? [17:22:52] Krinkle: keep in mind we purge binlogs everyday and they can be around 200G total [17:22:53] or more [17:23:35] Ah, interesting. Right, that includes blobs for inserts. [17:23:45] marostegui: how many days is that offset by? [17:24:19] Krinkle: not sure what you mean, they expire after 24h [17:24:25] okay, within 24h [17:24:30] I thought maybe multiple days. [17:24:50] no, we expire them after 24h, otherwise we would fill up the disk in no time [17:24:52] so we actually pay twice for the inserted blobs, but half of it reclaimed within 24h [17:25:10] unfortunately mariadb doesn't allow expiration based on minutes or seconds [17:25:15] MySQL does [17:25:20] and that's nice for parsercache [17:25:21] do we write blobs to binlogs even if the insert came from replication stream rather than direct query? [17:26:01] the slaves don't have binlogs, they have relay logs that get deleted after they are processed [17:26:01] in other words, are binlogs size consumption notably differently in behaviour based on eqiad vs codfw being primary? [17:26:07] nop [17:26:10] it is the same [17:26:18] marostegui: is the bin expiry of 24h is for pc or the core cluster is also 24h? [17:26:34] pc [17:26:37] core is 30 days [17:26:51] I'm on phone, need to go! [17:26:54] bye! [17:27:09] marostegui: so if replicas don't keep binlogs for the remote-received inserts, that means disk size will behave differently in Eqiad now vs normally, right? [17:27:29] since they now only get inserts in the replica stream which are kept less long it sounds like [17:27:42] makes sense. Thanks [17:28:06] I don't recall if we have log-slaves-updates enabled will need to check later [17:28:07] that would explain why the disk size was more irregular before the switch over since now it doesn't have the extra 200 GB file add/remove every day. [17:28:12] if we do, it is the same disk space [17:28:26] if we don't, then no :) [17:28:58] If I understand correctly, these are now trimmed on the go, but like file rotation, so we remove a full day's worth of binlog all at once, right? Or is it managed on-the-go in smaller intervals? [17:29:04] now*not* [18:22:44] apergos: hey, if you feel like it: https://gerrit.wikimedia.org/r/c/operations/puppet/+/702933 clearly not urgent [18:26:13] ah yeah, I mean to do that this morning and then forgot about it [18:26:16] so it will be tomorrow [18:26:31] got sucked into the hell of docker images instead [18:28:10] that sounds like the worst ring of hell :(