[03:38:18] https://grafana.wikimedia.org/d/000000106/parser-cache?viewPanel=6&orgId=1&var-contentModel=wikibase_item&var-dc=codfw&var-source_ops=codfw%20prometheus%2Fops&from=now-24h&to=now [03:38:49] marostegui: morning. Something truly bizarre is happening 😳 with ParserCache ^ [03:40:20] My estimation is that it will go up to 30TB free by end of the month [03:50:17] Amir1: I'm about to enter in the pool for training. I will check once I get out. but can it be just cause we added new hosts with larger disks and the total free space has increased [03:50:36] although I don't get why it starts yesterday, as those hosts were added last week [03:50:42] something being purged? [03:53:36] marostegui: sorry, I should have been clearer. I did it [03:53:44] 🤪🤪 [03:54:02] https://phabricator.wikimedia.org/T285987 [03:54:13] Enjoy the training! [03:55:02] jesus you scared me!!! [03:56:08] I was like: oh no pc again... [03:56:12] Sorry, I told you I was about to do it 😁😁 [03:56:59] This is going to go up for a full month!!!! [03:57:15] Until all expire [03:57:22] <3 [03:57:39] maybe it will even fill in the pool size! [07:07:00] jynus: I am going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/714704 if that's ok with you. No other stretch hosts are left in s4 eqiad [07:09:07] ok [08:29:43] Amir1: looking at the last 14 days, i don't think that current trend has anything to do with your changes [08:32:12] (or at least no clear relation) [08:35:28] marostegui: mariadb segfaulted last night on db2118 while `mysqlcheck --all-databases` was running :( [08:36:16] anything on the MySQL log? [08:36:53] `InnoDB: Failed to read file './metawiki/content.ibd' at offset 49350: Page read from tablespace is corrupted.` [08:38:01] damn [08:38:08] replication is stopped right? [08:38:12] yep [08:38:34] I would mysqldump that table, drop it (with replication stopped) and create it from the dump with replication disabled [08:39:03] the full error: https://phabricator.wikimedia.org/P17086 [08:40:06] Pff [08:40:27] I am sure there will be more tables like that, I would reclone it entirely [08:42:06] i checked for h/w errors in `ipmi-sel`, just in case, but it's clean [08:42:46] You can try the dump approach just for content table, but I think it is safer to reclone it entirely and maybe even just from logical backups to be 100% sure it is fine [08:43:03] (as it is a candidate) [08:43:26] safer == gooder [08:45:43] kormat: if you go for the snapshot reclone, I would still run a mysqlcheck just to be sure [08:45:57] yeah absolutely [08:46:05] currently looking at the process for a logical restore [08:46:14] btw, very nice write up on the parsercache investigation, very good job! [08:46:21] thanks! 💜 [08:46:36] Amir1 (mostly) helped, a lot [08:46:41] kormat: essentially drop all the databases, create them empty and reload the latest logical dump :) [08:46:52] Amir1 helping? That's new! [08:48:31] kormat: https://wikitech.wikimedia.org/wiki/MariaDB/Backups#Recovering_a_logical_backup [08:50:00] reminder: check the grants before pooling the host back in [08:50:20] marostegui: that's what i was looking at. it doesn't say anything about dropping/recreating databases [08:51:23] i'm now looking in horror at the "Enabling replication on the recovered server" section [08:51:24] kormat: I don't remember whether mydumper dumps stuff with DROP table or DROP database, it might [08:51:47] you can check the schema creation files and if it does, then you don't need to drop anything yourself [08:52:11] kormat: enabling replication is easy, as you'd have the coordinates on the metadata file mydumper generates [08:52:26] marostegui: the doc talks about manually recreating the heartbeat table, and populating it by hand [08:52:37] my face: 🙀 [08:52:54] if you don't drop heartbeat database/table then you don't need to do that :) [08:53:09] I think that doc is for a totally empty server [08:54:05] if myloader will take care of the DROP TABLE, then you don't have to issue anything else yourself manually - I simply don't recall if it does or not by default [08:54:23] jaime might [08:57:18] this doesn't seem right. https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/templates/dbbackups/dbprov2002.cnf.erb says that it's responsible for s7, but there's no 'latest' dump, and in the archive the most recent is from >3 weeks ago [08:58:54] By looking at the db, s7 logical lives in dbprov2003.codfw.wmnet [08:58:56] for codfw [08:59:06] sorry dbprov2001.codfw.wmnet [08:59:19] | 13061 | dump.s7.2021-08-24--00-00-02 | finished | db2100.codfw.wmnet:3317 | dbprov2001.codfw.wmnet | dump | s7 | 2021-08-24 00:00:02 | 2021-08-24 03:07:00 | 146162822416 | [08:59:42] 🤯 [09:00:13] i can confirm there's a dump there [09:01:01] but.. how the hell are the configs wrong, then [09:01:05] jynus: hi :) [09:01:50] ohh [09:01:52] https://phabricator.wikimedia.org/rOPUP44fa656c4c0d5ab9fc3db03920f0311af8a177da ? [09:02:00] https://gerrit.wikimedia.org/r/c/operations/puppet/+/710981 - it's from earlier this month, but was merged yeseterday [09:02:01] modules/profile/templates/dbbackups/cumin2001.cnf.erb points to dbprov2002 [09:02:12] Hahaha I was about to point that [09:02:21] so a dump simply hasn't run since that change was merged [09:02:26] yeah [09:02:28] god that's confusing [09:02:31] ok, thanks folks :) [09:02:34] I thinkt hey run on tuesdays [09:03:10] WTB tool that given a section and a DC tells you where the backups are [09:03:36] 🐎 [09:03:52] it doesn't help that git history shows the date a commit was last modified, _not_ when it was merged [09:03:56] [err, I realise that may be a joke idiom too niche] [09:05:05] kormat: disadvantage of not having a merge-based workflow :-/ [09:05:30] Emperor: wellll. merges are evil, and to be shunned where possible. [09:07:06] ! [09:08:27] [I won't side-track you from fixing databases right now] [09:10:39] * kormat quietly takes down the 'squash+rebase 4 life!!' campaign banners [09:13:09] marostegui: the logical dumps contain things like `frwiktionary-schema-create.sql.gz` which has `create database frwikitionary` [09:13:18] i don't see anything about dropping databases, though [09:13:28] so i guess i'll drop the non-hb ones by hand first [09:14:12] the wiki page for `recover-dump` doesn't specify either [09:14:12] non-hb? [09:14:19] non-heartbeat [09:14:22] ah yeah [09:14:28] didn't even realise i had abbreviated that :) [09:14:29] only *wik should be needed [09:14:50] if I want to update a CR from prod, is the correct workflow to change to the review/ branch generated by git review -d ; git rebase production and then git review -R again? [09:14:52] you can try, it will fail fast if it cannot do it for you [09:14:54] marostegui: plus everyone's favourite complication: centralauth [09:15:03] oh yeah...good catch [09:15:12] Emperor: what do you mean "from prod"? [09:15:30] there's a centralauth dump too right on those files, right? [09:15:41] marostegui: could be! [09:15:44] (yes, there is :) [09:15:55] great! [09:16:28] kormat: the "production" branch has been updated since I started working on my CR; I want e.g. tests run on my CR to reflect those changes [09:16:40] Emperor: ah ok. `git rebase production $yourbranch; git review` [09:17:10] `git review -d` is for when you want to check out someone else's CR [09:17:19] you don't need it for your own, you already have the branch locally [09:18:55] kormat: oh. https://www.mediawiki.org/wiki/Gerrit/Tutorial/tl;dr in the "reviewer asks you to make a change" bit says to run git review -d [09:19:32] that's crazy. ignore the hell out of that. [09:19:47] (I'm not sure how the Change-Id gets into the commit message if you only use your own branch?) [09:19:58] Emperor: the change-id gets added by a local git hook [09:20:14] if it doesn't, git review will refuse to push it [09:20:36] Huh. TIL and all that [09:30:53] sorry, I was in a break [09:31:10] indeed, the canonical location of backups is written to the database [09:31:27] we may have in the future a dashboard to make it easier to see, but it needs work [09:31:33] I might edit that tl;dr page [09:33:12] kormat regarding procedure to recover a dump, I am blocked on work from grants and source of truth by DBAs (remember those things I usually complain on meetings 0:-)) [09:39:55] jynus: i don't know what to tell you. we have so much work to do, and so many more important things that grant management is waaay down the list. i can't see it getting tackled before next FY, at the earliest. [09:40:04] in the meantime, zarcillo is the source of truth [09:40:09] and likely to stay that way [09:40:21] not complaining, just explaining why the recovery is not more automated :-) [10:29:36] Amir1: Do you think flaggedrevs_stats can be dropped? https://phabricator.wikimedia.org/T289050#7311461 not sure who I can ping about this [10:31:43] marostegui: looks like they were abandoned in 2011 [10:31:46] https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/FlaggedRevs/+/848ef073fa89036c40c440016a8092690ddcf56b%5E%21/schema/postgres/patch-flaggedrevs_statistics.sql [10:34:09] RhinosF1: but still named at https://gerrit.wikimedia.org/g/mediawiki/extensions/FlaggedRevs/+/a3d50d189ba943a751c98e24b4933c9e2578a8f4/frontend/specialpages/reports/ValidationStatistics.php [10:34:44] I love this codebase [10:36:13] XDD [10:38:10] marostegui: the more you look, the more you find that wows me with that codebase [10:38:28] That's why I don't look! [10:38:32] It one I keep reminding people that it wasn't ever designed [10:40:34] marostegui: it doesn't exist on any Miraheze c4 wikis so nothing should blow up without it. I assume that it simply predates us. [10:41:03] c4? [10:41:12] you mean s4? [10:41:46] It doesn't exist everywhere, ie: just a bunch of s3 wikis, but definitely not in the 900...it is not on frwiki or jawiki (only ruwiki on s6), and it is not on s4 (commons) or s8 (wikidata) [10:42:20] marostegui: no I checked Miraheze wikis too as we run flaggedrevs on some [10:42:44] I guess I can just rename the table in a live enwiki slave and see if we notice something [10:43:28] That sounds sensible [10:45:20] It doesn't have writes at all [10:45:37] So the only thing reading it might be some weird cronjobs just querying it somehow [11:13:40] I will do that on Monday I think as I am not working tomorrow [13:05:26] sorry I was afk [13:05:27] reading up [13:05:46] Amir1: don't apologise for being afk. apologise for coming back. [13:05:53] (<3) [13:06:15] kormat: regarding " looking at the last 14 days, i don't think that current trend has anything to do with your changes". Yes. It's the cleaner but look at the derivative specially when I got to deployed [13:07:31] marostegui: it sounds right, I can double check the table and let you know. [13:07:39] kormat: :(((( [13:07:45] * kormat giggles [13:09:02] kormat: back to PC. Basically what I did was that I stopped a lot of crap being introduced. I didn't clean anything, so when things expire, the cleaner cleans way more. [13:09:04] Amir1: I have also pinged aaron as he was the contact point on the dropping tables main task [13:09:50] Amir1: the time doesn't line up [13:09:57] the cleaner starts at 01:00 [13:10:08] how long it take? [13:10:17] say 9-10h [13:10:24] i can check if you like [13:10:41] ah, it can also be binlog [13:10:52] ah, less. it's currently taking about 6h [13:11:12] Amir1: ahh. ok, that might make more sense [13:12:04] with the huge amount of binlogs we generate on parsercache, any small change on the amount of things we write can make a huge difference on that too [13:12:18] in a good way, that is [13:12:50] my sorta worry is that if I reduce writes and size of PC, does it make the table smaller. It shouldn't unless we optimize it. Right? [13:13:44] yeah [13:13:44] as an example, pc2011 currently has 243G of binlogs [13:14:42] and that's only keeping 24h of binlogs [13:18:31] so what should we do to get back some space? Do you think it makes sense to optimize the tables? [13:18:45] (after twenty days when these are fully cleaned up) [13:20:04] optimizing the tables is an intensive operation [13:20:20] if you just want to see how much less space is being used, we could do it on a replica [13:21:10] Amir1: I don't think we need for now, it is a very time-consuming task [13:21:12] (we went through a cycle of optimizing all PC tables back in june. it _sucked_. and causes terrible cache hit rates) [13:21:39] Luckily with the new hosts we have lots of more space [13:21:54] We could optimize eqiad if we wanted, and then codfw once we are back in eqiad as primary [13:22:07] yeah, okay [13:22:11] whatever you prefer [13:22:48] I suggest doing it as late as possible so these cleanups can have some effects, otherwise let it happen "naturally" [13:23:01] nature is healing *cries* [13:23:11] 🥀 [13:23:42] IWBNI puppet-lint noticed syntax errors :-/ [13:23:46] I have some ideas for future, specially regarding commons, I feel it would be around 20% more clean ups but let's wait until I have proper time for it [13:24:48] IWBNI? [13:24:55] it would be nice if [13:24:55] It Would Be Nice If (sorry) [13:24:58] i googled it [13:25:03] we need an Emperor-translator [13:25:05] jeez [13:25:19] <-- overly-fond of acronyms, it seems [13:25:30] haha that reminds me of this [13:25:33] * marostegui searches for it [13:25:37] i'm surprised that wasn't `OFOAIS` :P [13:25:48] https://gist.github.com/klaaspieter/12cd68f54bb71a3940eae5cdd4ea1764 [13:25:48] 🐟 [13:27:06] jesus what is OFOAIS? XD [13:27:33] I think we should switch to spanish as a channel language [13:27:52] marostegui: kormat's "joke" Overly-Fond Of Acronyms It Seems [13:27:54] ;p [13:28:04] * marostegui quits [13:28:11] \o/ 🎉 \o/ [13:28:50] kormat: I guess you are happy that you need to complete https://phabricator.wikimedia.org/T167973 then? [13:29:01] touché [13:29:15] marostegui: i figure you're the last roadblock to just switching everything to sqlite [13:29:48] Yeah me and my family name: mediawiki [13:29:57] kormat: making friends and influencing people I see [13:33:18] marostegui: hahah [13:33:22] Emperor: that's me! [13:39:05] I just noticed db2089:3315 having events disabled, and a query running for a day [13:39:07] Nice... [13:39:17] I am going to get that fixed and I will double check all the events again [13:39:43] which reminds me of https://phabricator.wikimedia.org/T254738 [13:41:13] done and query killed [13:42:43] Thinking mostly about backups sources, so I give a bump in priority on seting up regular mysqlcheck/compare there? Thoughts? [13:43:49] Not following you [13:44:13] worried about data corruption you found recently [13:44:25] I think a compare wouldn't catch it [13:44:41] As it looks related to innodb internal tablespace [13:44:48] didn't mysqlcheck did in this case? :-) [13:44:57] maybe only the check for now? [13:44:59] mysqlcheck made it crash [13:45:02] he [13:45:23] It definitely doesn't hurt to have a compare running of course, but I don't think it would have caught this particular issue [13:45:38] so maybe not worth it for now, thanks, that's exactly the feedback I wanted [13:45:39] I remember we had a conversation about making a recurrent mysqlcheck on backup sources, but it was a lot harder than we thought [13:45:43] yeah [13:46:08] so something to do more long term, for all hosts as a larger project, I guess [13:46:16] yeah definitel [13:46:18] definitely [13:46:22] so many things to do and so little time! [13:46:26] :-/ [13:50:20] I have reviewed all hosts in production and it was just 2089:3315 without them enabled [13:57:23] I am going to go offline, I am not working tomorrow, however I will take care of starting replication on db1138 (eqiad master) once the tables check finish. I think it will finish tonight or somewhere during the EU night. [13:58:52] say hi to L for me! [13:59:38] btw flaggedtemplates clean up is still ongoing for ruwiki, arwiki and dewiki (for days now...) [14:10:07] Amir1: everytime i read 'r-u-w-i-k-i' i tense up, but then relax when it's not followed by 'n-e-w-s' [14:10:36] lol [14:10:56] there are more wikis to add to your collection, just wait a couple of years :D [14:42:39] hello! I'm currently looking at some improvements to our cassandra cluster and just wondering if there's prior art on this - do ye manage mariadb grants programatically? How are they stored? [14:44:19] he he [14:44:24] hnowlan: we do not! :) [14:46:45] three days since ruwiki clean up has started and so far 300 million rows have been deleted 🤦 [14:47:00] (out of 1.9B) [14:47:44] hnowlan: wrong question! [14:47:45] :D [14:51:23] kormat: grand, thanks! [14:59:38] ugh, puppet. [14:59:53] * Emperor tries to resist the temptation of a stupid hack to work around puppet being annoying [15:08:51] FYI there are a lot of duplicated definitions in Icinga wrt DBs only one for each will effectively be monitored [15:08:54] they are all for Duplicate definition found for service 'MariaDB sustained replica lag' [15:09:12] could it be related to multiinstance that might need the instance name in the alert description? [15:10:49] from a first looks seems so [15:10:50] # --PUPPET_NAME-- db2138 mariadb-prolonged-lag-s2 [15:10:51] # --PUPPET_NAME-- db2138 mariadb-prolonged-lag-s4 [15:14:04] volans: could i convince you to file a task? i'll reluctantly look into it tomorrow [15:16:58] kormat: https://gerrit.wikimedia.org/r/c/operations/puppet/+/715043 [15:19:23] I'll leave it to you for any potential side effect I might not be aware of [16:22:43] kormat: I've sent you another pile of brain-dump about my ongoing puppet woes. I know it's COB on Thursday, so don't feel you have to attend to it tomorrow.