[05:17:15] 10DBA: Please optimize logging table in dewiki - https://phabricator.wikimedia.org/T287344 (10Ladsgroup) [05:17:47] \o/ [05:21:11] just finished :D [05:37:42] excellent, I will do eqiad [05:42:57] random blog I found https://engineering.fb.com/2021/07/22/data-infrastructure/mysql/ [05:48:14] haha yeah, kinda well known [05:48:53] 10DBA: Please optimize logging table in dewiki - https://phabricator.wikimedia.org/T287344 (10Marostegui) p:05Triage→03Medium [05:49:48] 10DBA: Please optimize logging table in dewiki - https://phabricator.wikimedia.org/T287344 (10Marostegui) [05:49:57] oh the image table clean up is 40% done now [06:09:24] ^ 300GB cleaned (uncommpressed) [07:07:41] 10DBA: Please optimize logging table in dewiki - https://phabricator.wikimedia.org/T287344 (10Marostegui) [07:09:05] Amir1: <3 <3 <3 <3 [07:09:29] marostegui: so much left :((( [07:09:51] btw, let me know on how much dewiki size change if posisble [07:10:07] Yeah, I am running it now [07:10:12] it is 52G on eqiad master...so we'll see [07:11:37] I'm doing plwiki (s2) atm but I'm not sure it's worth optimizing once done, it's small, one third of dewiki https://phabricator.wikimedia.org/P16839) [07:11:47] unless s2 is under stress [07:11:53] yeah, s2 should be ok [07:12:00] let me see how big it is [07:12:12] yeah, it is 12G [07:20:31] Amir1: from 52G to 14G XD [07:20:38] ^^ [07:20:50] 10DBA: Please optimize logging table in dewiki - https://phabricator.wikimedia.org/T287344 (10Marostegui) It went from 52G to 14G [07:20:53] very nice! [07:20:59] I would have demanded bribe if I knew sooner [07:21:19] Too late, I am now going to remove all your access [07:21:26] :'( [07:21:36] s5 is not that under load though [07:21:56] yeah, but it is nice to get such a clean up [07:22:03] especially on eqiad, with just one command [07:22:32] yeah, flagged revs is one of the worst things we have in production, really needs an overhaul [07:22:56] dropped around 10K lines of code from it already but it's REALLY bad [07:34:57] 10DBA: Please optimize logging table in dewiki - https://phabricator.wikimedia.org/T287344 (10Marostegui) [07:35:26] 10DBA: Please optimize logging table in dewiki - https://phabricator.wikimedia.org/T287344 (10Marostegui) Waiting for the switch back to do codfw. [07:56:42] fs usage of s5 is adorable https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=12&from=now-6h&orgId=1&refresh=5m&to=now&var-server=db1130&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql [07:59:16] I thought s3's inode would be much worse (that's why we started creating wikis in s5) but I might be missing something: https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=12&from=now-6h&orgId=1&refresh=5m&to=now&var-server=db1157&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql [08:02:42] Amir1: it is more the fact that for instance taking a backup from s3 is insane, or just running a mysql_upgrade after an upgrade, with the amount of tables it has in total, it makes it veeeery hard to deal with [08:02:50] But not all the files are opened at the same time necessarily [08:03:22] aha, I see [08:03:36] I thought we are reaching inode limit [08:04:05] We might have had crashes with it in the past, I don't remeber exactly, but it is not a common issue [08:08:27] Amir1: what you using to delete flagged revs logs? Although I'm sure it'll be useless until 1.37 [08:08:52] manual sql query :D [08:09:36] specially since it's not using the right index and I have to give it timestamp so it doesn't scan millions of rows [08:09:42] Amir1: can you give me said query [08:09:54] `delete from logging where log_type = 'review' and log_action = 'approve-a' and log_timestamp like '2018%' limit 10000` [08:13:52] Ty [08:14:09] There is now a task for documenting 1.37 work done by you [08:14:11] https://phabricator.miraheze.org/T7696 [08:15:05] oh that list will get waaaaay longer [08:15:11] I know [08:15:14] :D [08:15:42] At least I have the script wrapper [08:15:50] And hopefully upgrade cookbooks [08:15:55] So less stress [08:16:33] Just don't let it come out Christmas week [08:16:49] I mean in December tbh [08:16:54] Because testing takes time [08:17:44] https://usercontent.irccloud-cdn.com/file/cGMBrFpQ/image.png [08:17:47] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=codfw&var-group=core&var-shard=s2&var-role=All [08:36:00] 10Data-Persistence-Backup, 10SRE: Restore ~tjones/reindex directory from mwmaint1002 - https://phabricator.wikimedia.org/T287304 (10LSobanski) [09:04:10] o/ I have trouble accessing parsercache dbs (need it for T285987). I tried so many things and nothing worked [09:04:11] T285987: Do not generate full html parser output at the end of Wikibase edit requests - https://phabricator.wikimedia.org/T285987 [09:04:37] https://www.irccloud.com/pastebin/wiEl5ZOM/ [09:04:38] Amir1: what do you mean accessing? [09:04:49] to query PC [09:04:54] let me check [09:05:10] I copied the password from private settings [09:05:41] Can you try from mwmaint1002? [09:05:45] for whatever reason, I can't make mediawiki (mysql.php) connect to it [09:05:49] I think that'0s the issue, we doi not have grants for mwmaint2002 [09:05:49] sure [09:06:25] I should fix that anyways, but just to confirm [09:06:34] sure [09:06:44] mm actually we do have 10.192 granted [09:07:44] I need to make sure the fingerprint is correct (mwmaint1002 got reimaged) [09:08:29] yeah, this is quite old: https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/mwmaint1002.eqiad.wmnet [09:08:37] moritzm: ^ [09:09:22] someone gave an updated list in deploy1001 but I keep forgetting where it is [09:09:34] we have fingerprints published automatically to https://config-master.wikimedia.org/ [09:09:40] The hashes for the wikiuser pass are the same for 10.64. and 10.192 [09:09:51] majavah: oh sweet [09:10:06] wohaaa nice [09:10:33] although that's from 22nd july [09:10:36] when was the host reimaged? [09:11:44] and an auto upgrade script that also deals with CNAMEs: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/wmf-sre-laptop/+/refs/heads/master/scripts/wmf-update-known-hosts-production [09:12:27] ugh, the system is giving me SHA256, none of these are sha256 [09:12:40] marostegui: mutante reimaged it last week or so [09:13:14] no, it's ECDSA [09:24:08] Amir1: it works for me from 2002 [09:24:30] https://phabricator.wikimedia.org/P16890 [09:24:31] marostegui: what is the command? I'm sure I'm doing it wrong [09:25:37] hmm, maybe I'm copying the wrong password [09:25:38] Amir1: https://phabricator.wikimedia.org/P16890#86511 [09:26:39] ahaaa [09:26:55] found out what was wrong, I was using the password for wikiadmin instead of wikiuser [09:27:04] I'm inside now [09:27:14] * marostegui revokes all his credentials so he's got no more problems [09:27:53] drop `drop table pc246;` what can go wrong [09:28:00] haha [15:59:27] still cleaning plwiki's logging table :((( [16:05:14] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10MoritzMuehlenhoff) [16:06:21] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10MoritzMuehlenhoff) [16:14:51] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Bstorm) [16:17:06] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Bstorm) Cloud team has decided we have too much in this row, and since breakage is possible if we freeze the cloud intentionally, we are going... [16:18:45] 10Data-Persistence-Backup, 10SRE, 10bacula: Restore ~tjones/reindex directory from mwmaint1002 - https://phabricator.wikimedia.org/T287304 (10jcrespo) a:03jcrespo [17:23:48] 10Data-Persistence-Backup, 10SRE, 10bacula: Restore ~tjones/reindex directory from mwmaint1002 - https://phabricator.wikimedia.org/T287304 (10jcrespo) p:05Triage→03High [17:25:13] 10Data-Persistence-Backup, 10database-backups, 10Goal, 10Patch-For-Review: Upgrade pending stretch backup hosts to buster - https://phabricator.wikimedia.org/T280979 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['dbprov1002.eqiad.wmnet'] ` The l... [17:28:30] I updated https://wikitech.wikimedia.org/w/index.php?title=Help%3ASSH_Fingerprints%2Fmwmaint1002.eqiad.wmnet&type=revision&diff=1919694&oldid=1804776 [17:30:00] lego, maybe you wanted to write in other channel? [17:30:53] no, I was replying to m.arostegui from earlier who complained the fingerprints were out of date [17:31:03] oh, sorry, I didn't have the context [17:31:09] but I should probably post there too :) [17:35:25] db2147 is having a weird pattern, from the graph it looks as if it is leaking both memory and disk space, but from processlist and SHOW ENGINE INNODB STATUS doesn't look like anything obvious [17:37:25] non-linear disk space utilization: https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=28&orgId=1&var-server=db2147&var-datasource=thanos&var-cluster=mysql&from=1619545015924&to=1627321015924 [17:49:34] 10Data-Persistence-Backup, 10database-backups, 10Goal, 10Patch-For-Review: Upgrade pending stretch backup hosts to buster - https://phabricator.wikimedia.org/T280979 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbprov1002.eqiad.wmnet'] ` and were **ALL** successful. [17:54:22] jynus: on db2147 it seems the mysql process, thread ID 2151 [17:54:47] is the one that wrote the most [18:00:01] mmmh, no THREAD_OS_ID in performance_schema.threads, that would have been too easy :) [18:03:02] volans, where did you get the thread #? [18:03:15] there is sometime confussion between innodb ids and mysql ids [18:03:30] that's OS thread ID from iotop [18:03:53] and I was looking to match it to mysql thread ID to see what is doing [18:04:03] ah, if was by "number of bytes written" probably a false detection [18:04:16] it might be the replica thread ofc [18:04:18] it is expected a single thread to write all data on a replica (the replication sql thread) :-) [18:04:27] yeah [18:04:35] not that it couldn't be that, ofc [18:05:15] I didn't do any deep research, honestly, I was finding something to do while I waited for a reimage [18:05:18] and saw the alert [18:05:29] and just wanted to confirm it as a real issue [18:05:50] but thanks for the help, maybe someone with more time can dig deeper [18:06:50] of if it is an internal innodb thread it doesn't give us much clues (why is there things not being released -eg. temp data) [18:06:53] **or [18:07:47] right [18:08:27] sometimes debugging not worth it, and just needs a restart, who knows :-( [18:09:02] eheheh [18:09:03] could be [18:09:20] BTW, thanks for reimage script [18:09:36] it worked nicely for dbprov1002, which is always nail-biting [18:09:50] I feel that I only complain to you when it fails :-D [18:09:51] yw, glad it worked :) [18:10:02] I wanted to congratulate you when it worked [18:10:11] ahaah that's true, I usuallt only hear when it fails :) [18:10:14] thanks :D [18:10:14] for once in a lifetime [18:10:19] /jk [18:12:03] :) [18:22:38] 10Data-Persistence-Backup, 10database-backups, 10Goal, 10Patch-For-Review: Upgrade pending stretch backup hosts to buster - https://phabricator.wikimedia.org/T280979 (10jcrespo) [18:23:15] 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10jcrespo) [18:24:46] 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10jcrespo) dbprov1002 has been successfully reimaged to buster, with no issues. I cannot discard I could had made some mistakes on backup reorganization, but those should not affect the following steps-... [22:48:51] 10Data-Persistence-Backup, 10SRE, 10bacula: Restore ~tjones/reindex directory from mwmaint1002 - https://phabricator.wikimedia.org/T287304 (10TJones) Any idea when someone might have time to look at this? I'm trying to avoid having to recreate code that I had on mwmaint1002, but I have another ticket that's... [23:50:43] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Legoktm) p:05Triage→03Medium [23:53:15] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Legoktm) p:05Triage→03Medium