[05:17:15] <wikibugs>	 10DBA: Please optimize logging table in dewiki - https://phabricator.wikimedia.org/T287344 (10Ladsgroup)
[05:17:47] <marostegui>	 \o/
[05:21:11] <Amir1>	 just finished :D
[05:37:42] <marostegui>	 excellent, I will do eqiad
[05:42:57] <Amir1>	 random blog I found https://engineering.fb.com/2021/07/22/data-infrastructure/mysql/
[05:48:14] <marostegui>	 haha yeah, kinda well known
[05:48:53] <wikibugs>	 10DBA: Please optimize logging table in dewiki - https://phabricator.wikimedia.org/T287344 (10Marostegui) p:05Triage→03Medium
[05:49:48] <wikibugs>	 10DBA: Please optimize logging table in dewiki - https://phabricator.wikimedia.org/T287344 (10Marostegui)
[05:49:57] <Amir1>	 oh the image table clean up is 40% done now
[06:09:24] <Amir1>	 ^ 300GB cleaned (uncommpressed) 
[07:07:41] <wikibugs>	 10DBA: Please optimize logging table in dewiki - https://phabricator.wikimedia.org/T287344 (10Marostegui)
[07:09:05] <marostegui>	 Amir1: <3 <3 <3 <3
[07:09:29] <Amir1>	 marostegui: so much left :(((
[07:09:51] <Amir1>	 btw, let me know on how much dewiki size change if posisble
[07:10:07] <marostegui>	 Yeah, I am running it now
[07:10:12] <marostegui>	 it is 52G on eqiad master...so we'll see
[07:11:37] <Amir1>	 I'm doing plwiki (s2) atm but I'm not sure it's worth optimizing once done, it's small, one third of dewiki https://phabricator.wikimedia.org/P16839)
[07:11:47] <Amir1>	 unless s2 is under stress
[07:11:53] <marostegui>	 yeah, s2 should be ok
[07:12:00] <marostegui>	 let me see how big it is
[07:12:12] <marostegui>	 yeah, it is 12G
[07:20:31] <marostegui>	 Amir1: from 52G to 14G XD
[07:20:38] <Amir1>	 ^^
[07:20:50] <wikibugs>	 10DBA: Please optimize logging table in dewiki - https://phabricator.wikimedia.org/T287344 (10Marostegui) It went from 52G to 14G
[07:20:53] <marostegui>	 very nice!
[07:20:59] <Amir1>	 I would have demanded bribe if I knew sooner
[07:21:19] <marostegui>	 Too late, I am now going to remove all your access 
[07:21:26] <Amir1>	 :'(
[07:21:36] <Amir1>	 s5 is not that under load though 
[07:21:56] <marostegui>	 yeah, but it is nice to get such a clean up
[07:22:03] <marostegui>	 especially on eqiad, with just one command
[07:22:32] <Amir1>	 yeah, flagged revs is one of the worst things we have in production, really needs an overhaul 
[07:22:56] <Amir1>	 dropped around 10K lines of code from it already but it's REALLY bad
[07:34:57] <wikibugs>	 10DBA: Please optimize logging table in dewiki - https://phabricator.wikimedia.org/T287344 (10Marostegui)
[07:35:26] <wikibugs>	 10DBA: Please optimize logging table in dewiki - https://phabricator.wikimedia.org/T287344 (10Marostegui) Waiting for the switch back to do codfw.
[07:56:42] <Amir1>	 fs usage of s5 is adorable https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=12&from=now-6h&orgId=1&refresh=5m&to=now&var-server=db1130&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql
[07:59:16] <Amir1>	 I thought s3's inode would be much worse (that's why we started creating wikis in s5) but I might be missing something: https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=12&from=now-6h&orgId=1&refresh=5m&to=now&var-server=db1157&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql
[08:02:42] <marostegui>	 Amir1: it is more the fact that for instance taking a backup from s3 is insane, or just running a mysql_upgrade after an upgrade, with the amount of tables it has in total, it makes it veeeery hard to deal with
[08:02:50] <marostegui>	 But not all the files are opened at the same time necessarily 
[08:03:22] <Amir1>	 aha, I see
[08:03:36] <Amir1>	 I thought we are reaching inode limit
[08:04:05] <marostegui>	 We might have had crashes with it in the past, I don't remeber exactly, but it is not a common issue
[08:08:27] <RhinosF1>	 Amir1: what you using to delete flagged revs logs? Although I'm sure it'll be useless until 1.37
[08:08:52] <Amir1>	 manual sql query :D
[08:09:36] <Amir1>	 specially since it's not using the right index and I have to give it timestamp so it doesn't scan millions of rows
[08:09:42] <RhinosF1>	 Amir1: can you give me said query
[08:09:54] <Amir1>	 `delete from logging where log_type = 'review' and log_action = 'approve-a' and log_timestamp like '2018%' limit 10000`
[08:13:52] <RhinosF1>	 Ty
[08:14:09] <RhinosF1>	 There is now a task for documenting 1.37 work done by you
[08:14:11] <RhinosF1>	 https://phabricator.miraheze.org/T7696
[08:15:05] <Amir1>	 oh that list will get waaaaay longer
[08:15:11] <RhinosF1>	 I know
[08:15:14] <Amir1>	 :D
[08:15:42] <RhinosF1>	 At least I have the script wrapper
[08:15:50] <RhinosF1>	 And hopefully upgrade cookbooks
[08:15:55] <RhinosF1>	 So less stress
[08:16:33] <RhinosF1>	 Just don't let it come out Christmas week
[08:16:49] <RhinosF1>	 I mean in December tbh
[08:16:54] <RhinosF1>	 Because testing takes time
[08:17:44] <Amir1>	 https://usercontent.irccloud-cdn.com/file/cGMBrFpQ/image.png
[08:17:47] <Amir1>	 https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=codfw&var-group=core&var-shard=s2&var-role=All
[08:36:00] <wikibugs>	 10Data-Persistence-Backup, 10SRE: Restore ~tjones/reindex directory from mwmaint1002 - https://phabricator.wikimedia.org/T287304 (10LSobanski)
[09:04:10] <Amir1>	 o/ I have trouble accessing parsercache dbs (need it for T285987). I tried so many things and nothing worked
[09:04:11] <stashbot>	 T285987: Do not generate full html parser output at the end of Wikibase edit requests - https://phabricator.wikimedia.org/T285987
[09:04:37] <Amir1>	 https://www.irccloud.com/pastebin/wiEl5ZOM/
[09:04:38] <marostegui>	 Amir1: what do you mean accessing?
[09:04:49] <Amir1>	 to query PC
[09:04:54] <marostegui>	 let me check
[09:05:10] <Amir1>	 I copied the password from private settings 
[09:05:41] <marostegui>	 Can you try from mwmaint1002?
[09:05:45] <Amir1>	 for whatever reason, I can't make mediawiki (mysql.php) connect to it
[09:05:49] <marostegui>	 I think that'0s the issue, we doi not have grants for mwmaint2002
[09:05:49] <Amir1>	 sure
[09:06:25] <marostegui>	 I should fix that anyways, but just to confirm
[09:06:34] <Amir1>	 sure
[09:06:44] <marostegui>	 mm actually we do have 10.192 granted
[09:07:44] <Amir1>	 I need to make sure the fingerprint is correct (mwmaint1002 got reimaged)
[09:08:29] <marostegui>	 yeah, this is quite old: https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/mwmaint1002.eqiad.wmnet
[09:08:37] <marostegui>	 moritzm: ^
[09:09:22] <Amir1>	 someone gave an updated list in deploy1001 but I keep forgetting where it is
[09:09:34] <majavah>	 we have fingerprints published automatically to https://config-master.wikimedia.org/
[09:09:40] <marostegui>	 The hashes for the wikiuser pass are the same for 10.64. and 10.192
[09:09:51] <marostegui>	 majavah: oh sweet
[09:10:06] <Amir1>	 wohaaa nice
[09:10:33] <marostegui>	 although that's from 22nd july
[09:10:36] <marostegui>	 when was the host reimaged?
[09:11:44] <majavah>	 and an auto upgrade script that also deals with CNAMEs: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/wmf-sre-laptop/+/refs/heads/master/scripts/wmf-update-known-hosts-production
[09:12:27] <Amir1>	 ugh, the system is giving me SHA256, none of these are sha256
[09:12:40] <moritzm>	 marostegui: mutante reimaged it last week or so
[09:13:14] <Amir1>	 no, it's ECDSA
[09:24:08] <marostegui>	 Amir1:  it works for me from 2002
[09:24:30] <marostegui>	 https://phabricator.wikimedia.org/P16890
[09:24:31] <Amir1>	 marostegui: what is the command? I'm sure I'm doing it wrong
[09:25:37] <Amir1>	 hmm, maybe I'm copying the wrong password
[09:25:38] <marostegui>	 Amir1: https://phabricator.wikimedia.org/P16890#86511
[09:26:39] <Amir1>	 ahaaa
[09:26:55] <Amir1>	 found out what was wrong, I was using the password for wikiadmin instead of wikiuser
[09:27:04] <Amir1>	 I'm inside now
[09:27:14] * marostegui revokes all his credentials so he's got no more problems
[09:27:53] <Amir1>	 drop `drop table pc246;` what can go wrong
[09:28:00] <marostegui>	 haha
[15:59:27] <Amir1>	 still cleaning plwiki's logging table :(((
[16:05:14] <wikibugs>	 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10MoritzMuehlenhoff)
[16:06:21] <wikibugs>	 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10MoritzMuehlenhoff)
[16:14:51] <wikibugs>	 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Bstorm)
[16:17:06] <wikibugs>	 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Bstorm) Cloud team has decided we have too much in this row, and since breakage is possible if we freeze the cloud intentionally, we are going...
[16:18:45] <wikibugs>	 10Data-Persistence-Backup, 10SRE, 10bacula: Restore ~tjones/reindex directory from mwmaint1002 - https://phabricator.wikimedia.org/T287304 (10jcrespo) a:03jcrespo
[17:23:48] <wikibugs>	 10Data-Persistence-Backup, 10SRE, 10bacula: Restore ~tjones/reindex directory from mwmaint1002 - https://phabricator.wikimedia.org/T287304 (10jcrespo) p:05Triage→03High
[17:25:13] <wikibugs>	 10Data-Persistence-Backup, 10database-backups, 10Goal, 10Patch-For-Review: Upgrade pending stretch backup hosts to buster - https://phabricator.wikimedia.org/T280979 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['dbprov1002.eqiad.wmnet'] ` The l...
[17:28:30] <legoktm>	 I updated https://wikitech.wikimedia.org/w/index.php?title=Help%3ASSH_Fingerprints%2Fmwmaint1002.eqiad.wmnet&type=revision&diff=1919694&oldid=1804776
[17:30:00] <jynus>	 lego, maybe you wanted to write in other channel?
[17:30:53] <legoktm>	 no, I was replying to m.arostegui from earlier who complained the fingerprints were out of date
[17:31:03] <jynus>	 oh, sorry, I didn't have the context
[17:31:09] <legoktm>	 but I should probably post there too :)
[17:35:25] <jynus>	 db2147 is having a weird pattern, from the graph it looks as if it is leaking both memory and disk space, but from processlist and SHOW ENGINE INNODB STATUS doesn't look like anything obvious
[17:37:25] <jynus>	 non-linear disk space utilization: https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=28&orgId=1&var-server=db2147&var-datasource=thanos&var-cluster=mysql&from=1619545015924&to=1627321015924
[17:49:34] <wikibugs>	 10Data-Persistence-Backup, 10database-backups, 10Goal, 10Patch-For-Review: Upgrade pending stretch backup hosts to buster - https://phabricator.wikimedia.org/T280979 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbprov1002.eqiad.wmnet'] `  and were **ALL** successful.
[17:54:22] <volans>	 jynus: on db2147 it seems the mysql process, thread ID 2151 
[17:54:47] <volans>	 is the one that wrote the most
[18:00:01] <volans>	 mmmh, no THREAD_OS_ID in performance_schema.threads, that would have been too easy :)
[18:03:02] <jynus>	 volans, where did you get the thread #?
[18:03:15] <jynus>	 there is sometime confussion between innodb ids and mysql ids
[18:03:30] <volans>	 that's OS thread ID from iotop
[18:03:53] <volans>	 and I was looking to match it to mysql thread ID to see what is doing
[18:04:03] <jynus>	 ah, if was by "number of bytes written" probably a false detection
[18:04:16] <volans>	 it might be the replica thread ofc
[18:04:18] <jynus>	 it is expected a single thread to write all data on a replica (the replication sql thread) :-)
[18:04:27] <jynus>	 yeah
[18:04:35] <jynus>	 not that it couldn't be that, ofc
[18:05:15] <jynus>	 I didn't do any deep research, honestly, I was finding something to do while I waited for a reimage
[18:05:18] <jynus>	 and saw the alert
[18:05:29] <jynus>	 and just wanted to confirm it as a real issue
[18:05:50] <jynus>	 but thanks for the help, maybe someone with more time can dig deeper
[18:06:50] <jynus>	 of if it is an internal innodb thread it doesn't give us much clues (why is there things not being released -eg. temp data)
[18:06:53] <jynus>	 **or
[18:07:47] <volans>	 right
[18:08:27] <jynus>	 sometimes debugging not worth it, and just needs a restart, who knows :-(
[18:09:02] <volans>	 eheheh
[18:09:03] <volans>	 could be
[18:09:20] <jynus>	 BTW, thanks for reimage script
[18:09:36] <jynus>	 it worked nicely for dbprov1002, which is always nail-biting
[18:09:50] <jynus>	 I feel that I only complain to you when it fails :-D
[18:09:51] <volans>	 yw, glad it worked :)
[18:10:02] <jynus>	 I wanted to congratulate you when it worked
[18:10:11] <volans>	 ahaah that's true, I usuallt only hear when it fails :)
[18:10:14] <volans>	 thanks :D
[18:10:14] <jynus>	 for once in a lifetime
[18:10:19] <jynus>	  /jk
[18:12:03] <volans>	 :)
[18:22:38] <wikibugs>	 10Data-Persistence-Backup, 10database-backups, 10Goal, 10Patch-For-Review: Upgrade pending stretch backup hosts to buster - https://phabricator.wikimedia.org/T280979 (10jcrespo)
[18:23:15] <wikibugs>	 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10jcrespo)
[18:24:46] <wikibugs>	 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10jcrespo) dbprov1002 has been successfully reimaged to buster, with no issues.  I cannot discard I could had made some mistakes on backup reorganization, but those should not affect the following steps-...
[22:48:51] <wikibugs>	 10Data-Persistence-Backup, 10SRE, 10bacula: Restore ~tjones/reindex directory from mwmaint1002 - https://phabricator.wikimedia.org/T287304 (10TJones) Any idea when someone might have time to look at this?  I'm trying to avoid having to recreate code that I had on mwmaint1002, but I have another ticket that's...
[23:50:43] <wikibugs>	 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Legoktm) p:05Triage→03Medium
[23:53:15] <wikibugs>	 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Legoktm) p:05Triage→03Medium