[01:07:08] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 5.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321
[01:08:24] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321
[01:09:04] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 7.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321
[01:10:20] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321
[07:37:36] <jynus>	 heads up: <jynus> !log start of rolling restart of backup hosts
[08:08:53] <jynus>	 db2183 and db2184 are identified as Infrastructure Foundations owned. Most likely this could be because they are "insetup"
[11:18:53] <Amir1>	 jynus: wrt raid rebuild. I'm on phone. It'd be great if you point me to docs so I learn?
[11:19:05] <Amir1>	 To avoid bothering you 
[11:31:15] <jynus>	 disk rebuild is automatic (that is why we buy hosts with hw raid)
[11:31:33] <Emperor>	 *hollow laughing noises*
[11:31:41] <jynus>	 there is a guide of common tasks at: https://wikitech.wikimedia.org/wiki/MegaCli
[11:32:56] <jynus>	 one thing I saw, however, is that learning cycles seem to be enabled, and that causes performance issues- you may want to discuss with manuel if to tune that
[11:33:27] <jynus>	 if you mean data provisioning, I can show you that too
[11:35:07] <jynus>	 Emperor: there is nothing wrong about what I said, I don't have to copy blocks of data manually for the raid; now if you want to discuss how often the raid controller fails or doesn't do the right thing that is another story... 0:-P
[11:35:29] <Emperor>	 :)
[11:35:52] <jynus>	 I'd say that for db hosts it usually fails less than 5-10% of the time than a disk fails
[11:36:08] <jynus>	 *that
[11:37:41] <jynus>	 you inherited a JBOD which definitely makes things more involved at app layer
[11:51:26] <Emperor>	 except it's not really a JBOD, it's instead a series of 1-drive arrays, which feels like the worst of both worlds :)
[11:59:14] <jynus>	 I'm going to lunch, but feel free to ask any question for when I come back
[13:31:59] <Emperor>	 fe reboots done; starting on the be which are often a PITA (not helped by the IPMI on be1040 seemingly very slow)
[13:44:48] <Amir1>	 jynus: I mean this: https://phabricator.wikimedia.org/T320786#8335125
[13:44:55] <Amir1>	 > @Ladsgroup do you want me to recover data to this host?
[13:49:56] <jynus>	 yeah, I with that I meant the data recovery/provision workflow. let me point the docs to you
[13:56:37] <jynus>	 Amir1:  https://wikitech.wikimedia.org/wiki/MariaDB/Backups#Provision_a_precompressed_and_prepared_snapshot_(preferred)
[14:07:32] <Emperor>	 ugh, swift-account won't start on ms-be1042 and doesn't say why not :(
[14:25:34] <Amir1>	 Emperor: I want to get some numbers on thumbnails visits in swift (context: T211661), can I ask a couple of questions about where they are stored? :D
[14:25:35] <stashbot>	 T211661: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661
[14:26:12] <Emperor>	 sure
[14:26:24] <Emperor>	 I don't promise to know the answers, mind...
[15:33:28] <Amir1>	 jynus: another question time, What is exactly needed for tickets like this? T319190
[15:33:29] <stashbot>	 T319190: Prepare and check storage layer for bnwikiquote - https://phabricator.wikimedia.org/T319190
[15:33:36] <Amir1>	 (from DBA side)
[15:33:45] <Amir1>	 I tried looking up documentation but can't find one
[15:33:47] <jynus>	 there is 2 phases
[15:34:30] <Amir1>	 I guess, this is the second phase? https://wikitech.wikimedia.org/wiki/Add_a_wiki#Best_method
[15:34:32] <jynus>	 before wiki deploy, for things that should not be replicated- add it to the list of tables/columns to ingore
[15:34:59] <jynus>	 the second is to purge existing data and setup triggers on sanitarium
[15:35:17] <jynus>	 there is a script/procedure for that, done twice, one of each datacenter
[15:35:33] <jynus>	 then cloud takes care of view updates/creation
[15:35:48] <jynus>	 in the future I will also have to take care of adding them to image backups
[15:36:29] <jynus>	 that is what wmcs does
[15:36:29] <Amir1>	 > there is a script/procedure for that, done twice, one of each datacenter
[15:36:29] <Amir1>	 What is it? :sweat_smile:
[15:36:51] <jynus>	 manuel will know, I haven't done that in maybe 5 years
[15:36:58] <jynus>	 do you want me to search it?
[15:38:07] <Amir1>	 hmm, if it's done on sanitarium hosts, I can take a look at his bash history
[15:38:10] <jynus>	 https://wikitech.wikimedia.org/wiki/MariaDB/Sanitarium_and_Labsdbs
[15:38:38] <jynus>	 so if the wiki is private, on the first phase it just have to be banned from replication
[15:38:49] <jynus>	 if it is public, it has to be redacted
[15:39:22] <jynus>	 I think this is the script installed on both sanitariums https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/files/mariadb/redact_sanitarium.sh?as=source&blame=off
[15:39:34] <jynus>	 remember it has to be done for both eqiad and codfw
[15:39:39] <Amir1>	 Manuel's bash history in sanitrium:
[15:39:42] <Amir1>	 https://www.irccloud.com/pastebin/mj5Oxf5g/
[15:39:55] <jynus>	 check root log then
[15:40:00] <Amir1>	 I know :D
[15:40:49] <jynus>	 one sanity check I did is making sure no password hashes were on cloud after redaction
[15:41:12] <jynus>	 as wikireplicas are not considered a safe environement
[15:41:34] <Amir1>	 for i in guwwiktionary pcmwiki bjnwiktionary; do echo $i; redact_sanitarium.sh -d $i -S /run/mysqld/mysqld.s5.sock | mysql -S /run/mysqld/mysqld.s5.sock $i ; done
[15:41:42] <Amir1>	 he loves bash loops
[15:41:48] <jynus>	 then there is: https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/files/mariadb/filtered_tables.txt
[15:42:06] <jynus>	 that should have all the filtered and unfiltered columns of a wiki
[15:42:51] <jynus>	 e.g. "user,user_password,F" (I am guessing F for filtering)
[15:43:34] <jynus>	 there is monitoring about this
[15:43:52] <jynus>	 so if you break it, root@ will receive an email (it won't alert publicly for obvious reasons)
[15:44:22] <jynus>	 with something like "private data found on cloud", even if it is not leaked
[15:44:39] <jynus>	 does that help?
[15:45:45] <jynus>	 if it was private, it would be added to the replication filters and filtered fully, requiring a mariadb restart
[15:48:28] <Amir1>	 Yeah thea
[15:48:32] <Amir1>	 Thanks!
[15:48:53] <jynus>	 let me know if you want me to check- I may not remember well the procedure
[15:49:10] <jynus>	 but it is easy to check if there is something weird on clouddbs afterwards
[15:49:35] <jynus>	 e.g. running https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/files/mariadb/check_private_data.py?as=source&blame=off manually
[15:49:37] <Amir1>	 nah, don't worry
[15:50:43] <jynus>	 the replication filtering + cloud provides 2 filtering methods for a reason 0:-D
[16:02:12] <jynus>	 we let wmcs/analytics handle the views because that way they can handle the depooling on their own to minimize actual user impact
[16:02:51] <jynus>	 (it used to be a hot process, but now it sometimes creates huge metadata locking contention issues if hosts not depooled, depending on the change)