[01:07:08] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 5.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:08:24] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:09:04] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 7.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:10:20] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [07:37:36] heads up: !log start of rolling restart of backup hosts [08:08:53] db2183 and db2184 are identified as Infrastructure Foundations owned. Most likely this could be because they are "insetup" [11:18:53] jynus: wrt raid rebuild. I'm on phone. It'd be great if you point me to docs so I learn? [11:19:05] To avoid bothering you [11:31:15] disk rebuild is automatic (that is why we buy hosts with hw raid) [11:31:33] *hollow laughing noises* [11:31:41] there is a guide of common tasks at: https://wikitech.wikimedia.org/wiki/MegaCli [11:32:56] one thing I saw, however, is that learning cycles seem to be enabled, and that causes performance issues- you may want to discuss with manuel if to tune that [11:33:27] if you mean data provisioning, I can show you that too [11:35:07] Emperor: there is nothing wrong about what I said, I don't have to copy blocks of data manually for the raid; now if you want to discuss how often the raid controller fails or doesn't do the right thing that is another story... 0:-P [11:35:29] :) [11:35:52] I'd say that for db hosts it usually fails less than 5-10% of the time than a disk fails [11:36:08] *that [11:37:41] you inherited a JBOD which definitely makes things more involved at app layer [11:51:26] except it's not really a JBOD, it's instead a series of 1-drive arrays, which feels like the worst of both worlds :) [11:59:14] I'm going to lunch, but feel free to ask any question for when I come back [13:31:59] fe reboots done; starting on the be which are often a PITA (not helped by the IPMI on be1040 seemingly very slow) [13:44:48] jynus: I mean this: https://phabricator.wikimedia.org/T320786#8335125 [13:44:55] > @Ladsgroup do you want me to recover data to this host? [13:49:56] yeah, I with that I meant the data recovery/provision workflow. let me point the docs to you [13:56:37] Amir1: https://wikitech.wikimedia.org/wiki/MariaDB/Backups#Provision_a_precompressed_and_prepared_snapshot_(preferred) [14:07:32] ugh, swift-account won't start on ms-be1042 and doesn't say why not :( [14:25:34] Emperor: I want to get some numbers on thumbnails visits in swift (context: T211661), can I ask a couple of questions about where they are stored? :D [14:25:35] T211661: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 [14:26:12] sure [14:26:24] I don't promise to know the answers, mind... [15:33:28] jynus: another question time, What is exactly needed for tickets like this? T319190 [15:33:29] T319190: Prepare and check storage layer for bnwikiquote - https://phabricator.wikimedia.org/T319190 [15:33:36] (from DBA side) [15:33:45] I tried looking up documentation but can't find one [15:33:47] there is 2 phases [15:34:30] I guess, this is the second phase? https://wikitech.wikimedia.org/wiki/Add_a_wiki#Best_method [15:34:32] before wiki deploy, for things that should not be replicated- add it to the list of tables/columns to ingore [15:34:59] the second is to purge existing data and setup triggers on sanitarium [15:35:17] there is a script/procedure for that, done twice, one of each datacenter [15:35:33] then cloud takes care of view updates/creation [15:35:48] in the future I will also have to take care of adding them to image backups [15:36:29] that is what wmcs does [15:36:29] > there is a script/procedure for that, done twice, one of each datacenter [15:36:29] What is it? :sweat_smile: [15:36:51] manuel will know, I haven't done that in maybe 5 years [15:36:58] do you want me to search it? [15:38:07] hmm, if it's done on sanitarium hosts, I can take a look at his bash history [15:38:10] https://wikitech.wikimedia.org/wiki/MariaDB/Sanitarium_and_Labsdbs [15:38:38] so if the wiki is private, on the first phase it just have to be banned from replication [15:38:49] if it is public, it has to be redacted [15:39:22] I think this is the script installed on both sanitariums https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/files/mariadb/redact_sanitarium.sh?as=source&blame=off [15:39:34] remember it has to be done for both eqiad and codfw [15:39:39] Manuel's bash history in sanitrium: [15:39:42] https://www.irccloud.com/pastebin/mj5Oxf5g/ [15:39:55] check root log then [15:40:00] I know :D [15:40:49] one sanity check I did is making sure no password hashes were on cloud after redaction [15:41:12] as wikireplicas are not considered a safe environement [15:41:34] for i in guwwiktionary pcmwiki bjnwiktionary; do echo $i; redact_sanitarium.sh -d $i -S /run/mysqld/mysqld.s5.sock | mysql -S /run/mysqld/mysqld.s5.sock $i ; done [15:41:42] he loves bash loops [15:41:48] then there is: https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/files/mariadb/filtered_tables.txt [15:42:06] that should have all the filtered and unfiltered columns of a wiki [15:42:51] e.g. "user,user_password,F" (I am guessing F for filtering) [15:43:34] there is monitoring about this [15:43:52] so if you break it, root@ will receive an email (it won't alert publicly for obvious reasons) [15:44:22] with something like "private data found on cloud", even if it is not leaked [15:44:39] does that help? [15:45:45] if it was private, it would be added to the replication filters and filtered fully, requiring a mariadb restart [15:48:28] Yeah thea [15:48:32] Thanks! [15:48:53] let me know if you want me to check- I may not remember well the procedure [15:49:10] but it is easy to check if there is something weird on clouddbs afterwards [15:49:35] e.g. running https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/files/mariadb/check_private_data.py?as=source&blame=off manually [15:49:37] nah, don't worry [15:50:43] the replication filtering + cloud provides 2 filtering methods for a reason 0:-D [16:02:12] we let wmcs/analytics handle the views because that way they can handle the depooling on their own to minimize actual user impact [16:02:51] (it used to be a hot process, but now it sometimes creates huge metadata locking contention issues if hosts not depooled, depending on the change)