[01:09:02] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 17.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:09:56] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 12 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:11:42] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:12:36] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [06:21:47] marostegui: we have like gazillion core drifts in core. Happy Monday [06:21:52] *codfw [06:22:48] \o/ [06:43:48] marostegui: I know it's a bit large and scary but is it possible to review https://gerrit.wikimedia.org/r/c/operations/puppet/+/883961? [06:43:58] PCC seems to be happy about it in most dbs [06:43:59] Amir1: yeah, I have it in my todo for today [06:44:05] Thanks [09:36:40] megacli no longer works on a recent host, do you know if there is a different cli for monitoring a RAID in newer hosts? [09:37:35] you need to use perccli64 [09:37:42] I see, thank you [09:37:59] I will link it on the other wiki page [09:38:11] the syntax is different too, and also a pain [09:38:28] well, not that the previous one was great :-D [09:38:35] but something > nothing [09:45:46] I will update the wiki in some other places too, such as https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Hardware_Troubleshooting_Runbook#Dell_Hardware_Raid_Information_Gathering [09:46:00] so other people with the same question gets the new link [09:49:22] cool [09:49:52] I will ping dc ops as technically is their documentation and they may want to do a larger refactoring of docs [09:50:10] but megacli keeps working on many many hosts [09:50:15] So it can't just be replaced on the doc [09:50:26] I know [09:50:44] that is why I will let them refactor it better- I just added the link [09:50:49] yep [09:50:52] e.g. maybe they will want to remove the hp section [09:50:59] when no hp are more around [09:55:39] I love when ugly problems end up quite easy to solve- 2 lvm partitions had been created with the same name on install and that caused all kind of weird problems [09:57:07] and this was probably because the new perc controller recognized disks in a different order [10:04:14] es2020 came data-clean, and also equal to eqiad, including wikidatawiki [10:04:47] that's great! [10:06:07] after the partitioning woes, s2 and s3 backups on codfw are now rerunning [10:06:29] and checking now es bacula jobs, which look stuck [10:17:31] another interesting but easy fix: https://gerrit.wikimedia.org/r/c/operations/puppet/+/884831 [10:18:19] work on binlogs next quarter should make es backups much smaller [10:30:50] nice: backup1002.eqiad.wmnet-Weekly-Thu-EsRwCodfw-mysql-srv-backups-dumps-latest is running [10:30:58] and thanks for the +1, marostegui [11:01:18] what is the cumin alias for all maraidb hosts? [11:02:47] I think it is db-all [11:04:18] about to find out [11:04:24] disabled everywhere [11:10:50] marostegui: okay, should I run sudo run-puppet-agent -f? [11:10:59] on random hosts [11:11:09] I would try on a db, pc, es [11:11:22] yeah, I'm on pc2011 rn [11:11:32] one of the db in core, one multi instance and one in misc [11:13:16] looks fine in pc2011 [11:13:24] moving on to a couple other hosts [11:13:35] ok [11:15:49] es2024 looks fine [11:18:01] ran on a master and replica and a multiinstance, all look good [11:18:05] nice [11:18:11] try on db1195 and db1117 [11:18:12] (misc) [11:19:18] done [11:19:22] noop [11:19:27] nice [11:20:24] marostegui: sudo cumin 'A:db-all' 'run-puppet-agent -e "Rotating mediawiki db password (T326802)"' ? [11:20:24] T326802: Rotate wikiuser and wikiadmin passwords - https://phabricator.wikimedia.org/T326802 [11:20:40] Amir1: sounds good [11:20:40] also will check https://puppetboard.wikimedia.org/ [11:21:00] started [11:22:38] backupmon is failing but it doesn't seem related https://puppetboard.wikimedia.org/report/backupmon1001.eqiad.wmnet/acb761f1e5e636ec4726bbae43b3875cb7377863 [11:22:48] (checking puppetboard) [11:23:47] no failure [11:27:12] marostegui: one last thing, can I remove this from private repo, It doesn't seem to be used anywhere https://codesearch.wmcloud.org/search/?q=wikiuser2_pass&i=nope&files=&excludeFiles=&repos= [11:27:43] Amir1: yeah it is probably ok [11:29:10] added in 2012 https://gerrit.wikimedia.org/r/c/labs/private/+/5792 [11:54:32] Amir1: the alter in db2140 is done, right? [11:54:48] let me check, which section? [11:54:53] s4 [11:55:00] I need to switchover s4 codfw master :) [11:55:04] it should be, it's catching up [11:55:07] yeah I know [11:55:13] But I mean, nothing else pending right? [11:55:22] yup, nothing else [11:55:25] okay thanks [11:55:32] don't forget to repool it once done [11:55:36] yeah [11:55:40] do you want me to stop the script? [11:55:47] for s4 codfw yes [11:56:02] for the rest of sections I don't care [11:56:02] it was just this host [11:56:09] stopped [11:56:12] thanks [11:56:18] I will let you know once I am done with s4 codfw [11:56:55] cool [12:02:19] replication to sanitarium of s3 in codfw is broken because the sanitarium didn't have the drifts the master had [12:02:24] le sigh [12:02:31] classic [13:00:26] it was in > 500 wikis. I had to write a bash script that in loop read the replication error, feed it to the alter table dropping pk and restart it [13:00:41] it's done now [15:31:09] marostegui: FYI, as of next week, you won't need to add user-notice or anything like that for eqiad switchovers, you can just do it in the designated time. https://meta.wikimedia.org/w/index.php?title=Tech%2FNews%2F2023%2F06&diff=24455737&oldid=24455720 [15:31:43] oh cool! [15:34:06] ^_^