[06:34:00] 10Blocked-on-schema-change, 10DBA: Rename name_title index on page to page_name_title - https://phabricator.wikimedia.org/T284375 (10Marostegui) [06:34:06] 10Blocked-on-schema-change, 10DBA: Schema change for renaming several indexes in change_tag table - https://phabricator.wikimedia.org/T284619 (10Marostegui) [08:23:50] I'm running some write queries in production for T279761 [08:23:50] T279761: When reviewing pending changes, raw message ID "⧼revreview-hist-quality⧽" shown instead of human readable string - https://phabricator.wikimedia.org/T279761 [08:33:46] PROBLEM - MariaDB sustained replica lag on db1098 is CRITICAL: 2.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1098&var-port=13317 [08:33:48] PROBLEM - MariaDB sustained replica lag on db2150 is CRITICAL: 2.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2150&var-port=9104 [08:33:48] PROBLEM - MariaDB sustained replica lag on db2118 is CRITICAL: 6.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2118&var-port=9104 [08:34:58] PROBLEM - MariaDB sustained replica lag on db1181 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1181&var-port=9104 [08:36:08] PROBLEM - MariaDB sustained replica lag on db2107 is CRITICAL: 2.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2107&var-port=9104 [08:38:00] RECOVERY - MariaDB sustained replica lag on db2107 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2107&var-port=9104 [08:38:40] RECOVERY - MariaDB sustained replica lag on db1181 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1181&var-port=9104 [08:39:20] RECOVERY - MariaDB sustained replica lag on db1098 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1098&var-port=13317 [08:39:22] RECOVERY - MariaDB sustained replica lag on db2118 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2118&var-port=9104 [08:39:22] RECOVERY - MariaDB sustained replica lag on db2150 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2150&var-port=9104 [09:04:29] 10DBA: Deploy wmfmariadbpy 0.7.1 - https://phabricator.wikimedia.org/T284819 (10Kormat) [09:05:43] 10DBA: Deploy wmfmariadbpy 0.7.1 - https://phabricator.wikimedia.org/T284819 (10Kormat) [09:07:39] 10DBA: Deploy wmfmariadbpy 0.7.1 - https://phabricator.wikimedia.org/T284819 (10Kormat) [09:08:39] 10DBA: Deploy wmfmariadbpy 0.7.1 - https://phabricator.wikimedia.org/T284819 (10Kormat) [09:19:03] 10DBA: Deploy wmfmariadbpy 0.7.1 - https://phabricator.wikimedia.org/T284819 (10Kormat) [09:19:54] 10DBA: Deploy wmfmariadbpy 0.7.1 - https://phabricator.wikimedia.org/T284819 (10Kormat) 05Open→03Resolved All done. [09:25:48] 10DBA, 10MediaWiki-Parser, 10Performance-Team, 10Parsoid (Tracking), 10Patch-For-Review: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761 (10Kormat) TODO: [] Make pc1009 the pc3 primary again: https://gerrit.wikimedia.org/r/c/operations/medi... [09:41:57] 10Data-Persistence-Backup, 10database-backups: Put db2100 back into service after hardware maintenance - https://phabricator.wikimedia.org/T284980 (10jcrespo) Forcing a backup rerun to validate the installation before closing the ticket (in case it crashes during it, like it happened before, under pressure). [10:16:01] 10Blocked-on-schema-change, 10DBA: Rename name_title index on page to page_name_title - https://phabricator.wikimedia.org/T284375 (10Marostegui) [10:17:34] 10Blocked-on-schema-change, 10DBA: Schema change for renaming several indexes in change_tag table - https://phabricator.wikimedia.org/T284619 (10Marostegui) Codfw is fully done, so now waiting for the DC switch to finish eqiad. [10:17:40] 10Blocked-on-schema-change, 10DBA: Rename name_title index on page to page_name_title - https://phabricator.wikimedia.org/T284375 (10Marostegui) Codfw is fully done, so now waiting for the DC switch to finish eqiad. [10:18:33] 10Blocked-on-schema-change, 10DBA: Schema change for making cuc_id in cu_changes unsigned - https://phabricator.wikimedia.org/T283093 (10Marostegui) a:03Marostegui [10:18:50] 10Blocked-on-schema-change, 10DBA: Schema change for making cuc_id in cu_changes unsigned - https://phabricator.wikimedia.org/T283093 (10Marostegui) I am going to try to get this done in codfw entirely before the switch. First I will run it on eqiad s6 hosts to make sure it is all good. [10:19:56] 10Blocked-on-schema-change, 10DBA: Schema change for making cuc_id in cu_changes unsigned - https://phabricator.wikimedia.org/T283093 (10Marostegui) s6 eqiad [] dbstore1005 [] db1180 [] db1173 [] db1168 [] db1165 [] db1155 [] db1140 [] db1131 [] db1113 [] db1098 [] db1096 [] clouddb1021 [] clouddb1019 [] cloud... [10:20:53] Amir1: to confirm, cu_changes isn't deployed on all wikis right? [10:21:01] I am not seeing it on frwiki, jawiki, ruwiki [10:21:05] (s6) [10:21:22] root@db1096:/srv/sqldata# find . | grep -i changes [10:21:22] root@db1096:/srv/sqldata# [10:22:43] marostegui: it should be there [10:22:54] checkuser is deployed in I think all wikis [10:23:08] ah wait, I am stupid [10:23:11] it is there [10:23:36] nevermind! thanks :) [10:44:05] "Certificate 'dbtree.wikimedia.org' expires in 10 day(s)" known? I think some of you worked with valentin to do some changes there [10:46:46] I will ping him [10:50:01] I also rememver arturo having issues with cert renewal, but cannot remember the context [10:51:02] valentin is taking a look, as he made some changes related to dbtree a few weeks ago [10:52:26] I am trying to finally put to rest the db2100 saga [11:00:18] 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10Jelto) @Dzahn I think https://gerrit.wikimedia.org/r/c/operations/puppet/+/697850 is not merged and deployed, so the fileset for GitLab doesn't exist.... [11:25:16] John has fixed the certificate issue, just fyi [11:25:38] ok thats fixed there where too issues fixed via, 3 patches. the first issue was a race condision between the check and the automatic renewal fixed here https://gerrit.wikimedia.org/r/c/operations/puppet/+/700051 and corrected here https://gerrit.wikimedia.org/r/c/operations/puppet/+/700053 [11:25:38] [13:22:46] the second issue was apache was not configuered to refresh should be fixed here but will need to keep an eye out for next time the cert renews https://gerrit.wikimedia.org/r/c/operations/puppet/+/700052 [12:37:58] marostegui: puppet disabled on all db machines, CR merged, running puppet manually on s6/codfw primary (db2129) [12:38:06] cool [12:38:24] good news, it didn't stop pt-heartbeat ;) [12:38:31] kormat: you might want to try also a misc section in codfw and an es one too [12:38:38] running puppet on alert1001 to see what monitoring thinks [12:38:52] hahaha that page scared me [12:38:57] uff. unrelated page happening at just this time... not fair [12:38:58] arrived one second after you said that [12:39:05] it's a conspiracy [12:40:21] marostegui: the CR also tells puppet to monitor the service, every 10 or 15s i think [12:40:29] so i want to make sure that's being sane [12:40:35] ah cool yeah [12:41:08] that way we'll get "pt-heartbeat isn't running" at the same time as all the "lag on replica" alerts, so at least it'll be easier to understand [12:41:51] icinga sees it as "Check unit status of pt-heartbeat-wikimedia" [12:42:35] and that will page? [12:43:00] ah - i set it so that it would alert in #wikimedia-databases, [12:43:02] but not page anyone [12:43:14] ah cool [12:43:15] at least for the beginning [12:43:20] maybe also alert on -operations [12:43:42] going to manually stop the service on db2129, and see how long it takes to notice [12:43:45] goddamn pages [12:43:56] haha [12:45:49] huuh [12:45:53] it's not alerting [12:46:49] 🤦‍♀️ [12:46:58] ok, it will only alert if the systemd service is in 'failed' state [12:47:19] and that's not the case after a mysql or host, reboot? [12:47:42] systemd is configured to immediately restart the service if it fails [12:48:13] so.. if mysql is stopped (either manually, or post boot), systemd will probably reach the maximum number of restarts for the service very quickly, [12:48:16] at which point the alert would fire [12:48:33] ah cool [12:48:42] can you make it alert on -operations too? so others are aware [12:48:47] i'd prefer to test that outside of prod though [12:49:04] probably, but let me finish rolling this out first :P [12:49:08] you can try with the test cluster [12:49:12] it has a master and a slave [12:50:04] 👍 [12:50:44] oh, look at that. [12:50:53] pt-heartbeat doesn't actually fail if it can't talk to mariadb [12:50:59] Jun 03 12:30:49 db1124 pt-heartbeat-wikimedia[19509]: Can't connect to local MySQL server through socket '/run/mysqld/mysqld.sock' (2) [12:51:03] ^ been running for 3 weeks [12:51:10] wtf XDDDDD [12:51:44] we can add the `--fail-successive-errors` flag to change that [12:52:41] if not it keeps trying forever? [12:52:46] apparently so! [12:53:23] going to make that change manually on db1124 and see [12:54:07] ok [12:56:29] good news/bad news [12:56:44] good news: it does see the service as failed now, and this shows up on icinga [12:56:57] bad news: it's failing because it says it can't parse its cmdline [12:57:07] `Unknown option: fail-successive-errors` [12:57:10] ?? [12:57:46] any chance our manpage for pt-heartbeat doesn't match the version of pt-heartbeat-wikimedia we have deployed? [12:57:49] root@db1124:~# pt-heartbeat-wikimedia --help | grep fail [12:57:49] root@db1124:~# [12:57:51] yeah :( [12:58:07] we're bad, and should feel bad. [12:59:00] ok. we should probably look at rebasing pt-heartbeat-wikimedia on top of a more modern version [12:59:04] but that's for another task [12:59:37] yeah [13:00:14] ok, so for now the monitoring of the service is basically a no-op [13:00:22] this isn't any _worse_ than what we had already [13:00:26] it's just useless until we fix it. [13:00:35] so i'm going to proceed to test on a few more dbs [13:01:56] so, it won't start ptheartbeat after reboot/restart right? [13:02:06] it will; that's the whole point of this CR [13:02:23] on every puppet run, it'll check to see if pt-heartbeat should be running or not, [13:02:24] then I don't get the no-op comment [13:02:28] and change the current state as appropriate [13:02:40] marostegui: the no-op is about the _monitoring_ of the systemd service [13:02:47] ah gotcha :) [13:02:51] that can only fire if the service is in 'failed' mode, which currently isn't going to happen [13:03:02] because pt-heartbeat-wikimedia doesn't GAF about issues [13:04:05] ok cool - started pt-hb on db1125 (test-s4 replica), ran puppet, puppet stops it. [13:04:28] sweet [13:04:37] doing the inverse test on db1124 (test-s4 primary) [13:04:55] `changed 'stopped' to 'running'` [13:04:59] yep, behaving correctly [13:05:03] \o/ [13:05:31] ok, i'll pick some more machines in codfw to roll out to by hand [13:05:40] try to pick one in misc [13:05:45] just in case those behave differently [13:05:53] one mw replica, a primary+replica in misc, primary+replica in es [13:06:18] that is good! [13:06:26] making that a multiinstance mw replica [13:06:51] oh, hah. they don't have pt-heartbeat servicdes [13:06:53] we don't have multi instance hosts running pthearbteat I think [13:06:55] yeah [13:06:59] ok grand [13:10:26] hah, nvm - all misc replicas are multiinstance too [13:10:28] onto es then [13:10:45] the codfw masters aren't multi instance [13:10:48] misc ones I mean [13:10:57] done one of those already [13:11:02] nice [13:13:37] tested on a live es section (es4) and also on an archive one (es1) [13:13:58] ok, i think at this point we're good to press Go [13:16:41] good! [13:16:58] (i have pressed Go. i'm now stepping away from the computer for a few minutes for, uh, safety reasons) [13:17:08] XD [13:56:11] PROBLEM - Check unit status of pt-heartbeat-wikimedia on db2151 is CRITICAL: CRITICAL: Status of the systemd unit pt-heartbeat-wikimedia https://wikitech.wikimedia.org/wiki/MariaDB/pt-heartbeat [13:56:49] jynus: that's also mediabackup :) [13:56:51] ^ [13:57:20] yes, I am doing the alter table [13:57:26] thanks! [13:57:37] those seem to be non-trivial, ya know! [13:57:42] 0:-) [13:59:41] I think it should be back soon [14:00:15] RECOVERY - Check unit status of pt-heartbeat-wikimedia on db2151 is OK: OK: Status of the systemd unit pt-heartbeat-wikimedia https://wikitech.wikimedia.org/wiki/MariaDB/pt-heartbeat [14:02:27] so I think thanks to kormat's patch, this was a silent bug (pt-heartbeat not working with sections > 10 chracters) to one reported [14:02:54] happily pt-heartbeat wasn't used for production on those servers [14:03:14] and the lag check fails back to using seconds behind master to check lag [14:03:33] actually, in this case as it didn't have replication, it wasn't checked at all [14:03:55] thanks kormat for the improvements! [14:09:12] hurray for accidental progress :) [14:11:29] marostegui: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=^\(db\|es\|pc\)[12]&style=detail&servicestatustypes=29 [14:12:01] 54 pending service checks. 😅 [14:12:31] screw it, scheduling them all now [14:12:36] let it burn [14:15:59] haha [14:25:10] icinga says db1110 didn't successfully run puppet last time. checking. [14:25:25] aaand it worked fine v0v [16:31:59] 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10jbond) >>! In T274463#7159360, @Jelto wrote: > However there is a already existing backup configuration in [modules/gitlab/manifests/backup.pp](https:/... [17:07:45] 10DBA, 10DiscussionTools, 10Performance-Team, 10Editing-team (FY2020-21 Kanban Board), and 2 others: Reduce parser cache retention temporarily for DiscussionTools - https://phabricator.wikimedia.org/T280605 (10Krinkle) 05Open→03Resolved [20:32:14] 10DBA, 10Data-Services: Prepare and check storage layer for shiwiki - https://phabricator.wikimedia.org/T284928 (10LSobanski) p:05Triage→03Medium Thanks, let us know when the database is created, so we can sanitize it. [22:16:01] 10DBA, 10SRE, 10Datacenter-Switchover: Check "Days in advance preparation" for databases before DC switchover - https://phabricator.wikimedia.org/T285069 (10Legoktm)