[06:34:00] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Rename name_title index on page to page_name_title - https://phabricator.wikimedia.org/T284375 (10Marostegui)
[06:34:06] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change for renaming several indexes in change_tag table - https://phabricator.wikimedia.org/T284619 (10Marostegui)
[08:23:50] <Amir1>	 I'm running some write queries in production for T279761
[08:23:50] <stashbot>	 T279761: When reviewing pending changes, raw message ID "⧼revreview-hist-quality⧽" shown instead of human readable string - https://phabricator.wikimedia.org/T279761
[08:33:46] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on db1098 is CRITICAL: 2.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1098&var-port=13317
[08:33:48] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on db2150 is CRITICAL: 2.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2150&var-port=9104
[08:33:48] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on db2118 is CRITICAL: 6.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2118&var-port=9104
[08:34:58] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on db1181 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1181&var-port=9104
[08:36:08] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on db2107 is CRITICAL: 2.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2107&var-port=9104
[08:38:00] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on db2107 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2107&var-port=9104
[08:38:40] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on db1181 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1181&var-port=9104
[08:39:20] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on db1098 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1098&var-port=13317
[08:39:22] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on db2118 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2118&var-port=9104
[08:39:22] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on db2150 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2150&var-port=9104
[09:04:29] <wikibugs>	 10DBA: Deploy wmfmariadbpy 0.7.1 - https://phabricator.wikimedia.org/T284819 (10Kormat)
[09:05:43] <wikibugs>	 10DBA: Deploy wmfmariadbpy 0.7.1 - https://phabricator.wikimedia.org/T284819 (10Kormat)
[09:07:39] <wikibugs>	 10DBA: Deploy wmfmariadbpy 0.7.1 - https://phabricator.wikimedia.org/T284819 (10Kormat)
[09:08:39] <wikibugs>	 10DBA: Deploy wmfmariadbpy 0.7.1 - https://phabricator.wikimedia.org/T284819 (10Kormat)
[09:19:03] <wikibugs>	 10DBA: Deploy wmfmariadbpy 0.7.1 - https://phabricator.wikimedia.org/T284819 (10Kormat)
[09:19:54] <wikibugs>	 10DBA: Deploy wmfmariadbpy 0.7.1 - https://phabricator.wikimedia.org/T284819 (10Kormat) 05Open→03Resolved All done.
[09:25:48] <wikibugs>	 10DBA, 10MediaWiki-Parser, 10Performance-Team, 10Parsoid (Tracking), 10Patch-For-Review: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761 (10Kormat) TODO: [] Make pc1009 the pc3 primary again: https://gerrit.wikimedia.org/r/c/operations/medi...
[09:41:57] <wikibugs>	 10Data-Persistence-Backup, 10database-backups: Put db2100 back into service after hardware maintenance - https://phabricator.wikimedia.org/T284980 (10jcrespo) Forcing a backup rerun to validate the installation before closing the ticket (in case it crashes during it, like it happened before, under pressure).
[10:16:01] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Rename name_title index on page to page_name_title - https://phabricator.wikimedia.org/T284375 (10Marostegui)
[10:17:34] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change for renaming several indexes in change_tag table - https://phabricator.wikimedia.org/T284619 (10Marostegui) Codfw is fully done, so now waiting for the DC switch to finish eqiad.
[10:17:40] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Rename name_title index on page to page_name_title - https://phabricator.wikimedia.org/T284375 (10Marostegui) Codfw is fully done, so now waiting for the DC switch to finish eqiad.
[10:18:33] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change for making cuc_id in cu_changes unsigned - https://phabricator.wikimedia.org/T283093 (10Marostegui) a:03Marostegui
[10:18:50] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change for making cuc_id in cu_changes unsigned - https://phabricator.wikimedia.org/T283093 (10Marostegui) I am going to try to get this done in codfw entirely before the switch. First I will run it on eqiad s6 hosts to make sure it is all good.
[10:19:56] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change for making cuc_id in cu_changes unsigned - https://phabricator.wikimedia.org/T283093 (10Marostegui) s6 eqiad [] dbstore1005 [] db1180 [] db1173 [] db1168 [] db1165 [] db1155 [] db1140 [] db1131 [] db1113 [] db1098 [] db1096 [] clouddb1021 [] clouddb1019 [] cloud...
[10:20:53] <marostegui>	 Amir1: to confirm, cu_changes isn't deployed on all wikis right?
[10:21:01] <marostegui>	 I am not seeing it on frwiki, jawiki, ruwiki
[10:21:05] <marostegui>	 (s6)
[10:21:22] <marostegui>	 root@db1096:/srv/sqldata# find . | grep -i changes
[10:21:22] <marostegui>	 root@db1096:/srv/sqldata#
[10:22:43] <Amir1>	 marostegui: it should be there
[10:22:54] <Amir1>	 checkuser is deployed in I think all wikis
[10:23:08] <marostegui>	 ah wait, I am stupid
[10:23:11] <marostegui>	 it is there
[10:23:36] <marostegui>	 nevermind! thanks :)
[10:44:05] <jynus>	 "Certificate 'dbtree.wikimedia.org' expires in 10 day(s)" known? I think some of you worked with valentin to do some changes there
[10:46:46] <marostegui>	 I will ping him
[10:50:01] <jynus>	 I also rememver arturo having issues with cert renewal, but cannot remember the context
[10:51:02] <marostegui>	 valentin is taking a look, as he made some changes related to dbtree a few weeks ago
[10:52:26] <jynus>	 I am trying to finally put to rest the db2100 saga
[11:00:18] <wikibugs>	 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10Jelto) @Dzahn I think https://gerrit.wikimedia.org/r/c/operations/puppet/+/697850 is not merged and deployed, so the fileset for GitLab doesn't exist....
[11:25:16] <marostegui>	 John has fixed the certificate issue, just fyi
[11:25:38] <marostegui>	 ok thats fixed there where too issues fixed via, 3 patches.  the first issue was a race condision between the check and the automatic renewal fixed here https://gerrit.wikimedia.org/r/c/operations/puppet/+/700051 and corrected here https://gerrit.wikimedia.org/r/c/operations/puppet/+/700053
[11:25:38] <marostegui>	 [13:22:46]  <jbond> the second issue was apache was not configuered to refresh should be fixed here but will need to keep an eye out for next time the cert renews https://gerrit.wikimedia.org/r/c/operations/puppet/+/700052
[12:37:58] <kormat>	 marostegui: puppet disabled on all db machines, CR merged, running puppet manually on s6/codfw primary (db2129)
[12:38:06] <marostegui>	 cool
[12:38:24] <kormat>	 good news, it didn't stop pt-heartbeat ;)
[12:38:31] <marostegui>	 kormat: you might want to try also a misc section in codfw and an es one too
[12:38:38] <kormat>	 running puppet on alert1001 to see what monitoring thinks
[12:38:52] <marostegui>	 hahaha that page scared me
[12:38:57] <kormat>	 uff. unrelated page happening at just this time... not fair
[12:38:58] <marostegui>	 arrived one second after you said that
[12:39:05] <kormat>	 it's a conspiracy
[12:40:21] <kormat>	 marostegui: the CR also tells puppet to monitor the service, every 10 or 15s i think
[12:40:29] <kormat>	 so i want to make sure that's being sane
[12:40:35] <marostegui>	 ah cool yeah
[12:41:08] <kormat>	 that way we'll get "pt-heartbeat isn't running" at the same time as all the "lag on replica" alerts, so at least it'll be easier to understand
[12:41:51] <kormat>	 icinga sees it as "Check unit status of pt-heartbeat-wikimedia"
[12:42:35] <marostegui>	 and that will page?
[12:43:00] <kormat>	 ah - i set it so that it would alert in #wikimedia-databases,
[12:43:02] <kormat>	 but not page anyone
[12:43:14] <marostegui>	 ah cool
[12:43:15] <kormat>	 at least for the beginning
[12:43:20] <marostegui>	 maybe also alert on -operations 
[12:43:42] <kormat>	 going to manually stop the service on db2129, and see how long it takes to notice
[12:43:45] <kormat>	 goddamn pages
[12:43:56] <marostegui>	 haha
[12:45:49] <kormat>	 huuh
[12:45:53] <kormat>	 it's not alerting
[12:46:49] <kormat>	 🤦‍♀️
[12:46:58] <kormat>	 ok, it will only alert if the systemd service is in 'failed' state
[12:47:19] <marostegui>	 and that's not the case after a mysql or host, reboot?
[12:47:42] <kormat>	 systemd is configured to immediately restart the service if it fails
[12:48:13] <kormat>	 so.. if mysql is stopped (either manually, or post boot), systemd will probably reach the maximum number of restarts for the service very quickly,
[12:48:16] <kormat>	 at which point the alert would fire
[12:48:33] <marostegui>	 ah cool
[12:48:42] <marostegui>	 can you make it alert on -operations too? so others are aware
[12:48:47] <kormat>	 i'd prefer to test that outside of prod though
[12:49:04] <kormat>	 probably, but let me finish rolling this out first :P
[12:49:08] <marostegui>	 you can try with the test cluster
[12:49:12] <marostegui>	 it has a master and a slave
[12:50:04] <kormat>	 👍
[12:50:44] <kormat>	 oh, look at that.
[12:50:53] <kormat>	 pt-heartbeat doesn't actually fail if it can't talk to mariadb
[12:50:59] <kormat>	 Jun 03 12:30:49 db1124 pt-heartbeat-wikimedia[19509]: Can't connect to local MySQL server through socket '/run/mysqld/mysqld.sock' (2)
[12:51:03] <kormat>	 ^ been running for 3 weeks
[12:51:10] <marostegui>	 wtf XDDDDD
[12:51:44] <kormat>	 we can add the `--fail-successive-errors` flag to change that
[12:52:41] <marostegui>	 if not it keeps trying forever?
[12:52:46] <kormat>	 apparently so!
[12:53:23] <kormat>	 going to make that change manually on db1124 and see
[12:54:07] <marostegui>	 ok
[12:56:29] <kormat>	 good news/bad news
[12:56:44] <kormat>	 good news: it does see the service as failed now, and this shows up on icinga
[12:56:57] <kormat>	 bad news: it's failing because it says it can't parse its cmdline
[12:57:07] <kormat>	 `Unknown option: fail-successive-errors`
[12:57:10] <marostegui>	 ??
[12:57:46] <kormat>	 any chance our manpage for pt-heartbeat doesn't match the version of pt-heartbeat-wikimedia we have deployed?
[12:57:49] <marostegui>	 root@db1124:~# pt-heartbeat-wikimedia --help | grep fail
[12:57:49] <marostegui>	 root@db1124:~#
[12:57:51] <marostegui>	 yeah :(
[12:58:07] <kormat>	 we're bad, and should feel bad.
[12:59:00] <kormat>	 ok. we should probably look at rebasing pt-heartbeat-wikimedia on top of a more modern version
[12:59:04] <kormat>	 but that's for another task
[12:59:37] <marostegui>	 yeah
[13:00:14] <kormat>	 ok, so for now the monitoring of the service is basically a no-op
[13:00:22] <kormat>	 this isn't any _worse_ than what we had already
[13:00:26] <kormat>	 it's just useless until we fix it.
[13:00:35] <kormat>	 so i'm going to proceed to test on a few more dbs
[13:01:56] <marostegui>	 so, it won't start ptheartbeat after reboot/restart right?
[13:02:06] <kormat>	 it will; that's the whole point of this CR
[13:02:23] <kormat>	 on every puppet run, it'll check to see if pt-heartbeat should be running or not,
[13:02:24] <marostegui>	 then I don't get the no-op comment
[13:02:28] <kormat>	 and change the current state as appropriate
[13:02:40] <kormat>	 marostegui: the no-op is about the _monitoring_ of the systemd service
[13:02:47] <marostegui>	 ah gotcha :)
[13:02:51] <kormat>	 that can only fire if the service is in 'failed' mode, which currently isn't going to happen
[13:03:02] <kormat>	 because pt-heartbeat-wikimedia doesn't GAF about issues
[13:04:05] <kormat>	 ok cool - started pt-hb on db1125 (test-s4 replica), ran puppet, puppet stops it.
[13:04:28] <marostegui>	 sweet
[13:04:37] <kormat>	 doing the inverse test on db1124 (test-s4 primary)
[13:04:55] <kormat>	 `changed 'stopped' to 'running'`
[13:04:59] <kormat>	 yep, behaving correctly
[13:05:03] <marostegui>	 \o/
[13:05:31] <kormat>	 ok, i'll pick some more machines in codfw to roll out to by hand
[13:05:40] <marostegui>	 try to pick one in misc
[13:05:45] <marostegui>	 just in case those behave differently
[13:05:53] <kormat>	 one mw replica, a primary+replica in misc, primary+replica in es
[13:06:18] <marostegui>	 that is good!
[13:06:26] <kormat>	 making that a multiinstance mw replica
[13:06:51] <kormat>	 oh, hah. they don't have pt-heartbeat servicdes
[13:06:53] <marostegui>	 we don't have multi instance hosts running pthearbteat I think
[13:06:55] <marostegui>	 yeah
[13:06:59] <kormat>	 ok grand
[13:10:26] <kormat>	 hah, nvm - all misc replicas are multiinstance too
[13:10:28] <kormat>	 onto es then
[13:10:45] <marostegui>	 the codfw masters aren't multi instance
[13:10:48] <marostegui>	 misc ones I mean
[13:10:57] <kormat>	 done one of those already
[13:11:02] <marostegui>	 nice
[13:13:37] <kormat>	 tested on a live es section (es4) and also on an archive one (es1)
[13:13:58] <kormat>	 ok, i think at this point we're good to press Go
[13:16:41] <marostegui>	 good!
[13:16:58] <kormat>	 (i have pressed Go. i'm now stepping away from the computer for a few minutes for, uh, safety reasons)
[13:17:08] <marostegui>	 XD
[13:56:11] <icinga-wm>	 PROBLEM - Check unit status of pt-heartbeat-wikimedia on db2151 is CRITICAL: CRITICAL: Status of the systemd unit pt-heartbeat-wikimedia https://wikitech.wikimedia.org/wiki/MariaDB/pt-heartbeat
[13:56:49] <marostegui>	 jynus: that's also mediabackup :)
[13:56:51] <marostegui>	 ^
[13:57:20] <jynus>	 yes, I am doing the alter table
[13:57:26] <marostegui>	 thanks!
[13:57:37] <jynus>	 those seem to be non-trivial, ya know!
[13:57:42] <jynus>	 0:-)
[13:59:41] <jynus>	 I think it should be back soon
[14:00:15] <icinga-wm>	 RECOVERY - Check unit status of pt-heartbeat-wikimedia on db2151 is OK: OK: Status of the systemd unit pt-heartbeat-wikimedia https://wikitech.wikimedia.org/wiki/MariaDB/pt-heartbeat
[14:02:27] <jynus>	 so I think thanks to kormat's patch, this was a silent bug (pt-heartbeat not working with sections > 10 chracters) to one reported
[14:02:54] <jynus>	 happily pt-heartbeat wasn't used for production on those servers
[14:03:14] <jynus>	 and the lag check fails back to using seconds behind master to check lag
[14:03:33] <jynus>	 actually, in this case as it didn't have replication, it wasn't checked at all
[14:03:55] <jynus>	 thanks kormat for the improvements!
[14:09:12] <kormat>	 hurray for accidental progress :)
[14:11:29] <kormat>	 marostegui: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=^\(db\|es\|pc\)[12]&style=detail&servicestatustypes=29
[14:12:01] <kormat>	 54 pending service checks. 😅
[14:12:31] <kormat>	 screw it, scheduling them all now
[14:12:36] <kormat>	 let it burn
[14:15:59] <marostegui>	 haha
[14:25:10] <kormat>	 icinga says db1110 didn't successfully run puppet last time. checking.
[14:25:25] <kormat>	 aaand it worked fine v0v
[16:31:59] <wikibugs>	 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10jbond) >>! In T274463#7159360, @Jelto wrote: > However there is a already existing backup configuration in [modules/gitlab/manifests/backup.pp](https:/...
[17:07:45] <wikibugs>	 10DBA, 10DiscussionTools, 10Performance-Team, 10Editing-team (FY2020-21 Kanban Board), and 2 others: Reduce parser cache retention temporarily for DiscussionTools - https://phabricator.wikimedia.org/T280605 (10Krinkle) 05Open→03Resolved
[20:32:14] <wikibugs>	 10DBA, 10Data-Services: Prepare and check storage layer for shiwiki - https://phabricator.wikimedia.org/T284928 (10LSobanski) p:05Triage→03Medium Thanks, let us know when the database is created, so we can sanitize it.
[22:16:01] <wikibugs>	 10DBA, 10SRE, 10Datacenter-Switchover: Check "Days in advance preparation" for databases before DC switchover - https://phabricator.wikimedia.org/T285069 (10Legoktm)