[04:08:58] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 3.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104
[04:14:24] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104
[04:29:02] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104
[04:32:24] <wikibugs>	 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) @razzi thanks - let me know when I can proceed
[04:34:30] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104
[04:43:08] <wikibugs>	 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): wikitech database has almost all of its varbinary fields wrong - https://phabricator.wikimedia.org/T269348 (10Marostegui)
[05:00:12] <marostegui>	 For awareness if you use dbctl and noticed !log isn't working: https://phabricator.wikimedia.org/T284123
[05:01:57] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of user_touched - https://phabricator.wikimedia.org/T282373 (10Marostegui) s4 eqiad   [x] dbstore1004 [] db1183 [] db1160 [x] db1155 [x] db1150 [] db1149 [x] db1148 [] db1147 [x] db1146 [x] db1145 [] db1144 [] db1143 [] db1142 [] db1141 [] d...
[05:01:59] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of page_touched - https://phabricator.wikimedia.org/T282372 (10Marostegui) s4 eqiad   [x] dbstore1004 [] db1183 [] db1160 [x] db1155 [x] db1150 [] db1149 [x] db1148 [] db1147 [x] db1146 [x] db1145 [] db1144 [] db1143 [] db1142 [] db1141 [] d...
[05:02:04] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of ar_timestamp - https://phabricator.wikimedia.org/T282371 (10Marostegui) s4 eqiad   [x] dbstore1004 [] db1183 [] db1160 [x] db1155 [x] db1150 [] db1149 [x] db1148 [] db1147 [x] db1146 [x] db1145 [] db1144 [] db1143 [] db1142 [] db1141 [] d...
[05:28:30] <wikibugs>	 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by razzi@cumin1001 for hosts: `db1183.eqiad.wmnet` - db1183.eqiad.wmnet (**PASS**)   - Downtimed...
[06:24:25] <wikibugs>	 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by razzi on cumin1001.eqiad.wmnet for hosts: ` db1183.eqiad.wmnet ` The log can be found in `/var/log...
[06:29:34] <wikibugs>	 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1183.eqiad.wmnet'] `  Of which those **FAILED**: ` ['db1183.eqiad.wmnet'] `
[06:39:05] <wikibugs>	 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by razzi on cumin1001.eqiad.wmnet for hosts: ` dbstore1007.eqiad.wmnet ` The log can be found in `/va...
[06:59:16] <wikibugs>	 10DBA: Re-image (rename) dbstore1006 into db1125 - https://phabricator.wikimedia.org/T284128 (10Marostegui)
[06:59:24] <wikibugs>	 10DBA: Re-image (rename) dbstore1006 into db1125 - https://phabricator.wikimedia.org/T284128 (10Marostegui) p:05Triage→03Medium
[07:01:57] <wikibugs>	 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbstore1007.eqiad.wmnet'] `  and were **ALL** successful.
[07:41:52] <wikibugs>	 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10jcrespo) > My understanding is that bacula runs daily and copies new backups.  Bacula has not been setup at all. You should have a "latest" directory w...
[08:03:04] <wikibugs>	 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): wikitech database has almost all of its varbinary fields wrong - https://phabricator.wikimedia.org/T269348 (10Ladsgroup) If we change charsets to binary, this would be fixed automatically. Shall I do it?
[08:40:02] <wikibugs>	 10DBA: Re-image (rename) dbstore1006 into db1125 - https://phabricator.wikimedia.org/T284128 (10Kormat) a:03Kormat
[09:38:11] <wikibugs>	 10Data-Persistence-Backup, 10SRE, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10cmooney) Thanks for the info @jcrespo that should help.  I did a lot of tests yesterday in relation to th...
[10:01:29] <wikibugs>	 10DBA: Re-image (rename) dbstore1006 into db1125 - https://phabricator.wikimedia.org/T284128 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by kormat@cumin1001 for hosts: `dbstore1006.eqiad.wmnet` - dbstore1006.eqiad.wmnet (**PASS**)   - Downtimed host on Icinga   - Found physical host   - Down...
[10:18:41] <wikibugs>	 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): wikitech database has almost all of its varbinary fields wrong - https://phabricator.wikimedia.org/T269348 (10Marostegui) @Ladsgroup if you want to take care of it, that's good!.
[10:51:17] <wikibugs>	 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) Thanks @razzi - could you or @elukey let me know if I can stop this host? (Given it is the start of the month, not sure if it is being used)
[11:32:13] <marostegui>	 An online schema change is taking me one day on s4
[11:32:19] <marostegui>	 A not online one is taking me 2 weeks
[11:32:23] <marostegui>	 And I haven't even finished it :(
[11:38:38] <jynus>	 which one?
[11:39:04] <jynus>	 I am having large problems scaling dumps of image
[11:42:27] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of user_touched - https://phabricator.wikimedia.org/T282373 (10Marostegui)
[11:42:41] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of page_touched - https://phabricator.wikimedia.org/T282372 (10Marostegui)
[11:43:00] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of ar_timestamp - https://phabricator.wikimedia.org/T282371 (10Marostegui)
[12:04:03] <wikibugs>	 10DBA: Re-image (rename) dbstore1006 into db1125 - https://phabricator.wikimedia.org/T284128 (10Marostegui) Dropped dbstore1006 from tendril and zarcillo
[12:06:17] <kormat>	 marostegui: cheers :)
[12:06:28] <marostegui>	 it was messing up with my schema changes lists :`
[12:06:30] <marostegui>	 :p
[12:06:32] <kormat>	 lol
[12:06:34] <kormat>	 even better
[12:07:06] <marostegui>	 grrrrr
[12:18:18] <wikibugs>	 10DBA, 10Patch-For-Review: Re-image (rename) dbstore1006 into db1125 - https://phabricator.wikimedia.org/T284128 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['db1125.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202106021218_k...
[12:43:39] <wikibugs>	 10DBA, 10Patch-For-Review: Re-image (rename) dbstore1006 into db1125 - https://phabricator.wikimedia.org/T284128 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['db1125.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202106021242_k...
[12:59:37] <wikibugs>	 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10elukey) No go for this week :(
[13:00:36] <wikibugs>	 10DBA, 10Patch-For-Review: Re-image (rename) dbstore1006 into db1125 - https://phabricator.wikimedia.org/T284128 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1125.eqiad.wmnet'] `  Of which those **FAILED**: ` ['db1125.eqiad.wmnet'] `
[13:03:24] <wikibugs>	 10DBA, 10Patch-For-Review: Re-image (rename) dbstore1006 into db1125 - https://phabricator.wikimedia.org/T284128 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['db1125.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202106021303_k...
[13:24:29] <wikibugs>	 10DBA: Re-image (rename) dbstore1006 into db1125 - https://phabricator.wikimedia.org/T284128 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1125.eqiad.wmnet'] `  and were **ALL** successful.
[13:37:02] <wikibugs>	 10DBA: Re-image (rename) dbstore1006 into db1125 - https://phabricator.wikimedia.org/T284128 (10Kormat) Current status: - db1125 has been renamed, wiped, and reimaged - It still needs to be re-added to tendril/zarcillo, and have an s4 snapshot deployed on it.
[14:29:26] <wikibugs>	 10DBA: Re-image (rename) dbstore1006 into db1125 - https://phabricator.wikimedia.org/T284128 (10Marostegui) @Kormat no need to add s4 data to it, just make it a replica of db1124 :)
[14:45:20] <Amir1>	 now we have timestamp of report and time it took to build the report for drifts https://drift-tracker.toolforge.org/report/core/
[14:47:34] <wikibugs>	 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) No worries - I will ping again on Monday next week
[15:43:46] <wikibugs>	 10DBA, 10cloud-services-team (Kanban): Use mariabackup instead of xtrabackup for galera backups?  (Or possibly for all maria backups?) - https://phabricator.wikimedia.org/T284157 (10Andrew)
[15:45:10] <wikibugs>	 10DBA, 10cloud-services-team (Kanban): Use mariabackup instead of xtrabackup for galera backups?  (Or possibly for all maria backups?) - https://phabricator.wikimedia.org/T284157 (10Bstorm) At very least, it is clear that Galera SST processes are supported by mariabackup. I'm not sure about xtrabackup, but it...
[15:51:28] <wikibugs>	 10DBA, 10cloud-services-team (Kanban): Use mariabackup instead of xtrabackup for galera backups?  (Or possibly for all maria backups?) - https://phabricator.wikimedia.org/T284157 (10Bstorm)
[15:52:50] <wikibugs>	 10DBA, 10cloud-services-team (Kanban): Use mariabackup instead of xtrabackup for galera backups?  (Or possibly for all maria backups?) - https://phabricator.wikimedia.org/T284157 (10jcrespo) the WMF package tries to standarize this for future extensibility. This is just FYI, as you probably use a different pac...
[15:54:07] <wikibugs>	 10Data-Persistence-Backup, 10database-backups, 10cloud-services-team (Kanban): Use mariabackup instead of xtrabackup for galera backups?  (Or possibly for all maria backups?) - https://phabricator.wikimedia.org/T284157 (10jcrespo)
[16:01:47] <wikibugs>	 10Data-Persistence-Backup, 10database-backups, 10cloud-services-team (Kanban): Use mariabackup instead of xtrabackup for galera backups?  (Or possibly for all maria backups?) - https://phabricator.wikimedia.org/T284157 (10Bstorm) Looks like for cloud (since we are using upstream), we should use https://packa...
[16:02:08] <wikibugs>	 10Data-Persistence-Backup, 10database-backups, 10cloud-services-team (Kanban): Use mariabackup instead of xtrabackup for galera backups?  (Or possibly for all maria backups?) - https://phabricator.wikimedia.org/T284157 (10Bstorm) Not sure what the puppet bacula thing is doing :)
[16:42:12] <wikibugs>	 10Data-Persistence-Backup, 10Wikimedia-Mailing-lists, 10bacula, 10Patch-For-Review: The Great Clean Up of Mailman2 - https://phabricator.wikimedia.org/T282303 (10jcrespo)
[16:42:43] <wikibugs>	 10Data-Persistence-Backup, 10SRE, 10bacula, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo)
[16:59:36] <marostegui>	 kormat: pc1010 is approaching full disk
[16:59:42] <marostegui>	 is it serving?
[17:00:07] <marostegui>	 not from what I can see
[17:00:52] <kormat>	 no, not yet
[17:00:59] <kormat>	 it's replicating from pc1008
[17:01:31] <marostegui>	 so it is probably getting full cause all the keys that belonged to pc1 were removed (as it is no longer replicating from that master)
[17:01:37] <marostegui>	 *were not removed
[17:01:54] <kormat>	 right
[17:01:59] <kormat>	 shall we just set it quietly on fire?
[17:02:03] <marostegui>	 +1
[17:02:09] <kormat>	 🔥
[17:02:44] <jynus>	 https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=12&orgId=1&var-server=pc1010&var-datasource=thanos&var-cluster=mysql&from=1621406166823&to=1622653346554
[17:05:38] <kormat>	 marostegui: i'm guessing i probably should just truncate all the tables
[17:05:47] <marostegui>	 yeah, that's ok 
[17:05:59] <marostegui>	 it won't have pc2 keys anyways when it is time for it to serve in pc2
[17:06:10] <marostegui>	 when are you planning to do so?
[17:06:13] <kormat>	 marostegui: it will have _some_, as it's replicating from pc2 primary
[17:06:16] <kormat>	 i can do it in a few
[17:06:26] <marostegui>	 No, I mean depooling pc1008
[17:06:29] <marostegui>	 And placing pc1010
[17:06:29] <kormat>	 ahh
[17:06:37] <kormat>	 tomorrow probably
[17:06:50] <marostegui>	 Sounds good, so at least it will have some hours of keys
[17:07:05] <marostegui>	 But normally when we do it it has none, as we just move it from pc1 to pc2 or pc3, whatever is needed
[17:07:10] <marostegui>	 so some hours is still better than that :)
[17:08:14] <kormat>	 truncate running
[17:08:20] <marostegui>	 \o/
[17:08:46] <marostegui>	 host downtimed just in case?
[17:09:25] <kormat>	 uhh
[17:10:01] <kormat>	 sure! 😅
[17:10:04] <marostegui>	 XDDD
[17:10:29] <kormat>	 truncate finished
[17:10:35] <marostegui>	 sweet
[17:38:50] <wikibugs>	 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Upgrade pending stretch backup hosts to buster - https://phabricator.wikimedia.org/T280979 (10jcrespo) Followup to our conversation, Re: T280979#7119805  @Marostegui Recently changed/setup/rebuilt hosts:  * db2097:s1 and s2 * db2098:s7 and s8 * db2100:s7...
[17:39:45] <wikibugs>	 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Upgrade pending stretch backup hosts to buster - https://phabricator.wikimedia.org/T280979 (10Marostegui) Cool - I will check them tomorrow and will come back to you here
[18:42:28] <wikibugs>	 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10Dzahn) There are 2 things needed to make Bacula backups work on a given host.  First is to add the "backup::host" puppet class to the host (but not dir...
[18:55:26] <wikibugs>	 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10wkandek) gitlab is configured to store backups in /srv/gitlab-backup  grep backup_path /etc/gitlab/gitlab.rb  # The directory where Gitlab backups will...
[18:56:39] <wikibugs>	 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10Dzahn) https://puppet-compiler.wmflabs.org/compiler1002/29784/gitlab1001.wikimedia.org/index.html
[19:04:12] <wikibugs>	 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10Dzahn) We now have the Bacula client (fd for file daemon) running on gitlab1001. It  ` [gitlab1001:~] $ ps aux | grep bacula root     11735  0.0  0.0...
[19:24:16] <wikibugs>	 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10Dzahn) all interested, please take a look at the latest Gerrit link above and feel free to review/comment there.  or just let me know here..  that is a...
[20:08:37] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104
[20:12:07] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104
[21:57:28] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 2.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104
[21:58:45] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on pc2008 is OK: (C)2 ge (W)1 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104
[22:28:47] <wikibugs>	 10DBA, 10wikitech.wikimedia.org, 10User-Ladsgroup, 10cloud-services-team (Kanban): wikitech database has almost all of its varbinary fields wrong - https://phabricator.wikimedia.org/T269348 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Done
[22:28:57] <wikibugs>	 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Sustainability (Incident Followup), and 2 others: Detect object, schema and data drifts between mediawiki HEAD, production masters and replicas - https://phabricator.wikimedia.org/T104459 (10Ladsgroup)