[04:08:58] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 3.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [04:14:24] RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [04:29:02] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [04:32:24] 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) @razzi thanks - let me know when I can proceed [04:34:30] RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [04:43:08] 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): wikitech database has almost all of its varbinary fields wrong - https://phabricator.wikimedia.org/T269348 (10Marostegui) [05:00:12] For awareness if you use dbctl and noticed !log isn't working: https://phabricator.wikimedia.org/T284123 [05:01:57] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of user_touched - https://phabricator.wikimedia.org/T282373 (10Marostegui) s4 eqiad [x] dbstore1004 [] db1183 [] db1160 [x] db1155 [x] db1150 [] db1149 [x] db1148 [] db1147 [x] db1146 [x] db1145 [] db1144 [] db1143 [] db1142 [] db1141 [] d... [05:01:59] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of page_touched - https://phabricator.wikimedia.org/T282372 (10Marostegui) s4 eqiad [x] dbstore1004 [] db1183 [] db1160 [x] db1155 [x] db1150 [] db1149 [x] db1148 [] db1147 [x] db1146 [x] db1145 [] db1144 [] db1143 [] db1142 [] db1141 [] d... [05:02:04] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of ar_timestamp - https://phabricator.wikimedia.org/T282371 (10Marostegui) s4 eqiad [x] dbstore1004 [] db1183 [] db1160 [x] db1155 [x] db1150 [] db1149 [x] db1148 [] db1147 [x] db1146 [x] db1145 [] db1144 [] db1143 [] db1142 [] db1141 [] d... [05:28:30] 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by razzi@cumin1001 for hosts: `db1183.eqiad.wmnet` - db1183.eqiad.wmnet (**PASS**) - Downtimed... [06:24:25] 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by razzi on cumin1001.eqiad.wmnet for hosts: ` db1183.eqiad.wmnet ` The log can be found in `/var/log... [06:29:34] 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1183.eqiad.wmnet'] ` Of which those **FAILED**: ` ['db1183.eqiad.wmnet'] ` [06:39:05] 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by razzi on cumin1001.eqiad.wmnet for hosts: ` dbstore1007.eqiad.wmnet ` The log can be found in `/va... [06:59:16] 10DBA: Re-image (rename) dbstore1006 into db1125 - https://phabricator.wikimedia.org/T284128 (10Marostegui) [06:59:24] 10DBA: Re-image (rename) dbstore1006 into db1125 - https://phabricator.wikimedia.org/T284128 (10Marostegui) p:05Triage→03Medium [07:01:57] 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbstore1007.eqiad.wmnet'] ` and were **ALL** successful. [07:41:52] 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10jcrespo) > My understanding is that bacula runs daily and copies new backups. Bacula has not been setup at all. You should have a "latest" directory w... [08:03:04] 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): wikitech database has almost all of its varbinary fields wrong - https://phabricator.wikimedia.org/T269348 (10Ladsgroup) If we change charsets to binary, this would be fixed automatically. Shall I do it? [08:40:02] 10DBA: Re-image (rename) dbstore1006 into db1125 - https://phabricator.wikimedia.org/T284128 (10Kormat) a:03Kormat [09:38:11] 10Data-Persistence-Backup, 10SRE, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10cmooney) Thanks for the info @jcrespo that should help. I did a lot of tests yesterday in relation to th... [10:01:29] 10DBA: Re-image (rename) dbstore1006 into db1125 - https://phabricator.wikimedia.org/T284128 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by kormat@cumin1001 for hosts: `dbstore1006.eqiad.wmnet` - dbstore1006.eqiad.wmnet (**PASS**) - Downtimed host on Icinga - Found physical host - Down... [10:18:41] 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): wikitech database has almost all of its varbinary fields wrong - https://phabricator.wikimedia.org/T269348 (10Marostegui) @Ladsgroup if you want to take care of it, that's good!. [10:51:17] 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) Thanks @razzi - could you or @elukey let me know if I can stop this host? (Given it is the start of the month, not sure if it is being used) [11:32:13] An online schema change is taking me one day on s4 [11:32:19] A not online one is taking me 2 weeks [11:32:23] And I haven't even finished it :( [11:38:38] which one? [11:39:04] I am having large problems scaling dumps of image [11:42:27] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of user_touched - https://phabricator.wikimedia.org/T282373 (10Marostegui) [11:42:41] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of page_touched - https://phabricator.wikimedia.org/T282372 (10Marostegui) [11:43:00] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of ar_timestamp - https://phabricator.wikimedia.org/T282371 (10Marostegui) [12:04:03] 10DBA: Re-image (rename) dbstore1006 into db1125 - https://phabricator.wikimedia.org/T284128 (10Marostegui) Dropped dbstore1006 from tendril and zarcillo [12:06:17] marostegui: cheers :) [12:06:28] it was messing up with my schema changes lists :` [12:06:30] :p [12:06:32] lol [12:06:34] even better [12:07:06] grrrrr [12:18:18] 10DBA, 10Patch-For-Review: Re-image (rename) dbstore1006 into db1125 - https://phabricator.wikimedia.org/T284128 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['db1125.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202106021218_k... [12:43:39] 10DBA, 10Patch-For-Review: Re-image (rename) dbstore1006 into db1125 - https://phabricator.wikimedia.org/T284128 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['db1125.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202106021242_k... [12:59:37] 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10elukey) No go for this week :( [13:00:36] 10DBA, 10Patch-For-Review: Re-image (rename) dbstore1006 into db1125 - https://phabricator.wikimedia.org/T284128 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1125.eqiad.wmnet'] ` Of which those **FAILED**: ` ['db1125.eqiad.wmnet'] ` [13:03:24] 10DBA, 10Patch-For-Review: Re-image (rename) dbstore1006 into db1125 - https://phabricator.wikimedia.org/T284128 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['db1125.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202106021303_k... [13:24:29] 10DBA: Re-image (rename) dbstore1006 into db1125 - https://phabricator.wikimedia.org/T284128 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1125.eqiad.wmnet'] ` and were **ALL** successful. [13:37:02] 10DBA: Re-image (rename) dbstore1006 into db1125 - https://phabricator.wikimedia.org/T284128 (10Kormat) Current status: - db1125 has been renamed, wiped, and reimaged - It still needs to be re-added to tendril/zarcillo, and have an s4 snapshot deployed on it. [14:29:26] 10DBA: Re-image (rename) dbstore1006 into db1125 - https://phabricator.wikimedia.org/T284128 (10Marostegui) @Kormat no need to add s4 data to it, just make it a replica of db1124 :) [14:45:20] now we have timestamp of report and time it took to build the report for drifts https://drift-tracker.toolforge.org/report/core/ [14:47:34] 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) No worries - I will ping again on Monday next week [15:43:46] 10DBA, 10cloud-services-team (Kanban): Use mariabackup instead of xtrabackup for galera backups? (Or possibly for all maria backups?) - https://phabricator.wikimedia.org/T284157 (10Andrew) [15:45:10] 10DBA, 10cloud-services-team (Kanban): Use mariabackup instead of xtrabackup for galera backups? (Or possibly for all maria backups?) - https://phabricator.wikimedia.org/T284157 (10Bstorm) At very least, it is clear that Galera SST processes are supported by mariabackup. I'm not sure about xtrabackup, but it... [15:51:28] 10DBA, 10cloud-services-team (Kanban): Use mariabackup instead of xtrabackup for galera backups? (Or possibly for all maria backups?) - https://phabricator.wikimedia.org/T284157 (10Bstorm) [15:52:50] 10DBA, 10cloud-services-team (Kanban): Use mariabackup instead of xtrabackup for galera backups? (Or possibly for all maria backups?) - https://phabricator.wikimedia.org/T284157 (10jcrespo) the WMF package tries to standarize this for future extensibility. This is just FYI, as you probably use a different pac... [15:54:07] 10Data-Persistence-Backup, 10database-backups, 10cloud-services-team (Kanban): Use mariabackup instead of xtrabackup for galera backups? (Or possibly for all maria backups?) - https://phabricator.wikimedia.org/T284157 (10jcrespo) [16:01:47] 10Data-Persistence-Backup, 10database-backups, 10cloud-services-team (Kanban): Use mariabackup instead of xtrabackup for galera backups? (Or possibly for all maria backups?) - https://phabricator.wikimedia.org/T284157 (10Bstorm) Looks like for cloud (since we are using upstream), we should use https://packa... [16:02:08] 10Data-Persistence-Backup, 10database-backups, 10cloud-services-team (Kanban): Use mariabackup instead of xtrabackup for galera backups? (Or possibly for all maria backups?) - https://phabricator.wikimedia.org/T284157 (10Bstorm) Not sure what the puppet bacula thing is doing :) [16:42:12] 10Data-Persistence-Backup, 10Wikimedia-Mailing-lists, 10bacula, 10Patch-For-Review: The Great Clean Up of Mailman2 - https://phabricator.wikimedia.org/T282303 (10jcrespo) [16:42:43] 10Data-Persistence-Backup, 10SRE, 10bacula, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) [16:59:36] kormat: pc1010 is approaching full disk [16:59:42] is it serving? [17:00:07] not from what I can see [17:00:52] no, not yet [17:00:59] it's replicating from pc1008 [17:01:31] so it is probably getting full cause all the keys that belonged to pc1 were removed (as it is no longer replicating from that master) [17:01:37] *were not removed [17:01:54] right [17:01:59] shall we just set it quietly on fire? [17:02:03] +1 [17:02:09] 🔥 [17:02:44] https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=12&orgId=1&var-server=pc1010&var-datasource=thanos&var-cluster=mysql&from=1621406166823&to=1622653346554 [17:05:38] marostegui: i'm guessing i probably should just truncate all the tables [17:05:47] yeah, that's ok [17:05:59] it won't have pc2 keys anyways when it is time for it to serve in pc2 [17:06:10] when are you planning to do so? [17:06:13] marostegui: it will have _some_, as it's replicating from pc2 primary [17:06:16] i can do it in a few [17:06:26] No, I mean depooling pc1008 [17:06:29] And placing pc1010 [17:06:29] ahh [17:06:37] tomorrow probably [17:06:50] Sounds good, so at least it will have some hours of keys [17:07:05] But normally when we do it it has none, as we just move it from pc1 to pc2 or pc3, whatever is needed [17:07:10] so some hours is still better than that :) [17:08:14] truncate running [17:08:20] \o/ [17:08:46] host downtimed just in case? [17:09:25] uhh [17:10:01] sure! 😅 [17:10:04] XDDD [17:10:29] truncate finished [17:10:35] sweet [17:38:50] 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Upgrade pending stretch backup hosts to buster - https://phabricator.wikimedia.org/T280979 (10jcrespo) Followup to our conversation, Re: T280979#7119805 @Marostegui Recently changed/setup/rebuilt hosts: * db2097:s1 and s2 * db2098:s7 and s8 * db2100:s7... [17:39:45] 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Upgrade pending stretch backup hosts to buster - https://phabricator.wikimedia.org/T280979 (10Marostegui) Cool - I will check them tomorrow and will come back to you here [18:42:28] 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10Dzahn) There are 2 things needed to make Bacula backups work on a given host. First is to add the "backup::host" puppet class to the host (but not dir... [18:55:26] 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10wkandek) gitlab is configured to store backups in /srv/gitlab-backup grep backup_path /etc/gitlab/gitlab.rb # The directory where Gitlab backups will... [18:56:39] 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10Dzahn) https://puppet-compiler.wmflabs.org/compiler1002/29784/gitlab1001.wikimedia.org/index.html [19:04:12] 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10Dzahn) We now have the Bacula client (fd for file daemon) running on gitlab1001. It ` [gitlab1001:~] $ ps aux | grep bacula root 11735 0.0 0.0... [19:24:16] 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10Dzahn) all interested, please take a look at the latest Gerrit link above and feel free to review/comment there. or just let me know here.. that is a... [20:08:37] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [20:12:07] RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [21:57:28] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 2.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [21:58:45] RECOVERY - MariaDB sustained replica lag on pc2008 is OK: (C)2 ge (W)1 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [22:28:47] 10DBA, 10wikitech.wikimedia.org, 10User-Ladsgroup, 10cloud-services-team (Kanban): wikitech database has almost all of its varbinary fields wrong - https://phabricator.wikimedia.org/T269348 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Done [22:28:57] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Sustainability (Incident Followup), and 2 others: Detect object, schema and data drifts between mediawiki HEAD, production masters and replicas - https://phabricator.wikimedia.org/T104459 (10Ladsgroup)