[00:52:08] PROBLEM - MariaDB sustained replica lag on pc2009 is CRITICAL: 27.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104 [01:01:04] RECOVERY - MariaDB sustained replica lag on pc2009 is OK: (C)2 ge (W)1 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104 [01:38:32] PROBLEM - MariaDB sustained replica lag on pc2009 is CRITICAL: 8.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104 [01:42:10] PROBLEM - MariaDB sustained replica lag on pc2007 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104 [01:43:52] RECOVERY - MariaDB sustained replica lag on pc2009 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104 [01:43:56] RECOVERY - MariaDB sustained replica lag on pc2007 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104 [04:38:14] 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Upgrade pending stretch backup hosts to buster - https://phabricator.wikimedia.org/T280979 (10Marostegui) db2097:s2 checked and it is all good from the schema changes point of view. [08:41:00] 10Data-Persistence-Backup, 10Patch-For-Review, 10good first task: Improve filename regex in cli/recover-dump - https://phabricator.wikimedia.org/T277754 (10jcrespo) I will be checking this contribution soon as part of my job maintaining backups, and hopefully, getting it merged. [08:43:11] 10Data-Persistence-Backup, 10Patch-For-Review, 10good first task: Check we are preparing (xtrabackup --prepare) with the same package version as the server version of which the backup was taken - https://phabricator.wikimedia.org/T253959 (10jcrespo) I will be checking this contribution soon as part of my job... [08:44:46] 10Data-Persistence-Backup, 10Google-Summer-of-Code (2021), 10Patch-For-Review, 10good first task: Make recover-dump show the time taken - https://phabricator.wikimedia.org/T277160 (10jcrespo) I will be checking this contribution soon as part of my job maintaining backups, and hopefully, getting it merged. [08:45:39] 10Data-Persistence-Backup, 10Google-Summer-of-Code (2021), 10Patch-For-Review, 10good first task: transfer.py argument parsing exception - https://phabricator.wikimedia.org/T268258 (10jcrespo) I will be checking this contribution soon as part of my job maintaining backups, and hopefully, getting it merged. [08:46:48] 10Data-Persistence-Backup, 10Google-Summer-of-Code (2021), 10Patch-For-Review, 10good first task: recover-mariadb should use logging (logger) to indicate actions taken - https://phabricator.wikimedia.org/T277162 (10jcrespo) I will be checking this contribution soon as part of my job maintaining backups, an... [11:28:27] 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) @razzi do you have an ETA on when do you will resume this work? Thanks! (host is 87% now) [11:40:40] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of ar_timestamp - https://phabricator.wikimedia.org/T282371 (10Marostegui) [11:40:42] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of page_touched - https://phabricator.wikimedia.org/T282372 (10Marostegui) [11:40:45] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of user_touched - https://phabricator.wikimedia.org/T282373 (10Marostegui) [11:42:03] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of ar_timestamp - https://phabricator.wikimedia.org/T282371 (10Marostegui) [11:42:06] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of page_touched - https://phabricator.wikimedia.org/T282372 (10Marostegui) [11:42:09] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of user_touched - https://phabricator.wikimedia.org/T282373 (10Marostegui) [11:56:41] 10DBA: db2094:3318 (sanitarium on codfw) needs recloning - https://phabricator.wikimedia.org/T283793 (10Kormat) >>! In T283793#7124520, @Marostegui wrote: > See email - s8 reported some tables that need to be dropped Done. Looking at `redact_sanitarium.sh`, it doesn't do anything with `private_tables`. I guess... [11:58:04] 10DBA: db2094:3318 (sanitarium on codfw) needs recloning - https://phabricator.wikimedia.org/T283793 (10Marostegui) Normally what we do is: `redact_sanitarium.sh -d wikidatawiki -S socket_path | mysql -S socket_path wikidatawiki` [12:01:32] 10DBA: db2094:3318 (sanitarium on codfw) needs recloning - https://phabricator.wikimedia.org/T283793 (10Kormat) >>! In T283793#7125715, @Marostegui wrote: > Normally what we do is: `redact_sanitarium.sh -d wikidatawiki -S socket_path | mysql -S socket_path wikidatawiki` My point is that will only act on `modul... [12:03:11] 10DBA: db2094:3318 (sanitarium on codfw) needs recloning - https://phabricator.wikimedia.org/T283793 (10Marostegui) Ah yes, I misunderstood you. Yes, indeed, that's why we run check_private_data after data sanitization on new wikis, so we can also get those private tables deleted. [12:10:44] 10DBA: db2094:3318 (sanitarium on codfw) needs recloning - https://phabricator.wikimedia.org/T283793 (10Kormat) 05Open→03Resolved >>! In T283793#7125719, @Marostegui wrote: > Ah yes, I misunderstood you. Yes, indeed, that's why we run check_private_data after data sanitization on new wikis, so we can also ge... [12:13:24] 10DBA: db2094:3318 (sanitarium on codfw) needs recloning - https://phabricator.wikimedia.org/T283793 (10jcrespo) To avoid redundancies, I think we should deprecate "redact_sanitarium.sh" and use the same script (check_private_data.py) for both checking and redacting. check_private_data.py can do almost everythin... [12:15:46] 10DBA, 10SRE: wmf-auto-reinstall fails on hosts that run pt-heartbeat - https://phabricator.wikimedia.org/T252528 (10LSobanski) A stub document capturing this is at https://wikitech.wikimedia.org/wiki/MariaDB/Rebooting_a_host. [12:20:50] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [12:23:55] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of user_touched - https://phabricator.wikimedia.org/T282373 (10Marostegui) s8 eqiad [x] dbstore1005 [x] db1178 [x] db1177 [x] db1172 [x] db1167 [x] db1154 [x] db1126 [x] db1116 [x] db1114 [x] db1111 [x] db1109 [x] db1104 [x] db1101 [x] db10... [12:23:59] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of page_touched - https://phabricator.wikimedia.org/T282372 (10Marostegui) s8 eqiad [x] dbstore1005 [x] db1178 [x] db1177 [x] db1172 [x] db1167 [x] db1154 [x] db1126 [x] db1116 [x] db1114 [x] db1111 [x] db1109 [x] db1104 [x] db1101 [x] db10... [12:24:01] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of ar_timestamp - https://phabricator.wikimedia.org/T282371 (10Marostegui) s8 eqiad [x] dbstore1005 [x] db1178 [x] db1177 [x] db1172 [x] db1167 [x] db1154 [x] db1126 [x] db1116 [x] db1114 [x] db1111 [x] db1109 [x] db1104 [x] db1101 [x] db10... [12:24:17] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of user_touched - https://phabricator.wikimedia.org/T282373 (10Marostegui) [12:24:37] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of page_touched - https://phabricator.wikimedia.org/T282372 (10Marostegui) [12:24:54] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of ar_timestamp - https://phabricator.wikimedia.org/T282371 (10Marostegui) [12:25:52] RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [12:29:54] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of ar_timestamp - https://phabricator.wikimedia.org/T282371 (10Marostegui) [12:29:57] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of page_touched - https://phabricator.wikimedia.org/T282372 (10Marostegui) [12:29:59] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of user_touched - https://phabricator.wikimedia.org/T282373 (10Marostegui) [12:53:11] 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Upgrade pending stretch backup hosts to buster - https://phabricator.wikimedia.org/T280979 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin2002.codfw.wmnet for hosts: ` ['db2098.codfw.wmnet'] ` The log can be found in `/var/l... [13:16:17] 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Upgrade pending stretch backup hosts to buster - https://phabricator.wikimedia.org/T280979 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2098.codfw.wmnet'] ` and were **ALL** successful. [13:30:51] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [13:32:09] RECOVERY - MariaDB sustained replica lag on pc2008 is OK: (C)2 ge (W)1 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [13:55:28] 10DBA, 10SRE, 10ops-codfw: codfw: db2079 memory issue on DIMM B8 - https://phabricator.wikimedia.org/T283743 (10Papaul) @Marostegui hello you can go ahead and depool the server i will be on site in about an hour. Thanks [13:56:04] 10DBA, 10Data-Persistence-Backup, 10ops-codfw: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10Papaul) @jcrespo hello you can go ahead and depool the server i will be on site in about an hour to swap the DIMM. Thanks [13:56:16] 10DBA, 10SRE, 10ops-codfw: codfw: db2079 memory issue on DIMM B8 - https://phabricator.wikimedia.org/T283743 (10Marostegui) Excellent, thanks @Papaul [14:00:17] 10DBA, 10SRE, 10ops-codfw: codfw: db2079 memory issue on DIMM B8 - https://phabricator.wikimedia.org/T283743 (10Marostegui) db2079 is off and ready for you @Papaul [14:37:37] I've restarted a few prometheus-mysqld-exporter processes that seemed to be stuck on the dashboard [14:37:52] jynus: thanks [14:38:47] 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Upgrade pending stretch backup hosts to buster - https://phabricator.wikimedia.org/T280979 (10jcrespo) [14:44:07] 10DBA, 10Data-Persistence-Backup, 10ops-codfw: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10jcrespo) Shutting it down, will comment again when downtimed to prevent unwanted alerts and fully down. [14:48:28] 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Upgrade pending stretch backup hosts to buster - https://phabricator.wikimedia.org/T280979 (10jcrespo) [15:01:30] 10DBA, 10Data-Persistence-Backup, 10ops-codfw: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10jcrespo) @Papaul The host should be down already and has been downtime'd for a day- it is all yours. Just reboot it after you are done and comment here... [15:08:47] 10Data-Persistence-Backup, 10SRE, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) FYI, cross-dc backups are now in a "normal state" meaning we should only have those a few hours... [15:26:25] 10DBA, 10SRE, 10ops-codfw: codfw: db2079 memory issue on DIMM B8 - https://phabricator.wikimedia.org/T283743 (10Papaul) 05Open→03Resolved Swapped DIMM B8 with DIMM A8 we will see if we do see the issue on DIMM A8 . If we do, I will use one of the DIMM from one if the Decom servers . Resolving this task... [15:38:09] 10DBA, 10SRE, 10ops-codfw: codfw: db2079 memory issue on DIMM B8 - https://phabricator.wikimedia.org/T283743 (10Marostegui) 05Resolved→03Open On boot, we are hitting T216240, @Papaul let's get firmware and bios upgraded please [16:16:13] 10DBA, 10SRE, 10ops-codfw: codfw: db2079 memory issue on DIMM B8 - https://phabricator.wikimedia.org/T283743 (10Papaul) 05Open→03Resolved Firmware upgrade complete [16:16:36] 10DBA, 10SRE, 10ops-codfw: codfw: db2079 memory issue on DIMM B8 - https://phabricator.wikimedia.org/T283743 (10Marostegui) MySQL started - thanks Papaul! [16:17:55] 10DBA, 10Data-Persistence-Backup, 10ops-codfw: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10Papaul) 05Open→03Resolved swapped P1 DIMM5 with P2 DIMM5 . Server is back online. closing is issue is seen on P2 DIMM 5. I will request a DIMM repla... [16:27:27] 10Data-Persistence-Backup, 10Wikimedia-Mailing-lists: The Great Clean Up of Mailman2 - https://phabricator.wikimedia.org/T282303 (10Legoktm) >>! In T282303#7118984, @Ladsgroup wrote: > ` > root@lists1001:/var/tmp/bacula-restores/var/lib/mailman/archives/private/cloud-announce.mbox# cmp cloud-announce.mbox /var... [16:58:01] 10Data-Persistence-Backup, 10Wikimedia-Mailing-lists: The Great Clean Up of Mailman2 - https://phabricator.wikimedia.org/T282303 (10Ladsgroup) >>! In T282303#7126689, @Legoktm wrote: {meme, src="such-data"} Then it's good. Let's clean up 🧹 [17:02:15] 10Data-Persistence-Backup, 10Wikimedia-Mailing-lists: The Great Clean Up of Mailman2 - https://phabricator.wikimedia.org/T282303 (10jcrespo) This helps clarify it was certainly not some bit-flipping-on-the-wire kind of corruption in our backup system, which would impact all of bacula jobs. Thanks for looking i... [18:51:33] 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10wkandek) I think that is something that Jelto (our new SRE - starts June 7) can handle, i.e. add a second disk with the right dimensions. In the mean... [19:17:07] 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10Sergey.Trofimovsky.SF) Yes, this is reasonable. These two variables need to be updated accordingly (and Ansible redeployed): `gitlab_backup_keep_time:... [19:25:29] 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10razzi) @Marostegui I'll reimage db1183 today, should be set for you to work on it tomorrow. [19:29:56] 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Cmjohnson) [19:29:59] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Cmjohnson) [19:34:08] 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui) [19:37:04] hello! am trying to follow how mariadb::instance works. I see there is a mariadb@ systemd template service. does that mean that the concreted mariadb services are not managed by puppet? [19:37:16] do users just do systemctl start mariadb@ [19:37:17] ? [20:51:16] 10DBA, 10wikitech.wikimedia.org: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 (10bd808) [21:46:10] 10Data-Persistence-Backup, 10Wikimedia-Mailing-lists, 10Patch-For-Review: The Great Clean Up of Mailman2 - https://phabricator.wikimedia.org/T282303 (10Legoktm) Now that the mailman2 package is gone, if we need to unpickle a config file to look at it we'll need to install MM2 in a container locally or someth... [21:59:13] 10Data-Persistence-Backup, 10Wikimedia-Mailing-lists, 10Patch-For-Review: The Great Clean Up of Mailman2 - https://phabricator.wikimedia.org/T282303 (10Ladsgroup) Maybe with virtualenv? [22:01:26] 10Data-Persistence-Backup, 10Wikimedia-Mailing-lists, 10Patch-For-Review: The Great Clean Up of Mailman2 - https://phabricator.wikimedia.org/T282303 (10Ladsgroup) >>! In T282303#7127705, @Ladsgroup wrote: > Maybe with virtualenv? for example from the source code but that'll be "fun"