[03:50:35] 10DBA, 10MediaWiki-extensions-SecurePoll, 10Performance-Team, 10Platform Engineering, and 2 others: Creating an election with "all wikis" can give a DBTransactionSizeError - https://phabricator.wikimedia.org/T287859 (10tstarling) The code is already making a reasonable attempt at splitting up the transacti... [04:39:37] 10DBA, 10serviceops, 10User-fgiunchedi, 10cloud-services-team (Kanban): Roll restart haproxy to apply updated configuration - https://phabricator.wikimedia.org/T287574 (10Marostegui) I have disabled puppet on the active dbproxies: * dbproxy1013 * dbproxy1014 * dbproxy1020 [04:42:17] 10DBA, 10ParserFunctions, 10MW-1.37-notes (1.37.0-wmf.15; 2021-07-19), 10good first task: Image links from #ifexist:Media:... are not being registered properly on tawiktionary - https://phabricator.wikimedia.org/T245965 (10Marostegui) I know that @Ladsgroup is thinking about ideas on how to improve the *li... [04:53:49] 10DBA, 10Toolhub, 10serviceops, 10User-bd808: Discuss database needs with the DBA team - https://phabricator.wikimedia.org/T271480 (10Marostegui) >>! In T271480#7252163, @bd808 wrote: >>>! In T271480#7251095, @Marostegui wrote: >>>>! In T271480#7225145, @bd808 wrote: >>> * toolhub: user with CRUD rights on... [04:54:50] 10DBA, 10MediaWiki-extensions-SecurePoll, 10Performance-Team, 10Platform Engineering, and 3 others: Creating an election with "all wikis" can give a DBTransactionSizeError - https://phabricator.wikimedia.org/T287859 (10tstarling) With the patch above, the transactions are short, and the write queries are n... [05:04:53] 10DBA, 10serviceops, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Roll restart haproxy to apply updated configuration - https://phabricator.wikimedia.org/T287574 (10Marostegui) The above patch is ready to be merged and deployed once the standby dbproxies are done. [05:13:27] 10DBA: Switchover s2 from db2107 to db2104 - https://phabricator.wikimedia.org/T287454 (10Marostegui) [05:54:23] 10DBA, 10ParserFunctions, 10MW-1.37-notes (1.37.0-wmf.15; 2021-07-19), 10good first task: Image links from #ifexist:Media:... are not being registered properly on tawiktionary - https://phabricator.wikimedia.org/T245965 (10doctaxon) @Marostegui "delicate position"? Let me know about problems, if there're a... [06:02:23] 10DBA, 10ParserFunctions, 10MW-1.37-notes (1.37.0-wmf.15; 2021-07-19), 10good first task: Image links from #ifexist:Media:... are not being registered properly on tawiktionary - https://phabricator.wikimedia.org/T245965 (10Marostegui) In general the *links tables are reaching quite considerable on-disk siz... [06:18:25] Last dump for s4 at codfw (db2099.codfw.wmnet:3314) taken on 2021-08-03 00:00:02 is 176 GB, but previous one was 255 GB, a change of 30.7% [06:19:01] that must be the image clean up! [06:19:03] nice [06:20:25] 10DBA, 10Commons, 10MediaWiki-File-management, 10MW-1.37-notes (1.37.0-wmf.14; 2021-07-12), and 4 others: Address "image" table capacity problems by storing pdf/djvu text outside file metadata - https://phabricator.wikimedia.org/T275268 (10jcrespo) ` root@db1159.eqiad.wmnet[dbbackups]> nopager; select star... [06:20:31] ^ Amir1 [06:21:59] that is almost 6 times smaller [06:22:40] the backup 5 times faster [06:40:29] <3 [07:09:35] 10DBA, 10Infrastructure-Foundations, 10Recommendation-API, 10SRE, and 2 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) db1183 is now up and replicating from db1107 [07:32:03] jynus: wohooooo [07:32:19] I think that'd be all for s4, this week is the last [07:32:55] for image table I mean, s4 needs much more work in general (links table) [07:40:37] 10% left, I assume it'll be done tomorrow or the day after [07:57:47] 10Data-Persistence-Backup, 10database-backups, 10Goal, 10Patch-For-Review: Upgrade pending stretch backup hosts to buster - https://phabricator.wikimedia.org/T280979 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['db1145.eqiad.wmnet'] ` The log c... [08:21:33] 10Data-Persistence-Backup, 10database-backups, 10Goal, 10Patch-For-Review: Upgrade pending stretch backup hosts to buster - https://phabricator.wikimedia.org/T280979 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1145.eqiad.wmnet'] ` and were **ALL** successful. [08:36:07] 10Data-Persistence-Backup, 10database-backups, 10Goal, 10Patch-For-Review: Upgrade pending stretch backup hosts to buster - https://phabricator.wikimedia.org/T280979 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['db1145.eqiad.wmnet'] ` The log c... [08:39:36] 10DBA, 10serviceops, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Roll restart haproxy to apply updated configuration - https://phabricator.wikimedia.org/T287574 (10fgiunchedi) Thank you @Marostegui ! To recap here's my plan: # stop puppet on `C:haproxy` # merge https://gerrit... [09:02:40] 10Data-Persistence-Backup, 10database-backups, 10Goal, 10Patch-For-Review: Upgrade pending stretch backup hosts to buster - https://phabricator.wikimedia.org/T280979 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1145.eqiad.wmnet'] ` and were **ALL** successful. [09:03:57] 10Data-Persistence-Backup, 10database-backups, 10Goal, 10Patch-For-Review: Upgrade pending stretch backup hosts to buster - https://phabricator.wikimedia.org/T280979 (10jcrespo) [09:05:22] 10DBA, 10ParserFunctions, 10MW-1.37-notes (1.37.0-wmf.15; 2021-07-19), 10good first task: Image links from #ifexist:Media:... are not being registered properly on tawiktionary - https://phabricator.wikimedia.org/T245965 (10Ladsgroup) imagelinks table (along rest of *links table) are basically a ticking bom... [09:33:46] 10DBA, 10Infrastructure-Foundations, 10Recommendation-API, 10SRE, and 2 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10hnowlan) No objection for sockpuppet, thanks! [09:35:25] so, time to drop this channel? [09:37:11] jynus: what do you mean? [09:37:21] finalize T283580 [09:37:22] T283580: Data Persistence IRC channels updates - https://phabricator.wikimedia.org/T283580 [09:37:44] which was to rename this to wikimedia-data-persistence and add a separate channel for bot traffic? [09:38:05] kormat, not literally "now", more like "soon" [09:38:26] (FWIW, splitting the bot traffic off would make life easier) [09:38:56] Emperor: are you enjoying #wikimedia-operations? :) [09:39:16] 10DBA, 10Infrastructure-Foundations, 10Recommendation-API, 10SRE, and 2 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) Thank you all for the fast replies! [09:42:47] /o\ [09:43:35] Emperor: fwiw, i don't read most of the traffic in #-operations. it's a very indiscriminate firehose [09:48:59] I do! It is enjoyable! [09:50:03] ^ all the evidence you need that marostegui is broken [09:56:22] jynus: good timing, I completely forgot about this and it came up yesterday. I'm happy for us to go ahead at any moment. [09:56:43] 10DBA, 10Toolhub, 10serviceops, 10User-bd808: Discuss database needs with the DBA team - https://phabricator.wikimedia.org/T271480 (10JMeybohm) @Marostegui please find the up to date Pod IP ranges at https://netbox.wikimedia.org/search/?q=kubernetes+pod&obj_type=#prefixes [09:57:02] 10DBA, 10serviceops, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Roll restart haproxy to apply updated configuration - https://phabricator.wikimedia.org/T287574 (10Marostegui) 05Open→03Resolved The proxies were failed over, and the old active ones got puppet enabled + run a... [09:57:28] I'm considering spending a little while tweaking ERC to ignore wikibugs from the POV of "which channels have stuff to read in" [09:58:10] 10DBA, 10Toolhub, 10serviceops, 10User-bd808: Discuss database needs with the DBA team - https://phabricator.wikimedia.org/T271480 (10Marostegui) Thanks - 10.64.% and 10.192.% should work then [09:58:39] sobanski I am not sure we could do easily all of the wishlist, but we should at the very least migrate the main human conversation channel there and announce it to other sres [10:04:26] Sounds good to me. Everyone happy with -firehose for the Wikibugs channel? [10:05:12] let me start writing a WIP patch for that, and people can comment there [10:06:45] I can do the Wikibugs patch, I'd just rather do it once :) [10:06:46] 10DBA, 10Toolhub, 10serviceops, 10User-bd808: Discuss database needs with the DBA team - https://phabricator.wikimedia.org/T271480 (10Marostegui) Recap - @bd808 please let me know if this looks good: cluster: `m5` db name: `toolhub` entry point: `m5-master.eqiad.wmnet` db users: * `toolhub_admin` Grants:... [10:06:53] sobanski: 👍 [10:06:55] ok, not touching it then [10:07:30] I have not preference re channel names, so you you people decide [10:07:37] marostegui: Emperor are you OK with the channel name (#wikimedia-data-persistence-firehose)? [10:07:49] sobanski: yep, no issues [10:09:16] Emperor: Full disclosure: IRC bot config changes are the only patches I ever make. If you see me touching other code, run for the hills. [10:09:24] ha ha [10:10:23] not true, I think I "made" you deploy patches of something related to backups? [10:11:35] evidence: https://gerrit.wikimedia.org/r/c/operations/puppet/+/663570 [10:11:48] sobanski: -firehose is fine with me, thanks [10:11:54] https://usercontent.irccloud-cdn.com/file/tfpSznw3/image.png [10:12:03] :D [10:17:39] btw, some channels are named -feed (like cloud at least), so while I have nothing against -firehose that's an option if you want consistency [10:17:54] -feed also seems fine [10:18:02] +1 (naming things is hard) [10:19:53] i see references to #mediawiki-feed and #wikidata-feed [10:19:59] Feed works for me too (and it's shorter) [10:20:02] 👍 [10:20:24] (i mainly just didn't want 'bot' in the name, as some stuff may not be from bots) [10:21:38] sobanski: if you want to configure irc bots, look at https://meta.wikimedia.org/wiki/IRC/Bots/ircservserv :P [10:29:03] 10DBA, 10Infrastructure-Foundations, 10Recommendation-API, 10SRE, and 3 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) [10:36:44] jynus: following up on your question from yesterday, the host decommissioning runbook now has a section on Orchestrator removal (thanks to kormat for walking me through it): https://wikitech.wikimedia.org/w/index.php?title=MariaDB%2FDecommissioning_a_DB_Host&type=revision&diff=1920427&oldid=1914851 [10:37:40] it can also be done from the GUI too btw (only admins can) [10:40:29] thanks, sobanski [11:18:22] should we add it to the decom cookbook? [11:40:34] volans: possibly, if the order of operations won't trigger automatic re-discovery at that point. Should I create a task to discuss this? [11:45:46] sure, that would be helpful. I guess there will be a way to prevent re-discovery :) [12:47:47] volans: do I tag it with #sre-tools? [13:16:43] submitted my first gerrit review request [13:20:06] \o/ [13:21:17] sobanski: that would do it, thx [13:35:35] 10DBA, 10Commons, 10MediaWiki-File-management, 10MW-1.37-notes (1.37.0-wmf.14; 2021-07-12), and 4 others: Address "image" table capacity problems by storing pdf/djvu text outside file metadata - https://phabricator.wikimedia.org/T275268 (10Ladsgroup) An update: This clean up will likely be finished (for pd... [14:39:09] 10DBA, 10Toolhub, 10serviceops, 10User-bd808: Discuss database needs with the DBA team - https://phabricator.wikimedia.org/T271480 (10bd808) >>! In T271480#7255015, @Marostegui wrote: > Recap - @bd808 please let me know if this looks good: > > cluster: `m5` > db name: `toolhub` > entry point: `m5-master.e... [16:25:33] 10DBA, 10SRE, 10ops-eqiad: Broken RAM on db1127 - https://phabricator.wikimedia.org/T286763 (10Cmjohnson) The DIMM has arrived, the server will need to be taken offline for a few minutes do swap the DIMM. [16:26:16] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: db1170 mysql process crashed - https://phabricator.wikimedia.org/T286888 (10Cmjohnson) The DIMM arrived, is it safe to turn the server off and swap the DIMM? [16:40:41] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: db1170 mysql process crashed - https://phabricator.wikimedia.org/T286888 (10Marostegui) @Cmjohnson I can do that now, let me know if that works. If not, just let me know when it would work for you and I will get the server offline for you. [16:40:47] 10DBA, 10SRE, 10ops-eqiad: Broken RAM on db1127 - https://phabricator.wikimedia.org/T286763 (10Marostegui) @Cmjohnson I can do that now, let me know if that works. If not, just let me know when it would work for you and I will get the server offline for you. [16:45:48] marostegui: the drift reports are updated now https://drift-tracker.toolforge.org/report/core/ much better than what they are used to be but still long way to go [16:49:38] 10DBA, 10ops-eqiad: Degraded RAID on db1175 - https://phabricator.wikimedia.org/T287137 (10Cmjohnson) @marostegui the disk has been swapped but it appears to have been removed. You will need to add back to the raid configuration. Resolve this task after you restore the raid config. [16:49:57] 10DBA, 10SRE, 10ops-eqiad: Broken RAM on db1127 - https://phabricator.wikimedia.org/T286763 (10Cmjohnson) @marostegui yes please [16:50:19] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: db1170 mysql process crashed - https://phabricator.wikimedia.org/T286888 (10Cmjohnson) @marostegui yes please [16:52:11] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: db1170 mysql process crashed - https://phabricator.wikimedia.org/T286888 (10Marostegui) @Cmjohnson I just realised that this host is unreachable, so you can proceed with it anytime and power it back on when you are done. Thanks [16:52:51] 10DBA, 10SRE, 10ops-eqiad: Broken RAM on db1127 - https://phabricator.wikimedia.org/T286763 (10Marostegui) @Cmjohnson host off - you can proceed as needed [17:04:30] 10DBA, 10SRE, 10ops-eqiad: Broken RAM on db1127 - https://phabricator.wikimedia.org/T286763 (10Cmjohnson) 05Open→03Resolved DIMM A3 was replaced and the log was cleared. [17:12:06] 10DBA, 10SRE, 10ops-eqiad: Broken RAM on db1127 - https://phabricator.wikimedia.org/T286763 (10Marostegui) Memory looks good now. This host needs to be recloned - I will do that tomorrow Thanks Chris [17:16:28] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: db1170 mysql process crashed - https://phabricator.wikimedia.org/T286888 (10Cmjohnson) 05Open→03Resolved @marostegui the DIMM was replaced, logged cleared and powered on. This should resolve your issue [17:22:06] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: db1170 mysql process crashed - https://phabricator.wikimedia.org/T286888 (10Marostegui) 05Resolved→03Open @Cmjohnson it seems that the host isn't reachable - could you take a look to see if there's any error preventing it to boot up? Thanks! [17:33:15] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: db1170 mysql process crashed - https://phabricator.wikimedia.org/T286888 (10Marostegui) The host is now up and the memory is ok - thanks! This host needs recloning - will do it tomorrow and then close the task Thanks for your help Chris [19:23:15] PROBLEM - MariaDB sustained replica lag on db2132 is CRITICAL: 10.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104 [19:25:09] RECOVERY - MariaDB sustained replica lag on db2132 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104 [19:55:12] kormat: thoughts on https://phabricator.wikimedia.org/T286226#7256737?