[01:52:51] <Amir1>	 legoktm: there are plans to remove those groups
[01:52:58] <Amir1>	 these days they are more of hints 
[01:53:02] <legoktm>	 hmm ok
[01:53:20] <Amir1>	 T263127
[01:53:20] <stashbot>	 T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127
[01:53:26] <legoktm>	 so "send DPL query traffic to watchlist replicas" is not really a feasible solution
[01:53:50] <Amir1>	 what about vslow?
[01:54:04] <Amir1>	 vslow is built for sandboxing
[01:56:40] <legoktm>	 will that continue to exist?
[01:57:07] <legoktm>	 oh, I read the ticket now
[01:57:17] <legoktm>	 writing a patch in a moment
[04:12:21] <wikibugs>	 10DBA, 10SRE, 10ops-eqiad: Degraded RAID on db1129 - https://phabricator.wikimedia.org/T285715 (10Marostegui) Yes please
[04:29:53] <marostegui>	 legoktm: but it wouldn't matter, if we overload that host I assume MW will go for a different other random one withing the section?
[04:33:36] <marostegui>	 Ah sorry, I just saw the ticket, I will comment there
[05:01:52] <legoktm>	 marostegui: I assume overloading vslow first is a little better than overloading a normal replica first
[05:02:00] <legoktm>	 still bad, but a little less bad
[05:04:14] <marostegui>	 legoktm: yes, but my point is, once that host is unavailable, the traffic will shift to the next one, right?
[05:04:20] <legoktm>	 yes
[05:04:21] <marostegui>	 so it would just give us a few seconds or minutes
[05:04:26] <legoktm>	 pretty much
[05:04:53] <marostegui>	 Then I think we need to fix the two things I commented on the ticket first before enabling it back (the query, and the amount of allowed queries)
[05:05:16] <legoktm>	 right, vslow is not by any means a solution, just a very tiny band-aid
[05:05:42] <legoktm>	 and tbh I don't think re-enabling on ruwikinews is ever going to happen
[05:05:43] <marostegui>	 yep sure, but it would just give us a few seconds or a few minutes tops
[05:05:53] <marostegui>	 legoktm: ah ok, then I am fine XD
[05:06:15] <legoktm>	 it's still enabled on some 300+ other wikis though
[05:06:55] <legoktm>	 see https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/708374/
[05:08:04] <marostegui>	 ah I see, so far we've only had issues with ruwikinews but it is indeed a potential tickling bomb
[05:08:08] <marostegui>	 so +1 to that change
[05:18:29] <Amir1>	 marostegui: "potential tickling bomb" would be an understatement :D
[05:34:27] <marostegui>	 hahahaha
[05:34:28] <marostegui>	 indeed
[05:34:32] <marostegui>	 it has exploded twice
[05:35:58] <wikibugs>	 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10Marostegui) db1129 came back clean
[05:36:10] <wikibugs>	 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10Marostegui)
[07:50:00] <Amir1>	 I'm making some queries against production pc replica. If it causes issues, let me know
[07:56:26] <marostegui>	 you can use pc2010 if you like, as that is a spare host for pc1
[07:56:37] <marostegui>	 or any of the eqiad hosts, which have no traffic
[08:00:59] <Amir1>	 good point
[08:01:02] <Amir1>	 I'll mess around
[08:01:19] <Amir1>	 around found a terrible issue and I haven't even started yet
[08:01:26] <Amir1>	 *already
[08:12:34] <jynus>	 I will research "Last snapshot for s2 at eqiad (db1102.eqiad.wmnet:3312) taken on 2021-07-27 20:52:07 is 1048 GB, but previous one was 882 GB, a change of 18.8%"
[08:12:50] <jynus>	 maybe some tables are uncompressed
[08:19:02] <marostegui>	 it could be the plwiki logging clean up, but that sounds too much no? Amir1 ?
[08:19:25] <Amir1>	 that would reduce it by 10GB ish
[08:19:26] <jynus>	 note that it grew, not shrink
[08:19:33] <Amir1>	 this is an growth
[08:19:39] <jynus>	 I am afraid the message may not be clear
[08:19:59] <Amir1>	 a positive/negative sign in the percentage would be awesome :D
[08:20:12] <wikibugs>	 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1122.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202107280820_mar...
[08:20:13] <jynus>	 maybe I can change it to be +X% or -X percent, or change "change" to "reduction/growth"
[08:20:13] <marostegui>	 jynus: I am reimagining s2 master now btw
[08:20:21] <marostegui>	 yeah, a +- would be great
[08:21:02] <jynus>	 I have to do the absolute number to compare with the threshold, but maybe I can print the original number with sign
[08:39:51] <Amir1>	 marostegui: two questions: what the second and the third pc clusters are for? is it consistent hashing? it doesn't look like it
[08:40:12] <marostegui>	 Amir1: I am not sure I get your question
[08:40:25] <Amir1>	 the second: Is there a ticket for recent issues of parsercache so I can have a place to dump all of my questions
[08:40:30] <Amir1>	 *findings
[08:40:35] <marostegui>	 Amir1: yeah, let me look for it
[08:40:38] <Amir1>	 https://dbtree.wikimedia.org/
[08:40:45] <Amir1>	 "pc2" cluster
[08:40:56] <wikibugs>	 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10MoritzMuehlenhoff)
[08:40:57] <Amir1>	 it has only one replica
[08:41:17] <wikibugs>	 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1122.eqiad.wmnet'] `  and were **ALL** successful.
[08:41:21] <marostegui>	 Yeah, there's only one replica per pcX (pc1 has 2 cause one is a float spare that can be placed on any of the other pcX)
[08:41:57] <wikibugs>	 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10MoritzMuehlenhoff)
[08:42:04] <Amir1>	 okay, so it's consistent hashing 
[08:42:08] <marostegui>	 yes
[08:42:19] <marostegui>	 the replicas just replicate from their masters
[08:42:41] <Amir1>	 weird that each has 256 tables which is also consistent hashing
[08:42:54] <Amir1>	 maybe to avoid hitting limits of tables
[08:42:59] <Amir1>	 doesn't matter
[08:43:04] <Amir1>	 yeah it's not a ring replication
[08:43:21] <marostegui>	 Amir1: it is, but only within pc1, pc2 and pc3
[08:43:25] <marostegui>	 not amongst them
[08:43:47] <wikibugs>	 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10Marostegui) db1122 (s2 eqiad master) has been reimaged, I am now checking its tables before starting replication
[08:44:44] <Amir1>	 that's what I meant. I don't know the meaning of the word "ring replication" properly :D
[08:45:02] <marostegui>	 haha yeah
[08:45:14] <marostegui>	 so pc1007 replicates from pc2007 and viceversa
[08:45:29] <marostegui>	 Amir1: can you see this? https://orchestrator.wikimedia.org/web/cluster/alias/pc1
[08:45:39] <Amir1>	 nope 
[08:45:43] <marostegui>	 :(
[08:45:54] <marostegui>	 Ok, we need to work on opening that up soon
[08:46:05] <marostegui>	 At least for people with NDA
[08:47:06] <marostegui>	 Amir1: https://phabricator.wikimedia.org/P16920
[08:47:18] <marostegui>	 you can see pc1007 replicates from pc2007 and the other way around
[08:49:47] <Amir1>	 ahaaa, I see what you're referring to
[08:49:59] <Amir1>	 you mean ring in mysql topoplogy 
[08:50:39] <Amir1>	 I meant in the sense of cassandra. it's a bit different
[08:50:45] <Amir1>	 but yeah, I fully get it
[08:50:58] <Amir1>	 btw, it's a bit scary it's statement based replication
[08:51:13] <Amir1>	 but not too much
[08:51:14] <marostegui>	 yeah, but we don't care about consistency
[08:51:23] <marostegui>	 in fact, it has saved us a lot that we use statement
[08:51:28] <marostegui>	 and helped a lot the operational work
[09:06:02] <wikibugs>	 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10jcrespo)
[09:07:39] <wikibugs>	 10DBA, 10Commons, 10MediaWiki-File-management, 10MW-1.37-notes (1.37.0-wmf.14; 2021-07-12), and 4 others: Address "image" table capacity problems by storing pdf/djvu text outside file metadata - https://phabricator.wikimedia.org/T275268 (10Ladsgroup) ???
[12:22:05] <wikibugs>	 10DBA, 10serviceops, 10User-fgiunchedi, 10cloud-services-team (Kanban): Roll restart haproxy to apply updated configuration - https://phabricator.wikimedia.org/T287574 (10Marostegui) dbproxy2* can be done anytime dbproxy1018 and dbproxy1019 are owned by the cloud services team. The other dbproxies hosts ar...
[12:33:03] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on db2077 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2077&var-port=9104
[12:34:13] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on db2077 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2077&var-port=9104
[14:31:52] <wikibugs>	 10DBA, 10serviceops, 10User-fgiunchedi, 10cloud-services-team (Kanban): Roll restart haproxy to apply updated configuration - https://phabricator.wikimedia.org/T287574 (10fgiunchedi) Thank you @Marostegui for the info! Yes this can wait next week or the week after no problem.
[19:34:33] <wikibugs>	 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Legoktm)
[20:21:27] <wikibugs>	 10DBA, 10SRE, 10ops-eqiad: Degraded RAID on db1129 - https://phabricator.wikimedia.org/T285715 (10Jclark-ctr) disk has been replaced  @Marostegui
[20:21:34] <wikibugs>	 10DBA, 10SRE, 10ops-eqiad: Degraded RAID on db1129 - https://phabricator.wikimedia.org/T285715 (10Jclark-ctr) 05Open→03Resolved