[01:52:51] legoktm: there are plans to remove those groups [01:52:58] these days they are more of hints [01:53:02] hmm ok [01:53:20] T263127 [01:53:20] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [01:53:26] so "send DPL query traffic to watchlist replicas" is not really a feasible solution [01:53:50] what about vslow? [01:54:04] vslow is built for sandboxing [01:56:40] will that continue to exist? [01:57:07] oh, I read the ticket now [01:57:17] writing a patch in a moment [04:12:21] 10DBA, 10SRE, 10ops-eqiad: Degraded RAID on db1129 - https://phabricator.wikimedia.org/T285715 (10Marostegui) Yes please [04:29:53] legoktm: but it wouldn't matter, if we overload that host I assume MW will go for a different other random one withing the section? [04:33:36] Ah sorry, I just saw the ticket, I will comment there [05:01:52] marostegui: I assume overloading vslow first is a little better than overloading a normal replica first [05:02:00] still bad, but a little less bad [05:04:14] legoktm: yes, but my point is, once that host is unavailable, the traffic will shift to the next one, right? [05:04:20] yes [05:04:21] so it would just give us a few seconds or minutes [05:04:26] pretty much [05:04:53] Then I think we need to fix the two things I commented on the ticket first before enabling it back (the query, and the amount of allowed queries) [05:05:16] right, vslow is not by any means a solution, just a very tiny band-aid [05:05:42] and tbh I don't think re-enabling on ruwikinews is ever going to happen [05:05:43] yep sure, but it would just give us a few seconds or a few minutes tops [05:05:53] legoktm: ah ok, then I am fine XD [05:06:15] it's still enabled on some 300+ other wikis though [05:06:55] see https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/708374/ [05:08:04] ah I see, so far we've only had issues with ruwikinews but it is indeed a potential tickling bomb [05:08:08] so +1 to that change [05:18:29] marostegui: "potential tickling bomb" would be an understatement :D [05:34:27] hahahaha [05:34:28] indeed [05:34:32] it has exploded twice [05:35:58] 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10Marostegui) db1129 came back clean [05:36:10] 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10Marostegui) [07:50:00] I'm making some queries against production pc replica. If it causes issues, let me know [07:56:26] you can use pc2010 if you like, as that is a spare host for pc1 [07:56:37] or any of the eqiad hosts, which have no traffic [08:00:59] good point [08:01:02] I'll mess around [08:01:19] around found a terrible issue and I haven't even started yet [08:01:26] *already [08:12:34] I will research "Last snapshot for s2 at eqiad (db1102.eqiad.wmnet:3312) taken on 2021-07-27 20:52:07 is 1048 GB, but previous one was 882 GB, a change of 18.8%" [08:12:50] maybe some tables are uncompressed [08:19:02] it could be the plwiki logging clean up, but that sounds too much no? Amir1 ? [08:19:25] that would reduce it by 10GB ish [08:19:26] note that it grew, not shrink [08:19:33] this is an growth [08:19:39] I am afraid the message may not be clear [08:19:59] a positive/negative sign in the percentage would be awesome :D [08:20:12] 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1122.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202107280820_mar... [08:20:13] maybe I can change it to be +X% or -X percent, or change "change" to "reduction/growth" [08:20:13] jynus: I am reimagining s2 master now btw [08:20:21] yeah, a +- would be great [08:21:02] I have to do the absolute number to compare with the threshold, but maybe I can print the original number with sign [08:39:51] marostegui: two questions: what the second and the third pc clusters are for? is it consistent hashing? it doesn't look like it [08:40:12] Amir1: I am not sure I get your question [08:40:25] the second: Is there a ticket for recent issues of parsercache so I can have a place to dump all of my questions [08:40:30] *findings [08:40:35] Amir1: yeah, let me look for it [08:40:38] https://dbtree.wikimedia.org/ [08:40:45] "pc2" cluster [08:40:56] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10MoritzMuehlenhoff) [08:40:57] it has only one replica [08:41:17] 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1122.eqiad.wmnet'] ` and were **ALL** successful. [08:41:21] Yeah, there's only one replica per pcX (pc1 has 2 cause one is a float spare that can be placed on any of the other pcX) [08:41:57] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10MoritzMuehlenhoff) [08:42:04] okay, so it's consistent hashing [08:42:08] yes [08:42:19] the replicas just replicate from their masters [08:42:41] weird that each has 256 tables which is also consistent hashing [08:42:54] maybe to avoid hitting limits of tables [08:42:59] doesn't matter [08:43:04] yeah it's not a ring replication [08:43:21] Amir1: it is, but only within pc1, pc2 and pc3 [08:43:25] not amongst them [08:43:47] 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10Marostegui) db1122 (s2 eqiad master) has been reimaged, I am now checking its tables before starting replication [08:44:44] that's what I meant. I don't know the meaning of the word "ring replication" properly :D [08:45:02] haha yeah [08:45:14] so pc1007 replicates from pc2007 and viceversa [08:45:29] Amir1: can you see this? https://orchestrator.wikimedia.org/web/cluster/alias/pc1 [08:45:39] nope [08:45:43] :( [08:45:54] Ok, we need to work on opening that up soon [08:46:05] At least for people with NDA [08:47:06] Amir1: https://phabricator.wikimedia.org/P16920 [08:47:18] you can see pc1007 replicates from pc2007 and the other way around [08:49:47] ahaaa, I see what you're referring to [08:49:59] you mean ring in mysql topoplogy [08:50:39] I meant in the sense of cassandra. it's a bit different [08:50:45] but yeah, I fully get it [08:50:58] btw, it's a bit scary it's statement based replication [08:51:13] but not too much [08:51:14] yeah, but we don't care about consistency [08:51:23] in fact, it has saved us a lot that we use statement [08:51:28] and helped a lot the operational work [09:06:02] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10jcrespo) [09:07:39] 10DBA, 10Commons, 10MediaWiki-File-management, 10MW-1.37-notes (1.37.0-wmf.14; 2021-07-12), and 4 others: Address "image" table capacity problems by storing pdf/djvu text outside file metadata - https://phabricator.wikimedia.org/T275268 (10Ladsgroup) ??? [12:22:05] 10DBA, 10serviceops, 10User-fgiunchedi, 10cloud-services-team (Kanban): Roll restart haproxy to apply updated configuration - https://phabricator.wikimedia.org/T287574 (10Marostegui) dbproxy2* can be done anytime dbproxy1018 and dbproxy1019 are owned by the cloud services team. The other dbproxies hosts ar... [12:33:03] PROBLEM - MariaDB sustained replica lag on db2077 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2077&var-port=9104 [12:34:13] RECOVERY - MariaDB sustained replica lag on db2077 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2077&var-port=9104 [14:31:52] 10DBA, 10serviceops, 10User-fgiunchedi, 10cloud-services-team (Kanban): Roll restart haproxy to apply updated configuration - https://phabricator.wikimedia.org/T287574 (10fgiunchedi) Thank you @Marostegui for the info! Yes this can wait next week or the week after no problem. [19:34:33] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Legoktm) [20:21:27] 10DBA, 10SRE, 10ops-eqiad: Degraded RAID on db1129 - https://phabricator.wikimedia.org/T285715 (10Jclark-ctr) disk has been replaced @Marostegui [20:21:34] 10DBA, 10SRE, 10ops-eqiad: Degraded RAID on db1129 - https://phabricator.wikimedia.org/T285715 (10Jclark-ctr) 05Open→03Resolved