[14:10:01] PROBLEM - MariaDB sustained replica lag on s2 on db2138 is CRITICAL: 13.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2138&var-port=13312 [14:10:45] PROBLEM - MariaDB sustained replica lag on s2 on db2125 is CRITICAL: 8.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2125&var-port=9104 [14:11:21] Emperor: making sure we're not waiting on one another, do you have time/bandwidth to take over T288458 ? thanks! [14:11:22] T288458: Put ms-be20[62-65] in service - https://phabricator.wikimedia.org/T288458 [14:11:25] PROBLEM - MariaDB sustained replica lag on s2 on db2095 is CRITICAL: 12.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2095&var-port=13312 [14:12:05] RECOVERY - MariaDB sustained replica lag on s2 on db2138 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2138&var-port=13312 [14:12:49] RECOVERY - MariaDB sustained replica lag on s2 on db2125 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2125&var-port=9104 [14:13:31] RECOVERY - MariaDB sustained replica lag on s2 on db2095 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2095&var-port=13312 [14:15:22] oh, that's me, sorry folks [14:15:31] i should put in downtimes [14:15:39] tut tut [14:24:32] finally, i know how elukey feels all the time [14:29:21] lulz, you're welcome [15:21:00] godog Emperor I just got a bunch of emails from a Pontoon Thanos host. Expected? [15:22:11] sobanski: not sure, can you forward them over ? [15:22:51] Forwarded. [15:22:59] marostegui: what have you done to poor s8 in codfw? [15:23:25] reimaged the master [15:23:32] you monster [15:23:36] checking its tables now, will be finished tomorrow [15:23:48] sobanski: thank you, not expected no, I'll fix it [15:27:23] godog: thanks :) [15:34:11] godog: apropos T288458, is codfw currently pooled? the sre.discovery.service-route cookbook doesn't work for status (I think someone was working on a fix, but I've lost the CR) [15:34:11] T288458: Put ms-be20[62-65] in service - https://phabricator.wikimedia.org/T288458 [15:35:07] ah https://gerrit.wikimedia.org/r/730692 [15:36:08] Emperor: yeah codfw is pooled atm, I was looking at https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&from=now-3h&to=now-1m&var-DC=codfw&var-prometheus=codfw%20prometheus%2Fops&refresh=1m [15:37:13] Is the plan to gradually trickle more weight onto these nodes, then, or to depool codfw swift instead? [15:38:35] yeah less impactful to depool codfw and then bump the weight and repool once rebalances have finished [15:39:47] eqiad depooled last week was putting strain on the codfw/eqiad bandwidth when one of the links failed, the link has fixed and codfw depooled iirc pulls in less bandwidth anyways [15:47:59] OK, so I can't use the cookbook to check what's pooled, is there a rune for confctl to check? confctl --tags --action get all wants dc, cluster, and service (and I'm not sure what should be in "cluster" here) [15:49:03] [I can click my way to https://config-master.wikimedia.org/pybal/eqiad/swift-https et al, but that's less ideal] [15:50:49] $ confctl --object-type discovery select 'dnsdisc=swift' get [15:50:50] {"codfw": {"pooled": true, "references": [], "ttl": 300}, "tags": "dnsdisc=swift"} [15:50:53] {"eqiad": {"pooled": true, "references": [], "ttl": 300}, "tags": "dnsdisc=swift"} [15:51:31] Emperor: ^ [15:52:37] 'dnsdisc=swift.*' will give uuou also the -ro -rw ones [15:53:08] this is for the discovery part of it [15:53:14] then if you want to check single hosts [15:54:24] confctl select 'cluster=swift' get [15:54:27] Ah, so codfw is already depooled for swift-rw [15:54:50] and you can add to the selection dc= or service=, comma separated [15:55:01] to fine-tune the selection [15:55:28] but not for discovery? `confctl --object-type discovery select 'dc=codfw,dnsdisc=swift.*' get` returns me the eqiad ones too [15:55:55] there the dc is used as key, so you have to use name=codfw [15:56:11] 'dnsdisc=swift.*,name=codfw' [15:56:24] ah, thanks. Is this all documented somewhere? [15:56:58] probably in https://wikitech.wikimedia.org/wiki/Conftool I don't guarantee it's up-to-dateness though ;) [15:57:01] And so to depool codfw-swift, I'd do `confctl --object-type discover select 'name=codfw,dnsdisc=swift.*' depool` ? [15:57:02] blame j.o..e :-P [15:57:25] what's the issue with the cookbook? we should fix that IMHO [15:57:48] Emperor: not exactly [15:57:48] see [15:57:49] https://wikitech.wikimedia.org/wiki/DNS/Discovery#How_to_manage_a_DNS_Discovery_service [15:58:28] volans: see https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/730692 (TL;DR - it doesn't cope with CNAMES, which the swift discovery records use AIUI) [15:58:55] ah yes that one, I even commented :D [15:58:56] sorry [15:59:09] NP, thanks for your help [15:59:40] I think, then: `confctl --object-type discovery select 'name=codfw,dnsdisc=swift.*' set/pooled=false` (but tomorrow) [16:00:41] yes, if you want to depool them all [16:03:31] godog: I presume we want to depool them... [16:09:11] godog: (and that it's expected that codfw wasn't pooled for swift-rw)? [16:10:15] Emperor: in the meeting now, but "yes" to both [16:11:34] Emperor: in the meeting too, can elaborate later but look at the active_active key in: [16:11:37] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/service.yaml#2365 [16:12:27] that is basically https://wikitech.wikimedia.org/wiki/DNS/Discovery#Active/passive_services [16:24:10] Thanks :)