[07:29:02] (I've not yet downtimed ms-be2045 'cos it does still have 2 disks in prod, though I think the plan is to take it entirely out of service this morning) [07:29:02] (I've not yet downtimed ms-be2045 'cos it does still have 2 disks in prod, though I think the plan is to take it entirely out of service this morning) [07:52:40] greetings, indeed [07:52:40] greetings, indeed [07:56:36] so basically I'm for: stop puppet, "wipefs -a" on all non-os partitions, power off [07:56:36] so basically I'm for: stop puppet, "wipefs -a" on all non-os partitions, power off [07:59:14] Emperor: FYI this would mean that if it doesn't get fixed within 2 weeks it will disappear from PuppetDB and hence Icinga (due to exported resources) and start alerting on a Netbox report. See also the alert boxes in https://wikitech.wikimedia.org/wiki/Puppet#Maintenance [07:59:14] Emperor: FYI this would mean that if it doesn't get fixed within 2 weeks it will disappear from PuppetDB and hence Icinga (due to exported resources) and start alerting on a Netbox report. See also the alert boxes in https://wikitech.wikimedia.org/wiki/Puppet#Maintenance [07:59:42] and we should set it as failed in Netbox too btw [07:59:43] and we should set it as failed in Netbox too btw [08:02:27] makes sense to me, thanks volant [08:02:27] makes sense to me, thanks volant [08:02:29] volans [08:02:29] volans [08:02:41] I'll proceed [08:02:41] I'll proceed [08:10:03] godog: does some ring maintenance need doing too? [08:10:03] godog: does some ring maintenance need doing too? [08:10:57] (like https://wikitech.wikimedia.org/wiki/Swift/How_To#Remove_a_failed_storage_node_from_the_cluster ) [08:10:57] (like https://wikitech.wikimedia.org/wiki/Swift/How_To#Remove_a_failed_storage_node_from_the_cluster ) [08:16:01] Emperor: good question, yeah given how long the host will likely stay offline I think we should remove it from the ring [08:16:01] Emperor: good question, yeah given how long the host will likely stay offline I think we should remove it from the ring [08:16:24] for single failed disks we usually don't bother since turnaround is ~1w [08:16:24] for single failed disks we usually don't bother since turnaround is ~1w [08:17:48] [I don't have a feel for how long getting the h/w looked at is likely to take] [08:17:48] [I don't have a feel for how long getting the h/w looked at is likely to take] [08:18:09] FYI, there were two DSAs for haproxy, but they are of no concern for our DB setups, they are a) specific to HTTP and b) only affect haproxy 2.x (and we don't have any bullseye haproxyy installs yet) [08:18:09] FYI, there were two DSAs for haproxy, but they are of no concern for our DB setups, they are a) specific to HTTP and b) only affect haproxy 2.x (and we don't have any bullseye haproxyy installs yet) [08:18:18] uf, something is weird, backups speed is down to 7.5 backups/s in the last hour [08:18:18] uf, something is weird, backups speed is down to 7.5 backups/s in the last hour [08:19:16] Emperor: normally papaul is pretty on it so the bottleneck is the vendor, though he's on parental leave ATM [08:19:16] Emperor: normally papaul is pretty on it so the bottleneck is the vendor, though he's on parental leave ATM [08:21:05] which is weird because network is at all time high, maybe we hit a block of long videos? [08:21:05] which is weird because network is at all time high, maybe we hit a block of long videos? [08:21:11] godog: am I right that the swift ring changes can be done on any system (e.g. my laptop), and only go into effect once I run make deploy [08:21:11] godog: am I right that the swift ring changes can be done on any system (e.g. my laptop), and only go into effect once I run make deploy [08:22:47] I think it is not videos, just very large TIFFs [08:22:47] I think it is not videos, just very large TIFFs [08:23:09] so all good [08:23:09] so all good [08:23:18] Emperor: correct yeah, I do it from my bullseye laptop with the 'swift' package installed [08:23:18] Emperor: correct yeah, I do it from my bullseye laptop with the 'swift' package installed [08:23:32] jynus: thank you, good to know [08:23:32] jynus: thank you, good to know [08:23:48] godog: cool, shall I try the ring surgery for removing ms-be2045 then, and you can review the CR? [08:23:48] godog: cool, shall I try the ring surgery for removing ms-be2045 then, and you can review the CR? [08:24:00] Emperor: for sure! thank you [08:24:00] Emperor: for sure! thank you [08:24:34] I'll power off ms-be2045 in the meantime and set it as failed in netbox [08:24:34] I'll power off ms-be2045 in the meantime and set it as failed in netbox [08:28:55] for additional context and the record, we're doing ring changes in big batches since it is a manual operation, ideally (and that's what "swiftstack" does) the weights are shifted gradually and automatically in the background to not move too many partitions around at the same time [08:28:55] for additional context and the record, we're doing ring changes in big batches since it is a manual operation, ideally (and that's what "swiftstack" does) the weights are shifted gradually and automatically in the background to not move too many partitions around at the same time [08:29:54] [one of the aims of automating ring management is making changes more gradually, also?] [08:29:54] [one of the aims of automating ring management is making changes more gradually, also?] [08:30:04] in practice it's been fine even though a big weight shift does cause increased latency when user traffic is also on the cluster, like codfw now [08:30:04] in practice it's been fine even though a big weight shift does cause increased latency when user traffic is also on the cluster, like codfw now [08:30:46] Emperor: precisely yeah, we'd set the desired weight and automation would make sure to converge to that weight gradually (at least that's how I think about it in my mind) [08:30:46] Emperor: precisely yeah, we'd set the desired weight and automation would make sure to converge to that weight gradually (at least that's how I think about it in my mind) [08:32:53] godog: https://gerrit.wikimedia.org/r/c/operations/software/swift-ring/+/720917 [08:32:53] godog: https://gerrit.wikimedia.org/r/c/operations/software/swift-ring/+/720917 [08:33:12] cheers, taking a look [08:33:12] cheers, taking a look [08:35:47] LGTM! [08:35:47] LGTM! [08:39:06] at deploy time I generally also specify TARGETS=codfw-prod (in this case) but it isn't compulsory [08:39:06] at deploy time I generally also specify TARGETS=codfw-prod (in this case) but it isn't compulsory [08:39:21] Is there any CI that runs on that? [I don't have a +2 option in gerrit] [08:39:21] Is there any CI that runs on that? [I don't have a +2 option in gerrit] [08:40:20] no CI no, I'm looking into why you can't +2 though [08:40:20] no CI no, I'm looking into why you can't +2 though [08:41:50] If I hit "reply", I get a +2 option, but normally I'd expect +2 at the top right next to "Rebase", IYSWIM [08:41:50] If I hit "reply", I get a +2 option, but normally I'd expect +2 at the top right next to "Rebase", IYSWIM [08:42:49] ah yes I see what you mean, I'm not sure but I'd guess the +2 option here appears when CI votes too [08:42:49] ah yes I see what you mean, I'm not sure but I'd guess the +2 option here appears when CI votes too [08:44:54] right, having CR+2 via Reply I now have UI for Verified +2, and then I will be able to merge :) [08:44:54] right, having CR+2 via Reply I now have UI for Verified +2, and then I will be able to merge :) [08:45:27] * godog sighs in gerrit [08:45:27] * godog sighs in gerrit [08:46:21] so yeah then make DEPLOY=puppet.eqiad.wmnet TARGETS=codfw-prod will DTRT [08:46:21] so yeah then make DEPLOY=puppet.eqiad.wmnet TARGETS=codfw-prod will DTRT [08:46:22] So once submitted, I do "make deploy TARGETS=codfw-prod DESTHOST=puppet.eqiad.wmnet" from my laptop? [08:46:23] So once submitted, I do "make deploy TARGETS=codfw-prod DESTHOST=puppet.eqiad.wmnet" from my laptop? [08:46:32] exactly [08:46:32] exactly [08:47:34] then we can either wait ~30 min for puppet to do its thing or force a run via cumin [08:47:34] then we can either wait ~30 min for puppet to do its thing or force a run via cumin [08:47:49] oh, bah, I have to go and faff around finding the hostkey for puppet.eqiad.wmnet [08:47:49] oh, bah, I have to go and faff around finding the hostkey for puppet.eqiad.wmnet [08:48:10] there's a rune for this... [08:48:10] there's a rune for this... [08:48:18] hah, puppetmaster1001.eqiad.wmnet too will work [08:48:18] hah, puppetmaster1001.eqiad.wmnet too will work [08:48:19] Emperor: do you use the known host script? [08:48:20] Emperor: do you use the known host script? [08:48:37] if run with the path to a local copy of the dns repo it will add those for you in the known hosts file [08:48:37] if run with the path to a local copy of the dns repo it will add those for you in the known hosts file [08:48:39] volans: yes, but it doesn't get the service names by default, and there's some other step to update those [08:48:39] volans: yes, but it doesn't get the service names by default, and there's some other step to update those [08:48:51] *run it [08:48:51] *run it [08:49:08] for all the CNAMEs, that is [08:49:09] for all the CNAMEs, that is [08:49:56] ah, yes, that seems to have DTRT [08:49:56] ah, yes, that seems to have DTRT [08:51:34] I don't think this needs cumin-faff, we can just wait for puppet to do it's thing? [08:51:34] I don't think this needs cumin-faff, we can just wait for puppet to do it's thing? [08:51:51] +1, I'll take a break [08:51:51] +1, I'll take a break [08:52:21] ⛾ [08:52:21] ⛾ [09:03:31] so there's a relatively small risk that the rebalance will cause quite high latency for clients, to the point of timeouts, if that happens we can shift traffic to eqiad [09:03:31] so there's a relatively small risk that the rebalance will cause quite high latency for clients, to the point of timeouts, if that happens we can shift traffic to eqiad [09:07:23] Emperor: (I'm updating the docs) sth left to do is !log the rebalance operation and cc the ticket to keep records happy [09:07:23] Emperor: (I'm updating the docs) sth left to do is !log the rebalance operation and cc the ticket to keep records happy [09:09:20] done :) [09:09:20] done :) [09:09:27] cheers [09:09:27] cheers [09:26:17] https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&from=now-3h&to=now-1m&refresh=1m&var-DC=codfw&var-prometheus=codfw%20prometheus%2Fops looks unremarkable still [09:26:18] https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&from=now-3h&to=now-1m&refresh=1m&var-DC=codfw&var-prometheus=codfw%20prometheus%2Fops looks unremarkable still [09:28:39] indeed [09:28:39] indeed [09:30:59] I saw some session*.scope unit failures for ms-be passing by, it is known but I can't find the task atm [09:30:59] I saw some session*.scope unit failures for ms-be passing by, it is known but I can't find the task atm [09:31:30] https://phabricator.wikimedia.org/T199911 [09:31:30] https://phabricator.wikimedia.org/T199911 [09:34:19] sigh, systemd [09:34:19] sigh, systemd [09:45:17] hopefully these maps (20TB) finish uploading before mw switchover starts- it should take just 3 hours more [09:45:17] hopefully these maps (20TB) finish uploading before mw switchover starts- it should take just 3 hours more [09:45:42] speaking of which, I am going to run a quick warmup [09:45:42] speaking of which, I am going to run a quick warmup [09:45:44] for around 1h [09:45:44] for around 1h [09:46:33] godog: it seems profile::thanos:.swift::backend simply adopted the toil::systemd_scope_cleanup class, but it's a systemd release which is two years more recent. how about we drop it from Thanos class to check whether the underlying systemd bug has fixed? [09:46:33] godog: it seems profile::thanos:.swift::backend simply adopted the toil::systemd_scope_cleanup class, but it's a systemd release which is two years more recent. how about we drop it from Thanos class to check whether the underlying systemd bug has fixed? [09:47:07] (given that Thanos bes are on Buster in contrast to the main mediastorage swift bes) [09:47:07] (given that Thanos bes are on Buster in contrast to the main mediastorage swift bes) [09:49:02] moritzm: definitely +1, also I don't recall seeing that specific failure on thanos hosts, one more reason to ditch the bandaid [09:49:02] moritzm: definitely +1, also I don't recall seeing that specific failure on thanos hosts, one more reason to ditch the bandaid [09:49:30] feel free to open a task for tracking, and/or we can review patches [09:49:30] feel free to open a task for tracking, and/or we can review patches [10:18:00] moritzm: also FWIW we're pretty close to be able to reimage thanos-be on bullseye and the whole thanos* will be upgraded - T288937 [10:18:01] moritzm: also FWIW we're pretty close to be able to reimage thanos-be on bullseye and the whole thanos* will be upgraded - T288937 [10:18:01] T288937: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 [10:18:01] T288937: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 [10:18:56] stupid power cuts :( [10:18:56] stupid power cuts :( [10:27:39] oh nice! [10:27:39] oh nice! [10:28:31] Emperor: unannounced/unexpected I take it? [10:28:31] Emperor: unannounced/unexpected I take it? [10:29:15] Yep, and while it was only out for ~10s, that knocks over the VDSL router, which takes a minute or so to restart [10:29:15] Yep, and while it was only out for ~10s, that knocks over the VDSL router, which takes a minute or so to restart [10:30:10] yeah that's annoying alright [10:30:10] yeah that's annoying alright [10:31:14] moritzm: thanks! re: my last message on ferm and dhcp and network-pre, I haven't dug very deeply except for testing the bandaid/fix but if you have more insights/ideas (not urgent) I'm quite interested [10:31:14] moritzm: thanks! re: my last message on ferm and dhcp and network-pre, I haven't dug very deeply except for testing the bandaid/fix but if you have more insights/ideas (not urgent) I'm quite interested [10:35:40] it's probably unrelated to the setup in pontoon/WMCS, we've seen this before with swift as well: https://phabricator.wikimedia.org/T254477 [10:35:41] it's probably unrelated to the setup in pontoon/WMCS, we've seen this before with swift as well: https://phabricator.wikimedia.org/T254477 [10:36:51] is it reproducible or does it happen on every pontoon/thanos/bullseye node? [10:36:51] is it reproducible or does it happen on every pontoon/thanos/bullseye node? [10:39:11] yeah I could reproduce at will, I'm not sure it is the same issue though because of "query timeout" vs "network unreachable" [10:39:12] yeah I could reproduce at will, I'm not sure it is the same issue though because of "query timeout" vs "network unreachable" [10:40:16] I tested the fix on thanos-be-01.swift.eqiad1.wikimedia.cloud whereas thanos-fe-01.swift.eqiad1.wikimedia.cloud doesn't have the fix [10:40:16] I tested the fix on thanos-be-01.swift.eqiad1.wikimedia.cloud whereas thanos-fe-01.swift.eqiad1.wikimedia.cloud doesn't have the fix [10:40:50] good question though re: other pontoon+bullseye nodes, I'm checking [10:40:50] good question though re: other pontoon+bullseye nodes, I'm checking [10:45:46] yeah can reproduce at will e.g. on ms-fe-01.swift.eqiad1.wikimedia.cloud or ms-be-01.swift.eqiad1.wikimedia.cloud [10:45:46] yeah can reproduce at will e.g. on ms-fe-01.swift.eqiad1.wikimedia.cloud or ms-be-01.swift.eqiad1.wikimedia.cloud [10:46:14] to be clear: in this scenario it isn't a huge deal, puppet will start ferm on the next run and all is well [10:46:14] to be clear: in this scenario it isn't a huge deal, puppet will start ferm on the next run and all is well [10:56:12] so, for our ferm package we're patching in the additional Wants: on nss-lookup.target (which makes ferm start a little later, but which also allows us to rely on resolve() working fine) [10:56:12] so, for our ferm package we're patching in the additional Wants: on nss-lookup.target (which makes ferm start a little later, but which also allows us to rely on resolve() working fine) [10:57:38] but possibly this isn't sufficient or how networking and name resolution is brought up in WMCS [10:57:38] but possibly this isn't sufficient or how networking and name resolution is brought up in WMCS [10:58:35] since we haven't seen this before, maybe this is caused by a change in bullseye (since I think your instance is actually one of the first (or maybe even the only one) in cloud VPS so far (IIRC the image is not yet offered by default to users) [10:58:35] since we haven't seen this before, maybe this is caused by a change in bullseye (since I think your instance is actually one of the first (or maybe even the only one) in cloud VPS so far (IIRC the image is not yet offered by default to users) [10:59:38] it is, since the official release date [10:59:38] it is, since the official release date [10:59:52] most users rely on neutron firewalling, not ferm though [10:59:52] most users rely on neutron firewalling, not ferm though [11:00:03] ah! thank you I didn't realize we're patching ferm, that explains why we're fine in production. Yeah I agree it must be due to networking ordering in wmcs [11:00:03] ah! thank you I didn't realize we're patching ferm, that explains why we're fine in production. Yeah I agree it must be due to networking ordering in wmcs [11:01:22] majavah: ah, good to know wrt availability! [11:01:22] majavah: ah, good to know wrt availability! [11:01:45] godog: what's the version of the ferm package in your instance? it should use the same deb as in production [11:01:45] godog: what's the version of the ferm package in your instance? it should use the same deb as in production [11:02:12] moritzm: can confirm it does, 2.5.1-1+wmf1 [11:02:12] moritzm: can confirm it does, 2.5.1-1+wmf1 [11:02:54] my current thinking is that if there's a simple/straightforward way to fix this we should do it, if not that's fine too since it "fixes" itself [11:02:54] my current thinking is that if there's a simple/straightforward way to fix this we should do it, if not that's fine too since it "fixes" itself [11:03:23] and it isn't an issue in production anyways [11:03:23] and it isn't an issue in production anyways [11:06:13] yeah, that's the same one as in prod [11:06:13] yeah, that's the same one as in prod [11:06:13] PROBLEM - MariaDB sustained replica lag on m1 on db2078 is CRITICAL: 78 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321 [11:06:13] PROBLEM - MariaDB sustained replica lag on m1 on db2078 is CRITICAL: 78 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321 [11:06:31] ok! gotta go to lunch [11:06:31] ok! gotta go to lunch [11:06:45] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 42 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [11:06:45] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 42 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [11:07:05] m1 lag? [11:07:05] m1 lag? [11:07:43] seems only a glitch [11:07:43] seems only a glitch [11:07:52] as in, a small spike [11:07:53] as in, a small spike [11:07:57] RECOVERY - MariaDB sustained replica lag on m1 on db2078 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321 [11:07:57] RECOVERY - MariaDB sustained replica lag on m1 on db2078 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321 [11:08:17] lots of deletions [11:08:17] lots of deletions [11:08:29] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [11:08:29] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [14:22:37] DC switch is starting now on -operations [14:22:38] DC switch is starting now on -operations [14:28:57] * Emperor at least half-watching [14:28:57] * Emperor at least half-watching [14:41:54] I have no idea what is going on, but maybe is because I am not connected to any stream [14:41:54] I have no idea what is going on, but maybe is because I am not connected to any stream [14:42:06] what? [14:42:07] what? [14:42:30] nothing is happenging, I just not sure what is the current status of the switchover [14:42:30] nothing is happenging, I just not sure what is the current status of the switchover [14:42:44] nothing is happening? what do you mean? [14:42:44] nothing is happening? what do you mean? [14:42:52] just by reading -operations [14:42:52] just by reading -operations [14:42:58] I am just confused, that's all [14:42:58] I am just confused, that's all [14:43:07] Ah, the script is loging all the steps [14:43:07] Ah, the script is loging all the steps [14:43:11] The next one is the read-only one [14:43:11] The next one is the read-only one [14:43:48] ok, I think I "caught on" [14:43:48] ok, I think I "caught on" [14:43:50] thanks [14:43:50] thanks [14:44:25] this can help with the steps if you are not attached to the tmux: https://wikitech.wikimedia.org/wiki/Switch_Datacenter [14:44:25] this can help with the steps if you are not attached to the tmux: https://wikitech.wikimedia.org/wiki/Switch_Datacenter [14:44:54] that I knew, it was I didn't know where we were before, with so much comments [14:44:54] that I knew, it was I didn't know where we were before, with so much comments [14:45:16] the cookbook will log the # of the step [14:45:16] the cookbook will log the # of the step [14:45:18] I can see it now [14:45:18] I can see it now [14:47:34] Anyone only seeing s1 and s8 here? https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=11&orgId=1&from=now-24h&to=now&var-site=eqiad&var-group=core&var-shard=All&var-role=All&refresh=1m [14:47:34] Anyone only seeing s1 and s8 here? https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=11&orgId=1&from=now-24h&to=now&var-site=eqiad&var-group=core&var-shard=All&var-role=All&refresh=1m [14:47:37] Even if all is selected? [14:47:37] Even if all is selected? [14:47:57] marostegui: yes [14:47:57] marostegui: yes [14:48:04] that's weird [14:48:04] that's weird [14:48:12] can someone double check? [14:48:12] can someone double check? [14:48:22] marostegui: me too, with "all" selected [14:48:22] marostegui: me too, with "all" selected [14:49:00] likewise if I check s1+s2+...s8 I only get 1 and 8 [14:49:00] likewise if I check s1+s2+...s8 I only get 1 and 8 [14:49:21] Hm, if you set e.g. s2 only you get "no data" [14:49:21] Hm, if you set e.g. s2 only you get "no data" [14:50:06] strange, the only 4 hosts failing to be scraped are the ones I mentioned to you on tickets "mine" [14:50:06] strange, the only 4 hosts failing to be scraped are the ones I mentioned to you on tickets "mine" [14:51:02] pc and es, look good, though -which caused issues in the past [14:51:02] pc and es, look good, though -which caused issues in the past [14:58:56] marostegui: i can't see any reason why that dashboard issue is happening. thanos seems to have the data just fine [14:58:56] marostegui: i can't see any reason why that dashboard issue is happening. thanos seems to have the data just fine [15:03:13] can also confirm as the metrics used can be seen on the individual mysql host: https://grafana-rw.wikimedia.org/d/000000273/mysql?viewPanel=40&orgId=1&var-server=db1104&var-port=9104&from=1631620984479&to=1631631784479 [15:03:13] can also confirm as the metrics used can be seen on the individual mysql host: https://grafana-rw.wikimedia.org/d/000000273/mysql?viewPanel=40&orgId=1&var-server=db1104&var-port=9104&from=1631620984479&to=1631631784479 [15:03:44] I just tested group: all and now it shows s1, s4, s7, s8 and tendril [15:03:44] I just tested group: all and now it shows s1, s4, s7, s8 and tendril [15:05:18] it could be the way aggregation happens [15:05:18] it could be the way aggregation happens [15:07:22] the thing is- I wouldn't worry if it happened on all metrics- but it seems to affect only that graph :-/ [15:07:22] the thing is- I wouldn't worry if it happened on all metrics- but it seems to affect only that graph :-/ [15:07:58] godog: are you available to have a look? ^ [15:07:58] godog: are you available to have a look? ^ [15:08:02] yeah, which is useful especially during switchovers [15:08:02] yeah, which is useful especially during switchovers [15:09:40] it is definitely something at monitoring layer, as old metrics disappeared too [15:09:40] it is definitely something at monitoring layer, as old metrics disappeared too [15:11:16] zarcillo-targets looks unchanged (and wouldn't explain the specific graph) [15:11:16] zarcillo-targets looks unchanged (and wouldn't explain the specific graph) [15:12:01] kormat: in a meeting until :30 [15:12:01] kormat: in a meeting until :30 [15:13:18] wait, what does s1, s8 and tendril have in common? [15:13:18] wait, what does s1, s8 and tendril have in common? [15:13:29] stretch/10.1? coincidence? [15:13:29] stretch/10.1? coincidence? [15:14:44] marostegui: i've just manually set db1124 (test-s4) to r/w [15:14:44] marostegui: i've just manually set db1124 (test-s4) to r/w [15:14:46] I know it is a crazy theory, but that is all I have for you [15:14:46] I know it is a crazy theory, but that is all I have for you [15:14:51] kormat: <3 [15:14:51] kormat: <3 [15:15:28] jynus: But I think it was working fine before the switchover [15:15:28] jynus: But I think it was working fine before the switchover [15:15:41] that is one of the reasons I say it is crazy :-) [15:15:41] that is one of the reasons I say it is crazy :-) [15:15:56] maybe something happened when we switched over services yesterday or something? [15:15:56] maybe something happened when we switched over services yesterday or something? [15:16:45] I am going in the direction of "something broke on eqiad only but we didn't notice until now" [15:16:45] I am going in the direction of "something broke on eqiad only but we didn't notice until now" [15:18:23] yeah [15:18:23] yeah [15:18:39] and I know I am not being super-useful here, sorry :-( [15:18:39] and I know I am not being super-useful here, sorry :-( [15:19:14] I am going to create a test panel to experiment [15:19:15] I am going to create a test panel to experiment [15:19:31] so you can focus on more important stuff for now [15:19:31] so you can focus on more important stuff for now [15:20:00] PROBLEM - MariaDB sustained replica lag on es5 on es1023 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es1023&var-port=9104 [15:20:00] PROBLEM - MariaDB sustained replica lag on es5 on es1023 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es1023&var-port=9104 [15:20:40] oh [15:20:40] oh [15:21:09] seems in this case like a really bad spike [15:21:09] seems in this case like a really bad spike [15:21:22] RECOVERY - MariaDB sustained replica lag on es5 on es1023 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es1023&var-port=9104 [15:21:22] RECOVERY - MariaDB sustained replica lag on es5 on es1023 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es1023&var-port=9104 [15:24:08] the query, without the max only returns results from s1, s8 and tendril (14 objects in total) [15:24:08] the query, without the max only returns results from s1, s8 and tendril (14 objects in total) [15:26:17] but the query runs fine on 10.4 (and everywhere) on the individual hosts, no? [15:26:17] but the query runs fine on 10.4 (and everywhere) on the individual hosts, no? [15:26:20] and I think all are MariaDB 10.1 ones [15:26:20] and I think all are MariaDB 10.1 ones [15:26:59] it is not the exact same query- that's the next thing I am checking [15:26:59] it is not the exact same query- that's the next thing I am checking [15:27:37] interesting [15:27:37] interesting [15:27:49] if I remove quantile, it "works" [15:27:49] if I remove quantile, it "works" [15:27:58] so the format must have changed subtlely [15:27:58] so the format must have changed subtlely [15:28:00] between versions [15:28:00] between versions [15:28:25] not the first time we see that :( [15:28:25] not the first time we see that :( [15:28:43] you have a bad version at: https://grafana-rw.wikimedia.org/d/21pxVYS7z/jaimes-mysql-aggregated-copy?orgId=1 [15:28:43] you have a bad version at: https://grafana-rw.wikimedia.org/d/21pxVYS7z/jaimes-mysql-aggregated-copy?orgId=1 [15:29:04] good news it is now at good levels [15:29:04] good news it is now at good levels [15:29:47] so the metric is the same, but the parametrization is probably different [15:29:47] so the metric is the same, but the parametrization is probably different [15:31:06] that's useful thanks :) [15:31:06] that's useful thanks :) [15:31:49] mmm but that still only shows s1 and s8? [15:31:49] mmm but that still only shows s1 and s8? [15:35:09] kormat: back, is assistance required? [15:35:09] kormat: back, is assistance required? [15:35:19] marostegui, not final, try this: https://grafana-rw.wikimedia.org/d/21pxVYS7z/jaimes-mysql-aggregated-copy?viewPanel=11&orgId=1&from=1631612105065&to=1631633705065&var-site=All&var-group=All&var-shard=All&var-role=All [15:35:19] marostegui, not final, try this: https://grafana-rw.wikimedia.org/d/21pxVYS7z/jaimes-mysql-aggregated-copy?viewPanel=11&orgId=1&from=1631612105065&to=1631633705065&var-site=All&var-group=All&var-shard=All&var-role=All [15:35:56] godog, I think it is grafana, not prometheus related in the end [15:35:56] godog, I think it is grafana, not prometheus related in the end [15:36:14] godog: yes, please. https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=11&orgId=1&from=now-24h&to=now&var-site=eqiad&var-group=core&var-shard=All&var-role=All&refresh=1m [15:36:14] godog: yes, please. https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=11&orgId=1&from=now-24h&to=now&var-site=eqiad&var-group=core&var-shard=All&var-role=All&refresh=1m [15:36:30] this only shows s1+s8 in eqiad. it shows all sections if you switch to codfw. we haven't been able to figure out why [15:36:30] this only shows s1+s8 in eqiad. it shows all sections if you switch to codfw. we haven't been able to figure out why [15:36:36] afaict, the metrics are in thanos for both [15:36:36] afaict, the metrics are in thanos for both [15:36:47] I have, hello, kormat :-) [15:36:47] I have, hello, kormat :-) [15:37:12] jynus: all i saw above is that you removed one of the key labels [15:37:12] jynus: all i saw above is that you removed one of the key labels [15:37:16] did i miss something else? [15:37:16] did i miss something else? [15:37:28] it has a slightly different format [15:37:28] it has a slightly different format [15:37:43] on the mysql panel it seems to have a hack [15:37:43] on the mysql panel it seems to have a hack [15:37:48] to work on 10.1 and 10.4 [15:37:48] to work on 10.1 and 10.4 [15:38:00] so I copied it to the grouping one [15:38:00] so I copied it to the grouping one [15:38:24] handler=~"(prometheus|metrics)",quantile="0.99" [15:38:25] handler=~"(prometheus|metrics)",quantile="0.99" [15:38:34] so it may not be quantile, but the handler one [15:38:35] so it may not be quantile, but the handler one [15:38:46] (checking as well) [15:38:46] (checking as well) [15:39:09] jynus: that doesn't explain why your updated graph only shows s1/s8/tendril though, no? [15:39:09] jynus: that doesn't explain why your updated graph only shows s1/s8/tendril though, no? [15:39:20] it shows all to me [15:39:20] it shows all to me [15:39:39] but there may be some weirdness on saving [15:39:39] but there may be some weirdness on saving [15:39:52] try now, I saved again [15:39:52] try now, I saved again [15:40:06] it may have saved it only for my session before [15:40:06] it may have saved it only for my session before [15:40:28] This one shows everything: https://grafana.wikimedia.org/d/21pxVYS7z/jaimes-mysql-aggregated-copy?viewPanel=11&orgId=1&from=1631612105065&to=1631633705065&var-site=All&var-group=All&var-shard=All&var-role=All [15:40:28] This one shows everything: https://grafana.wikimedia.org/d/21pxVYS7z/jaimes-mysql-aggregated-copy?viewPanel=11&orgId=1&from=1631612105065&to=1631633705065&var-site=All&var-group=All&var-shard=All&var-role=All [15:40:36] jynus: that still doesn't fully explain it, though [15:40:36] jynus: that still doesn't fully explain it, though [15:40:51] https://thanos.wikimedia.org/graph?g0.expr=http_request_duration_microseconds%7Bjob%3D%22mysql-core%22%2C%20quantile%3D%220.99%22%2Cshard%3D%22s8%22%7D&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D [15:40:51] https://thanos.wikimedia.org/graph?g0.expr=http_request_duration_microseconds%7Bjob%3D%22mysql-core%22%2C%20quantile%3D%220.99%22%2Cshard%3D%22s8%22%7D&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D [15:40:56] if you mean- why it break now? [15:40:57] if you mean- why it break now? [15:41:07] I am thinking it didn't [15:41:07] I am thinking it didn't [15:41:09] there's both eqiad+codfw metrics in there, some with handler=metrics, some with handler=prometheus [15:41:09] there's both eqiad+codfw metrics in there, some with handler=metrics, some with handler=prometheus [15:42:00] yes, so the explanation to that- ofc I am not answering with authority [15:42:00] yes, so the explanation to that- ofc I am not answering with authority [15:42:10] mmhh does using handler=~"(prometheus|metrics)" do the right thing? [15:42:10] mmhh does using handler=~"(prometheus|metrics)" do the right thing? [15:42:15] would be the metrics changed with the prometheus version [15:42:15] would be the metrics changed with the prometheus version [15:42:28] godog: that sounds something we had issues with in the past indeed [15:42:28] godog: that sounds something we had issues with in the past indeed [15:42:29] after upgrade of exporter and/or server [15:42:29] after upgrade of exporter and/or server [15:42:43] godog: i'm dubious about it [15:42:43] godog: i'm dubious about it [15:42:44] now, that hack on mysql graphs [15:42:44] now, that hack on mysql graphs [15:42:58] I am not 100% sure is the right fix either [15:42:58] I am not 100% sure is the right fix either [15:43:30] so I am not claiming to know everthing, just I think it is in the right direction of the issue [15:43:30] so I am not claiming to know everthing, just I think it is in the right direction of the issue [15:43:56] mmhh ok [15:43:56] mmhh ok [15:44:45] I've just replicated that hack into the aggregated one [15:44:46] I've just replicated that hack into the aggregated one [15:45:05] on a separate panel of course, so not to touch the original [15:45:05] on a separate panel of course, so not to touch the original [15:53:45] dashboard fixed [15:53:45] dashboard fixed [15:54:03] we were specifying handler= when there was no need to [15:54:03] we were specifying handler= when there was no need to [15:54:58] yeah that's definitely better than handler=~... [15:54:58] yeah that's definitely better than handler=~... [15:55:03] \o/ [15:55:03] \o/ [15:55:05] Thank you guys [15:55:05] Thank you guys [15:55:06] s/better/future proof/ [15:55:06] s/better/future proof/ [15:55:13] How did that change by the way? [15:55:13] How did that change by the way? [15:55:34] marostegui: stretch->buster change, is all. [15:55:34] marostegui: stretch->buster change, is all. [15:55:49] marostegui: if you look at that graph for the last 30 days, you can see s4 disappear from it on the 26th when you reimaged the eqiad primary [15:55:49] marostegui: if you look at that graph for the last 30 days, you can see s4 disappear from it on the 26th when you reimaged the eqiad primary [15:55:55] yeah pretty sure it is the mysqld-exporter version [15:55:55] yeah pretty sure it is the mysqld-exporter version [15:56:04] Interesting [15:56:04] Interesting [15:56:16] i checked all the other graphs on the page, none of them have this issue. [15:56:16] i checked all the other graphs on the page, none of them have this issue. [15:58:36] So the issue won't arise once we migrate s1 and s8? [15:58:36] So the issue won't arise once we migrate s1 and s8? [15:58:42] correct. [15:58:42] correct. [15:58:45] we were being over-specific [15:58:45] we were being over-specific [15:58:47] sweet [15:58:47] sweet [15:58:55] thank you all for debugging this [15:58:55] thank you all for debugging this [15:59:30] jynus was right on the money about it being a stretch/buster difference [15:59:30] jynus was right on the money about it being a stretch/buster difference [15:59:55] thank you kormat, I just pointed in some guess, you killed it in the end :-) [15:59:55] thank you kormat, I just pointed in some guess, you killed it in the end :-)