[04:57:05] The most important thing is that i love you [09:37:27] Upstream puppet stdlib now has the Stdlib::IP::Address::CIDR type (as the obvious Variant[Stdlib::IP::Address::V4::CIDR,Stdlib::IP::Address::V6::CIDR,]). Our version doesn't have that type. If I wanted to use it, should I just declare it as a type in the module I want to use it in, patch our copy of vendor_modules/stdlib, put it in wmflib, Not Do That You Fool, ...? [09:37:45] [our stdlib does have the V4 and V6-specific versions] [09:40:19] i feel like doing it in wmflib for now would be the best, and then it's pretty much just find-and-replace when we update stdlib (once we're on puppet 7 aiui?) [09:41:39] Not sure - upstream code is https://github.com/puppetlabs/puppetlabs-stdlib/blob/main/types/ip/address/cidr.pp [09:42:00] it's not in 7.0.0 [09:42:33] looks like it comes in puppet 9 [09:43:40] you mean stdlib 9? Puppet itself is at 8.something now I think [09:44:27] yes, sorry, stdlib 9 [09:44:57] could somebody pair with me to depool codfw ? I'm not sure how to proceed and I think s2 being 30min behind in SQL replication could be a sufficient issue [09:47:38] depool the whole DC ? [09:49:48] idk if it's doable at the section level, I'm following up on marostegui recommendation for the issue I'm having currently but he's unavailable atm so I can't have more details on what was meant [09:50:16] the issue I'm having only impacts s2 [09:50:25] s2/codfw* [09:51:30] I think you'll want to follow this procedure: https://wikitech.wikimedia.org/wiki/Global_traffic_routing#Disabling_a_Site [09:52:17] that will drain the whole dc traffic as soon as the dns ttl expires [09:52:38] ack, reading through and will keep posting here [09:55:15] that's for edge traffic, but I think you want to depool mediawiki instead? [09:55:21] should this be an incident? [09:55:33] yeah, I feel like if we're depooling a DC we need a document [09:55:44] head's up godog vgutierrez ^ [09:56:02] ack thanks [09:56:10] indeed taavi there should be an incident [09:56:21] claime: not sure we need to depool the DC [09:56:25] as only https://noc.wikimedia.org/db.php#tabs-s2 those wikis are impacted [09:56:28] and I'd like _joe_'s opinion on this as well because it feels weird to depool our whole secondary for one section lag [09:56:32] disabling codfw on those should be sufficient [09:56:43] err yeah I'm not sure we can do that tbh [09:56:49] ack [09:57:20] <_joe_> claime: I think we need to depool codfw from specifically the mediawiki-ro discovery entries [09:57:25] <_joe_> or am I missing something? [09:57:41] downtiming db related alerts for 5hrs [09:57:48] <_joe_> and yes ofc it's an incident [09:58:20] _joe_: yeah, sucks we can't be more granular with it though [09:58:25] <_joe_> I do think we should depool [09:58:49] ok I'm starting an incident doc [09:58:56] I'm gonna go ahead and depool then [09:59:14] <_joe_> claime: only mw-ro entries, we don't need to depool everything tbh [09:59:20] yep [09:59:28] https://gerrit.wikimedia.org/r/c/operations/dns/+/1041041 [09:59:50] taavi: feel free to also merge my change [09:59:56] <_joe_> arnaudb: that is not the right thing to do [09:59:57] if somebody has time for a quick sanity check and +1 [10:00:04] oh [10:00:04] <_joe_> that depools external traffic to the edge in the site [10:00:08] https://docs.google.com/document/d/1FItcuVBL5TiW-obHDSTIsnowYWZ1WO7cByKxW3okg6A/edit [10:00:13] <_joe_> we want just not to send traffic to mediawiki [10:00:15] that depools everything :) [10:00:23] what should I use then? [10:00:25] <_joe_> vgutierrez: actually no [10:00:33] <_joe_> it doesn't depool internal traffic to mediawiki [10:00:45] I'll catch up on the incident doc on the status [10:00:58] <_joe_> arnaudb: this is kind of a special case, but you need to use conftool on the discovery object type [10:01:05] arnaudb: confctl, but I'll do it [10:01:12] thanks [10:01:13] <_joe_> ack I was about to ask [10:01:13] I'm already on it [10:01:21] jelto: mine was already merging, you'll need to merge it after mine finishes [10:01:40] yep, done. Thanks anyways! [10:02:04] isn't sre.discovery.service-route there for this use case? [10:02:18] _joe_: keeping -int-ro up in codfw or nah? [10:02:31] I feel no, but I'd like a second opinion [10:02:32] <_joe_> claime: specifically nah [10:02:52] <_joe_> volans: I don't think we special-cased "all mediawiki active/active services" there [10:03:55] what's the user impact atm ? [10:04:13] https://noc.wikimedia.org/db.php#tabs-s2 all s2 wikis are lagging [10:04:17] (on codfw) [10:04:23] because of a database replication issue [10:05:04] someone just filed a phab task about their edits not showing up, which is probably caused by this: T367033 [10:05:06] T367033: page saves to English Wiktionary are getting lost - https://phabricator.wikimedia.org/T367033 [10:05:12] so... wiki inconsistencies for codfw/ulsfo/eqsin users? [10:05:14] it occured while switching masters → https://phabricator.wikimedia.org/T367019, newly promoted source failed to assume its role [10:05:25] I'd say every read on codfw is inconsistent [10:05:28] done for mw-on-k8s [10:05:38] need to remember the way for bare-metal, just a sec [10:06:03] thank you [10:07:23] <_joe_> claime: appservers-ro and api-ro [10:07:40] ah, plural... [10:08:06] * kamila_ will update statuspage unless someone tells me not to [10:08:14] done [10:08:35] kamila_: yes please, thank you [10:08:54] thanks claime so incident user facing impact is fixed? [10:09:35] I'd say we're monitoring impact now, soon to be fixed [10:09:39] should be, once cache updates [10:10:11] ack thank you all for the help [10:10:28] arnaudb: seeing a bunch of recoveries for the lag [10:10:43] a bunch of false negatives [10:10:49] I guess it's just the local replicas? [10:10:53] yeah [10:10:56] <_joe_> yep [10:11:13] what is off (outside of user impact) https://orchestrator.wikimedia.org/web/cluster/alias/s2 [10:11:42] yeah almost an hour [10:11:45] eesh [10:12:09] is that db2207 which was supposed to be the newly promoted master stopped doing its job as soon as it sat on the leader chair → I'm still not certain that reverting the replicas would be doable with the current scripts [10:12:21] <_joe_> claime: how is eqiad holding up? [10:12:30] <_joe_> the k8s deployments of mw [10:13:01] db2207 lag is recovering [10:13:29] yep, i'm not sure wtf happened [10:13:51] it was stuck on a semi_sync ACK for an hour [10:13:52] * kamila_ updated status page to monitoring [10:13:56] orchestrator still reports 59 minutes lag for db2207 though [10:13:57] * arnaudb becomes crazy [10:14:15] yep but selecting locally on the host seems to say otherwise vgutierrez [10:14:30] ack [10:14:32] _joe_: under pressure, latency is up a bit, but holding [10:14:55] I'll keep on monitoring the situation [10:15:34] It's mostly mw-web and mw-api-ext getting hit, as expected [10:15:48] <_joe_> claime: do we have some room to grow? [10:15:54] oh I know what's wrong, hold on [10:15:55] arnaudb: how's orchestrator measuring the lag? [10:16:07] sec, fixing, will explain after [10:16:27] _joe_: we do but that's not necessary imo, it's ~+20ms [10:16:33] <_joe_> yeah [10:16:37] <_joe_> it's ok [10:16:56] <_joe_> it's better than what we got historically on bare metal [10:17:37] replication measuring should come back [10:17:44] yep [10:18:05] _joe_: that's at 75% worker saturation btw [10:18:06] finishing up my switchover as everything went back to green, please hold on for a moment before repooling codfw [10:18:12] it should be quick [10:18:17] <_joe_> wow that's good claime [10:18:28] very encouraging [10:19:40] happy it allowed you to benchmark :) [10:19:48] taavi: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1041046 seem good to you? [10:19:50] everything back to normal on my end [10:20:09] masters have been swapped, lag is at 0 [10:20:21] Good to repool then? [10:20:26] afaict yep [10:20:38] Cool, on it [10:20:40] <_joe_> taavi: FWIW I don't think that https://phabricator.wikimedia.org/T367033 could be caused by this [10:20:44] thanks! [10:21:19] <_joe_> or if it is, then it's worrisome [10:21:44] _joe_: why not? to me that feels like exactly what would happen if the replicas are all behind [10:21:45] <_joe_> in theory an editor gets assigned a short-lived cookie that sends them to the primary dc for minutes after an edit [10:21:54] <_joe_> so they should stick to the primary [10:22:07] <_joe_> unless this person was checking back some minutes later [10:22:16] <_joe_> it would mean that mechanism is broken [10:22:20] <_joe_> which would be worrisome [10:23:25] codfw traffic starting to ramp back up slowly [10:24:52] by slowly I mean very quickly actually but who's counting [10:28:27] :D [10:28:45] back up to normal [10:28:55] sweet, thank you all [10:29:06] kamila_: I think we're good to call the incident resolved [10:29:24] godog: ack, will do [10:29:42] cheers [10:36:08] going to lunch, bbiab [10:41:28] !oncall-now [10:41:28] Oncall now for team SRE, rotation business_hours: [10:41:28] g.odog, v.gutierrez [10:42:03] going to depool text@drmrs to apply IPIP encapsulation modifications: https://phabricator.wikimedia.org/T366466 [10:54:05] disabling puppet on A:cp-text to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1039948 [11:49:59] ack fabfur [11:52:02] drmrs is still depooled but the activity is completed. I'm waiting for some other tests and I'll repool it after lunch [15:04:19] jelto: o/ when you have a moment https://phabricator.wikimedia.org/T356252#9871159 [15:13:53] yeah sure, I'll do after the meetings [15:15:48] <3 [21:48:31] fun fact of the day: we once had POP called "lopar". lopar.wmnet was in Paris [21:59:19] IIRC there was one in south korea at one point as well [21:59:54] oh yeah they're all still listed in: https://wikitech.wikimedia.org/wiki/Data_centers [22:05:35] usage: "the server" lol [22:08:55] can't get any more micro PoP than that! [22:15:05] it's archeology day today.. all triggered by a new ticket. "pretty" names, before we had to worry about wildcard certs: commonsprototype.tesla.usability.wikimedia.org. https://static-bugzilla.wikimedia.org/show_bug.cgi?id=41433#c8