[07:44:15] XioNoX topranks we just got an email saying: https://ru.wikipedia.org not reachable from AS213048. [07:44:35] I can create a task if that helps [07:44:44] looking [07:45:11] thanks XioNoX [07:56:04] I replied, but if there is an issue it's not on our side [07:56:23] Thanks :) [09:34:20] Reminder we'll be repooling eqiad RO (so all active/active services, including mw ones) at 11:30UTC [09:43:42] Corrected reminder we'll be repooling eqiad RO (so all active/active services, including mw ones) at 10:30UTC [10:24:37] Heads up effie XioNoX, I'll repool eqiad RO in 5 minutes [10:25:08] * marostegui ready [10:25:25] ๐Ÿ‘ [10:26:41] I'll run the warmup script just in case, opinions? [10:26:50] yeah [10:26:56] better be safe than sorry [10:31:18] doesn't hurt [10:32:17] yep, it's running now [10:32:31] Down to 120ms average [10:32:37] Run complete, letล› go [10:42:47] api_appserver and appserver at ~6krps [10:43:09] I am monitoring the dbs [10:44:00] Is there a rule to prefer sending RO traffic to the passive DC ? I suppose so, I have a perfect crossover between eqiad and codfw for appserers [10:44:03] appservers* [10:44:36] you do? I have see eqiad getting all the traffic here: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&from=now-1h&to=now&viewPanel=65 [10:44:47] Yes that's what I mean [10:45:03] there is no rule afaik [10:45:13] and that thing there is weird [10:45:30] if you zoom out to say 30d, it's not like that [10:45:49] yeah [10:46:18] afaik we didn't change anything else than just repooling api-ro and appservers-ro in discovery [10:47:22] api is actually well split [10:47:25] almost half/half [10:49:43] latencies are also weird [10:50:07] while for api cluster they are converging back to the previous level, for appservers they aren't [10:50:29] but rather stabilizing at more than 50% higher numbers [10:50:41] ok, this is pretty concerning, what on earth is going on [10:51:16] I don't know what could be different between api and appservers tbh [10:51:56] pooled status of the actual servers looks normal [10:52:30] ah [10:52:39] cumin2002:~$ host appservers-ro.discovery.wmnet [10:52:39] appservers-ro.discovery.wmnet has address 10.2.2.1 [10:52:41] this is wrong [10:52:47] it point to eqiad? [10:53:19] huh [10:53:50] wait... this is ... weird... [10:54:24] cgoubert@cumin1001:~$ confctl --object-type discovery select 'dnsdisc=appservers-ro' get [10:54:25] {"codfw": {"pooled": true, "references": [], "ttl": 10}, "tags": "dnsdisc=appservers-ro"} [10:54:27] {"eqiad": {"pooled": true, "references": [], "ttl": 10}, "tags": "dnsdisc=appservers-ro"} [10:54:29] That looks normal [10:54:32] should it be that cumin2002 returns eqiad ? [10:54:34] e.g. [10:54:36] cumin2002:~$ host mathoid.discovery.wmnet [10:54:36] mathoid.discovery.wmnet has address 10.2.2.20 [10:54:40] this is also wrong [10:54:59] blubberoid (which was skipped back on March 1st cause it's exlucded) [10:55:04] cumin2002:~$ host blubberoid.discovery.wmnet [10:55:04] blubberoid.discovery.wmnet has address 10.2.1.31 [10:55:07] ok, this is weird [10:55:17] Ok, I can rerun the cookbook with a pool on codfw [10:55:30] It *should* restore the correct datacenter [10:55:31] why? is it depooled ? [10:55:33] No [10:55:46] But I suspect weirdness when running with only one DC [10:55:57] this looks to be dns specific ... [10:56:04] not sure if the cookbook would help ? [10:56:16] on the other hand, it's supposed to make sure those records are wiped, right ? [10:56:19] yes [10:56:25] it even has checks, what on earth ? [10:56:42] And I suspect it changes to the datacenter you tell it to pool and then wipes [10:57:24] btw, api-ro.discovery.wmnet also points to eqiad, so this doesn't explain the discrepancy [10:57:30] This is extremely strange [11:01:22] fyi if i query ns0 directly i get the correct results [11:01:52] working theory the cache somehow got cleared and refreshed itsd cache befopre the change was made on ns0 [11:02:04] id recomend cleatring the cache in eqiad again and think it should worlk [11:02:21] i can do that if you want claime [11:02:30] dns* hosts are a bit difficult to reason about [11:02:33] yes, please do [11:02:36] to me right now that is [11:02:38] jbond go ahead [11:02:45] akosiaris: Especially since there's been some changes recently [11:02:46] I assume you mean rec_control wipe-cache right ? [11:02:56] there is a cookbook [11:02:57] for it [11:03:09] claime: yeah, I see multiple dns daemons, each listening on 1 or more IPs [11:03:12] didn;t help [11:03:14] gndsd and powerdns colocated [11:03:28] * jbond used the coiokbook [11:03:29] how are those 2 interacting with each other? [11:03:34] sudo cookbook sre.dns.wipe-cache 'api-ro.discovery.wmnet' [11:03:50] or sre.dns.wipe-cache 'discovery.wmnet$' if you want to refresh the whole thing [11:04:23] we should also make sure that both eqiad and codfw are announcing the anycast address 10.2.2.22 [11:04:27] *10.3.0.1 [11:05:01] ok, I'm proposing running sre.discovery.datacenter pool codfw [11:05:16] claime: what that is supposed to change? [11:05:23] Looking at the code, I suspect the cookbook only checks for correctness for the datacenter it is running for [11:05:39] Fyi, I'm stepping away for ~1h30 [11:06:39] volans: forget it, I'm checking something [11:06:50] root@deploy2002:~# dig @ns0.wikimedia.org api-ro.discovery.wmnet +short [11:06:52] 10.2.1.22 [11:07:11] root@deploy2002:~# dig @ns0.wikimedia.org appservers-ro.discovery.wmnet +short [11:07:13] 10.2.1.1 [11:07:15] So from codfw that looks ok [11:07:31] wait, what? [11:07:32] but if you do it @10.3.0.1 [11:07:34] cgoubert@cumin1001:~$ dig @ns0.wikimedia.org appservers-ro.discovery.wmnet +short [11:07:36] 10.2.2.1 [11:07:38] hen you get the other one [11:07:38] cgoubert@cumin1001:~$ dig @ns0.wikimedia.org api-ro.discovery.wmnet +short [11:07:40] 10.2.2.22 [11:07:42] And that is from eqiad, and works [11:07:47] ah, yes, look at 10.3.0.1 [11:08:00] 10.3.0.1 is borked [11:08:03] yes ns0 is working fine but the anycast address is not [11:08:03] the authoritative ones appear to have the correct data ? [11:08:17] What's holding 10.3.0.1 ? [11:08:29] it's anycast. All DCs and PoPs do [11:08:39] and it's not TTL [11:08:45] as it's expired many times over by now [11:08:53] however ns0 ip address on dns2001 is giving the wrong address [11:08:57] even assuming the wipe cache didn't work [11:09:20] what's nsa.wikimedia.org ? [11:09:32] IP, 198.35.27.27/32 [11:09:54] that's for thhe other dns ;) ignore it [11:09:55] akosiaris: i think thast something related to DoH [11:10:03] sukhe: ^^ [11:10:15] those dns boxes are confusing to me right now. I haven't had to reason about them in a long while and somehow things have changed enough to feel out of my comfort zone [11:10:25] cgoubert@cumin1001:~$ sudo cumin 'A:dns-auth' 'dig @10.3.0.1 +short appservers-ro.discovery.wmnet' [11:10:31] ===== NODE GROUP ===== [11:10:33] (14) dns[1001-1003,2001-2003,3001-3002,4003-4004,5003-5004,6001-6002].wikimedia.org [11:10:35] ----- OUTPUT of 'dig @10.3.0.1 +s....discovery.wmnet' ----- [11:10:37] 10.2.2.1 [11:10:39] ================ [11:10:41] wth [11:10:50] ok, so anycast is borked everywhere [11:10:53] at least it's consistent [11:10:58] I don't understand what's answering that anycast query [11:11:06] the recursors [11:11:07] lsof says powerdns [11:11:08] on the sam ehosts [11:11:23] so so i thinkt hat the issues is that dns2001 is getting the eqiad address when it queries ns{0,1,2} [11:11:59] so when we hit its anycast address it doses a dns query to ns{0,1,2} sourced from its self which gets the eqiad address [11:12:00] there has been some work on those in recent days, rif T330670 [11:12:00] T330670: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 [11:12:13] which then gets added to the cache [11:12:31] I'll sal log that we're encountering unexpected dns issues, but not affecting production status, y/n ? [11:12:33] jbond: it should propagate the edns though [11:12:48] claime: y but note increased latencies [11:12:50] if its configuered correctly [11:12:53] akosiaris: ack [11:12:59] volans: moight be of for privacy [11:13:37] either way dns2001 gets the wrong answer and thast what you hit for the anycast lookup so i would say that is the most likley issue [11:15:01] further if i bind to the correct address (ns1) i get the correct answer [11:15:02] dig -b 208.80.153.231 api-ro.discovery.wmnet. @91.198.174.239 [11:15:40] so, source address selection is borked? [11:15:48] thats what id say [11:16:10] the powerdns config on dns2001 (and maybe oithers) needsd to bind to the ns1 address for outbound queries [11:17:11] why dns2001 has an esams IP? [11:17:30] or has all of them? [11:17:30] volans: look at all the IPs that it has [11:17:41] volans i think traffic have spoken about configuring all dns serveres so that they could take any ns{0,1,2} ip address if needed [11:17:44] it has all IPs of all nsX and the recursor [11:17:46] right has all of them [11:18:05] But it binds to 208.80.153.77 [11:18:11] Not 208.80.153.231 [11:18:24] yes its the outbound queries which are more linked to the linux routing table unless specified [11:18:42] it binds to multiple IP+Port pairs [11:18:53] e.g. powerdns to 10.3.0.1:53 [11:19:03] # local-address IP addresses to listen on, separated by spaces or commas [11:19:04] but gdnsd to the ns{0,1,2}:53 [11:19:05] local-address=10.3.0.1 208.80.153.77 2620:0:860:3:208:80:153:77 [11:19:21] I'm out of my depth right now [11:19:42] so, it binds to the anycast IP for recursor + it's own IP addresses [11:19:48] claime: I'd need to deploy something to staging, should I hold off until the DNS issue is fixed or do we have the green light? [11:19:48] that's powerdns specifically [11:20:08] elukey: helmfile deployments should not be affected, you can go ahead [11:20:08] elukey: it's probably unrelated, you can probably go ahead [11:20:21] I'm holding scap deployments for now as a precaution [11:20:33] super, just wanted to verify, thanks :) [11:20:37] bast2002:~$ dig @ns2.wikimedia.org mathoid.discovery.wmnet +short [11:20:37] 10.2.1.20 [11:20:42] help me out here [11:20:52] scratch that [11:20:56] that's correct [11:21:01] damn my brain is fried [11:21:22] bast2002:~$ dig @10.3.0.1 mathoid.discovery.wmnet +short [11:21:22] 10.2.1.20 [11:21:26] now that's correct too [11:21:55] bast2002:~$ dig @10.3.0.1 appservers-ro.discovery.wmnet +short [11:21:55] 10.2.1.1 [11:21:59] and now everything is correct? [11:22:09] i did a cache wip for testing things [11:22:28] Didn't you do one earlier that did nothing? [11:22:40] that could have fixed it. possible only oine sourc/destination combination is causing issues [11:23:16] i.e. if the cache did a query to ns0 then because ns0 is on loopback it (i think would source the address from the ns0 ip) if it queries ns1 it would source from ns1 [11:23:26] was just trying toi test that via tcpdump [11:23:40] ok a quick sudo cumin 'bast*' 'dig +short mathoid.discovery.wmnet' now returns what I would expect to see [11:23:45] For now it doesn't change anything to appserver query balance [11:23:47] 3 and 3 per ip returns [11:24:18] and yet, yes, nothing changed on the appserver query balance [11:24:39] however that might happen in a few when the TTL expires [11:24:46] yeah [11:25:27] I was checking edns-subnet-allow-list on the powerdns config [11:25:53] it has all the 3 nsX addresses, as expectd, but maybe internal queries are now going locally using anther IP [11:26:00] because before they were on different boxes [11:26:04] while now they are colocated [11:26:25] except esams that I think was already colocated (not sure) [11:26:59] akosiaris: Also most queries are served by ATS using reused connections, so that may linger too, couldn't it? [11:27:22] I am in a have a meeting, shall I start a status document ? [11:27:53] fyi confirmed the above query if dns2001 anycast node queries for something in wikimedia.org then it will pick an ns? server at random (or based on gdnsd selction) algorithem) and it will sourvce queries from the same ip address [11:27:57] https://phabricator.wikimedia.org/P45858 [11:28:07] effie: We've taken a +50% hit on latency, don't know if it's worth an incident [11:28:09] i.e. if it picks ns0 then it will source from ns0 and get the eaid address [11:28:27] claime: not sure about ATS right now tbh. If it was envoy I 'd be pretty [11:28:28] i suspect that the lkast time i cleared the cache we got lucky and it picked ns1 (sorecd from ns1) [11:28:30] vgutierrez ^ [11:28:51] ok I am standing by, if you feel this is getting bigger, please let me know [11:30:15] Checking envoy, ores error rate is pretty high [11:30:46] as a quick fix we could maybe update the following so it only gose to its local address [11:30:49] We're starting to get mediawiki errors https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus%2Fops&viewPanel=19&orgId=1 [11:30:49] forward-zones=wmnet=208.80.154.238;208.80.153.231;91.198.174.239, 10.in-addr.arpa=208.80.154.238;208.80.153.231;91.198.174.239, [11:31:04] but this likley reduced some other aspects of the resilient design [11:31:52] claime: the memcached fleet is under pressure, https://grafana.wikimedia.org/d/000000316/memcache?orgId=1, may be related [11:32:38] jbond: I'm wondering if the current setting for edns-subnet-allow-list is just missing the /32 netmasks, I would assume powerdns is smart enough to assume /32 if missing, but could it be just this simple mistake? [11:32:49] docs says: Comma separated list of domain names and netmasks, negation supported [11:32:52] https://doc.powerdns.com/recursor/settings.html?highlight=forward%20zones%20recurse#query-local-address [11:32:56] I am guessing we want this ? [11:33:11] volans: not sure of the top of my head pdns is the one dns daemon i have used least :) [11:33:39] akosiaris: i think its a bit more complicated then that because the addresses are local [11:34:05] fixing edns might be the best option. however not everythin is garunteade to send edns [11:34:39] but perhaps pdns just adds it opertunisticly ? [11:34:54] claime: it seems that we are using the gutter pool, https://grafana.wikimedia.org/d/5XF4XXyWz/memcache-gutter-pool?orgId=1&refresh=30s, is it expected? [11:34:59] jbond: could it hurt in anyway to try to add the /32 and exclude that? seems a simple thing to test [11:35:08] jbond: I may have misunderstood then. But the problem is probably that powerdns queries locally the auth dns instances that are also on that same machine via any random IP on the node ? [11:35:09] yes seems safe enough to me [11:35:27] akosiaris: yes, but if edns is propagated correctly it shoul dwork anyway [11:35:37] unless I'm missing something else [11:35:56] elukey: I'm guessing the warmup script didn't warm memcached enough/at all, but effie may have a better answer [11:36:05] akosiaris: i yes. so pdns looksup foo.wikimedia.org. it has no cache. dos an ns lookup which returns ns{0,1,2}. it picks one at randome then sends the query [11:36:16] because all ip addresses live on the same box and are all bound to LO [11:36:33] linux will source the address fromt hye ip that has the best route (not the one the damone is bound to) [11:36:36] claime: yeah I see timeouts in logstash from mw hosts, then mcrouter fails over to the gutter pool [11:36:42] so when going to ns-0 it picks the ip for ns0 to source [11:36:55] claime: looks better now, good [11:37:04] elukey: Yeah, alert recovered [11:37:15] elukey: I think it's a warmup issue. [11:37:27] I'll not it as something to be investigated [11:37:28] volans: i just checked and it cutrrently sends an edns with loopback [11:37:48] https://phabricator.wikimedia.org/P45859 [11:37:50] http://linux-ip.net/gl/ip-cref/ip-cref-node174.html [11:38:02] source address selection is a bit more complicated in fact [11:38:25] we are already on the 3rd otherwise in this case [11:38:38] and... returns a zero source address ? [11:38:40] yes and the bind it referes to in step one there is not the same (neccesaraly) as the bind for the listening port [11:38:42] jbond: interesting, so do you think is not sending edns and so maybe the /32 might fix it? [11:38:53] although those settings are old enough in puppet [11:38:55] jbond: yes, it's the bind of the outgoing ephemeral port [11:38:55] yes could be [11:38:58] so I wonder how they worked before [11:38:59] yes [11:39:13] and ofc edns complicates all of this even more [11:39:24] and still doesn't explain why on earth appservers in eqiad see all the traffic [11:39:29] whereas api servers don't [11:39:42] 50% chance? [11:39:51] actually 33% :D [11:40:06] we are talking an almost 100% cutover here [11:40:08] yes i think we never seen this before because we changed the dns serveres between the last switch and now [11:40:13] jokes apart I'm not sure how that is akosiaris [11:40:17] almost entirely from codfw -> eqiad [11:40:25] and just for 1 cluster, this is absurd [11:40:32] and it's lingering... [11:40:37] akosiaris: eache record will have a 33% chance of getting the right address which will then be cached in the anycast ionstance [11:41:04] so it could be that for TTL api gose to eqiad but api-ro to codfw etc [11:41:16] sudo cumin 'bast*' 'dig @10.3.0.1 +short appservers-ro.discovery.wmnet' gives the correct values btw [11:41:19] when the TTL expires we role the dice again [11:41:28] jbond: but we have a 10s TTL right now [11:41:39] so I would expect a lot of graphs moving around [11:41:42] and we don't see that [11:41:46] yeah it would have converged on that 33% you are talking about by now [11:41:54] yes true it could be powerdns is doing something a bit more clever [11:42:16] caches get crazy wwith ns selctions and cache refresh and i dont know the internals well enough to guess further [11:42:43] but i would expect to see a bit more issues here, i find it strange its so stable [11:43:15] do we want to try changing something or wait for the dns folks to come online? [11:43:18] I propose depooling eqiad from appservers-ro for now? [11:43:34] +1 for me [11:43:41] but why just that? [11:43:53] volans: Because it's the only one majorly affected [11:43:53] it's the only one that we 've identified misbehaving ? [11:43:54] if that happens for all records, we should just go back to codfw only? [11:44:09] have we identified anything else ? [11:44:14] what has that record special compared to others I don't get it [11:44:59] have we checked them all? [11:45:22] easy enough, gimme a sec [11:49:43] everything seems fine tbh. a sudo cumin 'bast*' 'dig .discovery.wmnet' for active/active services returns 3 hosts and 3 hosts per record [11:50:07] the exceptions being the various either excluded things (helm-charts, blubberoid) or some weird sutff like e.g. releases [11:50:14] so, dns records are kewl [11:50:27] claime: +1 on shifting appservers-ro to codfw [11:50:32] I wanna see if it changes anything [11:50:33] ack [11:50:40] if it does, I 'll be very very puzzled [11:51:13] sudo cookbook sre.discovery.service-route --reason T331541 depool --wipe-cache eqiad appservers-ro [11:51:13] T331541: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 [11:51:16] Ok? [11:51:27] ๐Ÿ‘ [11:51:32] Lezzgo [11:51:34] ok i think i have worked out another bit of the puzzle [11:51:58] i.e. why dosn;t the answer swotch/change every 10 seconds [11:52:01] basicaly [11:52:02] dig -b 208.80.153.231 api-ro.discovery.wmnet. @208.80.153.231 [11:52:11] returns the correct answer (codfw) however [11:52:17] dig +subnet=127.0.0.1/32 -b 208.80.153.231 api-ro.discovery.wmnet. @208.80.153.231 [11:52:20] returns eqiad [11:52:28] Done [11:52:38] so i guess gnds always returns eqiad iof the subnet is localhost [11:53:08] i only did any testing with api-ro.discovery.wmnet. so unfortunatly dont have data for working ones [11:53:25] have dumped some notes into https://phabricator.wikimedia.org/P45858 [11:53:37] appservers-ro Active/Active pooled [11:53:43] Service Type eqiad codfw [11:54:46] well, I don't see a difference [11:54:52] at least that's consistent [11:54:57] yeah [11:55:08] the default discovery map is datacenters => [eqiad, ... [11:55:20] good, cause I 'd be going crazy if that thing had any effect [11:55:26] and I don't see an entry for 127, so I guess uses the default one [11:55:37] becuase doesn't match any other geo map or nets map [11:56:02] ack thanks [11:56:48] That still means our discovery system doesn't do what we want it to do :/ [11:58:23] ehm, scratch that [11:58:24] I've proposed my invasive test, but happy to wait for the dns folks to show up before messing up with their systems :) [11:58:35] I might need to go crazy after all [11:58:38] it's already 7~8am over there [11:58:38] akosiaris: Yeah it's starting to converge [11:58:52] hi folks [11:58:59] And it crossed over. [11:59:04] I see a ping and a long scrollback [11:59:08] what's up :D [11:59:15] hey sukhe I'll try to summarize [11:59:18] I'll let people who know what they're talking about fill you in [11:59:27] claime: that's a biiiig delay [11:59:29] 5mins ? [11:59:35] akosiaris: yes. [11:59:39] I thought we had a 10s TTL [11:59:39] 1) traffic for a/a services was repooled in eqiad [11:59:41] sukhe: also see https://phabricator.wikimedia.org/P45858#186220 (veryrogh notes probably best to check after volans summary) [12:00:03] 2) appservers-ro.discovery.wmnet moved the whole traffic to eqiad, not balancing between eqiad and codfw [12:00:11] akosiaris: For some reason, the cookbook sre.discovery.service-route depool **raises** TTL back to 300 before doing anything. [12:00:23] 3) on etcd/confctl and at the authdns level all is fine (querying directly nsX) [12:00:47] 4) the problem resides in the pdns recursors that seem to not propagate correctly the ends client information [12:01:05] claime: well, that at least explains it [12:01:13] 5) the pdns cache was cleared multiple times for specific records [12:01:45] claime: wait another 5 and repool eqiad ? [12:01:55] I wanna see if we will trigger it again [12:01:56] 6) we tried to explain it looking at various angles, j.bond paste is what he found trying to debug at the dns level [12:01:59] akosiaris: agreed. [12:02:35] volans: thanks looking [12:02:51] I did look at the config and I have a doubt for the edns-subnet-allow-list setting, that has IPs and not netmasks, and I wonder if adding /32 might solve it (easy to try, could do nothing) [12:02:53] when did we last try this and it worked? [12:03:01] trying to figure out what changed [12:03:17] sukhe: AFAIK what changed is the co-location of pdns and gdnsd on eqiad/codfw [12:03:28] sukhe: dns box bump to bullseye? [12:04:18] apparently psdns is sending 127.0.0.1 as edns subnet [12:04:27] probably that, unlikely that the authdns move had any change [12:04:31] looking [12:04:32] and our guess is that in that case gdnsd returns the default mapping, that has eqiad first [12:04:41] did the pdns version change? [12:04:53] yes [12:04:59] sukhe: we not have the pdns cache and the authdns server on the same box (i dont think we had that before did we) [12:05:05] *now [12:05:06] but we had been running the new one in other places [12:05:27] yes but most records are eqiad and codfw [12:05:54] akosiaris: I think I can unlock scap, it's not like we're protecting anything by blocking deployments are we? [12:06:08] claime: yeah, go ahead [12:06:12] the only place we have an auth dns server is esams ann that i suspect always responds with eqiad/codfw (i.e. never esams) regardless so we probably didn't spot anything [12:08:01] claime: 5m mark :-) [12:08:07] akosiaris: yep, doing [12:08:42] The sre.discovery.service-route always tries to set TTL to 300, and I can't understand why. elukey ? [12:08:56] I see it in the code, I just don't understand the rationale [12:08:58] 08:04:59 < jbond> sukhe: we not have the pdns cache and the authdns server on the same box (i dont think we had that before did we) [12:09:46] the dnsrec role is a superset of the dnsauth role and has been for a while [12:10:11] akosiaris: mark. [12:10:39] sukhe: do you happen to know if IPv4s are automatically converted to /32 netmasks in settings that expect netmpasks? [12:10:44] see https://doc.powerdns.com/recursor/settings.html#setting-edns-subnet-allow-list [12:11:49] volans: /32 wouldn't work right? [12:11:53] per " The initial requestor address will be truncated to 24 bits for IPv4 (see ecs-ipv4-bits) and to 56 bits for IPv6 (see ecs-ipv6-bits), as recommended in the privacy section of RFC 7871." [12:12:02] that's the truncation part though [12:12:11] that's not what the setting expects [12:12:14] it's fine to specify just the IP and it seems like that it appends /32 [12:12:15] sukhe: ack [12:12:26] because that's what we are doing in our config and it's clearly working in other places :) [12:12:38] maybe other places expects IPS [12:12:45] or they don't convert them in all places :D [12:13:00] akosiaris: here we go [12:13:04] vgutierrez: that's the truncation, not the subnets of the authoritative servers to contact with edns [12:13:06] akosiaris: converging again [12:13:12] ok, at least it was DNS [12:13:14] will it blend [12:13:33] And it's crossing over [12:13:35] So it's DNS. [12:13:51] * claime it's always DNS [12:13:57] I am not following this last discussion but I guess this drop in mysql reads is because of https://phabricator.wikimedia.org/T331981 https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=now-3h&to=now&viewPanel=8 and when memcached got warmed up? [12:13:57] lol [12:14:00] ๐Ÿ‘ [12:14:11] so... [12:14:44] marostegui: It would seem the warmup scrip does not warm up correctly [12:15:10] claime: Yeah, that's my point, if memcached wasn't warmed up, that big big spike on mysql makes sense, and once memcached got warmed, the reads from the DBs decreased [12:15:15] So yeah, it corroborates what we've seen with memcached [12:15:21] Oki [12:15:34] marostegui: I'm adding your graph to the task [12:15:39] k [12:15:40] sukhe: https://github.com/PowerDNS/pdns/blob/55b30d284ba25c350b1746f036e2d3167959ef65/pdns/recursordist/syncres.cc#L5759 [12:16:05] ok, I need to clear up my head for a sec. That was weird [12:16:44] akosiaris: same [12:16:48] back in 10 [12:16:58] so to be clear: is this blocking anything now? [12:17:15] not that I can spot anything so far but trying to understand the urgency of it for debugging [12:17:18] sukhe: It's blocking us repooling active active appservers-ro correctly [12:17:22] or have we put in a temp fix? [12:17:23] I see [12:17:31] So not ugrent but schedule derailing a bit [12:17:33] sorry, still waking up and trying to make sense [12:17:34] urgent* [12:17:38] I am seeing nothing so far [12:18:18] sukhe: basically look at https://grafana.wikimedia.org/goto/p64nwbaVk?orgId=1 [12:18:24] volans: what about the link? as in that it should add the subnet mask? [12:18:32] We canยดt go back to a rough 50/50 split between eqiad and codfw [12:18:43] I [12:18:46] I [12:18:48] sukhe: not sure yet, checking what Netmask(a) does [12:18:53] DAMN fingers. [12:19:03] I'm taking a break, I'm reachable through Signal [12:20:00] if we strictly talk about edns-subnet-allow-list for a second [12:20:09] then at least nothing seems to have changed between the versions, from the notes at least [12:20:19] edns-subnet-allow-list=208.80.154.238, 208.80.153.231, 91.198.174.239 [12:20:56] hmmm that number of requests between eqiad/codfw is quite weird [12:21:14] eqsin should hit codfw anyways [12:22:18] I have a test too to check for if the EDNS client subnet option is propagated properly and that also seems to be working fine, https://github.com/wikimedia/operations-software-knead-wikidough/blob/master/tests/test_dns.py#L190. if it was not anyway, we would have seen failures in other places too [12:22:33] so if going by that, that leaves us with: [12:22:43] 07:52:38 < jbond> so i guess gnds always returns eqiad iof the subnet is localhost [12:22:49] that I am not sure about actually [12:23:18] sukhe: vgutierrez: this is a strangness i see which https://phabricator.wikimedia.org/P45859 [12:23:25] I've only begun to catch up, but a few points I see that are causing some misleading "tests" in various scrollback: [12:23:42] sukhe: that EDNS subnet statment is not 100% confirmed only one or two qeuries so i could have missed soemthing elses [12:24:01] 1) the recursors aren't using localhost to reach recdns on the same machine. They use the real nsX IPs, even if it happens to be local traffic. [12:24:03] sukhe: I think it adds it automatically, it should be: https://github.com/PowerDNS/pdns/blob/rec-4.6.1/pdns/iputils.hh#L496 [12:24:26] I am starting a document, sorry for the delay I was in meeting [12:24:33] bblack: yes the real ips but they are on the loopback interface [12:24:42] *thats what i was meaning [12:24:44] volans: thanks, that makes sense anyway because we have relied on it for a while on that assumption [12:24:57] the interface name shouldn't be a factor [12:25:41] well it impacks the source address selection algorithem as it affects the routing table [12:26:29] 2) setting explicit source addresses (e.g. dig @ns0 from ns0) and/or testing 10.3.0.1 from the recursors themselves... those can maybe return useful results, but it's at the very least going to get confusing. Even the recdns's resolv.conf don't use themselves, etc. It'd probably be more-reliable to use the site-level bastions as tests, using their own normal source IPs. [12:26:40] but anyway the main point is that queries destined to ns0 will be sourced from ns0 [12:26:58] https://docs.google.com/document/d/1-e4y3UXtim6MW1Y0p3JqRwPip-4esbbHD7sEiVpB4cc/edit#heading=h.95p2g5d67t9q [12:27:03] 3) "dig +nsid" will help confirm which authserver is returning the result you see, might be helpful [12:27:33] jbond: you mean queries destined to a recdns machine in the dns100x set, right? [12:27:39] sukhe: fyi for the subnet test i set up the foolowing on dns2001 [12:27:40] sudo tcpdump port -vvvnnn 53 -i lo | grep jbond.wikimedia.org [12:27:53] and then sent a query from cumin2001 like dig foojbond.wikimedia.org [12:28:30] bblack: yes if dns2001 recusor is looking up something in w.o then if it hist ns0 it will be sourced from ns0 (that is at least what it looked like as well) [12:28:50] that makes sense as a theory, but why hasn't it been an issue before? [12:29:24] bblack: it possibly isn't i thought this changed during the recent authdns migrations so i could have been going sdown the wrong rabit hole [12:29:24] maybe something changed with the bullseye upgrade? [12:29:49] and it seemed to tie with the codfw dns recusor getting the wrong goip answer [12:30:07] the timelines overlap somewhat, and I'm still on coffee #1. [12:30:31] sukhe: did we have bullseye dns100x before the depooling of a couple weeks ago? [12:30:37] depooling of eqiad I mean [12:30:52] * volans has to step away for lunch, bbiab [12:31:58] checking the timeline [12:31:59] dns1001 uptime is 11 days, that probably is when it was reimaged [12:32:16] so yeah, we may have been on buster for the initial failover, and bullseye now for the failback [12:32:51] March 2 for the dns1001 bullseye upgrade [12:33:12] so yeah, after [12:33:47] either way, aside from whatever's implied by the bullseye upgrade (possibly new pdns upstream changes, possibly kernel changes in source selection, etc), fundamentally nothing else has changed lately. The recdns have all had authdns-over-loopback for a very long time now. [12:36:39] so as one check on the "ns0 source" theory: [12:36:49] bblack@cumin1001:~$ sudo cumin 'A:bastion' 'dig +short @10.3.0.1 en.wikipedia.org A' [12:37:25] ^ this returns a dc-local public IP from each bastion. If that were broken, the contagion of general "ns0 source" issues would have impacted global geographic routing for the public, too [12:37:55] slightly different than the real issue we're seeing here, but confirms some level of basic sanity [12:38:41] bblack@cumin1001:~$ sudo cumin 'A:bastion' 'dig +short @10.3.0.1 appservers-ro.discovery.wmnet A' [12:38:56] (3) bast[2002,4004,5003].wikimedia.org -> 10.2.1.1 [12:39:08] (3) bast[1003,3006,6002].wikimedia.org -> 10.2.2.1 [12:39:28] ^ that also seems correct, assuming we turned a/a back on earlier today, right? [12:39:46] claime: akosiaris: can you confirm, i thought that was ment to get switched back? [12:39:51] codfw+ulsfo+eqsin -> codfw [12:39:58] eqiad+esams+drmrs -> eqiad [12:41:10] i see its still pooled in both so yes that would be correct [12:41:37] so where do we see the issue at specifically, that I can repro right now? [12:42:41] bblack: im not sure we can see it now. however (and wait for claime akosiaris to confirm) but when we ran the cookbook to make theses active active the recursore in codfw was still returning the address for eqiad [12:42:49] we cleared the cache but this difn;t fix things [12:42:54] cleared the cache again and it did [12:43:07] ok [12:43:09] *cleared the dns_rec cache on dns2001 [12:43:20] so clearing the cache twice, eventually everything was ok? that narrows it down a bit! [12:43:47] maybe it's just something with the process/cookbook itself, order of operations, method of wiping, TTLs, etc [12:44:40] maybe even the sequence of "checking" things from some process, might have been refreshes caches at bad times. (a dns query to confirm something actually reloads a full TTL, just before a slightly-async auth change?) [12:45:03] yes i possed that firs. perhaps we clear the cache on the recursorses, and re-query before the auth is updated [12:45:12] * claime backlogging [12:45:14] however the ttl is 10 seconds so it should have in theory recovered it self [12:47:51] TTL is reset to 300 when using sre.discovery.service-route [12:48:11] And yes, we repooled both eqiad and codfw to see if we had the same crossover [12:48:15] And we did [12:48:19] yeah reading more backlog too [12:48:48] https://grafana.wikimedia.org/goto/71nClba4k?orgId=1 [12:48:50] even if the 300 TTL were the problem, it would've self-resolved in 5 minutes [12:48:55] (or the only problem, anyways) [12:49:22] Basically we can't return to the "normal" state of around 50/50 split when both are pooled [12:50:04] fwiw I see nothing in the pdnsrec thing [12:50:18] claime: so even now, we do still have a testable problem? [12:50:23] stepping out for daycare dropoff, be back soon [12:50:24] sorry maybe I misunderstood above [12:51:17] yeah clearly we do: all traffic's still inexplicably all on one side [12:51:42] bblack: we do [12:51:48] bblack: i have given https://docs.google.com/document/d/1-e4y3UXtim6MW1Y0p3JqRwPip-4esbbHD7sEiVpB4cc/edit# a quick timeline which shows when we saw the error, when caches where clearedn etc [12:54:02] ok, I /can/ confirm we still get wrong recursor results from codfw appservers themselves, there's something [12:54:17] sudo cumin 'A:mw' 'dig +short @10.3.0.1 appservers-ro.discovery.wmnet A' [12:54:31] ^ this gave the eqiad IP for all mw* it hits, in both DCs [12:55:23] and locally confirming a full dig output with +nsid from a codfw mw server, it is in fact using dns2002 to get the result [12:55:44] err 2003, whatever. The point is, the 10.3.0.1 traffic isn't being misrouted to another DC or whatever [12:55:48] volans, elukey, it seems to me discovery.service-route should reduce ttl then put it back to what it was, I sent https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/898731 for it [12:59:00] bblack: i see a lot of queries with [ECS 127.0.0.1/32/0] which i belive will always get an eqiad response [12:59:32] yeah, it would [13:00:01] https://phabricator.wikimedia.org/P45865 [13:00:09] dns2001 ~ % sudo tcpdump -vvvnnn -i lo | grep ECS [13:02:36] So right now we're basically serving all RO traffic from eqiad, no matter what DC the call originates from [13:02:53] jbond: but where this 127 should come from? I guess if edns is not propagated it should just not be there, not be 127.0.0.1 [13:03:42] api-ro gives an eqiad IP too [13:04:02] volans: i would assume that if the cache dosn;t get one then it shuldn't set anything and then gdnsd would uise the source address of it sees the qurey, however i think that would end up with the issue i thught we where having before [13:04:15] (using the sudo cumin 'A:mw' 'dig +short @10.3.0.1 .discovery.wmnet A' command as a test) [13:04:22] the ECS needs to be there. just removing it would cause problems too. [13:04:40] so the loopback ECS is definitely part of this, still digging [13:04:43] is the page related to the work being done here? [13:04:53] exactly but i think we should set it to $facts['networking']['ip'] instead of localhost [13:05:51] XioNoX: parsoid is only pooled in codfw [13:05:56] https://doc.powerdns.com/recursor/settings.html#ecs-add-for [13:06:14] ^ this is missing in our config. maybe something to do with which pdns-rec versions we jumped between on upgrade, and/or the defaults changed [13:06:24] but by current documentation, we should need to set that option appropriately [13:07:02] jbond: it shouldn't be explicitly set to anything: it should be the real client IP for all of this setup to work as intended [13:07:44] If no suitable address is found, the recursor fallbacks to sending 127.0.0.1. [13:07:46] (for example, it's supposed to be acceptable in the current design that e.g. all recursors in codfw are dead, and mw1234 in eqiad queries via anycast to reach dns1001 recursor, which should still return a codfw-associated result [13:08:22] there's an expicit "ecs-add-for" that allows/blocks ECS based on the client IP network, and it defaults to blocking networks like 10/8 from being used in ECS, and we're not setting it [13:08:40] yes I think that's it [13:08:41] clearly wasn't a problem before, and the version history is unclear, but I think that's the problem we're having now [13:08:49] ack [13:09:17] I think I got my example backwards above, sorry for the confusion [13:09:37] which pdns version we had prior to bullseye upgrade? [13:09:43] but I meant to say, basically: global failover of /recdns/ traffic between hosts in one site to recursors in another should still give correct results for discovery/geoip, which requires "real" ECS [13:10:13] volans: will check shortly once I am on the computer but 4.3 something [13:10:15] volans 4.1.11-1+deb10u1 [13:10:32] jbond: that seems really old? [13:10:34] we might have had some components with more reccent version that's why I'm asking [13:10:36] at least thats the default in git could have had a specific component [13:10:42] default in deb [13:10:51] ill check apt [13:11:01] we are using 4.6 currently. [13:11:03] what's confusing me about the history, is that we do have a long-standing setting for "edns-subnet-allow-list" [13:11:06] which says it was added in 4.5 [13:11:14] but this ecs-add-for says it was added in 4.2 :P [13:11:14] bblack: [13:11:15] yeah I was confused by it too [13:11:27] that was subnet whitelist before [13:11:28] name change [13:11:32] oh! [13:11:33] ok [13:11:44] so the 4.5 is just when the option name changed, got it [13:11:50] and if we were <4.2 before, then this all makes sense [13:12:02] bblack: https://doc.powerdns.com/recursor/upgrade.html#x-to-4-5-1 [13:14:55] patch incoming [13:15:13] we can make it betterer in terms of version selection or exact settings later, but this should bandaid for now [13:15:14] on the doh hosts we have 4.6.0-1wm1 [13:15:54] yeah doh has had different packaging, it had some requirements beyond the internal recdns needs [13:16:01] ack [13:16:06] then I don't know :) [13:16:08] https://gerrit.wikimedia.org/r/c/operations/puppet/+/898736/ [13:16:31] can someone confirm this seems sane, given current theory? [13:16:46] * jbond looking [13:16:49] +1ed [13:17:59] I guess we can clear it later [13:18:05] there is some old cruft there too [13:19:06] you mean the caches? [13:19:18] no, the puppet template, sorry [13:19:22] ok [13:19:42] I think the puppet agent runs will restart pdns for us (for better or worse!), so they should come up fresh on caches [13:19:48] I have them running in cumin at -b1 now [13:20:00] back [13:20:07] k [13:20:25] ok so it looks like we have always (or for some time) installed from component but only the most recent is still on apt so cant see what we use to have [13:20:51] kinda weird though, because it says ecs-add-for defaults to 0.0.0.0 already? [13:20:54] anyway let's see [13:20:58] yeah, pdns-rec is one of those packages, I donno how far back we'd have to look in history, when we last used plain old debian upstream, if ever [13:21:17] sukhe: it has a !10.0.0.0/8 in the default, too (among similar others) [13:21:25] aaah that was the dealbreaker [13:21:29] which is like, all the clients that matter for this case :) [13:21:47] I Fhere there are some uploads to reprepro [13:21:47] https://sal.toolforge.org/production?p=0&q=pdns-rec&d= [13:22:07] look for buster [13:22:13] well then this only further makes sense if we were running 4.1 before [13:22:17] becuse this changed in 4.2 [13:22:20] yup [13:22:23] bonus points to sukhe to have ! log-ged them consistently :D [13:22:28] ack so 4.5.7-1wm1 from 2021-11-08 [13:22:28] volans: thanks :P [13:23:00] there is also a 4.6.0-1wm1 but was for doh [13:23:02] on buster? [13:23:09] that's what it says [13:23:10] yeah the doh machines have a different one [13:23:16] 4.5.7-1wm1 specificaly says buster [13:23:23] maybe we never installed that before moving on? [13:23:47] we do have some pinning-type stuff in puppet too right? [13:23:50] my memory says we were definitely not on 4.1 but then that's kinda moot anyway now as long as this works :) [13:23:51] ahh its doh that specificaly adds the install_from_component [13:23:57] yes [13:24:14] otherwise it defaults to false and will use whats in debian [13:25:45] wait, we changed config [13:25:48] if debian::codename::ge('bullseye') or (debian::codename::ge('buster') and $install_from_component) [13:25:52] sukhe: [13:25:53] we're noe entering here [13:25:54] Mar 14 13:25:40 dns1001 pdns-recursor[2876856]: Mar 14 13:25:40 Exception: Trying to set unknown setting 'ecs-add-for: 0.0.0.0/0, ::/0' [13:25:57] where before wewere not [13:25:59] wtf? [13:26:08] so we're setting $pdns_43 = true [13:26:19] I assume the prod config was with $pdns_43 = false [13:26:24] see modules/dnsrecursor/manifests/init.pp [13:26:41] starting on line 83 [13:26:43] bblack: [13:27:01] ecs-add-for: 0.0.0.0/0, ::/0 [13:27:06] ecs-add-for = 0.0.0.0/0, ::/0 [13:27:34] :facepalm: [13:27:48] DNS :P [13:27:57] * claime shakes fist [13:28:10] either way, reverting for now with a quicker agent run on all in parallel [13:28:13] ok [13:28:20] I am working on the updated patch [13:28:30] that's not DNS, that's just stupidity on my human part :) [13:28:36] and ours too [13:28:38] sorry about that [13:29:35] btw the puppet manifest too needs a lot of updates, the logic is currently quite changed [13:29:44] from buster to bullseye for prod [13:29:59] volans: yes, we were in between the transition to bullseye [13:30:06] and some dnsrec hosts still are on buster (cloud stuff) [13:30:09] and hence [13:30:25] https://gerrit.wikimedia.org/r/c/operations/puppet/+/898738/ [13:31:14] yes, what I'm saying is that will endup in doh too [13:31:25] that's OK [13:31:35] ok [13:31:36] but thanks for checking [13:32:15] cool, looks OK [13:32:30] if anything, probably needs the fix too, if we want consistent behavior for any healthcheck/monitor inside our infra. [13:32:35] (doh, that is) [13:33:02] 2023-03-14 13:32:58,979 anycast-healthchecker[3758841] INFO hc-vip-nsa.wikimedia.org status UP [13:33:08] 2023-03-14 13:32:57,977 anycast-healthchecker[3758841] INFO hc-vip-recdns.anycast.wmnet status UP [13:33:12] So should I expect traffic to A/A to rebalance itself between eqiad and codfw? [13:33:42] ok bblack is running it everywhere [13:33:42] shortly, yes [13:33:45] ack [13:33:54] it will take a few minutes, running agent serially on 14 hosts [13:34:03] +TTL [13:34:03] but we'll see effects after the first few [13:34:10] of course, I was just checking if I needed to do something on my end [13:34:12] they get restarted for the config change, so no TTL [13:34:17] ah right [13:34:21] yep [13:35:30] sukhe: don't show them the NSA sniffer IP, now everyone knows our secrets :) [13:36:04] :( [13:36:19] :P [13:36:57] (for the record since it's obscure, "nsa.wikimedia.org" is just a fun-poke at the NSA. it's just our anycasted authdns public IP) [13:37:15] (heh) [13:37:21] ns0, ns1, ns2, nsa [13:37:27] a-for-anycast [13:37:42] this better make it to the next round of Snowden+1 leaks [13:38:08] config now present on dns2001 [13:38:22] I'm seeing convergence in appserver graph [13:38:44] only the future will tell if it's convergence or inversion :-P [13:38:50] Exactly :D [13:38:54] oh ye of little faith [13:39:02] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&viewPanel=65&from=now-30m&to=now&refresh=1m [13:39:04] s/little/no/ [13:39:05] for the curious [13:39:31] the ramps are already moving in the direction of split I think, in the last data point [13:40:37] claime: whaat levels do we expect them to level to? [13:41:28] it has hit all the dns2 boxes now I think [13:41:43] so shortly it should stabilize. I'm sure there's effectively some tiny caching at other layers, too [13:41:54] im seeing sane ECS values now [13:42:13] it has to reach all the edge recursors to see full impact, though [13:42:37] (because e.g. ATS in ulsfo/eqsin is affected by the discovery ECS as well for public traffic) [13:43:05] volans: around 2/3 split, looking at 30 days data [13:43:32] probably varies by time of day, too [13:43:38] Between 2/4 and 2/3 [13:43:42] bblack: Yep [13:44:08] With active datacenter having more variability than passive [13:45:08] so: root cause is basically: traffic failed to notice a new config setting in our latest pdns-rec package, whose default value breaks stuff we rely on. [13:45:40] adding some test dns queries to test could help for future upgrades [13:45:50] that and, arguably, as much as it's hard for everyone to schedule around everyone in the general case at current paces, it maybe wasn't the wisest move to schedule dns server upgrades in the midst of a dc-switchover period :) [13:45:57] volans: yeah, good idea! [13:46:12] honestly, our icinga/alerts stuff should be checking that from multiple POVs or something [13:46:17] if we have that capability [13:46:35] I guess we could invent it indirectly, via nrpe checks from a different host or something [13:47:35] can an AM alert run a cumin-based test? [13:47:54] because we could build a pretty solid one based on cumin+dig across some sample hosts in all DCs [13:48:08] we need to do the other way around [13:48:27] systemd timer on cumin to run the command, exporting the data in some way (prometheus, file, etc...) [13:48:36] and then an alert on top of that one [13:48:39] ok [13:48:51] but which IP are you expecting to get depends on the pooled status in confctl [13:48:58] https://phabricator.wikimedia.org/T311618 [13:48:59] at least for discovery [13:49:11] well, similar for admin_state, which is even thornier [13:49:22] (if we check for public) [13:49:41] but technically you can affect discovery with admin_state commits too, we just don't commonly do so [13:50:01] sorry about the trouble folks! this one is on me as part of the bullseye upgrade :P [13:50:17] I usually go through the changelogs and upgrade guide and I did but I still don't see the change for ecs-add-for anywhere [13:50:25] sukhe: it was mostly extremely confusing lol [13:50:41] the other weird part being the pdns-rec setting for as a default [13:50:42] Default: 0.0.0.0/0, ::/0, !127.0.0.0/8, !10.0.0.0/8, !100.64.0.0/10, !169.254.0.0/16, !192.168.0.0/16, !172.16.0.0/12, !::1/128, !fc00::/7, !fe80::/10 [13:50:43] Like we really were scratching our heads on why repooling both datacenters caused a complete inversion of traffic [13:50:50] btw we do also have the capability to resolve with client IP in spicerack (and we can move it to wmflib easily) [13:50:52] I am not sure why it's like that [13:50:58] anyway [13:51:18] sukhe: I assume it's because they don't want common installs to accidentally leak specific "internal" client IPs on queries to external authservers [13:51:30] which is already prevented by the other ecs setting in our case, but whatever [13:51:43] yeah, "Regardless of the value of this setting, ECS values are only sent for outgoing queries matching the conditions in the edns-subnet-allow-list setting" [13:51:46] anyway [13:52:03] Split is now 2/4 for appservers, which is a lot more aligned with expectations [13:52:19] I suppose a lot of users probably had edns-subnet-allow-list=0/0 or whatever to allow ECS-based geoip for remote/public results [13:52:33] I am kind of wondering why codfw being the primary datacenter rn is getting the smallest part of the traffic [13:52:45] and so this was the fix to only leak relevant public client IPs, not private space, which should be meaningless for public authdns anyways, in the general/public case [13:52:49] But it may be becasue primary/secondary has no real bearing on RO traffic spread [13:53:18] they should have mentioned a bit more explicitly in the upgrade guide IMHO [13:53:18] right, RO dwarfs RW [13:53:56] volans: we could uses the data in https://config-master.wikimedia.org/discovery/discovery-basic.yaml (this is created via a confd:file which we can add to cumin). [13:53:58] claime: esams and drmrs go to eqiad [13:54:08] volans: yeah, that's what I figured [13:54:10] technically, lack of edns-subnet-allow-list=0/0 on our recursors can have some negative impact on our outbound resolution for some corner case scenarios as well, it's just not important to us. [13:54:49] jbond: in that case I'd write a cookbook to be honest [13:54:53] bblack: in what way though? as in, lookups not resolving properly because the ECS information is not sent? [13:54:53] all the bits are already there [13:54:54] (for cases where an internal service daemon makes calls to an external public service, which uses some kind of GeoIP resolution, and even then it only impacts latency, and only when all the recursors in one site have failed and we've switched to using non-local recursors) [13:55:09] yeah [13:55:09] volans: yes that may be better [13:55:22] api_appservers is a better view for "complete" traffic repartition since it's much more balanced between RO and RW [13:55:26] fyi https://github.com/wikimedia/operations-puppet/blob/production/modules/profile/manifests/discovery/client.pp is the resource that wrotes that file [13:57:51] sukhe or bblack, could you add conclusions + fix to https://docs.google.com/document/d/1-e4y3UXtim6MW1Y0p3JqRwPip-4esbbHD7sEiVpB4cc/edit#heading=h.95p2g5d67t9q ? [13:58:33] claime: heading to a meeting but yes, happy to do it after [13:58:39] FYI it finished up (the agent rollout) around :53 [13:59:00] claime: I am still seeing lots of timeouts to the eqiad DBs, I am not sure if all this is related, but they are still quite high [14:01:32] marostegui: https://grafana.wikimedia.org/goto/l3vm6baVk?orgId=1 ? [14:02:01] No not that one, that's stable [14:02:08] claime: https://logstash.wikimedia.org/goto/2cee9922d6d1a5b856194f56b14cbdb7 [14:02:31] Although they seem to have almost dropped now [14:02:40] claime: another aspect is the cookbook side [14:02:59] IMHO it should have failed when checking that the correct IPs were returned [14:03:06] marostegui: the memcached part is mostly stable as well [14:04:48] volans: It apparently got the right IPs at some point, since it checked 3 times, then wiped the caches [14:04:56] So either the check is not the right check to do [14:05:10] claime: it checks only the authdns directly [14:05:15] that weren't at fault here [14:05:18] Yeah [14:05:22] doesn't check via recdns [14:05:38] we could/should add that too I guess? not sure [14:05:47] So it should probably wipe, then check recdns ? [14:06:05] i think it should wipe then check recdns from every site [14:06:09] ideally, not sure if overkill [14:06:36] volans: it would probably be the same cook book used for the nrpe/prometheeus check you proposed above [14:07:09] we can always make a task to discuss it, and decide if it's overkill or not [14:09:22] +1 for checking on a disc change via: wipe, check-recdns [14:09:24] seems more robust [14:11:14] bblack: for that check hitting 10.3.0.1 would suffice? of I need to hit every host on their rec listening instance? [14:13:13] IMO it's best to hit the rec host directly given that 10.3.0.1 is anycasted [14:13:18] but curious what bblack thinks [14:13:34] my worry is that with 10.3.0.1 I might not find the issue [14:13:45] depending on the combination of pooled service + cumin host where it's run [14:14:58] https://phabricator.wikimedia.org/T332009 [14:15:28] thx [14:15:35] sorry we're in a meeting, so kinda slow on the convo [14:15:45] we can follow up on task [14:15:46] no hurry [14:16:06] yeah we're in a stable state now [14:16:18] both are valuable: checking 10.3.0.1 from test hosts in $various_dcs is valuable. You could go deeper and also explicit check against dns[123456]00x directly from any one host. [14:16:33] they both have some value, but the first one is probably the most important [14:20:19] effie, XioNoX, urandom, we're now in a stable state with eqiad pooled read-only [14:20:29] ๐Ÿ‘ [14:24:09] awesome, good job! [14:25:10] great, thank you everyone, sorry from my part [14:26:38] sorry about the DNS part folks! [17:54:39] jbond: is the cert-manager removal for pki2001.codfw.wmnet safe to apply on admin_ng? [17:55:14] hnowlan: yes please [17:55:21] cool :) [17:55:29] its been decomissions so should be safe to apply everywhre [19:53:36] marostegui: since Amir1 is out, can you have a look at this patch? I'd like to deploy it some time this week. https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/898795/ [22:03:03] Hi, sorry, I made the mistake of reading my email late in the evening and T331820 got bumped to UBN! And i think correctly. Something is going awry with thumbs (users are getting an error page instead of a thumbnail) [22:03:04] T331820: Upstream caches: 404 - https://phabricator.wikimedia.org/T331820 [22:03:47] ...but they do then sometimes work the following time [22:11:45] I've had that intermittently from about 17:00 UTC today [22:13:09] e.g. https://upload.wikimedia.org/wikipedia/commons/thumb/f/f9/Octicons-gift.svg/12px-Octicons-gift.svg.png will consistently fail for me, but https://upload.wikimedia.org/wikipedia/commons/thumb/f/f9/Octicons-gift.svg/30px-Octicons-gift.svg.png works โ€” Thumbor having a normal one it seems [22:13:43] thumbors error rate doesn't look unusual - https://grafana.wikimedia.org/goto/GB-IuaaVk?orgId=1 [22:14:10] TheresNoTime: the first of those works fine [22:14:12] (for me) [22:14:37] I get a 404 Emperor [22:15:02] (UK here, so DC issue?) [22:15:04] On the first [22:15:08] Both UK [22:15:17] But then I thought Emperor was [22:16:05] * Emperor also UK [22:17:23] Iโ€™m confused [22:18:08] log searching for that first of TheresNoTime's links [22:19:01] The `12px-` has now started working for me [22:20:41] P45869 [NDA] [22:23:05] So I see ms-fe2012 saying 404 at 22:13:09 then 201 at 22:13:10 and ms-fe1011 saying 404 at 22:07:15 and 22:07:50 then 201 at 22:07:51 and then some other servers saying 499 (client got bored) or 200 [22:23:14] I'm not sure what I'm looking at there, but 404 -> 201, now it loads? [22:23:51] Swift has a custom 404 handler which hands the request off to thumbor then returns thumbor's answer to the client (and stores the resulting image) [22:24:20] And 201 from a quick Google is "created", which is expected (?) [22:24:45] So I read that as thumbor in both DCs saying 404 initially, and then saying 201 on the second attempt [22:25:10] fwiw the 12px one has been 404ing for me for a few hours [22:25:31] those are the only swift hits today [22:25:44] Does the CDN cache 404s? [22:25:51] only for a short time [22:26:03] that's what my brain thought [22:26:03] IIRC up to 10 minutes (or less, if the origin had a shorter TTL on it) [22:26:43] [yes, 10 minutes] [22:27:18] I was trying to edit Tech News today with the 12px icon, at around 17:00 UTC-ish, that's when I first noticed it (and after a few tries of the icon not displaying in the preview, I picked a different size and it worked..) [22:27:45] some of these thumb symptoms around the DC switch, I suspect are indirect impacts of the DC switch [22:28:08] when asking about a different thumb problem yesterday, I was told that thumbs are only cached into the dc-local swift on generation these days. [22:28:21] s/cached/stored/, whatever [22:28:29] the point is, they're not synced across eqiad+codfw like originals [22:28:31] s/17:00/18:00 [22:28:49] yeah, the cross-clsuter replication was turned off for the thumbs pool quite a while ago I think [22:29:02] e.g. https://commons.wikimedia.org/wiki/File:The_Sky_is_Not_the_Limit;_There_Are_Footprints_on_the_Moon_(5932386).jpeg is now producing an error for some thumbs, but images for others (and it was serving more errors earlier) [22:29:17] thumb generation+storage is a little bit slow and complex (cf the 404-handled thing above, and thumbor's general scaling woes and limits) [22:29:42] in the "normal" pattern, you'd end up with some significant differences in which sets of thumbs have been generated+stored over time in the two different swift clusters [22:30:02] dc switchover (and switchback) is going to perturb those patterns, causing the need to generate more fresh thumbs than normal [22:30:28] Yeah, but I don't grok why some people are seeing the error pages for hours, rather than the <10m we'd expect. [22:30:36] yeah that part, I donno [22:31:13] the edge caches shouldn't cache a 404 over 10 minutes. But for thumb purposes swift is also effectively like a cache layer, where misses go to thumbor. [22:31:40] replication> each thumb generation used to be chucked at the other DC at generation time, but we turned that off because it was overloading the proxies; similarly, the huge thumb buckets were an issue for rclone (because it tries to load the entire index into RAM), so we stopped swiftrepl-replacement from worrying about thumbs [22:32:35] the two core DCs are mostly-equivalent in terms of function, but the traffic they see has different patterns based on geography [22:32:36] I'm surprised that thumbor is saying 404 (rather than 5xx if it failed) [22:33:24] the "left side" is eqsin+ulsfo+codfw. the "right side" is esams+drmrs+eqiad. Some wikis are much more popular on the left or right, and this leads to differing patterns in both edge cache contents and swift thumbs contents. [22:34:06] (because e.g. frwiki is more popular on the right than it is on the left, due to regions served by edges and language/culture stuff) [22:34:30] The response code panels in https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor?orgId=1 seem to show nearly 50/50 split between 200 and 404 responses [22:34:46] why would thumbor ever generate a 404? [22:34:49] Ah, log-grepping for TheresNoTime's image (Octicons-gift.svg) in /srv/log/thumbor/thumbor.404.log finds 4 2023-03-14 18:04:40,566 8801 thumbor:ERROR [SWIFT_LOADER] get_object failed: https%3A//ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.f9/f/f9/Octicons-gift.svg ClientException('Auth GET failed',) [22:34:57] I mean I'm sure there's an answer, I just don't know it :) [22:35:16] bblack: I think only if it can't find the original... [22:35:23] yeah that makes some sense! [22:35:56] looks like Emperor is maybe on a trail... [22:35:58] the "Auth failed" thing was part of the mysteries we were examining yesterday as well. The fallout looking a little bit different [22:36:00] those thumbor logs https://phabricator.wikimedia.org/P45869#186278 [22:37:07] I'm not sure thumbor should say 404 where it's had an AUTH failure. [22:37:27] well, if auth fails, then it can't load the original, which is as good as not finding it by some logic? [22:37:29] But swift auth is a pile of sadness - we use tempauth which is not for production use, and AFAICT it logs nothing anywhere under any circumstances [22:38:04] bblack: I see that, though I think I'd rather 5xx [22:38:09] the thumbor HTTP status codes only make sense from Thumbor's perspective, not clients [22:38:29] a fine theory, but its HTTP status codes get sent to the client [22:38:36] but yes, it's a 404 if there's any exception while loading the file from Swift [22:38:47] the reason I bring up the pattern split (more than once!) is that I suspect these problems may not be truly-novel. Just existing problems that become much more visible under more thumb-generating load, and the dc-switch + patterns may cause more of that [22:39:18] (+ lack of thumb replication) [22:39:30] I also suspect thumbor isn't caching its auth data as well as it should so is inclined to hammer the swift auth service more than it should [22:40:03] bblack: Mmm [22:41:30] irrelevant to the immediate issues at hand, but one way we could mitigate those kinds of cross-side pattern effects at all kinds of internal layers of "caching", would be to have our edges send a small percentage of requests to the "wrong" core DC all the time. [22:41:38] * bd808 was whining to _j.oe_ jus yesterday that we should kill Thumbor with fire and use MediaWiki + shellbox instead [22:41:47] (when both are pooled) [22:43:37] I would really like thumb generation out of the insides of swift; but that's not a tonight issue. [22:43:44] yeah that too [22:44:27] FWIW, I've seen some variation of "let's run a service that auto-creates thumbnails/transcodings for multimedia based on whatever arbitary stuff we see in the request URL" at multiple $employers before wikimedia [22:44:38] and every one of them has had problems at least as bad as ours :) [22:45:11] it's a fundmentally difficult problem to solve without throwing massive resources or having some severe limitations [22:45:12] https://upload.wikimedia.org/wikipedia/commons/thumb/c/ca/The_Sky_is_Not_the_Limit%3B_There_Are_Footprints_on_the_Moon_%285932386%29.jpeg/320px-The_Sky_is_Not_the_Limit%3B_There_Are_Footprints_on_the_Moon_%285932386%29.jpeg <-- do I read this right that it's a cached 404? [22:45:24] right, which is why we stole one from Facebook instead of using the one hacked into MW :) [22:45:35] Emperor: image loaded for me [22:45:51] Also, does T331820 need to remain UBN! or can it be downgraded (and relatedly, can I go to bed? :) [22:45:51] T331820: Upstream caches: 404 - https://phabricator.wikimedia.org/T331820 [22:46:45] Feels "quite broken" but not "entirely broken", guess that doesn't count as UBN [22:47:11] Emperor: I would say you can always go to bed, but Andre set it as UBN because of the number of duplicate reports [22:47:29] duesen: I'll check it tomorrow yep [22:47:32] it's highly visible, but fixes itself for individual cases eventually [22:47:59] Emperor: you can of course make an executive decision that it is not really UBN! as the service owner [22:48:07] we can easily reduce the 404 cap for just the upload cluster to attempt to reduce impact (or wait time) [22:48:26] but, that may have unintended consequences and exacerbate whatever' going on underneath by causing more swift->thumbor reqs, too [22:49:04] in any case, it's just a hieradata one-liner if we want to do that [22:49:50] I see `x-cache: cp1078 miss, cp1076 hit/1` with the 304 response I'm getting for the last thumb url that Emperor shared (320px-The_Sky_is_Not_the_Limit%3B_There_Are_Footprints_on_the_Moon_(5932386).jpeg) [22:50:14] well, a one-liner hieradata for varnish. a very trivial lua code edit for ATS, I guess. either way, easy enough if it's worth it. [22:50:47] I see Request from [me] via cp3055 cp3055, Varnish XID 2665847 [22:50:47] Upstream caches: cp3055 int Error: 404, Not Found at Tue, 14 Mar 2023 22:50:07 GMT - that's a cached response, yes? [22:51:19] bblack: might be worth reducing it a bit, but probably at a time when we have more SREs online [22:51:42] bd808: I'm definitely not the thumbor service owner ;p [22:51:43] note that thumbor can also reduce it [22:51:51] that might be even better, if it's easy enough [22:52:31] (the cache layers generally obey the TTLs from the origins on things like 404. We only explicitly cap it to 10m to gaurd against a silly origin issuing a 404 with TTL=1d on something that might start to exist shortly after) [22:52:35] bblack: among the "joy is unconfined in the lower bound" things here is that h.nowlan has been trying to deploy thumbor-on-k8s but whenever he does we see a huge spike in 5xxs, so we're still running the venerable thumbor-on-metal [22:53:04] but we're not raising it either [22:53:07] I can recreate a 404 for The_Sky_is_Not_the_Limit by changing the thumb size to "321" and now multiple reloads are returning the cached 404. [22:53:35] if thumbor returned a 404 with TTL=60 or whatever, we'd honor it. [22:53:43] bd808: so we expect if you wait 10 minutes it'll then work [22:53:58] bblack: Mmm, I'm just wary of how easy making such changes to thumbor is [22:54:01] or, probably, it should be returning a 5xx, in which case we wouldn't cache it at all, and probably some things would be melting harder too :) [22:54:17] yes, I would rather thumbor returned 5xx on AUTH failures. [22:54:20] I first hit -321 at 22:48 [22:54:40] I would also rather it cached its sessions and that swift's auth server logged anything [22:54:59] does it just make a new session for every original fetch or something? [22:55:46] bblack: I don't think it's that bad (we did a key rollover without everything catching fire), but the rate at which it logs Auth failures makes me think it's not caching sessions as hard as it could/should [22:56:11] I really don't want to get nerd-sniped into actually reading thumbor code :P [22:56:24] But from the swift side I have no visibility of what it does auth wise [22:56:36] bblack: I have been avoiding that for like 20 minutes now :) [22:56:59] and cursing Gilles under my breath [22:57:13] Thumbor reuses a swiftclient.client.Connection, I have no idea what swiftclient does [22:57:50] Anyhow, I'm going to bow out now since I think it's late enough I'm unlikely to add anything productive now, and I do need to sleep [22:58:05] Thanks for the assistance folks [22:58:15] this would be a good hackathon project: find a better way to do thumbs :P [22:58:33] we have it. MediaWiki. [22:58:49] There is a ticket open about thumbs ATM T331138 [22:58:49] T331138: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 [22:59:09] (I need to have another go at arguing for getting thumb generation out of the swift middleware) [22:59:55] all the theories about Thumbor being a better path died when Gilles had to implement custom handlers for every media type to maintain an exact match to MediaWiki's thumb generation quirks. [23:00:26] :) [23:02:06] we could also ask why we're even contemplating these questions to this depth in an SRE channel, but this gets back to the ownership/responsibility question at the bottom of every such well. [23:02:48] The 404 handler logic is needed somewhere in the CDN stack. If it moves out of Swift it would have to go into ATS or be turned into an intermediate between ATS and Swift. [23:02:49] it was thrown over the wall, and thus we must catch it [23:03:26] we could probably implement the 404 stuff in ATS Lua code without too much fuss, but the whole architecture of the 404->thumb thing seems fundamentally unideal [23:04:00] * bd808 looks for a timemachine [23:04:17] as long as it's not a hot tub :P [23:04:42] Wikimedia wikis have had thumb-on-404 generation since... only Tim would know, but a damn long time. [23:05:11] I mean, I get it (don't find thumb, need to make one), I just don't see why the process rises up to HTTP semantics. [23:05:41] (and crosses boundaries between different services) [23:06:39] y'all with your caches and services caused that part. It's generally implemented in Apache rules right next to the php module. Same service all the way down [23:08:11] is that because apache was serving thumbnails from the filesystem, that it had to even rise up through apache logic? [23:09:48] bblack: yeah. generally it's a rewrite on !-f -- https://www.mediawiki.org/wiki/Manual:Thumb_handler.php#Apache [23:10:01] but anyways, even if we theorize an ideal thumbs-service (or subset of mediawiki as one big service)... let's say there's one HTTP endpoint the edge caches reach out to to fetch thumbs. [23:10:13] and internally, it handles all the details: it owns its own storage/cache, and it generates missing ones. [23:10:49] the more-fundamental problem is that for scalability, we don't want user requests to ever take "a long time" anywhere in the stack, as it can tie up resources at all the other layers on the way. [23:11:07] but sometimes the thumb doesn't exist yet, and sometimes it can take a long time to generate (or become a halting problem or whatever). [23:11:39] that's not really different than any cache miss though [23:11:40] I would argue that we shouldn't even be offering a service that auto-generates random thumb sizes on the whim of the URL contents of a random GET request in the first place [23:12:23] let editors who want new sizes use some other interface to get them generated+stored at /that/ time. read requests should only be looking for extant thumbnails or 404-ing because they were never explicitly created. [23:12:50] that is how it was originally done, turns out that works less well [23:13:07] maybe for the users, but this doesn't work well operationally on a fundamental level [23:13:17] wikis are flexible. if you could only have a fixed set image sizes ever for any media thumbnail you are taking away a lot of content creation freedom [23:13:33] throw in srcset these days and it's much worse [23:14:02] I'm not arguing for fixed sizes. You want a 321-px variant for your article, go for it. But someone has to decide that explicitly and submit a form more like an editing interface. Not generated on the whim of a random anonymous GET [23:14:39] anyone can edit, so taking anons out of the loop is not trivial either [23:14:44] because a fully-flexible, public, anonymous interface that can ask to do any transform on any multimedia is about a scalable and operable as .... [23:14:47] I donno, WDQS [23:15:11] it's not fundamentally suitable for high-volume low-latency anonymous reader GET traffic [23:15:37] any transform is a frequent feature request. Thumbor only does resizing and format changes [23:15:44] bblack: I don't disagree, and yet we make it work with a lot of nines. :) [23:16:14] ooh, I forgot about webp too. That increases the number of thumbnails that would need to be pregenerated too [23:16:16] yeah, editors would love crops, flops, rotations to aslo be possible [23:16:20] AntiComposite: yeah but that includes some odd things, like jpegs of pdfs, and jpegs of mpegs, etc [23:16:28] and the whole djvu thing /shudder [23:17:18] general-purpose image transform that works reliably and quickly is like... you might as well let them put random Lua code in the URL and execute it on command, in terms of how good an idea it is [23:17:26] (again, on anonymous/random GET) [23:17:46] * bd808 looks over at the abstractwiki roadmap and giggles [23:17:59] yeah... [23:18:31] there's a fundamental conflict between the needs of the editor community (all this full flexibility and might, which is costly and terribly difficult, but ok at normal volumes) [23:18:47] and the needs of the reader interface (high volume reading by random anons and bots and whatever else) [23:19:11] a lot of conflict comes from trying to scale and operate the latter while still supporting the former, all through the same "interfaces" at many levels [23:20:03] and "GET /please/dos/my/server/3210000px-abcdefg.mpeg.jpeg.tiff.png" or whatever is a prime example [23:20:09] oh, I forgot that the mobile apps use their own thumbnail sizes, you'd have to pregenerate those too [23:20:25] The answer to the swift auth caching question is that it happens as a global in the thumbor process -- https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/thumbor-plugins/+/refs/heads/master/wikimedia_thumbor/loader/swift/__init__.py#44 [23:24:33] I guess what I'm saying above, is I wish we had different interfaces for the two crowds, which separately all the way out to the URL layer or maybe even hostname layer publicly. [23:24:58] I think one argument against it in the past, was that this would increase friction for readers to become editors [23:25:14] but surely that can be papered over with a little UI magic independent of the technical separation somehow