[09:36:15] Hi folks! [09:36:27] I have opened https://gerrit.wikimedia.org/r/c/operations/puppet/+/933866 to add the config for ores-legacy.wikimedia.org [09:36:35] never done it before, lemme know if it works [09:36:50] the plan is to create the DNS change after this one is reviewed/merged etc.. [09:42:49] elukey: why would you create the DNS after and not before? [09:43:09] and by DNS are you referring to the wikimedia.org one or the discovery.wmnet one? [09:44:22] vgutierrez: o/ the .wikimedia.org one, I thought it was going to be needed only after ats/varnish knew about it [09:44:33] yep, that makes sense [09:45:03] https://gerrit.wikimedia.org/r/c/operations/dns/+/933869 :) [09:46:41] ahhh snap I didn't add the SAN! [09:46:47] stupid me [09:46:59] I'll also add the ores.wikimedia.org one so that in the future the CNAME works [09:47:02] consider adding ores.wm.o as well if you plan to unite them as some point [09:47:02] :) [09:47:20] yep yep you are totally right, didn't think about it [10:03:46] fixed :) [10:10:25] https://www.irccloud.com/pastebin/ofMcozk9/ [10:10:35] you might need to fix the istio-envoy config [10:10:45] with Host: ores-legacy.wm.o is returning a 404 [10:10:49] a 200 without the header [10:12:26] nice, this is new for me, I think something is missing indeed [10:12:38] (first service like this on lift wing) [10:12:53] look at you with all your fancy services ;P [10:13:18] * vgutierrez needs a "back in my day" meme [10:18:38] :D [10:18:57] I'd argue that those services are everything but fancy [10:19:23] hnowlan: same as with the other one.. let's disable puppet on A:cp-text and test on a single node :) [10:20:35] vgutierrez: great! I'll use cp2037 again [10:23:40] puppet is disabled, just merged [10:23:54] ack [10:24:08] enabling puppet on cp2037 [10:27:08] hnowlan: ok... your script is working and request ends targeting rest-gateway.discovery.wmnet [10:27:17] I'm getting a 404 though [10:27:35] Yeah, most likely a config error on rest-gateway [10:30:50] hnowlan: are you able to confirm that rb-mw-mangling is still working as expected? [10:31:17] aka it's hitting the right URL on the backend service [10:35:04] vgutierrez: trying to confirm that now - requests from ats are arriving as "/en.wikipedia.org/v1/page/pdf/Mountain" but the gateway is configured for /v1/pdf. Probably an error on the gateway side. " [10:35:17] are we okay to hold in our current state for a few minutes while I debug? [10:38:25] nope [10:38:28] let's rollback [10:38:38] and please set a beta environment for this :) [10:38:48] Date:2023-06-28 Time:10:37:26 ConnAttempts:3 ConnReuse:0 TTFetchHeaders:-1 ClientTTFB:6 CacheReadTime:0 CacheWriteTime:0 TotalSMTime:6 TotalPluginTime:0 ActivePluginTime:0 TotalTime:6 OriginServer:appservers-ro.discovery.wmnet OriginServerTime:-1 CacheResultCode:ERR_CONNECT_FAIL CacheWriteResult:- ReqMethod:GET RespStatus:502 OriginStatus:000 ReqURL:http://127.0.0.1:3128/api/rest_v1/page/summary/Ade_Laoye?vgutierrez=1 [10:38:48] ReqHeader:User-Agent:curl/7.74.0 ReqHeader:Host:127.0.0.1:3128 ReqHeader:X-Client-IP:- ReqHeader:Cookie:- BerespHeader:Set-Cookie:- BerespHeader:Cache-Control:- BerespHeader:Connection:- RespHeader:X-Cache-Int:cp2037 miss RespHeader:Backend-Timing:- [10:38:56] btw I'm running the haproxy upgrade cookbook on codfw upload cluster [10:39:03] fabfur: :_) [10:39:14] fabfur: you wanna stop that [10:39:47] at the moment I had just an error on cp2027 for puppet is disabled [10:40:10] that is on the text cluster [10:40:19] okay, revert here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/933633 [10:40:19] hnowlan: not a big deal sorry [10:40:28] so the text cluster will be upgraded another time [10:40:30] hnowlan: the 502 was a problem between the keyboard and the chir [10:40:35] *chair [10:40:35] vgutierrez: phew! [10:41:01] (malformed host header) [10:41:03] waiting on verification from restbase devs that proton url patterns are correct [10:41:12] they should be ok [10:41:22] cause rb-mw-mangling is working as expected for other restbase services [10:41:43] (validating with curl -H 'Host: en.wikipedia.org' -H 'X-Forwarded-Proto: https' http://127.0.0.1:3128/api/rest_v1/page/summary/Ade_Laoye?vgutierrez=1 -v -o /dev/null) [10:44:31] fabfur: where did it fail on text? which host? [10:44:37] and what's the current state of that host? [10:44:46] cp2027 [10:44:49] (fixed ores-legacy's envoy config) [10:45:15] from the puppet log : [10:45:17] https://www.irccloud.com/pastebin/SxrhjmJd/ [10:45:37] fabfur: so the node is currently depooled and with HAProxy misconfigured [10:45:50] so we're unable to repool it till hnowlan finishes :) [10:46:06] no prob, I'll work on the upload cluster in the meantime [10:53:20] elukey: +1ed, please hold till hnowlan finishes :) [10:57:36] be done in a minute - problems with the mapping from rb-url to service url on the gateway side [10:57:44] doesn't seem like there's any issues with the ATS/lua side of things [11:00:36] still getting 404s after your last deploy :) [11:07:03] making one last attempt and then we'll roll back. [11:07:47] hnowlan: ack [11:08:00] curious about why curl-ing directly works but via ATS doesn't [11:08:18] of course you'll have some extra headers like XFF and X-Client-IP [11:11:31] which curl are you seeing that works directly? [11:12:00] The problem I'm seeing is that restbase adds "page" to the URL, which proton doesn't like. Even when adding a rewrite for that though I'm not seeing a correct response [11:12:43] oh right [11:12:55] the /page/ part isn't there in your examples that we tested the other day [11:13:26] yeah, restbase is adding that :( surprise [11:15:59] okay, I think I need go back to the team to figure this out. I'm pleased the lua stuff works though [11:16:18] good [11:16:23] let's revert it now then [11:16:25] revert here https://gerrit.wikimedia.org/r/933633 [11:18:58] Puppet reenabled [11:19:14] thanks for the time again! [11:19:15] fabfur, elukey ^^ [11:21:09] hnowlan: rememeber that the traffic team is proudly fueled by ☕ and 🍺 [11:21:24] *remember [11:22:08] * hnowlan makes a note [11:23:14] But not mixed together [11:24:04] not even coffee stouts? [11:56:44] imperial coffee stout here please :) [11:57:13] fabfur: please take care of cp2027, it's still depooled [12:40:36] vgutierrez: I'll run the cookbook against cp2027 [12:47:27] ok cp2027 repooled [12:47:39] I'll restart the cookbook against text@codfw [12:47:58] hnowlan: is that ok for you? [12:48:14] (the cookbook is for haproxy upgrade) [12:53:28] yeah.. no blockers righjt now [12:53:44] ack, proceeding [12:58:49] vg: next dc after codfw? [13:11:04] fabfur: esams & eqiad tomorrow if nothing burns first [13:11:08] and drmrs of course [13:12:14] ok [13:23:24] 10Traffic, 10SRE, 10envoy, 10serviceops: Refactor envoy.filters.http.router and envoy.filters.listener.tls_inspector - https://phabricator.wikimedia.org/T337405 (10JMeybohm) 05Open→03Resolved [13:23:33] 10Traffic, 10SRE, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [13:33:04] fabfur: o/ are you doing any work on cp nodes? Otherwise I'll merge my change :) [13:33:43] text@codfw was getting an HAProxy upgrade [13:33:49] I finished w/ cp nodes on codfw, thanks! [13:41:46] nice! ok to go then? [13:42:26] yep [13:42:34] for me is a go [13:43:04] <3 [13:52:27] tried on a couple of nodes and it seems to work fine, I'll let puppet to do its work during the next half an hour [13:52:34] then is https://gerrit.wikimedia.org/r/c/operations/dns/+/933869 ok to merge? [13:53:00] elukey: ok by us :) [13:53:21] sukhe: thank youuu [14:04:06] fabfur: no problem for me [15:43:59] vgutierrez: would we have time to try again? I believe we've fixed the URL pathing issues in the gateway. I know it's late in the day for you though so happy to try tomorrow. [15:44:18] hnowlan: let's do that tomorrow morning please [15:44:25] sounds good, thanks [15:44:25] easier to handle potential issues that way [15:57:47] 10netops, 10Infrastructure-Foundations, 10SRE: Connection errors from users on Vodafone DE (AS3209) [28.06.2023] - https://phabricator.wikimedia.org/T340670 (10cmooney) p:05Triage→03Low