[06:13:42] 10Traffic, 10MW-on-K8s, 10SRE, 10Wikidata, and 2 others: Serve Wikidata traffic via Kubernetes - https://phabricator.wikimedia.org/T347493 (10Joe) @Jdforrester-WMF no, this task is actually about that patch not having the effect we expected. [06:15:32] 10Traffic, 10MW-on-K8s, 10SRE, 10Wikidata, and 2 others: Serve Wikidata traffic via Kubernetes - https://phabricator.wikimedia.org/T347493 (10Joe) Interestingly, I do get correct results for m.wikidata.org, but somehow not for www.wikidata.org (also, please grep for `mw-web` as we've repooled eqiad in the... [06:27:29] <_joe_> vgutierrez: can I restart trafficserver on a cp host in eqiad? [06:28:18] <_joe_> I'm trying to investigate T347493 [06:28:19] T347493: Serve Wikidata traffic via Kubernetes - https://phabricator.wikimedia.org/T347493 [06:34:35] 10Traffic, 10Wikifunctions, 10serviceops: Separate deployment for wikifunctions.org - https://phabricator.wikimedia.org/T347544 (10Joe) [06:34:55] 10Traffic, 10Wikifunctions, 10serviceops: Separate deployment for wikifunctions.org - https://phabricator.wikimedia.org/T347544 (10Joe) p:05Triage→03High [06:44:35] sure _joe_ [06:45:03] <_joe_> ack thanks [06:45:27] <_joe_> will do in a few :) [07:23:35] 10Traffic, 10Abstract Wikipedia team, 10SRE, 10Wikifunctions, 10serviceops: Separate deployment for wikifunctions.org - https://phabricator.wikimedia.org/T347544 (10JMeybohm) [07:33:04] 10Traffic, 10MW-on-K8s, 10SRE, 10Wikidata, and 2 others: Serve Wikidata traffic via Kubernetes - https://phabricator.wikimedia.org/T347493 (10Joe) I tried restarting ATS on a backend, cp1081, then made requests for wikidata's special:random to trafficserver directly: still all going to appservers on bare m... [07:43:21] 10Traffic, 10MW-on-K8s, 10SRE, 10Wikidata, and 3 others: Serve Wikidata traffic via Kubernetes - https://phabricator.wikimedia.org/T347493 (10Joe) Well turns out the issue was simpler: we even had a TODO in the code: ` # TODO: add mw-on-k8s once we think of moving wikidata or partial traffic. ` Sigh. Tha... [07:50:42] 10Traffic, 10Abstract Wikipedia team, 10SRE, 10Wikifunctions, 10serviceops: Separate deployment for wikifunctions.org - https://phabricator.wikimedia.org/T347544 (10JMeybohm) [09:13:14] 10Traffic, 10Abstract Wikipedia team, 10SRE, 10Wikifunctions, 10serviceops: Separate deployment for wikifunctions.org - https://phabricator.wikimedia.org/T347544 (10JMeybohm) a:03JMeybohm [09:54:17] 10Traffic, 10Content-Transform-Team-WIP, 10RESTBase, 10RESTBase Sunsetting, and 6 others: PCS caching and pregeneration when restbase is decommissioned - https://phabricator.wikimedia.org/T319365 (10jijiki) [09:59:37] <_joe_> fabfur: I just ran puppet on cp-text, so your change to purging is live [10:00:11] if you didn't restart varnish is fine [10:00:33] (or if you didn't repool cp4037 for some reason) [10:00:53] <_joe_> fabfur: I didn't, but assumed puppet would [10:01:07] the change is spread to all cp hosts but not really applied until varnish restart [10:01:12] <_joe_> refresh of Exec[load-new-vcl-file-frontend] [10:01:31] no, puppet doesn't restart varnish on VCL (or even systemd unit) change [10:01:38] <_joe_> right [10:01:44] <_joe_> but the vcl change is going live [10:01:52] otherwise I would be sweating a lot more [10:01:53] <_joe_> with puppet [10:02:02] <_joe_> fabfur: I think you're missing my point [10:02:09] <_joe_> your change included a vcl change [10:02:11] <_joe_> which is live [10:02:15] <_joe_> an a cli arg change [10:02:17] <_joe_> which isn't [10:02:28] <_joe_> vgutierrez: am I missing something? [10:02:32] mmmm are you sure ? [10:02:39] <_joe_> I think right now varnish is refusing all purges, probably [10:02:56] uh? [10:03:19] fabfur change still allows PURGEs from 127.0.0.1 [10:03:54] <_joe_> ah yeah sorry, the puppet output doesn't help [10:03:59] <_joe_> via cumin, I'd add [10:04:08] quick check with varnishlog says that PURGEs are still being allowed [10:04:18] <_joe_> yeah I also checked :D [10:04:24] :) [10:04:31] * vgutierrez back to his 1:1 with kwakuofori [10:04:41] <_joe_> I'll blame cumin :P [10:05:20] yeah, was checking the same (only a bit slower than you) and yep confirm that purge requests are answered correctly [10:10:20] :-P [10:15:56] 10Traffic, 10MW-on-K8s, 10SRE, 10Wikidata, and 2 others: Serve Wikidata traffic via Kubernetes - https://phabricator.wikimedia.org/T347493 (10Lucas_Werkmeister_WMDE) Seems to be working now, thanks a lot for fixing it! `lang=shell-session $ for i in {1..100}; do curl -sIH 'User-Agent: test-Iebdc15b19b (lu... [10:17:19] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [10:17:39] 10Traffic, 10MW-on-K8s, 10SRE, 10Wikidata, and 2 others: Serve Wikidata traffic via Kubernetes - https://phabricator.wikimedia.org/T347493 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert Yes, confirmed now working. Resolving. [11:18:59] hey vgutierrez - I'm afraid I have yet another service to gateway-ify. https://gerrit.wikimedia.org/r/c/operations/puppet/+/956909 [11:19:05] external URLs look like https://wikimedia.org/api/rest_v1/metrics/mediarequests/aggregate/all-referers/all-media-types/all-agents/daily/20200101/20200121 [11:19:25] internal tests look like `curl -H "Host: wikimedia.org" https://rest-gateway.discovery.wmnet:4113/wikimedia.org/v1/metrics/mediarequests/aggregate/all-referers/all-media-types/all-agents/daily/20200101/20200121` [11:19:29] lemme know if today would suit :) [11:20:04] fabfur: ^^ that one is for you:) [11:20:41] hnowlan: yeah.. we will take care of it during this afternoon (dentist appointment soon) [11:21:03] check [11:22:33] sounds good :) [13:28:03] \o I am starring the process of turning down LVS/pybal for ORES. WHo can I coordinate the steps from https://wikitech.wikimedia.org/wiki/LVS#Remove_a_load_balanced_service with? [13:30:45] klausman: happy to help [13:30:55] Thank you! [13:30:57] do you have some particular questions or just a heads-up? [13:31:07] DNS change here: https://gerrit.wikimedia.org/r/c/operations/dns/+/961802 [13:31:29] assuming you have silenced the probes? [13:31:38] looking at the patch! [13:31:43] The Silence is in (I think. If I got it wrong, Luca already disabled paging for ORES, so at worst, we pollute the AM webui) [13:32:06] Silence ID is c6d0a4d7-11ce-495f-b394-c21399027bba [13:32:49] I am unsure about the isnatnce: field, since I couldn't find an example [13:34:51] Merging DNS change and updating authdns [13:34:56] ok! [13:37:31] https://gerrit.wikimedia.org/r/c/operations/puppet/+/961799 Change for state: lvs_turndown [13:37:39] er lvs_setup :D [13:39:57] merging that and running rpa on authdns servers [13:40:16] thanks! A:dns-auth or A:dns-rec, both fine [13:40:46] the docs say -auth, so I c&p'd that :) [13:40:52] yep all good [13:50:33] https://gerrit.wikimedia.org/r/c/operations/puppet/+/961805 is the switch to service_setup and deletion of the servives stanza [13:51:44] looking [13:53:20] merging and running rpa on O:lvs::balancer [13:54:51] klausman: LVS low-traffic backup is lvs1020 in eqiad and lvs2014 in codfw [13:54:54] for the next steps [13:55:10] thankyou, will ping here befofr proceeding [13:57:12] restarting pybal on lvs2014 in a few seconds unless you stop me :) [13:58:26] Restarted, waiting for Klaxons [13:59:15] all good, go for it :) [13:59:46] No klaxons, now doing lvs1020 [13:59:57] remember to log :) [14:00:03] oh, right [14:01:27] > Run a test [14:01:29] > Example http://eventgate-analytics.svc.eqiad.wmnet:31192/_info [14:01:43] what is this test meant for? Checking if there is no answer/other error? [14:08:12] I would have imagined this would make sense for when you are adding a service instead [14:09:27] Yeah, may be a c&p leftover [14:10:15] Ok to restart pybal on active servers? [14:10:35] (and if so: which are those? :D) [14:11:34] lvs1019 in eqiad [14:11:41] and lvs2013 in codfw [14:11:42] and go ahead [14:13:27] Restarting on 1019 [14:14:43] and restarting on 2013 [14:16:14] Ok to run `sudo ipvsadm --delete-service --tcp-service ores.svc.eqiad.wmnet:https` on 1019? [14:17:53] :443 instead? [14:18:08] ok, wasan't sure if service names also worked :) [14:18:31] Service deleted on 1019, now doing 2013 [14:18:45] ok :) [14:19:15] and done [14:19:52] The last step says to run rpa on "the backends". What backends? The ORES hosts? something else? [14:20:03] yep [14:20:34] doing that now [14:22:03] https://phabricator.wikimedia.org/P52724 Got this error, investigating [14:22:15] hmm [14:24:31] maybe let's complete the final removal? [14:26:03] +1 [14:26:14] we (ML) can follow up on the ores nodes later [14:26:44] Aye. [14:26:52] preparing change for final step [14:30:26] https://gerrit.wikimedia.org/r/c/operations/puppet/+/961791 is ready for review [14:32:21] klausman: left a comment about removal from conftool-data as well [14:32:26] merci! [14:32:40] also added another LVS role leftover (which is likely what breaks puppet runs) [14:32:55] yeah good catch, that should take care of it [14:34:14] it was all Luca who did the catching :) [14:34:25] Ok, merging change and retrying the rpa run [14:37:20] rpa run successful! And we are done. Thanks for the help, Sukhbir, appreciate it! [14:37:55] nice, and np, hth! [14:38:01] you did the hard work anyway :) [14:38:23] It's easier when you don't have to second-guess yoruself a million times along the way :) [14:39:59] https://alerts.wikimedia.org/?q=alertname%3DPyBal%20IPVS%20diff%20check&q=team%3Dsre&q=%40receiver%3Dirc-spam <- this is still open, anything I need to do or will it converge by itself? [14:41:26] I will check [14:44:24] Oh, I may have forgotten to delete the services on the slaves %-) [14:44:48] Doing that now [14:45:17] and done [14:47:04] klausman: forced a recheck, all green now [14:50:10] merci! [15:03:59] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [15:05:35] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [15:17:37] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [15:36:56] 10Traffic, 10WMF-Legal, 10Patch-For-Review, 10Privacy: Add no-transform to Cache-Control header - https://phabricator.wikimedia.org/T218618 (10BCornwall) The scope of this ticket's decision-making goes well beyond the Traffic team. As such, we need more cross-functional input before we can merge this patch... [15:56:18] thanks to fabfur https://gerrit.wikimedia.org/r/c/operations/puppet/+/956909 is ready to roll out - I'll do the usual puppet/depooling cp2037 dance while I do it and keep things updated here [16:02:49] puppet stopped, change merged [16:08:38] change is live on cp2037, looking okay to me [16:12:31] hnowlan: I'm about to restart all varnish instances on codfw [16:12:47] they will be depooled, restarted (varnish service), and repooled [16:12:58] fabfur: I was just about to reenable puppet, should I wait or would it be easier to finish up before you do that? [16:13:04] this is for another task and shouldn't impact but wanted to share [16:13:08] no prob, reenable puppet [16:13:27] I'll launch the command later (mine it's a loooong task) [16:14:08] ack :) [16:14:13] reenabling in that case! [16:16:00] 10Traffic, 10SRE: Varnish should allow PURGE requests only from socket (purged) - https://phabricator.wikimedia.org/T347192 (10Fabfur) All cp hosts restarted in ulsfo, change is actually applied, no issues so far. Proceeding with other DCs [16:16:44] hnowlan: let me know when you're ok [16:20:18] fabfur: done! [16:20:32] great! [16:20:36] proceed with the restart [16:31:44] (VarnishHighThreadCount) firing: (8) Varnish's thread count on cp2027:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [16:33:18] ^^ that's one of the varnish instance I've just restarted [16:34:22] 10Traffic, 10Abstract Wikipedia team, 10SRE, 10Wikifunctions, 10serviceops: Separate deployment for wikifunctions.org - https://phabricator.wikimedia.org/T347544 (10Jdforrester-WMF) 05Open→03In progress [16:36:41] (LVSHighCPU) firing: (8) The host lvs3008:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs3008 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [16:36:44] (VarnishHighThreadCount) firing: (32) Varnish's thread count on cp2027:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [16:41:41] (LVSHighCPU) resolved: (8) The host lvs3008:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs3008 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [16:46:44] (VarnishHighThreadCount) firing: (34) Varnish's thread count on cp2027:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [16:51:44] (VarnishHighThreadCount) firing: (35) Varnish's thread count on cp2027:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [16:56:45] (VarnishHighThreadCount) firing: (34) Varnish's thread count on cp2027:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [17:06:44] (VarnishHighThreadCount) firing: (31) Varnish's thread count on cp2027:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [17:11:44] (VarnishHighThreadCount) firing: (29) Varnish's thread count on cp2027:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [17:21:45] (VarnishHighThreadCount) firing: (29) Varnish's thread count on cp2027:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [17:26:45] (VarnishHighThreadCount) firing: (49) Varnish's thread count on cp2027:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [17:31:45] (VarnishHighThreadCount) firing: (48) Varnish's thread count on cp2027:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [17:41:45] (VarnishHighThreadCount) firing: (46) Varnish's thread count on cp2027:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [17:46:44] (VarnishHighThreadCount) resolved: (24) Varnish's thread count on cp2027:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [19:00:19] 10Traffic: Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10BCornwall) [19:04:58] 10Traffic: Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10BCornwall) 05Open→03In progress [19:06:19] 10Traffic, 10GitLab (Project Migration): Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10Aklapper) [19:07:09] 10Traffic, 10GitLab (Project Migration): Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10Aklapper) [19:07:35] 10Traffic, 10GitLab (Project Migration): Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10Aklapper) Note that there is some potential overlap with {T341504} [19:09:42] 10Traffic, 10GitLab (Project Migration): Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10BCornwall) [19:11:10] 10Traffic, 10GitLab (Project Migration): Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10BCornwall) Ugh, not fond of mega-tickets. I can move it over to there if that's clearer/more useful though. [19:13:56] 10Traffic, 10GitLab (Project Migration): Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10BCornwall) Oh, I see, it's broken down by namespace. Hm, I think I'll keep this one around since this is an effort specifically for traffic and it mixes a whole bunch of th... [21:16:56] 10Acme-chief, 10cloud-services-team, 10IPv6, 10Patch-For-Review: tools-acme-chief-01 is attempting to validate DNS challenge against cloud authdns IPv6 addresses - https://phabricator.wikimedia.org/T245937 (10CodeReviewBot) brett opened https://gitlab.wikimedia.org/repos/sre/acme-chief/-/merge_requests/1... [21:31:33] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10cmooney) [21:31:41] 10netops, 10Infrastructure-Foundations, 10SRE: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [21:32:38] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 (10cmooney) [21:32:46] 10netops, 10Infrastructure-Foundations, 10SRE: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [21:36:30] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10CodeReviewBot) brett opened https://gitlab.wikimedia.org/repos/sre/acme-chief/-/merge_requests/2 Draft: BCornwall's avatar Release 0.36-2 for Bookworm [21:37:23] 10Acme-chief, 10Patch-For-Review: acme-chief calls unnecessarily to ACMEChief._push_live_certificates() on daemon start - https://phabricator.wikimedia.org/T218543 (10CodeReviewBot) brett opened https://gitlab.wikimedia.org/repos/sre/acme-chief/-/merge_requests/2 Draft: BCornwall's avatar Release 0.36-2 for B... [21:37:37] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10CodeReviewBot) brett opened https://gitlab.wikimedia.org/repos/sre/acme-chief/-/merge_requests/3 Draft: Update dependencies to match Bookworm versions [21:37:50] 10Acme-chief, 10Traffic, 10SRE, 10Patch-For-Review: acme-chief should support debian bookworm - https://phabricator.wikimedia.org/T344330 (10CodeReviewBot) brett opened https://gitlab.wikimedia.org/repos/sre/acme-chief/-/merge_requests/3 Draft: Update dependencies to match Bookworm versions