[06:53:14] hello folks [06:53:38] so I re-tested a "test" vk instance on cp4037, and pushing to jumbo works fine [06:53:59] so I am going to retry the upgrade of the node with the new patch, that uses the chained cert [06:54:15] ok for you? [06:54:24] fabfur: --^ o/ [06:54:50] I think yes, I'm working on a different host set! [06:55:00] for me is a go [06:55:27] super, fingers crossed [06:59:59] it seems to work now [07:00:23] 👍 [07:02:27] Valentin will probably kill me [07:02:45] why? [07:02:58] we spent hours yesterday diving deep into ssl etc.. [07:03:20] the only thing that I changed was to use the "chained" cert, but I thought we manually tested it [07:03:45] anyway, on stat1004 with kafkacat I see health check for varnish flowing to the webrequest_text topic [07:03:52] for cp4037, so varnishkafka works :) [07:05:35] also restarted the instances, all good [07:06:27] well, good news, isn't it? [07:07:54] elukey: nice :) [07:08:14] chain validation issues after all [07:08:18] it is yes, but I can't explain why it didn't work yesterday [07:08:25] sorry folks and thanks again for the help [07:08:29] np [07:08:34] repooling cp4037 now [07:11:58] 10Traffic, 10Data-Engineering, 10Data-Platform-SRE, 10SRE: Move varnishkafka to PKI - https://phabricator.wikimedia.org/T337825 (10elukey) All vk instances running on cp4037, next steps: 1) Monitor cp4037 to verify that nothing explodes. 2) Extend the change to ulsfo and monitor. 3) Extend the change to a... [08:03:47] 10Traffic, 10SRE, 10Patch-For-Review: Confusing error message when making plaintext HTTP POST requests to Wikimedia sites - https://phabricator.wikimedia.org/T338481 (10Vgutierrez) 05Open→03Resolved a:03taavi [08:45:58] hello - currently platform are planning the rollout of the new device-analytics service. This service will replace the AQS service's /device-analytics/ path (and other dedicated services will eventually do the same for other paths). Would ye have any good ideas about how to switch traffic over? [08:46:41] the blunt approach is doing something like adding a config for that path specifically and routing all traffic to the new service, but I was wondering if there is a way to route a percentage of requests or similar [08:47:02] the APIs are identical between the new device-analytics service and the old AQS service btw [08:59:21] hnowlan: AQS is behind the CDN? [09:00:17] hmm that's /api/rest_v1/metrics/pageviews and friends, right? [09:01:39] * vgutierrez checking https://wikitech.wikimedia.org/wiki/Analytics/AQS/Devices_Analytics [09:02:07] vgutierrez: yep! [09:03:31] so for example the team is working on a geo-analytics service, a pageviews service etc also [09:05:02] hnowlan: it can be done in VCL by matching the requests and leveraging std.random [09:07:53] 10netops, 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10ayounsi) `name=IPv4,lang=json { "event_type": "purge", "tag2": 1, "as_src": 48551, "as_dst": 0, "comms": "", "as_path": ""... [09:08:49] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by aborrero@cumin2002 for hosts: `cloudservices2004-dev` - cloudservic... [09:09:06] vgutierrez: ah cool - so do something like match on the device-analytics path and also checking std.random(0, 100) < whatever_my_request_percentage_is? [09:10:09] yep.. just wondering how do you wanna redirect the traffic [09:10:42] 302 and let the UA hit the new endpoint? [09:12:05] Or should varnish rewrite the URL and hit the new endpoint? [09:12:24] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) [09:12:53] ideally varnish would rewrite - the external URLs should remain the same as far as the UA is concerned [09:20:57] hnowlan: we should split the cache somehow then [09:21:34] you don't want a [potential]bug in the new code polluting the cache [09:21:58] vgutierrez: oof, yep, good call [09:25:09] Do we have much prior art on doing that in puppet I could look up? I'm a little out of my depth :D [09:26:02] 10HTTPS, 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE: upload.wikimedia.beta.wmflabs.org certificate expired (May 2023) - https://phabricator.wikimedia.org/T337642 (10AlexisJazz) 05Open→03Resolved a:03AlexisJazz Dunno how but it works again. [09:26:08] 10HTTPS, 10Traffic, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), and 3 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10AlexisJazz) [09:26:43] 10HTTPS, 10Traffic, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), and 3 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10AlexisJazz) 05Open→03Resolved [09:26:45] 10Acme-chief, 10User-bd808, 10User-dcaro, 10cloud-services-team (Kanban): acme-chief sometimes doesn't refresh certificates because it ignores SIGHUP - https://phabricator.wikimedia.org/T273956 (10AlexisJazz) [09:41:48] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) [09:56:08] While I'm bothering you, would it be okay if I move rest-gateway to production in LVS? https://gerrit.wikimedia.org/r/c/operations/puppet/+/920667 [10:18:52] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) a:05aborrero→03Papaul hey @Papaul or @Jhancock.wm would you please do the following: * disconnect server eno1 from asw... [10:19:45] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp4046:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [10:23:54] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10cmooney) @aborrero yep exactly, when this is done let me know I'll change the netbox side. [10:30:52] hnowlan: sorry.. contractor @ home :) [10:33:22] hnowlan: looks good, just mind the deployment windows.. one finishing in 28 minutes and another one starting in 2h27m [10:35:07] vgutierrez: ack, thank you! [10:46:23] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) [11:20:15] going ahead with the LVS restarts for rest-gateway [11:25:30] using lvs1020 and lvs2014 as secondaries, will hit go if it's okay :) [11:40:37] done, doing primaries (lvs1019,lvs2013) [11:46:29] done :) [13:11:19] hnowlan: lovely [13:45:24] varnishkafkas on cp4037 look ok [13:45:24] https://grafana-rw.wikimedia.org/d/000000253/varnishkafka?forceLogin&from=now-12h&orgId=1&to=now&var-cp_cluster=All&var-datasource=ulsfo%20prometheus%2Fops&var-instance=cp4037&var-source=webrequest [13:52:27] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 4 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10Papaul) @aborrero can we move this to ge-0/0/11 and not ge-0/0/36? [14:03:46] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 4 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10Papaul) @cmooney is it possible when moving servers from asw to cloudsw try to connect that server on the interface that matches th... [14:10:10] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 4 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) >>! In T338778#8927696, @Papaul wrote: > @aborrero can we move this to ge-0/0/11 and not ge-0/0/36? yes, I'm fine with wh... [14:13:42] 10Traffic, 10SRE, 10serviceops, 10Datacenter-Switchover: Figure out what changes are needed in the traffic layer for having codfw be the r/w DC for half a year - https://phabricator.wikimedia.org/T337535 (10akosiaris) 05Open→03Resolved a:03akosiaris Makes sense. Added in [Phase 9 of the Switchover](h... [14:33:32] another query if anyone has time :) We want to start routing traffic away from restbase, going service by service. The rest-gateway is gonna handle the basics of what restbase does but for most services we're reducing the amount of movings parts greatly. If I wanted to route traffic on an URL by URL basis initially as we take stuff out of restbase what would be the best path? [14:33:52] I have written https://gerrit.wikimedia.org/r/c/operations/puppet/+/929674 with full awareness that it's probably not quite the right way to do things but I figured reusing the same approach might be a good start [14:36:30] hnowlan: looks good. the new service expects the very same mangling as restbase and the same path normalization? [14:37:29] hnowlan: see https://github.com/wikimedia/operations-puppet/blob/f926ab7e3effdc52518a743a6b49e6c1a143ad63/modules/profile/files/trafficserver/rb-mw-mangling.lua#L8-L13 [14:39:56] yeah it does - in time I plan to move the mangling to the gateway as best as is possible but for now we want to maintain parity [14:40:13] proton is very low traffic so we're not at all concerned about the cut-over [14:41:45] hnowlan: ack [14:53:13] 10netops, 10Infrastructure-Foundations, 10SRE, 10netbox, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) 05Open→03Resolved a:03cmooney Above patch implements the logic from the "Re-image cookbook changes" in... [15:03:36] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 4 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10cmooney) @Papaul why do we need to change? Easiest thing here is just move the cable from eno2 to eno1 on the server side, then rem... [15:06:02] vgutierrez: from an ATS perspective, will having the two pattern matches for `target: 'http://(.*)/api/rest_v1/(.+)/pdf/(.*)'` and `target: 'http://(.*)/api/rest_v1'` work somewhat as expected? That is to say that we route proton to the rest gateway and the rest just goes to restbase [15:06:11] Now that I review that I actually see there is a problem with my regex to begin with [15:06:41] hnowlan: yep, as long as the most specific one comes first it should be OK [15:09:30] sweet, thank you [15:12:14] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10cmooney) >>! In T338778#8927735, @Papaul wrote: > @cmooney is it possible when moving servers from asw to cloudsw try to connect th... [15:31:19] hnowlan: also, if you have a deployment-prep instance of the new service we can test it there first :) [15:31:35] aka the beta cluster [15:36:57] 10netops, 10Infrastructure-Foundations, 10SRE: Peering: prefer primary IXP for direcly connected networks - https://phabricator.wikimedia.org/T338201 (10ayounsi) 05Open→03Resolved a:03ayounsi Tested in eqsin, traffic is now balanced more equally between all 3 IXPs. Same for ulsfo. [16:24:02] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10taavi) The current implementation on codfw1dev seems to have forgotten that the recursors need outbound access to the pu... [16:32:46] 10netops, 10Infrastructure-Foundations, 10SRE, 10Epic: [tracking] Don't keep on the public vlans hosts that don't require it - https://phabricator.wikimedia.org/T317177 (10aborrero) [16:32:54] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) 05Stalled→03Open [16:43:01] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10cmooney) >>! In T307357#8928378, @taavi wrote: > The current implementation on codfw1dev seems to have forgotten that th... [16:56:15] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10cmooney) @aborrero I discussed the idea of a [[ https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Enhanc... [18:22:45] (VarnishHighThreadCount) firing: (8) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [18:27:44] (VarnishHighThreadCount) firing: (9) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [18:32:44] (VarnishHighThreadCount) firing: (10) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [18:37:44] (VarnishHighThreadCount) firing: (16) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [18:47:44] (VarnishHighThreadCount) firing: (15) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [18:52:44] (VarnishHighThreadCount) firing: (14) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [18:57:45] (VarnishHighThreadCount) resolved: (8) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [21:36:52] 10Domains, 10Traffic, 10DNS, 10SRE: Update DNS records for mastodon.wikimedia.org - https://phabricator.wikimedia.org/T337586 (10NMariano-WMF) Hi, @Dzahn my apologies for the delay. It took me a while to get in touch with comms and I was OOO all last week. Here is a general overview document of the project... [21:38:10] 10Domains, 10Traffic, 10DNS, 10SRE: Update DNS records for mastodon.wikimedia.org - https://phabricator.wikimedia.org/T337586 (10NMariano-WMF) Comms also let me know that they are ok with using social.wikimedia.org as requested by @Legoktm [21:39:57] 10Domains, 10Traffic, 10DNS, 10SRE: Update DNS records for mastodon.wikimedia.org - https://phabricator.wikimedia.org/T337586 (10BCornwall) a:05BCornwall→03None [21:47:20] 10Domains, 10Traffic, 10DNS, 10SRE: Update DNS records for mastodon.wikimedia.org - https://phabricator.wikimedia.org/T337586 (10Dzahn) Thank you for the link, @NMariano-WMF . It's appreciated. Though I am not sure I see technical things on where and how to host it in there? I would like to suggest what L... [21:49:56] 10Domains, 10Traffic, 10DNS, 10SRE: Update DNS records for mastodon.wikimedia.org - https://phabricator.wikimedia.org/T337586 (10BCornwall) Yeah, the document is pretty barren. It sounds like there needs to be a little bit more planning! [22:12:59] 10Domains, 10Traffic, 10DNS, 10SRE: Update DNS records for mastodon.wikimedia.org - https://phabricator.wikimedia.org/T337586 (10Legoktm) >>! In T337586#8929576, @NMariano-WMF wrote: > Hi, @Dzahn my apologies for the delay. It took me a while to get in touch with comms and I was OOO all last week. Here is...