[07:39:29] 06serviceops, 10conftool: Move conftool to gitlab, turn on deb package auto-generation - https://phabricator.wikimedia.org/T369594 (10Joe) 03NEW [07:51:59] 06serviceops, 06Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10wmde-wikidata-tech: Use Envoy instead of LVS to route internal federation traffic for WDQS - https://phabricator.wikimedia.org/T368972#9964174 (10Gehel) Note that the throttling question is orthogonal to the LVS vs Envoy question.... [07:57:19] 06serviceops, 06Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10wmde-wikidata-tech: Use Envoy instead of LVS to route internal federation traffic for WDQS - https://phabricator.wikimedia.org/T368972#9964183 (10Gehel) It is not clear to me what solutions we have in place to route / load balance... [07:58:02] <_joe_> gehel: I'll get to look at that task today, I'm sorry but I've been out for personal reasons on friday and monday [07:58:17] _joe_: thanks! [07:58:38] ping me if anything is unclear [07:58:55] (it's fairly unclear to me, so that's probably reflected in the task) [08:05:23] 06serviceops, 06Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10wmde-wikidata-tech: Use Envoy instead of LVS to route internal federation traffic for WDQS - https://phabricator.wikimedia.org/T368972#9964216 (10Gehel) p:05Triage→03High [08:06:51] <_joe_> gehel: I mean the request per se seems clear-ish, what isn't clear to me rn is the why it's made [08:06:57] 06serviceops, 10Wikidata, 10Wikidata-Query-Service, 10wmde-wikidata-tech, 10Data-Platform-SRE (2024.07.08 - 2024.07.28): Use Envoy instead of LVS to route internal federation traffic for WDQS - https://phabricator.wikimedia.org/T368972#9964227 (10Gehel) [08:07:14] <_joe_> I'll ask you if I can't figure that out [08:07:50] vgutierrez was asking us to move away from LVS for internal traffic routing [08:08:37] I was asking to avoid loops on internal traffic.. aka realservers of the low-traffic LVS sending traffic to the low-traffic LVS :) [08:20:41] so actually, we should not have loops strictly speaking. I'll add a comment on the task trying to describe this better [08:23:39] 06serviceops, 10Wikidata, 10Wikidata-Query-Service, 10wmde-wikidata-tech, 10Data-Platform-SRE (2024.07.08 - 2024.07.28): Use Envoy instead of LVS to route internal federation traffic for WDQS - https://phabricator.wikimedia.org/T368972#9964289 (10Gehel) Note that we should not have traffic loops, at leas... [08:25:17] vgutierrez: we have multiple blazegraph pools, with different datasets. They will federate with each others, but not with themselves. Does this addresses your concerns? [08:26:44] gehel: from what I understood, an initial request that comes via the LVS to a WDQS realserver will trigger additional requests via the same LVS [08:31:01] yes, but to a different pool of real servers [08:31:13] does this count as a loop? And is it still problematic? [08:34:28] from our point of view applayer shouldn't depend on its own load balancing layer for internal operation [08:37:58] and a separate pool dedicated to internal federation? cdn -> lvs(wdqs-main) -> WDQS-main realserver -> lvs(wdqs-sholarly-internal) -> wdqs-scholarly realserver [08:38:50] it's still the same LVS instance serving traffic in both ways... under pressure from a external client you keep adding load to the same LVS [08:40:16] are there other lvs instances we could use? :) [08:45:01] <_joe_> I don't think traffic flows to LVS at all in this situation, unless I'm missing something [08:45:15] <_joe_> a host calling its own LVS will always send traffic locally [08:45:36] <_joe_> but I'll look at the tasks today and come back hopefully with more questions [08:46:05] <_joe_> gehel: no if it's another pool I don't see the problem tbh [08:46:30] <_joe_> if the realserver is not in both pools [08:46:46] <_joe_> but if it is, then traffic will go locally on the machine without touching LVS [08:47:18] <_joe_> since, given we do LVS-DR, realservers have the IPs of all its pools on the loopback interface [08:48:44] yes... that's true for any kind of LVS, but you won't balance traffic to other instances [08:49:03] <_joe_> that's the point i was making [08:49:29] <_joe_> so if the pools have separated backends, lvs is perfectly fine [08:50:08] and a different VIP [08:50:27] <_joe_> we already have that [08:50:35] <_joe_> wdqs-internal IIRC [08:50:44] cool [08:51:38] <_joe_> yep [08:51:41] We will add 2 other pools in this context, wdqs internal is still another story. I probably need to write a graph of all that. [08:51:54] <_joe_> gehel: I'll read the tasks [08:52:01] <_joe_> but my point above stands [08:52:27] <_joe_> if the pools don't share backends, LVS is fine. Otherwise we'll need another solution [08:52:57] then we should be good! pools don't share backends and don't send traffic to themsleves [09:18:55] 06serviceops: kafka-main replacement nodes don't fit kafka-main (storage wise) - https://phabricator.wikimedia.org/T368714#9964544 (10akosiaris) [09:32:29] good morning, we got an automated email that some cumin aliases are not valid anymore, all the wikikube-etcd*. This is because O:etcd::v3::kubernetes doesn't match any host. Is that temporary due to maintenance or the puppettization has changed and the aliases need to be adapted? [09:33:28] we don't have those nodes anymore IIRC since like a few days ago [09:33:36] we can probably remove the role and alias [09:34:12] 06serviceops, 10Observability-Metrics, 07Grafana, 07Kubernetes: High cardinality metrics break queries/dashboards (envoy, istio, ...) - https://phabricator.wikimedia.org/T369607 (10JMeybohm) 03NEW [09:34:37] ack, I'll leave it to you for the cleanup :) [09:43:01] 06serviceops, 10docker-pkg: Rationalize and update the use of base images in our docker-pkg repositories - https://phabricator.wikimedia.org/T341115#9964622 (10Joe) 05Open→03Resolved [09:48:58] 06serviceops, 10conftool: Move conftool to gitlab, turn on deb package auto-generation - https://phabricator.wikimedia.org/T369594#9964647 (10Volans) Have you considered the alternative approach of adapting the repo like the other python projects that we have and release it with the [[ https://gitlab.wikimedia... [09:56:09] 06serviceops, 10Observability-Metrics, 07Grafana, 07Kubernetes: High cardinality metrics break queries/dashboards (envoy, istio, ...) - https://phabricator.wikimedia.org/T369607#9964720 (10JMeybohm) [09:57:39] 06serviceops, 10Observability-Metrics, 07Grafana, 07Kubernetes: High cardinality metrics break queries/dashboards (envoy, istio, ...) - https://phabricator.wikimedia.org/T369607#9964751 (10fgiunchedi) Definitely +1 to use recording rules for histograms, since they are broken anyways already. I think also s... [09:59:45] 06serviceops, 10conftool: Move conftool to gitlab, turn on deb package auto-generation - https://phabricator.wikimedia.org/T369594#9964776 (10Joe) I don't see how the two as conflicting options, although in the case of conftool, I'd prefer to keep distributing it in production as deb packages, which makes a lo... [10:04:44] 06serviceops, 10Wikidata, 10Wikidata-Query-Service, 10wmde-wikidata-tech, 10Data-Platform-SRE (2024.07.08 - 2024.07.28): Use Envoy instead of LVS to route internal federation traffic for WDQS - https://phabricator.wikimedia.org/T368972#9964782 (10Gehel) And a graph to hopefully make all this more cle... [10:05:22] _joe_, vgutierrez: I've closed that task per our conversation. Pleas re-open if I misunderstood. [10:07:23] 06serviceops, 10Wikidata, 10Wikidata-Query-Service, 10wmde-wikidata-tech, 10Data-Platform-SRE (2024.07.08 - 2024.07.28): Use Envoy instead of LVS to route internal federation traffic for WDQS - https://phabricator.wikimedia.org/T368972#9964789 (10Gehel) 05Open→03Invalid a:03Gehel After discussi... [10:14:54] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 16), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9964826 (10SGupta-WMF) @mforns Thank you for verifying and raising the MR . I... [10:20:46] hey folks in a few hours we'll be upgrading lsw1-e3-eqiad, which will affect these hosts: [10:20:50] kubernetes1047 [10:20:50] kubernetes1048 [10:20:50] kubernetes1049 [10:20:50] kubernetes1050 [10:20:50] kubernetes1051 [10:20:51] kubernetes1061 [10:20:51] mw1491 [10:20:52] mw1492 [10:20:52] mw1493 [10:21:10] scheduled for 15:00 UTC / 17:00 CEST, if there is someone who can assist with depooling? [10:21:22] https://phabricator.wikimedia.org/T365995 [10:33:08] sure, I can handle that [10:33:54] 06serviceops, 10conftool: Move conftool to gitlab, turn on deb package auto-generation - https://phabricator.wikimedia.org/T369594#9964891 (10elukey) >>! In T369594#9964776, @Joe wrote: > I don't see how the two as conflicting options, although in the case of conftool, I'd prefer to keep distributing it in pro... [10:46:47] hnowlan: thanks! I'll be starting the work at 4pm Irish time will check in with you prior [10:47:30] topranks: grand, sgtm [13:56:07] 06serviceops, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Migrate charts to Calico Network Policies - https://phabricator.wikimedia.org/T359423#9965448 (10akosiaris) [14:46:10] 06serviceops, 10MW-on-K8s, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review: mcrouter daemonset on mw-on-k8s - https://phabricator.wikimedia.org/T346690#9965648 (10jijiki) [14:46:48] 06serviceops, 10MW-on-K8s, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review: mcrouter daemonset on mw-on-k8s - https://phabricator.wikimedia.org/T346690#9965661 (10jijiki) Moving forward, mw-web was deployed in eqiad, with nothing standing out. If we don't discover anything odd, we can finish this up... [14:57:26] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 16), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9965710 (10Scott_French) Thanks, @SGupta-WMF! @mforns - The v1.0.2 image is n... [14:58:30] 06serviceops, 06DC-Ops, 10ops-eqiad, 10Prod-Kubernetes, 06SRE: kubernetes1051.eqiad.wmnet failed to pull mediawiki images - https://phabricator.wikimedia.org/T369011#9965709 (10cmooney) >>! In T369011#9948452, @JMeybohm wrote: > I've deleted the node from the k8s API as a required istio update would not... [14:58:54] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 16), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9965715 (10mforns) Thank you a lot @Scott_French, testing now. [15:09:58] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 16), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9965752 (10mforns) @Scott_French, v1.0.2 in staging looks good now! I think we... [15:49:28] 06serviceops, 06Content-Transform-Team, 10Page Content Service, 10RESTBase Sunsetting, 13Patch-For-Review: Update mobileapps k8s deployment chart for Cassandra credentials - https://phabricator.wikimedia.org/T350507#9965953 (10Jgiannelos) From our last meeting with @hnowlan and @daniel: * Our main conce... [16:30:21] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 16), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9966147 (10Scott_French) [17:14:17] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 16), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9966315 (10Scott_French) [17:22:04] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 16), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9966331 (10Scott_French) Alright, good news: `/api/rest_v1/metrics/commons-imp... [17:28:51] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 16), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9966359 (10Scott_French) [17:33:54] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 16), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9966404 (10Scott_French) @SGupta-WMF - thanks for documenting the API at [0].... [20:55:35] 06serviceops, 10MW-on-K8s: Migrate mwmaint server functionality to mw-on-k8s - https://phabricator.wikimedia.org/T341560#9967242 (10Krinkle) How can MW developers access the output of currently-runinng (or recently-completed) scheduled maintenance scripts that execute in Kubernetes? Today this works via journ... [21:11:55] 06serviceops, 10MW-on-K8s: Migrate mwmaint server functionality to mw-on-k8s - https://phabricator.wikimedia.org/T341560#9967304 (10RLazarus) Script output is visible through `kubectl logs`, and mwscript-k8s can be invoked with `-f` to immediately start tailing the script output (under the hood, it just invoke... [21:59:09] 06serviceops, 10DNS, 10fundraising-tech-ops, 06SRE, 06Traffic: redirect benefactors.wikimedia.org (was: Cleanup unused DNS subdomains) - https://phabricator.wikimedia.org/T367012#9967449 (10Dzahn) [22:01:59] 06serviceops, 10DNS, 10fundraising-tech-ops, 06SRE, 06Traffic: redirect benefactors.wikimedia.org (was: Cleanup unused DNS subdomains) - https://phabricator.wikimedia.org/T367012#9967466 (10Pppery) T367012#9874025 - the original title of this ticket was to redirect benefactors before I expanded it so we'... [22:03:06] 06serviceops, 10DNS, 10fundraising-tech-ops, 06SRE, 06Traffic: redirect benefactors.wikimedia.org (was: Cleanup unused DNS subdomains) - https://phabricator.wikimedia.org/T367012#9967470 (10Dzahn) hah! ;) not wrong though. The other stuff is done and want to clarify what's left:) [22:32:39] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 16), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9967550 (10mforns) Thank you @Scott_French for deploying! I checked all endpoi... [23:30:44] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Allow running one-off scripts manually - https://phabricator.wikimedia.org/T341553#9967694 (10bd808) [23:48:23] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 16), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9967724 (10Scott_French) Ah, thanks for surfacing that, @mforns. If serving a...