[08:44:06] ottomata: I think you need to provide a fixture in helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment/.fixtures.yaml [08:44:12] see helmfile.d/services/tegola-vector-tiles/.fixtures.yaml for example [09:10:47] 10serviceops, 10SRE: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Lucas_Werkmeister_WMDE) Good to know, thanks! [09:11:49] 10serviceops, 10SRE, 10Wikidata, 10wdwb-tech: Migrate wikibase/termbox to newer Node.js version - https://phabricator.wikimedia.org/T328295 (10Lucas_Werkmeister_WMDE) [09:11:56] 10serviceops, 10SRE, 10Wikidata, 10wdwb-tech: Migrate wikibase/termbox to newer Node.js version - https://phabricator.wikimedia.org/T328295 (10Lucas_Werkmeister_WMDE) [09:46:42] o/ we discussed that the MW api-ro might not be well fitted for internal async processing use-cases, is there a task to discuss this? Should I create one if not? [10:00:23] dcausse: I am missing context (and quite possibly multiple pieces of my hard disk memory)... uhm, when and where was that discussion held? [10:01:14] <_joe_> dcausse: I think your best course of action is to point your service to "mwapi-async" in the service proxy, and we'll control where that points over time [10:01:31] <_joe_> dcausse: if you need a ro-only endpoint, that indeed is lacking [10:01:46] akosiaris: it was related to the rework of the search update pipeline [10:02:33] _joe_: ok, does mwapi-async have drawbacks when being hit from the non-master DC? [10:02:53] I can't remember at all... sorry :-( [10:02:57] <_joe_> dcausse: right now yes, it goes cross-dc, so you get an additional 25/30 ms of latency [10:04:09] <_joe_> dcausse: basically I'm suggesting you add the service-proxy sidecar to your chart, if you don't already, then send requests for mediawiki's api to 127.0.0.1:6500 or the port of the new -ro endpoint I'm creating now :) [10:04:30] <_joe_> that will eventually be pointed to a dedicated cluster on mw on k8s [10:05:07] <_joe_> possibly the jobrunner cluster, even, where I hope to be able to provide the full mw api too [10:13:44] <_joe_> I *really* need to write some guidance on wikitech on this stuff :/ If I find the time I'll do it this week [10:31:01] _joe_: I'd need a puppet run on deploy hosts, would that be okay? [10:35:07] <_joe_> jayme: oh sorry, yes ofc [10:35:16] <_joe_> that was me debugging something yesterday [10:36:02] ok, cool. enabling puppet then [10:47:31] _joe_: thanks! (sorry was in a meeting) [10:48:54] yes guidance on what endpoint to use for what would be great [10:52:57] _joe_: re: "the new -ro endpoint I'm creating now" this would be like "mw-async-ro"? [10:53:14] <_joe_> dcausse: yes possibly [10:53:33] makes sense, thanks! [10:56:37] ottomata: ^ (might be useful for the enrichement pipeline you're working on). [10:59:45] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update staging-eqiad to k8s 1.23 - https://phabricator.wikimedia.org/T327664 (10JMeybohm) [11:42:49] <_joe_> yes, basically the idea here is that given these are processing stuff without a human waiting for a response, they can have more relaxed timeouts and availability requirements, and we can turn them down if we need more computing resources to serve live users [11:43:11] <_joe_> this is all in the future; if we still did Design Docs I would probably write one [11:44:14] <_joe_> I'll write something on-wiki instead and request feedback there :) [11:51:19] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes: Post Kubernetes v1.23 cleanup - https://phabricator.wikimedia.org/T328291 (10JMeybohm) [12:04:27] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes: Post Kubernetes v1.23 cleanup - https://phabricator.wikimedia.org/T328291 (10JMeybohm) [12:16:15] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Elitre) [13:08:37] _joe_: dcausse thanks, following, we were planning to use api-ro [13:09:05] <_joe_> ok, I promise I'll try to get to write something this week [13:09:17] <_joe_> I'm literally flooded with work though, so no promises [13:42:55] 10serviceops, 10Commons, 10MediaWiki-File-management, 10SRE, and 3 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10PatchDemoBot) Test wiki on [[ https://patchdemo.wmflabs.org | Patch demo ]] by TheDJ using patch(es)... [14:06:10] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update staging-eqiad to k8s 1.23 - https://phabricator.wikimedia.org/T327664 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=52889e40-9b66-42b2-8c86-a2c1b5fabe68) set by jayme@cumin1001 for 1 day, 0:00:00 on 6 host(s) an... [15:00:33] jayme: if you have a sec, i think the flink app can't talk to kafka jumbo now, but as far as I can tell the netpol is correct? [15:00:40] kube_env stream-enrichment-poc dse-k8s-eqiad [15:01:01] e.g. [15:01:03] To Port: 9092/TCP [15:01:03] To Port: 9093/TCP [15:01:03] To: [15:01:03] IPBlock: [15:01:03] CIDR: 10.64.0.175/32 [15:01:14] 10.64.0.175 is kafka-jumbo1001.eqiad.wmnet [15:01:38] we had errors about getting metadata for kafka topics (usually means can't talk to kafka) [15:01:40] and [15:01:44] root@dse-k8s-worker1007:~# nsenter -t 2743379 -n telnet 10.64.0.175 9092 [15:01:47] just hangs [15:01:55] ...although i am probably not doing nsenter corrctly [15:06:08] oh, maybe I am, ^ works for eventgate container pid in wikikube. [15:06:20] so, somethign is def wrong with networking rules here. [15:09:33] <_joe_> ottomata: ipv6? [15:09:56] i think i found it. ferm on brokers doesn't inlcude dse k8s [15:10:21] <_joe_> hah [15:10:26] <_joe_> so it was on the other side [15:10:28] <_joe_> :D [15:12:18] where i network::constants::services_* defined? don't see it in hiera. [15:12:27] i htink i need to define a network constant for dse k8s eqiad [15:15:46] hm i see in network constants.pp some slicing going on [15:15:54] just gottta figure out the right slice... [15:16:56] btullis: elukey .... got any network constants for dse kubepods? [15:16:59] i don't see any defined... [15:17:03] do you know the subnets? [15:17:21] Yeah, hang on a sec... [15:18:31] 10.67.24.0/21 and 2620:0:861:302::/64 (https://netbox.wikimedia.org/ipam/prefixes/538/ and https://netbox.wikimedia.org/ipam/prefixes/588/) [15:20:38] https://github.com/wikimedia/puppet/blob/production/hieradata/role/eqiad/dse_k8s/master.yaml#L31-L33 [15:21:55] Oh right, you need them as constants, not just the values? [15:22:40] yes, i don't totally understand how this all works but [15:22:52] ultimately i want to add an entry to profile::kafka::broker::custom_ferm_srange_components [15:22:53] like [15:23:04] '$DSE_K8S_KUBEPODS_NETWROKS' [15:23:05] or something [15:23:19] btullis: aren't they defined in profile::kubernetes::cluster_cidr ? [15:23:22] hieradata/role/eqiad/dse_k8s/master.yaml [15:23:47] they can be aliased in any other hiera namepspace as needed ofc [15:24:34] volans: is that the cluster IP cidr net? [15:24:42] volans: Yes, that's what I linked a few lines above. I think that ottomata needs to be able to reference them easily for ferm . [15:24:48] i think the ferm rules reference the kubepod ips? [15:25:17] I haven't added a nice variable like this, which is the k8s_aux one: https://github.com/wikimedia/puppet/blob/production/modules/network/manifests/constants.pp#L133-L138 [15:25:24] i thtink we need this change but for dse-k8s [15:25:24] https://gerrit.wikimedia.org/r/c/operations/puppet/+/724933 [15:25:38] ack sorry I missed the reference above [15:25:45] right [15:26:09] but, btullis are the dse net constants even defined and availble for use with slice_network_constants? [15:26:27] do those need o go intto nework/data/data.yaml? [15:27:05] also, whatt is the diff between kubepods and kubesvc in https://gerrit.wikimedia.org/r/c/operations/puppet/+/724933/4/modules/network/data/data.yaml#136 [15:27:38] ottomata: Yes, I think you're right. I forgot to add it here too. Here's the corresponding one for the k8s-aux again: https://github.com/wikimedia/puppet/blob/production/modules/network/data/data.yaml#L204-L206 [15:28:46] ottomata: Pods vs services ranges are explained a little bit here: https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#Prerequisites [15:30:25] Normally you'd want a fairly small range of service IPs because you'd run more pods than you would expose services. However, when building the ml-serve cluster and starting to work with kserve/knative e.lukey ran into an issue of exhaustion of the service IP pool. [15:31:06] interesting. okay, so for our ferm purposes, we'd want the podips, since that is what ferm will see as trying to connect? [15:31:07] ref: https://phabricator.wikimedia.org/T302701 [15:31:09] the service is for ingress, ya? [15:32:30] "for our ferm purposes, we'd want the podips," <- Yes I believe that this is correct. This is what you want to add to the kafka broker firewall to permit workloads on dse-k8s to access kafka. [15:32:37] yup [15:33:27] btullis: okay, for good measure i'll add both nets in data.yaml thten [15:33:33] what iis the service cidr ? [15:33:46] for dse [15:35:06] Service CIDR is here: https://github.com/wikimedia/puppet/blob/production/hieradata/role/eqiad/dse_k8s/master.yaml#L1-L3 [15:35:18] gr8 [15:35:44] "i'll add both nets in data.yaml" <- Excellent, many thanks and apologies for forgetting to do it. [15:35:53] np [15:36:22] btw, these nets probably should be globally accessible / defined, you might want to grab them via network constants rather than defining them in your role hiera, ya? [15:36:25] kind of like we do for kafka broker listts? [15:42:39] Agreed, it would be good to have it defined in a single place but I'd be a bit worried about poking other people's stuff. All of the other clusters do it this way, so if we're going to change that we should probably all do it. https://github.com/wikimedia/puppet/search?q=profile%3A%3Akubernetes%3A%3Acluster_cidr&type=code [15:43:01] aye, ya [15:43:05] should do it for all [15:47:54] btullis: [15:47:55] https://gerrit.wikimedia.org/r/c/operations/puppet/+/885366 [15:47:56] and [15:48:01] https://gerrit.wikimedia.org/r/c/operations/puppet/+/885366 [15:48:10] and pcc: [15:48:11] https://puppet-compiler.wmflabs.org/output/885367/39347/kafka-jumbo1001.eqiad.wmnet/index.html [15:58:46] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes: Post Kubernetes v1.23 cleanup - https://phabricator.wikimedia.org/T328291 (10JMeybohm) [15:58:55] I was just checking why the MW_APPSERVER_NETWORKS was updated with 885367 but it makes sense. [16:10:53] 10serviceops, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10EChetty) [16:11:18] yeehaw thanks for your help btullis ! its working! [16:11:38] A pleasure. [16:14:12] 10serviceops, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10EChetty) [16:25:38] jayme: akosiaris: are there any written instructions (or even past tasks I can reference) about upgrading k8s cluster versions? [16:29:16] cdanis: best one is this probably https://phabricator.wikimedia.org/T326340 [16:29:46] perfect thank you [16:47:12] hm, btullis "pods "flink-app-main-6957b44d8d-nm8df" is forbidden: maximum memory usage per Container is 3Gi, but limit is 4Gi" [16:55:09] jayme: gonna spend the day reading up and I'll hopefully have some patches and whatnot for you to take a look at tomorrow for the aux cluster <3 [16:59:19] ottomata: OK, let's lift that limit, shall we? [17:00:40] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update staging-eqiad to k8s 1.23 - https://phabricator.wikimedia.org/T327664 (10JMeybohm) [17:01:15] cdanis: cool. I just did the process again with staging-eqiad and there where no surprises [17:03:46] awesome [17:04:23] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Decide on new Pod and Sevice IPv4 ranges for wikikube clusters - https://phabricator.wikimedia.org/T326617 (10akosiaris) === Pods == So, a `/21` contains 32 `/26` subnets. /26 is the current calico allocation size and it is pretty ok as an allocation size. Assum... [17:07:10] btullis: sounds good, i just bumped taskmanaers to 3GiB, tbd. i noticed that rdf-streaming-updater, which i'd assume is doingi more mem intensive work, has lower mem settings [17:07:27] so maybe we don't need to bump hard limits, but just needed a bit more than we initially gave it [17:07:28] tbd... [17:07:51] but btullis , i betcha if you want to run spark batch jobs on DSE, u gonna need a bigger limit :) [17:20:11] Yeah, I haven't quite got that far yet, but I agree :-) [17:33:04] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update staging-eqiad to k8s 1.23 - https://phabricator.wikimedia.org/T327664 (10JMeybohm) [17:56:21] ottomata: testing in yarn with flink 1.16 I had to bump to 5g [17:56:57] so something seems to have changed between flink 1.12 currently in prod for us and 1.16 [18:19:32] interesting [19:15:17] 10serviceops, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10colewhite)