[03:56:48] 06serviceops, 10Beta-Cluster-Infrastructure, 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation): Replace deployment-memc[08-10] with Bullseye or Bookworm - https://phabricator.wikimedia.org/T361384#9947872 (10Andrew) The old memc hosts run redis, and are referred to as such in deployment-prep p... [05:23:58] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9947951 (10SGupta-WMF) @xcollazo The column renaming is done to match api outp... [08:08:09] hey serviceops, search / DPE sre teams are working on architecting&deploying the wdqs graph split which splits the previous single knowledge graph into 2 subgraphs. in order to support the same breadth of use cases there are certain types of queries that must use federation to talk between the two subgraphs, however this has the current result that our approach to throttling ends up throttling the internal federation requests [08:08:20] I've tried to simplify the basic overview and problem statement in https://phabricator.wikimedia.org/T368972#9948102, although it still leaves much to be desired. if any of yall have time to look over and help us out, that would be great [08:08:39] we've talked to traffic team a bit and currently they've proposed we use envoy for internal federation requests, but there are some thorny issues (mainly: if using envoy to load balance how do we route only to pooled hosts and not depooled ones) that your team's experience might help with. thus the question(s) in the linked comment [08:57:59] 06serviceops, 10MW-on-K8s: Show more useful information when mwscript-k8s fails to launch - https://phabricator.wikimedia.org/T369142 (10Lucas_Werkmeister_WMDE) 03NEW [08:58:05] 06serviceops, 10MW-on-K8s: Allow cleaning up specific mwscript-k8s runs - https://phabricator.wikimedia.org/T369143 (10Lucas_Werkmeister_WMDE) 03NEW [09:04:06] 06serviceops, 06DC-Ops, 10ops-eqiad, 10Prod-Kubernetes, 06SRE: kubernetes1051.eqiad.wmnet failed to pull mediawiki images - https://phabricator.wikimedia.org/T369011#9948452 (10JMeybohm) I've deleted the node from the k8s API as a required istio update would not finish successfully because it was waiting... [09:08:35] 06serviceops, 10MW-on-K8s: Allow cleaning up specific mwscript-k8s runs - https://phabricator.wikimedia.org/T369143#9948463 (10Lucas_Werkmeister_WMDE) [09:09:04] 06serviceops, 06Infrastructure-Foundations: Upgrade thumbor Dockeri images to Bookworm - https://phabricator.wikimedia.org/T369144 (10elukey) 03NEW [09:11:23] o/ opened --^ to see what we can do to upgrade Thumbor [09:11:31] I know that you all like me so much [09:22:09] claime akosiaris Regarding https://phabricator.wikimedia.org/T366819#9939478 could it be again the same problem with nodejs + ipv4/v6 dns resolution? [09:22:21] * akosiaris taking a look [09:22:22] I have this patch to test with the ipv4 directly: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1051688 [09:22:36] nemo-yiannis: Yeah, there's actually a dns resolution issue [09:22:45] but it's not v4/v6 I don't think [09:22:53] of course dig isn't installed in the container [09:23:12] but openssl s_client -connect staging.svc.eqiad.wmnet:4492 doesn't return from inside the container (it does from the namespace) [09:23:24] openssl s_client -connect 10.64.16.55:4492 does [09:24:14] nemo-yiannis: fwiw, that data.txt file in the task, returns "event 68a24380-37a9-11ef-b9d3-87400a6a8e37 of schema at /resource_change/1.0.0 destined to stream resource_purge is not allowed in stream; resource_purge is not configured." [09:24:20] I guess this is known, right? [09:24:26] yeah thats the expected output [09:24:30] ok [09:24:33] (commented what I've debugged in the task) [09:26:52] 06serviceops, 06Content-Transform-Team-WIP, 10RESTBase, 10RESTBase Sunsetting, and 2 others: Enable PCS to send resource change events to handle URL purges - https://phabricator.wikimedia.org/T366819#9948594 (10Clement_Goubert) ` root@kubestage1003:/home/cgoubert# curl https://staging.svc.eqiad.wmnet:4492/... [09:31:47] oh yeah, node returns the ipv6 [09:32:32] 06serviceops, 06Content-Transform-Team-WIP, 10RESTBase, 10RESTBase Sunsetting, and 2 others: Enable PCS to send resource change events to handle URL purges - https://phabricator.wikimedia.org/T366819#9948628 (10Clement_Goubert) From inside the pod, nodejs DNS lookup returns the ipv6 for `staging.svc.eqiad.... [09:35:24] and sure enough, that's not working [09:37:07] That's because unlike production where eventgate-main.svc.eqiad.wmnet (for instance) is an A record, staging.svc.eqiad.wmnet is a CNAME to a kubestage server, and node defaults to ipv6 [09:38:07] and it's not using the mesh, so akosiaris' patch from may doesn't work [09:40:05] runuser@mobileapps-staging-6794bbd5c6-cp6gd:/srv/service$ echo sure enough, this doesn't work [09:40:13] while the IPv4 one does [09:40:18] FWIW this is only to test things on staging [09:40:23] production is using service mesh [09:42:18] yeah, netpol for eventgate-main only has the IPv4 address of production, even on the staging environment [09:44:33] omg, eventgate-main's release in staging env is called production... sigh [09:44:47] yes. [09:45:13] echo open [09:45:14] works [09:45:35] so, the test can get away with using the in staging clusterIP [09:46:17] DNS resolution doesn't work though, looking into it [09:48:25] cho open [09:48:27] there we go [09:49:26] nemo-yiannis: how do you configure which endpoint to talk to? is it easy for staging specifically to use "eventgate-production-tls-service.eventgate-main.svc.cluster.local." instead of "staging.svc.eqiad.wmnet"? [09:49:37] Sure [09:49:49] its in deployment charts defined on staging values [09:50:11] that should keep your test traffic in the cluster and avoid these weird shenanigans that we have to do in the staging cluster for intra service traffic. [09:50:17] inter service* [09:51:43] ok [10:05:57] 06serviceops, 06Infrastructure-Foundations: Upgrade thumbor Docker images to Bookworm - https://phabricator.wikimedia.org/T369144#9948730 (10Volans) [10:30:40] 06serviceops, 06Infrastructure-Foundations: Upgrade thumbor Docker images to Bookworm - https://phabricator.wikimedia.org/T369144#9948794 (10akosiaris) Hasn't this already been done in T355020 ? At least judging from the last image ` $ docker run --rm -it --entrypoint /usr/bin/cat docker-registry.wikimedia.... [10:30:53] nemo-yiannis: did it work? [10:31:00] deploying now [10:33:47] ha, it worked [10:33:53] thanks! [10:36:45] 06serviceops, 06Content-Transform-Team-WIP, 10RESTBase, 10RESTBase Sunsetting, and 2 others: Enable PCS to send resource change events to handle URL purges - https://phabricator.wikimedia.org/T366819#9948820 (10Jgiannelos) Verified on staging: ` curl https://staging.svc.eqiad.wmnet:4102/en.wikipedia.org/v1... [10:46:51] 06serviceops, 06Content-Transform-Team-WIP, 10RESTBase, 10RESTBase Sunsetting, and 2 others: Enable PCS to send resource change events to handle URL purges - https://phabricator.wikimedia.org/T366819#9948843 (10Jgiannelos) 05Open→03Resolved [10:50:07] yw [10:51:23] 06serviceops, 06Content-Transform-Team-WIP, 10RESTBase, 10RESTBase Sunsetting, and 2 others: Enable PCS to send resource change events to handle URL purges - https://phabricator.wikimedia.org/T366819#9948893 (10akosiaris) For posterity's sake, a summary follows: * wikikube staging doesn't have a very... [11:28:37] 06serviceops, 10MW-on-K8s, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q1): Create a per-release deployment of statsd-exporter for mw-on-k8s - https://phabricator.wikimedia.org/T365265#9949013 (10Clement_Goubert) Tagging in @RLazarus for `mw-script`, I don't know how you... [11:29:14] 06serviceops, 10MW-on-K8s, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q1): Create a per-release deployment of statsd-exporter for mw-on-k8s - https://phabricator.wikimedia.org/T365265#9949016 (10Clement_Goubert) p:05High→03Low Lowering priority now that the producti... [11:42:21] 06serviceops, 06Infrastructure-Foundations: Upgrade thumbor Docker images - https://phabricator.wikimedia.org/T369144#9949043 (10elukey) [11:43:21] 06serviceops, 06Infrastructure-Foundations: Upgrade thumbor Docker images - https://phabricator.wikimedia.org/T369144#9949047 (10elukey) >>! In T369144#9948794, @akosiaris wrote: > Hasn't this already been done in T355020 ? > > At least judging from the last image > > ` > $ docker run --rm -it --entrypoint... [12:30:24] akosiaris: o/ [12:30:58] is it ok if I deploy the new thumbor image? [12:31:14] (modulo what I wrote in the task, namely my pebcak) [12:31:53] <_joe_> elukey: wait what [12:32:00] <_joe_> what's the new thumbor image? [12:32:15] it contains the libvpx package updated, nothing more [12:32:21] to fix a debian DSA [12:32:26] <_joe_> oh right [12:32:32] <_joe_> d'oh I even read the task [12:32:51] my bad I confused the haproxy upgrade to bookworm with the libvpx upgrade [12:32:54] <_joe_> I think nothing prevents you from doing it [12:33:03] I am clearly on top of my new job as security fixer [12:33:19] [12:33:24] okok will do thanks! [12:37:42] SGTM [12:42:38] (deployed to codfw, will wait a bit, check the thumbor dashboard and then proceed with eqiad) [12:42:53] the haproxy upgrade question remains, if anybody wants to chime in feel free (in the task)( [12:48:48] 06serviceops, 06Infrastructure-Foundations: Upgrade thumbor Docker images - https://phabricator.wikimedia.org/T369144#9949308 (10elukey) The thumbor plugin image has been deployed, next step is to figure out what to do with haproxy. [12:56:09] done! [13:07:22] 06serviceops, 06Data Products, 07Epic: SDS 2.1.1 Evaluations of 3rd part Experimentation Platform by SRE Service Ops - https://phabricator.wikimedia.org/T369174 (10WDoranWMF) 03NEW p:05Triage→03High [13:08:35] 06serviceops, 06Data Products, 07Epic: SDS 2.1.1 Evaluations of 3rd part Experimentation Platform by SRE Service Ops - https://phabricator.wikimedia.org/T369174#9949517 (10WDoranWMF) [13:18:20] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9949542 (10xcollazo) >>! In T361835#9947951, @SGupta-WMF wrote: > @xcollazo Th... [13:18:44] 06serviceops, 10MW-on-K8s: mwscript-k8s --attach error: TypeError: 'NoneType' object is not iterable - https://phabricator.wikimedia.org/T369175 (10Lucas_Werkmeister_WMDE) 03NEW [13:22:20] hello service ops posse [13:22:43] We have kubernetes1060, mw1489 & mw1490 in rack E2 which will take a hit at 15:00 UTC / 16:00 CEST [13:22:53] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9949563 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host parsoidtest1001.eqiad.wmnet with OS bullseye [13:23:24] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9949564 (10Jclark-ctr) [13:24:11] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9949567 (10Jclark-ctr) a:03Jclark-ctr [13:38:08] 06serviceops, 06collaboration-services, 06Infrastructure-Foundations, 10GitLab (Integrations), and 2 others: Container image reports in debmonitor are broken - https://phabricator.wikimedia.org/T348876#9949610 (10brennen) [13:39:33] topranks: I can drain and cordon them in a bit. Do you have the task handy by chance? [13:39:53] jayme: thanks [13:39:55] yep its [13:39:55] https://phabricator.wikimedia.org/T365994 [13:41:39] topranks: ah, list is a bit oldish - fixing [13:41:52] oh shit sorry..... that's on me [13:41:55] I can update [13:41:58] np [13:42:06] you linked the rack in netbox, all good [13:46:31] I think the difference is mw1489 and mw1490 are renamed wikikube-worker1007 and wikikube-worker1021? [13:49:17] yes, correct [13:52:47] topranks: oh...is it 15:00 UTC or 16:00 CEST? :D [13:52:59] task says 14:00 UTC [13:53:10] timezones god damn [13:53:39] 14:00 UTC is correct, which is 16:00 CEST ? [13:53:45] yes :) [13:54:03] I am in neither and always get my mental time-maths wrong :( [13:54:22] but no rush we can delay until you are ready [13:56:18] all goot, I just got confused an thought it's an hour until start [13:57:31] cool, thanks :) [13:59:01] topranks: gtg from out end [13:59:16] thanks for the ping! [13:59:20] awesome, thank you <3 [14:07:52] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9949685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host parsoidtest1001.eqiad.wmnet with OS bullseye executed... [14:11:19] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9949691 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host parsoidtest1001.eqiad.wmnet with OS bullseye [14:15:26] 06serviceops, 06Data Products, 07Epic: SDS 2.1.1 Evaluations of 3rd part Experimentation Platform by SRE Service Ops - https://phabricator.wikimedia.org/T369174#9949712 (10WDoranWMF) [14:23:41] jayme: switch upgrade is done if you want to check anything / repool [14:23:51] topranks: cool, thanks [14:36:09] 06serviceops, 10MW-on-K8s: mwscript-k8s --attach error: TypeError: 'NoneType' object is not iterable - https://phabricator.wikimedia.org/T369175#9949840 (10Lucas_Werkmeister_WMDE) > As there were a lot of changes to deploy, I didn’t investigate yet, but just ran the script on mwmaint1002 instead. And by now i... [14:45:53] 06serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9949885 (10Clement_Goubert) Only hosts left are: [] The 5 nodes with an incorrect RAID config from {T358489} that haven't yet been reimaged [] The codfw nodes to be decommissioned []... [14:48:15] 06serviceops, 10MW-on-K8s, 06SRE, 06Traffic, and 2 others: Turn down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#9949876 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium [14:50:27] 06serviceops, 10MW-on-K8s, 10Observability-Logging, 06SRE, 13Patch-For-Review: benthos mw-accesslog-metrics kafka lag and interpolation errors - https://phabricator.wikimedia.org/T367076#9949907 (10kamila) 05Open→03Resolved a:03kamila Increasing batch size slightly improved the situation, very... [14:55:04] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9949929 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host parsoidtest1001.eqiad.wmnet with OS bullseye executed... [16:14:59] 👋 We have this patch for changeprop https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1051361 that is required in order to completely remove parsoid from restbase. I am not really familiar with changeprop (other than understanding the rules). Is there anyone who can help us with reviewing? [16:54:28] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 16), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9950626 (10WDoranWMF) [17:41:46] 06serviceops, 06SRE: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9950976 (10CDanis) 05In progress→03Resolved a:03CDanis Boldly closing this because we've resolved all of {T353464} and the two tasks for 10G NICs T366204 T366205 [21:57:04] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9952054 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host parsoidtest1001.eqiad.wmnet with OS bullseye [22:36:25] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9952219 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host parsoidtest1001.eqiad.wmnet with OS bullseye executed... [22:38:44] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9952225 (10Jclark-ctr) @Papaul if you get a chance can you look at this one? [22:46:26] 06serviceops, 10Beta-Cluster-Infrastructure, 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation), 13Patch-For-Review: Replace deployment-memc[08-10] with Bullseye or Bookworm - https://phabricator.wikimedia.org/T361384#9952292 (10Andrew) The three old VMs have been replaced by: deployment-mem... [22:46:28] 06serviceops, 10MW-on-K8s: mwscript-k8s --attach error: TypeError: 'NoneType' object is not iterable - https://phabricator.wikimedia.org/T369175#9952290 (10RLazarus) a:03RLazarus Thanks for the report! It's actually not because of the successful exit; the script handles that. Rather, it turns out the pod wa... [23:08:40] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9952366 (10Dzahn) We can see in reimage-extended.log that the reimage fails but it's not immediately clear why. ` 2024-07-03 22:36:13,115 jclark 2636322 [ERRO... [23:46:11] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9952433 (10Papaul) @Jclark-ctr @Dzahn this is what i have on the conole [ (1*installer) 2 shell 3 shell 4- log ][ Jul 03 23:44 ]...