[09:50:56] 10serviceops, 10MW-on-K8s: Monitor all mw-on-k8s deployments with httpbb - https://phabricator.wikimedia.org/T334456 (10Clement_Goubert) [09:51:18] 10serviceops, 10MW-on-K8s: Monitor all mw-on-k8s deployments with httpbb - https://phabricator.wikimedia.org/T334456 (10Clement_Goubert) 05Open→03In progress p:05Triage→03High [09:51:28] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [10:33:18] Today I'm planning on pooling thumbor-k8s in both DCs 50/50 metal/k8s and leaving it pooled for the foreseeable future. Any objections? [10:34:06] not from me [10:34:51] nice, go ahead! [10:43:19] claime: are new services allowed to used mw-api-(async-)int directly or do you wish some control over it? [10:43:37] They would be the first to use it [10:43:47] Also, depends on the volume of calls [10:47:05] Ah, I thought you had switched some already. I'll make a note on the task then and cc you just to be sure [10:47:26] still some time until deployment, though :) [10:47:31] Yep, thanks [10:48:25] I havent't switched anybody yet for various reasons, number 1 being that I was waiting on testing procedures for various services, number 2 being that I realised we only httpbb test mw-web, but not mw-api-ext or mw-api-int [10:49:02] I'm trying to do something not too disgusting to setup httpbb tests for every mw-* service [11:06:25] elukey: all kafka-main brokers have transitioned away from the puppet ca cert right ? [11:34:43] claime: being very naive about the httpbb puppet code: Could it make sense to add a field to the service::catalog type where one can specify the httpbb tests to run (`httpbb: appserver` in this case for example)? [11:35:36] oh..I see there is probably only mediawiki being regularly checked, right? [11:36:23] jayme: There's more than that, but we can't apply the way it's currently done to mw-on-k8s [11:36:38] But you're right, mw-jobrunner shouldn't be checked with the appserver test suite [11:37:20] So yes, it makes sense to add it for mw-on-k8s at least [11:38:32] * jayme happy to provide more work ;) [11:39:04] Even if mw-jobrunner isn't in the catalog yet, might as well set it up rn [12:29:46] 10serviceops, 10DC-Ops, 10ops-codfw: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Clement_Goubert) p:05Triage→03Medium a:03Jhancock.wm [12:31:13] 10serviceops, 10DC-Ops, 10ops-codfw: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Clement_Goubert) ` cgoubert@mw2448:~$ sudo ipmi-sel ID | Date | Time | Name | Type | Event 1 | Jan-20-2023 | 16:14:47 | S... [13:28:40] claime: correct yes! [13:28:48] (sorry just got back from afk) [13:29:24] claime: there is still some clean up to do in puppet private etc.. [13:29:32] elukey: no worries, it was nothing urgent. Do you think we can now remove the puppet CA certs ? They are warning for expiry. Disregard if it was already the plan [13:31:50] claime: ah interesting didn't see it.. yes definitely [13:32:22] I can stage the the change on puppet private and if you have time to proof read it we could merge it now [13:35:09] elukey: let's do that :) [13:38:13] claime: puppet private change ready [13:38:34] elukey: puppetmaster1001 ? [13:38:40] yep [13:39:28] and puppet disabled on the nodes [13:42:10] elukey: lgtm [13:48:57] claime: ran puppet on kafka-main2005, all good [13:49:06] Nice. [13:49:46] going to manually clean up the certs on the node as well [13:50:33] ah and I think that we need to also revoke them as well [13:50:51] Yep, puppet ca cert clean whatever [13:51:54] Sorry, puppet ca clean --certname [13:52:40] IIRC I used the "whatever" version the last time [13:53:36] Right, puppet ca is deprecated, it's puppet cert clean whatever [13:55:54] 10serviceops, 10API Platform, 10RESTbase Sunsetting, 10Epic, 10Platform Engineering Roadmap: Survey RESTBase services and find which ones accesses Parsoid via RESTBase - https://phabricator.wikimedia.org/T333536 (10DAlangi_WMF) [13:57:23] 10serviceops, 10API Platform, 10RESTbase Sunsetting, 10Epic, 10Platform Engineering Roadmap: Survey RESTBase services and find which ones accesses Parsoid via RESTBase - https://phabricator.wikimedia.org/T333536 (10DAlangi_WMF) [13:57:48] I am running puppet on all nodes, later on I'll clean the old certs from the CA (please do it if you have time, happy to skip it :) [13:58:30] Yeah, I can do it, no problem, just tell me when the puppet run's done [13:59:31] all cleaned up [13:59:44] ack, cleaning up the certs [14:01:06] All done [14:01:07] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Papaul) @Jhancock.wm swap CPU2 with CPU1 and see if the error will report on CPU1 if it does then we will have to replace the CPU. if the error still shows on CPU... [14:06:02] Should we be worried about https://alerts.wikimedia.org/?q=team%3Dsre&q=%40state%3Dactive&q=alertname%3DKafka%20MirrorMaker%20main-codfw_to_main-eqiad%20max%20lag%20in%20last%2010%20minutes ? [14:06:46] I don't think so according to the doc https://wikitech.wikimedia.org/wiki/Kafka/Administration#main_mirroring [14:09:28] Hmm looks like it can't connect to the codfw brokers [14:10:21] Nevermind these are old logs [14:10:26] (Apr 6) [14:13:48] And the lag is going down [14:16:29] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b7d176fc-f1f5-4faf-a380-3ea6e306f06c) set by cgoubert@cumin1001 for 7 days, 0:00:00 on 1 host(s)... [14:23:54] jayme: I implemented your httpbb suggestion https://gerrit.wikimedia.org/r/c/operations/puppet/+/907814/ [14:25:09] cool, will check in a bit [14:25:14] thx [14:52:06] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10herron) jftr I accepted this diff which came up during an unrelated sre.dns.netbox run ` diff --git a/hosts/mw2448.yaml b/hosts/mw2448.yaml index a58c536..120b45... [14:58:07] 10serviceops, 10Machine-Learning-Team, 10SRE: Import and deploy istio 1.15.7 - https://phabricator.wikimedia.org/T334068 (10elukey) a:03elukey [15:07:24] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Clement_Goubert) @herron Thanks, my bad, I forgot to run `sre.dns.netbox` after setting the node to failed. Adding to the documentation. [15:14:03] the thumbor dashboard is pretty noisy atm (although there's lots of good info) - I made this one to drill down to the main failure signals: https://grafana-rw.wikimedia.org/d/gG5owlLVz/hnowlan-thumbor-failures [15:14:18] probably will consolidate the above into the main one once we get rid of the metal instances [15:34:56] 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, 10Platform Team Workboards (Platform Engineering Reliability): Final steps for fully-Kubernetes Thumbor - https://phabricator.wikimedia.org/T334488 (10hnowlan) [15:36:46] 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan) Thumbor-k8s is now pooled in both datacentres and, some kind of major issue notwithstanding, will remain pooled. Given the sheer age/size of this ticket,... [15:36:49] 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan) 05Open→03Resolved [15:48:32] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 11), 10Patch-For-Review: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10Ottomata) > I would suggest to experiment with proper values in DSE first (the charts values.yaml suggests 512Mi fo... [15:49:13] jayme: o/ shall we rollout istio during these days, or is there still a problem with the daemonset? [15:49:34] elukey: nono, all happy face [15:49:41] ah nice! [15:49:47] wanted to do tomorrow morning [15:49:54] super, will do the same [16:16:50] 10serviceops, 10SRE: Move Kafka main to the new intermediate PKI CA - https://phabricator.wikimedia.org/T319372 (10elukey) Final step - check if we have to migrate deployment-prep or not. See https://gerrit.wikimedia.org/r/c/operations/puppet/+/905954, some hiera settings may need to be added if we want to kee... [18:03:49] 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): Replace Nutcracker - https://phabricator.wikimedia.org/T333019 (10kamila) I am inclined to go with Envoy: it supports our use cases, has good performance (esp. with TLS), seems to have the best documentation, and most importantly is w... [21:52:58] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Jhancock.wm) @Papaul swapped CPU1 and CPU2. all of the DIMM have been reseated. Powered back on and log has been cleared. [22:19:28] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Papaul) @Jhancock.wm thank you. We will leave the task open until the end of the week to see if we do have any errors on CPU1