[08:30:02] 10serviceops, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [08:30:37] Morning folks, we (Data Persistence) are getting alerts about k8s pods running in the wrong place again - SessionStoreOnNonDedicatedHost on mw2297.codfw.wmnet I think? [09:13:08] * akosiaris looking [09:13:39] indeed [09:14:01] looking into why [09:16:28] apparently during a deployment 12h ago? [09:17:33] yup, https://sal.toolforge.org/log/gGQCn40BxE1_1c7sN3q5 [09:22:26] Emperor: pod got rescheduled after a deletion, the alert should be resolved in a bit [09:22:46] what's interesting is why the scheduler needed that [09:23:02] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10Volans) FYI the host is up and running with the old OS but new puppet role and puppet disabled since 26 days, it has disappeared from puppetdb (because of the puppet disabled... [09:26:12] akosiaris: thanks. Possibly-stupid question - should the alerts go to -serviceops rather than data-persistence? [09:27:21] I think we are the one taking care of the kask part of sessionstore (the cassandra part is with data persistence) anyway, so yeah. [09:31:27] 10serviceops, 10iPoid-Service (iPoid 1.0): Determine cause of HTTP 503 errors for ~8% of MediaWiki requests to ipoid service - https://phabricator.wikimedia.org/T356766 (10kostajh) In case it helps, @STran spotted these "connection timeout" messages https://logstash.wikimedia.org/app/discover#/doc/0fade920-671... [09:38:16] akosiaris: I've opened https://gerrit.wikimedia.org/r/c/operations/alerts/+/1002925 to move that alert to serviceops, if you'd mind having a look please? :) [09:39:06] +1ed [09:39:31] <3 [09:52:18] Successfully assigned sessionstore/kask-production-6f8b7cf67b-sttws to mw2297.codfw.wmnet [09:52:19] hmmm [09:52:44] why though, the replicaset has a clear affinity rule to prefer kask dedicated nodes [10:10:37] IIRC we changed that from a nodeSelector to a prefered during scheduling affinity at some point to make sure the pods get scheduled even if the kask nodes are in trouble [10:10:58] but ofc they should still be preferred [10:17:29] yes, I have the same memory, so that matches [10:18:59] the nodes btw are, requests wise (which is what informs the schedules) at 53% memory and 39% CPU. A single kask pod has 2500m CPU request (18% of a node) and 400Mi memory request (15% of a node) [10:19:24] the strategy is that standard one of 25% max surge, 25% max unavailable [10:19:57] so unless I am bad at math, something doesn't add up [10:20:53] hmm...do the events from the pod say anything about the scheduling decision? [10:24:04] the events have been purged from etcd, so looking into logstash [10:24:11] up to now, nothing particular [10:24:53] logs on kube-controller-manager and kube-scheduler have 1 single entry [10:25:05] Feb 12 20:29:10 kubemaster2001 kube-controller-manager[854092]: I0212 20:29:10.577892 854092 event.go:294] "Event occurred" object="sessionstore/kask-production-6f8b7cf67b" kind="ReplicaSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: kask-production-6f8b7cf67b-sttws" [10:25:24] interestingly nothing in kube-scheduler's output [10:26:12] in fact the entire control-plane cluster (I've checked both nodes via cumin) has that 1 single entry in journald for the sttws string [10:28:55] yeah. Quite possible that only issues are logged (like the not enough resources things) [10:51:20] yeah, that's what I was looking for. Failure on resources, but nope. No sign of it in logs [10:57:31] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye [11:00:57] strange [11:10:26] 10serviceops, 10iPoid-Service (iPoid 1.0): Determine cause of HTTP 503 errors for ~8% of MediaWiki requests to ipoid service - https://phabricator.wikimedia.org/T356766 (10jijiki) >>! In T356766#9534550, @kostajh wrote: >>>! In T356766#9534518, @jijiki wrote: >> so far: >> >> * dumped traffic at the pod leve... [11:12:46] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, 10Release-Engineering-Team (Seen): Create parsoid mediawiki deployment - https://phabricator.wikimedia.org/T357392 (10akosiaris) [11:13:08] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors: - mw2282 (**FAIL**) - Removed from Pup... [11:14:15] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye [11:27:20] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10hnowlan) >>! In T355333#9529467, @Jhancock.wm wrote: > I reseated the NIC and it connected. when I rebooted it went down again and didn't come up. swapped it out and rebooted... [11:28:02] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors: - mw2282 (**FAIL**)... [12:29:17] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team, 10SRE, 10Scap: Scap should check errors coming from mw-on-k8s canaries during deployments - https://phabricator.wikimedia.org/T357402 (10Clement_Goubert) [12:47:56] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1431.eqiad.wmnet with OS bullseye [12:48:21] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1430.eqiad.wmnet with OS bullseye [12:48:31] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1434.eqiad.wmnet with OS bullseye [12:48:34] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1453.eqiad.wmnet with OS bullseye [12:48:37] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1385.eqiad.wmnet with OS bullseye [13:21:18] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1431.eqiad.wmnet with OS bullseye completed: - mw1431 (**PAS... [13:24:25] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1385.eqiad.wmnet with OS bullseye completed: - mw1385 (**PAS... [13:26:36] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1434.eqiad.wmnet with OS bullseye completed: - mw1434 (**PAS... [13:29:01] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1453.eqiad.wmnet with OS bullseye completed: - mw1453 (**PAS... [13:31:57] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1430.eqiad.wmnet with OS bullseye completed: - mw1430 (**PASS**) - Downtimed on... [14:19:56] 10serviceops, 10iPoid-Service (iPoid 1.0): Determine cause of HTTP 503 errors for ~8% of MediaWiki requests to ipoid service - https://phabricator.wikimedia.org/T356766 (10jijiki) * tcpdump shows that upstream sends an RST, but nothing else useful {F41881301} {F41881318} * meanwhile, we restarted the envo... [14:26:34] 10serviceops, 10iPoid-Service (iPoid 1.0): Determine cause of HTTP 503 errors for ~8% of MediaWiki requests to ipoid service - https://phabricator.wikimedia.org/T356766 (10kostajh) [14:27:49] 10serviceops, 10iPoid-Service (iPoid 1.0): Determine cause of HTTP 503 errors for ~8% of MediaWiki requests to ipoid service - https://phabricator.wikimedia.org/T356766 (10kostajh) >>! In T356766#9537874, @jijiki wrote: > * meanwhile, we restarted the envoyproxies, which seems to have significantly improved... [14:39:32] in operations/docker-images/production-images is there any reason for the weekly build to be committed to the git repository? :) [14:41:43] having an auditable history that it happened ? [14:42:21] possibly yes [14:42:22] ;) [14:42:39] I'd have to remember to `git rebase` :) [15:10:49] 10serviceops, 10Patch-For-Review, 10iPoid-Service (iPoid 1.0): Determine cause of HTTP 503 errors for ~8% of MediaWiki requests to ipoid service - https://phabricator.wikimedia.org/T356766 (10jijiki) [15:13:52] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10Jhancock.wm) idk if this would help, but can we run the provisioning script with the --no-dhcp and --no-user tags. to catch any bios settings that might have changed? [15:35:03] 10serviceops, 10Content-Transform-Team, 10MW-on-K8s, 10SRE, and 2 others: Create parsoid mediawiki deployment - https://phabricator.wikimedia.org/T357392 (10Jdforrester-WMF) [15:58:48] 10serviceops: Cross fleet runc upgrades - https://phabricator.wikimedia.org/T356661 (10klausman) [16:00:02] 10serviceops, 10Data-Engineering, 10Wikidata, 10Wikidata-Termbox, and 3 others: Migrate Termbox SSR from Node 16 to 18 - https://phabricator.wikimedia.org/T355685 (10akosiaris) Patches have been deployed, simple curl tests as well as `service-checker-swagger` checks have passed. I double checked the diff,... [16:48:40] 10serviceops, 10SRE: Container Image policy for non-k8s uses - https://phabricator.wikimedia.org/T357441 (10MatthewVernon) [17:15:44] 10serviceops, 10SRE: Container Image policy for non-k8s uses - https://phabricator.wikimedia.org/T357441 (10akosiaris) I 'd argue that the policy already covers this, even if it isn't scoped (on purpose) outside of kubernetes production realms. The biggest issue isn't the non-Debian base but rather the fact t... [17:22:09] 10serviceops, 10SRE: Container Image policy for non-k8s uses - https://phabricator.wikimedia.org/T357441 (10MatthewVernon) Thanks for your comment. >>! In T357441#9539023, @akosiaris wrote: > The process to build images out of those isn't trivial, but it isn't difficult either. I was obviously unclear in wh... [19:38:05] 10serviceops, 10Domains, 10Fundraising-Backlog, 10SRE: Request donatewiki redirect - https://phabricator.wikimedia.org/T357436 (10RLazarus) Hi from Service Ops SRE! @AKanji-WMF How long would you like the redirect to stay active? Adding @Jgreen and @Dwisehaupt from FR-Tech SRE, as I'm not sure where we'v...