[00:26:47] 06serviceops, 06MediaWiki-Platform-Team: Testing and verification of MediaWiki on PHP 8.1 in mwdebug-next - https://phabricator.wikimedia.org/T379986 (10Scott_French) 03NEW [00:26:58] 06serviceops, 06MediaWiki-Platform-Team: Testing and verification of MediaWiki on PHP 8.1 in mwdebug-next - https://phabricator.wikimedia.org/T379986#10324969 (10Scott_French) [00:27:07] 06serviceops, 10Dumps-Generation, 06MediaWiki-Platform-Team: Migrate WMF production from PHP 7.4 to PHP 8.1 - https://phabricator.wikimedia.org/T319432#10324970 (10Scott_French) [00:27:59] 06serviceops, 10Dumps-Generation, 06MediaWiki-Platform-Team: Migrate WMF production from PHP 7.4 to PHP 8.1 - https://phabricator.wikimedia.org/T319432#10324972 (10Scott_French) [00:30:30] 06serviceops, 13Patch-For-Review: Extend x-wikimedia-debug-routing.lua to support PHP 8.1 mw-debug deployment - https://phabricator.wikimedia.org/T372605#10324976 (10Scott_French) 8.1 is live as of yesterday and passes basic httpbb checks (T372604#10318811). I think it makes sense to include it in `debug.json`... [02:33:54] 06serviceops, 13Patch-For-Review: Monitoring to surface "low-traffic" jobs isolation failure - https://phabricator.wikimedia.org/T378609#10325210 (10Scott_French) There are really two parts to this: 1. Identifying a sufficiently accurate alert signal and implementing the alert. 2. Defining an appropriate respo... [09:49:03] 06serviceops, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes: Update Kubernetes clusters to >1.25 - https://phabricator.wikimedia.org/T341984#10325811 (10JMeybohm) [10:11:30] 06serviceops, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes: Update Kubernetes clusters to >1.25 - https://phabricator.wikimedia.org/T341984#10325866 (10JMeybohm) [10:12:32] 06serviceops, 06MediaWiki-Platform-Team: Testing and verification of MediaWiki on PHP 8.1 in mwdebug-next - https://phabricator.wikimedia.org/T379986#10325867 (10akosiaris) [10:41:18] 06serviceops, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes: Update Kubernetes clusters to >1.25 - https://phabricator.wikimedia.org/T341984#10325961 (10JMeybohm) [11:09:59] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10326076 (10Clement_Goubert) Thanks a bunch! [11:11:53] 06serviceops: wikikube-worker21[36-55] implementation tracking - https://phabricator.wikimedia.org/T377028#10326090 (10Clement_Goubert) p:05Triage→03Medium [11:12:05] 06serviceops: Decommission kubernetes20[07-14].codfw.wmnet - https://phabricator.wikimedia.org/T379788#10326088 (10Clement_Goubert) p:05Triage→03Low [11:24:15] 06serviceops: wikikube-worker13[05-12] implementation tracking - https://phabricator.wikimedia.org/T377022#10326113 (10ops-monitoring-bot) pool host wikikube-worker[1305-1312].eqiad.wmnet by cgoubert@cumin1002 with reason: None [11:24:15] 06serviceops: wikikube-worker13[05-12] implementation tracking - https://phabricator.wikimedia.org/T377022#10326114 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1002 pool for host wikikube-worker[1305-1312].eqiad.wmnet completed: - wikikube-worker[1305-1312].eqiad.w... [11:27:41] 06serviceops: wikikube-worker13[05-12] implementation tracking - https://phabricator.wikimedia.org/T377022#10326118 (10Clement_Goubert) 05In progress→03Resolved [11:27:51] 06serviceops: Decommission kubernetes10[07-14] - https://phabricator.wikimedia.org/T380027 (10Clement_Goubert) 03NEW [11:28:58] 06serviceops: Decommission kubernetes10[07-14] - https://phabricator.wikimedia.org/T380027#10326132 (10Clement_Goubert) p:05Triage→03Low [11:31:11] 06serviceops, 10Prod-Kubernetes, 07Kubernetes: Update kubeconform schema and CI checks to new target Kubernetes version - https://phabricator.wikimedia.org/T379919#10326138 (10Jelto) p:05Triage→03Medium [11:45:49] 06serviceops: Decommission kubernetes20[07-14].codfw.wmnet - https://phabricator.wikimedia.org/T379788#10326187 (10Clement_Goubert) [12:24:47] 06serviceops, 07Kubernetes: Add 2 more nodes per DC to wikikube-staging - https://phabricator.wikimedia.org/T380043 (10JMeybohm) 03NEW [12:24:49] 06serviceops, 07Kubernetes: Add 2 more nodes per DC to wikikube-staging - https://phabricator.wikimedia.org/T380043#10326396 (10JMeybohm) [12:24:58] 06serviceops, 07Kubernetes: Add 2 more nodes per DC to wikikube-staging - https://phabricator.wikimedia.org/T380043#10326401 (10JMeybohm) [15:49:41] 06serviceops, 10Dumps 2.0 (Kanban Board): noc.wikimedia.org is slow and it times out sporadically - https://phabricator.wikimedia.org/T379968#10327209 (10xcollazo) As of today, I cannot repro this behavior anymore. I will let #serviceops decide if we should pursue a postmortem analysis further than what @Scot... [15:59:26] 06serviceops, 10Citoid, 10VisualEditor, 10VisualEditor-MediaWiki-References, and 2 others: Register Citoid as a "friendly bot" (or alternatively verified bot) with Cloudflare - https://phabricator.wikimedia.org/T370118#10327284 (10joanna_borun) Update: I've resubmitted the bot and written to Cloudflare, as... [19:43:51] 06serviceops, 07Kubernetes: Create tool to monitor and automatically delete misbehaving pods - https://phabricator.wikimedia.org/T379901#10328147 (10CDanis) >>! In T379901#10322298, @JMeybohm wrote: > I would like to see some more details on how this compares to "proper" readiness/liveness probes and in which... [21:37:18] 06serviceops, 07Kubernetes: Create tool to monitor and automatically delete misbehaving pods - https://phabricator.wikimedia.org/T379901#10328441 (10CDanis) And now some notes pseudo-inlined with the task description. ==== select failing pods using prometheus `lang=yaml - name: thumbor_failing_pod namespace... [22:55:09] 06serviceops, 10Dumps 2.0 (Kanban Board): noc.wikimedia.org is slow and it times out sporadically - https://phabricator.wikimedia.org/T379968#10328618 (10Scott_French) Thanks, @xcollazo. Agreed, yeah - I can't seem to reproduce this either today. However, I realized this morning that something didn't add up...