[06:29:45] 06serviceops, 10Dumps 2.0 (Kanban Board): noc.wikimedia.org is slow and it times out sporadically - https://phabricator.wikimedia.org/T379968#10329910 (10JMeybohm) Thanks for the very thorough analysis @Scott_French! I think I found the missing piece by looking at logs from istiod: https://logstash.wikimedia.o... [06:50:23] 06serviceops, 10Prod-Kubernetes, 07Kubernetes: Reimaging a kubernetes control-plane invalidates service-account tokens issued by it - https://phabricator.wikimedia.org/T380142 (10JMeybohm) 03NEW [06:51:00] 06serviceops, 10Prod-Kubernetes, 07Kubernetes: Reimaging a kubernetes control-plane invalidates service-account tokens issued by it - https://phabricator.wikimedia.org/T380142#10329929 (10JMeybohm) [06:51:02] 06serviceops, 10Dumps 2.0 (Kanban Board): noc.wikimedia.org is slow and it times out sporadically - https://phabricator.wikimedia.org/T379968#10329930 (10JMeybohm) [06:51:04] 06serviceops, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes: Update Kubernetes clusters to >1.25 - https://phabricator.wikimedia.org/T341984#10329931 (10JMeybohm) [06:51:07] 06serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 07Kubernetes: Kubernetes v1.23 use PKI for service-account signing (instead of cergen) - https://phabricator.wikimedia.org/T329826#10329932 (10JMeybohm) [09:28:04] 06serviceops, 06MediaWiki-Engineering, 06MediaWiki-Platform-Team: Testing and verification of MediaWiki on PHP 8.1 in mwdebug-next - https://phabricator.wikimedia.org/T379986#10330207 (10MSantos) [09:28:59] 06serviceops, 06MediaWiki-Engineering, 06MediaWiki-Platform-Team, 07OKR-Work: Testing and verification of MediaWiki on PHP 8.1 in mwdebug-next - https://phabricator.wikimedia.org/T379986#10330208 (10MSantos) [09:58:32] 06serviceops, 10Prod-Kubernetes, 07Kubernetes: Reimaging a kubernetes control-plane invalidates service-account tokens issued by it - https://phabricator.wikimedia.org/T380142#10330292 (10JMeybohm) [10:14:59] 06serviceops, 10MW-on-K8s, 10TimedMediaHandler, 13Patch-For-Review, 07Video: shellbox-video pods being restarted prematurely - https://phabricator.wikimedia.org/T373517#10330319 (10hnowlan) 05Open→03Resolved We've migrated to shellbox-video and the pod failures are no longer an issue thanks to th... [10:27:50] 06serviceops: wikikube-worker21[56-70] implementation tracking - https://phabricator.wikimedia.org/T376966#10330386 (10jijiki) p:05Triage→03Medium [10:30:53] 06serviceops, 10Prod-Kubernetes, 07Kubernetes: Reimaging a kubernetes control-plane invalidates service-account tokens issued by it - https://phabricator.wikimedia.org/T380142#10330407 (10jijiki) p:05Triage→03High [10:31:05] 06serviceops, 06MediaWiki-Engineering, 06MediaWiki-Platform-Team, 07OKR-Work: Testing and verification of MediaWiki on PHP 8.1 in mwdebug-next - https://phabricator.wikimedia.org/T379986#10330409 (10jijiki) p:05Triage→03Medium [10:32:22] 06serviceops, 10Dumps 2.0 (Kanban Board): noc.wikimedia.org is slow and it times out sporadically - https://phabricator.wikimedia.org/T379968#10330414 (10jijiki) p:05Triage→03Medium [10:34:23] 06serviceops, 10MediaWiki-extensions-OAuth: Allow a user to disable an OAuth client - https://phabricator.wikimedia.org/T254190#10330418 (10jijiki) p:05Triage→03Low [11:06:28] 06serviceops, 06Infrastructure-Foundations, 10netops, 07Kubernetes: Reimage one of the wikikube-worker1240 to wikikube-worker1304 node in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10330555 (10JMeybohm) Beware of {T380142} [11:41:22] 06serviceops, 06Infrastructure-Foundations, 10netops, 07Kubernetes: Reimage one of the wikikube-worker1240 to wikikube-worker1304 node in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10330660 (10cmooney) >>! In T379790#10322697, @akosiaris wrote: > Cool, thanks.... [12:36:19] 06serviceops, 06DC-Ops, 10ops-eqiad: Degraded RAID on wikikube-worker1256 - https://phabricator.wikimedia.org/T379454#10330969 (10Clement_Goubert) p:05Triage→03Medium a:03Jclark-ctr [12:40:03] 06serviceops: mc-gp200[4-6] implementation tracking - https://phabricator.wikimedia.org/T376969#10330992 (10Clement_Goubert) a:05Clement_Goubert→03None [12:40:49] 06serviceops: mc-gp200[4-6] implementation tracking - https://phabricator.wikimedia.org/T376969#10330993 (10Clement_Goubert) p:05Triage→03Medium [12:43:03] 06serviceops: steady increase in 503s from mw-api-ext-ro.discovery.wmnet since 5 UTC - https://phabricator.wikimedia.org/T367401#10331001 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert [13:18:14] 06serviceops, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes: Update Kubernetes clusters to >1.25 - https://phabricator.wikimedia.org/T341984#10331101 (10JMeybohm) [13:22:54] 06serviceops, 10MW-on-K8s: Convert captchaloop to kubernetes CronJob - https://phabricator.wikimedia.org/T380167 (10Clement_Goubert) 03NEW [13:29:15] 06serviceops, 10MW-on-K8s: Convert characterEditStatsTranslate to kubernetes CronJob - https://phabricator.wikimedia.org/T380170 (10Clement_Goubert) 03NEW [13:36:12] 06serviceops, 10MW-on-K8s: Convert cirrus_build_completion_indices.sh to kubernetes CronJob - https://phabricator.wikimedia.org/T380171 (10Clement_Goubert) 03NEW [13:38:48] 06serviceops, 10MW-on-K8s: Convert cirrus_build_completion_indices.sh to kubernetes CronJob - https://phabricator.wikimedia.org/T380171#10331224 (10Clement_Goubert) p:05Triage→03Medium [13:38:50] 06serviceops, 10MW-on-K8s: Convert characterEditStatsTranslate to kubernetes CronJob - https://phabricator.wikimedia.org/T380170#10331225 (10Clement_Goubert) p:05Triage→03Medium [13:38:53] 06serviceops, 10MW-on-K8s: Convert captchaloop to kubernetes CronJob - https://phabricator.wikimedia.org/T380167#10331226 (10Clement_Goubert) p:05Triage→03Medium [13:39:05] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Add helper script functionality to our php images - https://phabricator.wikimedia.org/T377958#10331214 (10Clement_Goubert) [13:40:59] 06serviceops: mc-gp200[4-6] implementation tracking - https://phabricator.wikimedia.org/T376969#10331240 (10jijiki) a:03jijiki [13:41:18] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Add helper script functionality to our php images - https://phabricator.wikimedia.org/T377958#10331219 (10Clement_Goubert) 05In progress→03Resolved Resolving as the general purpose helper scripts are now inside the image. Subtasks will track special cases. [13:44:45] 06serviceops: docker-reporter-base-images.service failed on build2001 - https://phabricator.wikimedia.org/T364931#10331256 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert Resolving since image upgrades fixed this issue. [14:27:36] 06serviceops: wikikube-worker13[05-12] implementation tracking - https://phabricator.wikimedia.org/T377022#10331441 (10Clement_Goubert) 05Resolved→03Open I messed up and reimaged to bullseye instead of bookworm. Reopening for reimage. [14:28:39] 06serviceops: wikikube-worker13[05-12] implementation tracking - https://phabricator.wikimedia.org/T377022#10331450 (10ops-monitoring-bot) depool host wikikube-worker[1305-1312].eqiad.wmnet by cgoubert@cumin1002 with reason: reimage to bookwork [14:32:30] 06serviceops: wikikube-worker13[05-12] implementation tracking - https://phabricator.wikimedia.org/T377022#10331473 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1002 depool for host wikikube-worker[1305-1312].eqiad.wmnet completed: - wikikube-worker[1305-1312].eqiad... [14:36:08] 06serviceops: wikikube-worker13[05-12] implementation tracking - https://phabricator.wikimedia.org/T377022#10331492 (10Clement_Goubert) [14:51:09] 06serviceops, 10MW-on-K8s, 10TimedMediaHandler, 13Patch-For-Review, 07Video: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241#10331530 (10hnowlan) 05In progress→03Resolved As of the 13th of November, all video transcoding has been moved to shellbox-video. The ser... [15:26:57] 06serviceops: wikikube-worker13[05-12] implementation tracking - https://phabricator.wikimedia.org/T377022#10331663 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1305.eqiad.wmnet with OS bookworm [15:28:44] 06serviceops: wikikube-worker13[05-12] implementation tracking - https://phabricator.wikimedia.org/T377022#10331673 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1306.eqiad.wmnet with OS bookworm [15:29:46] 06serviceops: wikikube-worker13[05-12] implementation tracking - https://phabricator.wikimedia.org/T377022#10331676 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1307.eqiad.wmnet with OS bookworm [15:30:49] 06serviceops: wikikube-worker13[05-12] implementation tracking - https://phabricator.wikimedia.org/T377022#10331678 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1308.eqiad.wmnet with OS bookworm [15:31:29] 06serviceops: wikikube-worker13[05-12] implementation tracking - https://phabricator.wikimedia.org/T377022#10331685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1309.eqiad.wmnet with OS bookworm [15:32:00] 06serviceops: wikikube-worker13[05-12] implementation tracking - https://phabricator.wikimedia.org/T377022#10331690 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1310.eqiad.wmnet with OS bookworm [15:35:01] 06serviceops, 10Dumps-Generation, 06MediaWiki-Platform-Team: Migrate WMF production from PHP 7.4 to PHP 8.1 - https://phabricator.wikimedia.org/T319432#10331717 (10Reedy) [15:36:25] 06serviceops: wikikube-worker13[05-12] implementation tracking - https://phabricator.wikimedia.org/T377022#10331728 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1311.eqiad.wmnet with OS bookworm [15:37:02] 06serviceops: wikikube-worker13[05-12] implementation tracking - https://phabricator.wikimedia.org/T377022#10331729 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1312.eqiad.wmnet with OS bookworm [15:45:28] 06serviceops, 06Content-Transform-Team-WIP, 10Push-Notification-Service, 06Wikipedia-Android-App-Backlog, 07Essential-Work: Timeout errors when making requests to Firebase for push notifications - https://phabricator.wikimedia.org/T379647#10331767 (10MSantos) [15:51:50] 06serviceops, 06Infrastructure-Foundations, 10netops, 07Kubernetes: Reimage one of the wikikube-worker1240 to wikikube-worker1304 node in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10331792 (10cmooney) p:05Triage→03Medium [16:08:44] 06serviceops: wikikube-worker13[05-12] implementation tracking - https://phabricator.wikimedia.org/T377022#10331942 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1305.eqiad.wmnet with OS bookworm completed: - wikikube-worker1305 (**PASS**) - D... [16:10:52] 06serviceops: wikikube-worker13[05-12] implementation tracking - https://phabricator.wikimedia.org/T377022#10331964 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1307.eqiad.wmnet with OS bookworm completed: - wikikube-worker1307 (**PASS**) - D... [16:14:05] 06serviceops: wikikube-worker13[05-12] implementation tracking - https://phabricator.wikimedia.org/T377022#10331971 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1309.eqiad.wmnet with OS bookworm completed: - wikikube-worker1309 (**PASS**) - D... [16:17:02] 06serviceops: wikikube-worker13[05-12] implementation tracking - https://phabricator.wikimedia.org/T377022#10331996 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1308.eqiad.wmnet with OS bookworm completed: - wikikube-worker1308 (**PASS**) - D... [16:19:40] 06serviceops: wikikube-worker13[05-12] implementation tracking - https://phabricator.wikimedia.org/T377022#10332007 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1306.eqiad.wmnet with OS bookworm completed: - wikikube-worker1306 (**PASS**) - D... [16:22:49] 06serviceops: wikikube-worker13[05-12] implementation tracking - https://phabricator.wikimedia.org/T377022#10332034 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1312.eqiad.wmnet with OS bookworm completed: - wikikube-worker1312 (**PASS**) - D... [16:28:22] 06serviceops: wikikube-worker13[05-12] implementation tracking - https://phabricator.wikimedia.org/T377022#10332093 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1310.eqiad.wmnet with OS bookworm completed: - wikikube-worker1310 (**PASS**) - D... [16:30:23] 06serviceops: wikikube-worker13[05-12] implementation tracking - https://phabricator.wikimedia.org/T377022#10332112 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1311.eqiad.wmnet with OS bookworm completed: - wikikube-worker1311 (**PASS**) - D... [16:34:43] 06serviceops: wikikube-worker13[05-12] implementation tracking - https://phabricator.wikimedia.org/T377022#10332135 (10ops-monitoring-bot) pool host wikikube-worker[1305-1312].eqiad.wmnet by cgoubert@cumin1002 with reason: None [16:34:47] 06serviceops: wikikube-worker13[05-12] implementation tracking - https://phabricator.wikimedia.org/T377022#10332136 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1002 pool for host wikikube-worker[1305-1312].eqiad.wmnet completed: - wikikube-worker[1305-1312].eqiad.w... [16:55:36] 06serviceops: wikikube-worker13[05-12] implementation tracking - https://phabricator.wikimedia.org/T377022#10332282 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert [16:57:24] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10332303 (10Jhancock.wm) [16:58:33] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10332308 (10Jhancock.wm) 2163 is being a pain. gonna take a closer look today. failed during imaging but didn't catch the error. [17:07:13] 06serviceops, 06Traffic: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10332388 (10JMeybohm) I see that I did not put this here, sorry. In the IPIP mail thread we suggested to set a fixed, smaller MTU for all Pod traffic in order to not have... [17:11:28] 06serviceops, 10Thumbor: Alert on high Thumbor per-pod error rate - https://phabricator.wikimedia.org/T379559#10332409 (10hnowlan) 05Open→03Resolved a:03hnowlan [17:22:33] 06serviceops, 10Prod-Kubernetes, 07Kubernetes: Reimaging a kubernetes control-plane invalidates service-account tokens issued by it - https://phabricator.wikimedia.org/T380142#10332605 (10CDanis) +1 to fingerprint as the key of keys +1 to blocking k8s control plane reimages until we re-key by fingerprint [17:23:58] 06serviceops, 06Traffic: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10332637 (10Vgutierrez) as mentioned on the email thread that sounds like viable option for us [17:28:41] 06serviceops, 10Prod-Kubernetes, 06Traffic, 07Kubernetes: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10332691 (10JMeybohm) [17:45:34] 06serviceops, 10Dumps 2.0 (Kanban Board): noc.wikimedia.org is slow and it times out sporadically - https://phabricator.wikimedia.org/T379968#10332792 (10Scott_French) Thank you very much, @JMeybohm! That explains what we saw quite well. [17:49:15] 06serviceops, 10Dumps 2.0 (Kanban Board): noc.wikimedia.org is slow and it times out sporadically - https://phabricator.wikimedia.org/T379968#10332794 (10Scott_French) →14Duplicate dup:03T380142 [17:50:28] 06serviceops, 10envoy, 06SRE, 06Traffic: Upgrade Envoy to >= 1.24 - https://phabricator.wikimedia.org/T380211 (10JMeybohm) 03NEW [17:50:41] 06serviceops, 10Prod-Kubernetes, 07Kubernetes: Reimaging a kubernetes control-plane invalidates service-account tokens issued by it - https://phabricator.wikimedia.org/T380142#10332796 (10Scott_French) [17:50:42] 06serviceops, 10envoy, 06SRE, 06Traffic: Upgrade Envoy to >= 1.24 - https://phabricator.wikimedia.org/T380211#10332849 (10JMeybohm) [17:50:49] 06serviceops, 10envoy, 06SRE, 06Traffic, 13Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324#10332850 (10JMeybohm) [17:50:54] does istio ingressgateway support generating http access logs? [17:51:27] 06serviceops, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes: Update Kubernetes clusters to >1.25 - https://phabricator.wikimedia.org/T341984#10332857 (10JMeybohm) [17:51:35] 06serviceops, 10envoy, 06SRE, 07Kubernetes, 07Service-Architecture: Upgrade envoy configuration to use the v3 API - https://phabricator.wikimedia.org/T265880#10332798 (10JMeybohm) 05Open→03Resolved a:03JMeybohm I believe this is done https://gerrit.wikimedia.org/r/c/operations/puppet/+/754460 [17:55:25] 06serviceops: Package prometheus-mcrouter-exporter v0.4.0 - https://phabricator.wikimedia.org/T380212 (10jijiki) 03NEW [17:56:20] 06serviceops: Package prometheus-mcrouter-exporter v0.4.0 - https://phabricator.wikimedia.org/T380212#10332886 (10jijiki) 05Open→03In progress p:05Triage→03Medium [19:00:05] 06serviceops, 06DC-Ops, 10ops-eqiad: Degraded RAID on wikikube-worker1256 - https://phabricator.wikimedia.org/T379454#10333210 (10Jclark-ctr) Opened ticket with Dell Advised of i/o errors on sda and uploaded tsr report ` [Sat Nov 9 08:53:19 2024] blk_update_request: I/O error, dev sda, sector 0 op 0x1:(... [19:01:03] 06serviceops, 06DC-Ops, 10ops-eqiad: Degraded RAID on wikikube-worker1256 - https://phabricator.wikimedia.org/T379454#10333216 (10Jclark-ctr) Confirmed: Service Request 201149035 [19:17:48] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10333264 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2163.codfw.wmnet with OS bookworm [19:20:18] 06serviceops: Package prometheus-mcrouter-exporter v0.4.0 - https://phabricator.wikimedia.org/T380212#10333272 (10jijiki) [19:36:07] 06serviceops: Turn up PHP 8.1-flavored mw-debug k8s deployment - https://phabricator.wikimedia.org/T372604#10333371 (10Scott_French) 05In progress→03Resolved Alright, I believe that's everything tracked here. The next and pinkunicorn deployments should be pretty much identical at this point, aside from t... [19:36:48] 06serviceops, 06Structured-Data-Backlog, 10Thumbor: Thumbor workers hang indefinitely when conducting some tiff operations, leading to user-facing error - https://phabricator.wikimedia.org/T374350#10333391 (10Don-vip) If it helps, I still face problems, last one three minutes ago with https://commons.wikimed... [19:56:56] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10333502 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2163.codfw.wmnet with OS bookworm completed: - wi... [19:58:33] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10333508 (10Jhancock.wm) [19:58:42] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10333509 (10Jhancock.wm) @Clement_Goubert last batch done! [20:25:39] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10333608 (10Jhancock.wm) 05Open→03Resolved [22:01:15] 06serviceops: Package prometheus-mcrouter-exporter v0.4.0 - https://phabricator.wikimedia.org/T380212#10333949 (10jijiki) [22:30:44] 06serviceops, 13Patch-For-Review: Monitoring to surface "low-traffic" jobs isolation failure - https://phabricator.wikimedia.org/T378609#10334008 (10Scott_French) To spell out step #2 of the procedure described in T378609#10325210 more explicitly: Suppose the antagonist is `AntagonistJob` and the current prim...