[06:12:30] 10serviceops, 10Cassandra: Cassandra instance with corrupted commit log after powercycle of restbase1027 - https://phabricator.wikimedia.org/T345058 (10elukey) No no for the moment it is fine, what I wanted to do is to avoid single persons on-call for events like Cassandra being in trouble (namely, you :). We... [08:41:55] 10serviceops, 10envoy: Puppet doesn't self-recover when build-envoy-config leaves behind a zero-byte envoy.yaml - https://phabricator.wikimedia.org/T346129 (10fgiunchedi) [09:17:25] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780 (10thiemowmde) Is this possibly related to toolforge.org being entirely unavailable right now? No matter what I try, everything fails with 502 or 503. Prominent examp... [09:18:21] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780 (10RhinosF1) >>! In T341780#9159438, @thiemowmde wrote: > Is this possibly related to toolforge.org being entirely unavailable right now? No matter what I try, everyt... [09:20:35] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780 (10Vgutierrez) >>! In T341780#9159438, @thiemowmde wrote: > Is this possibly related to toolforge.org being entirely unavailable right now? No matter what I try, ever... [09:52:19] 10serviceops, 10MW-on-K8s, 10Observability-Logging, 10Patch-For-Review: Some apache access logs are invalid json - https://phabricator.wikimedia.org/T340935 (10CodeReviewBot) oblivian opened https://gitlab.wikimedia.org/repos/sre/glogger/-/merge_requests/2 Introduce glogger [13:29:17] 10serviceops, 10observability: Hardcode the SLO time windows in Grafana dashboards generated via Grizzly - https://phabricator.wikimedia.org/T346144 (10elukey) [13:45:53] 10serviceops, 10Parsoid (Tracking): Upgrade nodejs on testreduce1001 - https://phabricator.wikimedia.org/T345220 (10MoritzMuehlenhoff) @ssastry It seems like you're maybe missing the proxy setting? Can you please retry with --proxy http://url-downloader.wikimedia.org:8080? I just tried a npm --proxy ht... [13:58:27] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) [14:02:56] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jclark-ctr) [14:04:11] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) [14:06:06] 10serviceops, 10observability, 10Patch-For-Review: Hardcode the SLO time windows in Grafana dashboards generated via Grizzly - https://phabricator.wikimedia.org/T346144 (10herron) +1 for trying this. Thinking out loud: 1) With something like this in place should we worry about an alternate workflow to insp... [14:07:46] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) I worked on some of the items for this week. I let @UOzurumba check on the done elements. Distribut... [14:15:00] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10UOzurumba) [14:22:22] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10UOzurumba) [14:37:49] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: hw troubleshooting: DIMM failure for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T345802 (10MoritzMuehlenhoff) I've set the Netbox status back to Active. [16:51:29] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10kamila) [16:53:12] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10kamila) Thank you for the reminder @Trizek-WMF! I sent a Slack announcement a while ago and now sent a reminder... [18:25:03] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team (Seen): Create mw-videoscaler helmfile deployment - https://phabricator.wikimedia.org/T321899 (10BCornwall) [20:57:48] 10serviceops, 10SRE, 10ops-codfw: mw2444 down - https://phabricator.wikimedia.org/T345884 (10Jhancock.wm) I've replaced the CPU. We should know by Friday if it has issues. I will leave the ticket open until then. return tracking for the bad part (which I will also hold until Friday): 783629071254 [22:10:54] 10serviceops, 10Parsoid (Tracking): Upgrade nodejs on testreduce1001 - https://phabricator.wikimedia.org/T345220 (10ssastry) Aha .. okay! That did the trick. Thanks for getting us this far @MoritzMuehlenhoff. Now, the only thing left is getting the mysql grants ... `Sep 12 22:07:09 testreduce1002 nodejs[2595...