[07:47:12] 10serviceops, 10Performance-Team, 10Wikimedia Enterprise, 10affects-Kiwix-and-openZIM: large amount of traffic to the action=parse API from MWOffliner - https://phabricator.wikimedia.org/T324866 (10Kelson) Would that https://github.com/openzim/mwoffliner/issues/1664 fix the issue, so far we are really not... [07:51:33] 10serviceops, 10Performance-Team, 10Wikimedia Enterprise, 10affects-Kiwix-and-openZIM: large amount of traffic to the action=parse API from MWOffliner - https://phabricator.wikimedia.org/T324866 (10Ladsgroup) I don't think it would fix the issue. The issue is that you shouldn't hit our API for every page e... [07:59:27] 10serviceops, 10Performance-Team, 10Wikimedia Enterprise, 10affects-Kiwix-and-openZIM: large amount of traffic to the action=parse API from MWOffliner - https://phabricator.wikimedia.org/T324866 (10Kelson) @Ladsgroup The MWoffliner scraper has already been quite optimised over years. I have no obvious impr... [08:06:25] 10serviceops, 10Performance-Team, 10Wikimedia Enterprise, 10affects-Kiwix-and-openZIM: large amount of traffic to the action=parse API from MWOffliner - https://phabricator.wikimedia.org/T324866 (10Ladsgroup) I can think of several (I don't know the details of your system and might have missed something):... [09:30:26] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: PSU failure for restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T324572 (10Clement_Goubert) 05Resolved→03Open Hi, Checking up on this on the server, it would seem it started failing again immediately: ` 32 | Dec-09-2022 | 13... [09:40:37] parse1002 just rebooted on its own [09:40:41] 58 | Dec-12-2022 | 08:27:38 | CPU Machine Chk | Processor | transition to Non-recoverable ; OEM Event Data2 code = 02h ; OEM Event Data3 code = 20h [09:42:33] That doesn´t match the reboot time tho [09:49:10] 10serviceops, 10MW-on-K8s, 10SRE, 10observability: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) I've dug into it a bit, and we have 3 brokers per datacenter for kafka-logging, so for balance's sake I'll create... [09:50:06] claime: it's probably just miscofigured for DST. given that this is a new host under warranty and that it flagged multiple CPU alerts over the past months, we should get the CPU replaced, best to open a DC ops task [09:50:39] moritzm: Yeah, will do [09:52:26] 10serviceops, 10MW-on-K8s, 10SRE, 10observability: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) ` cgoubert@kafka-logging1001:~$ kafka topics --create --topic mediawiki.http.accesslog --partitions 6 --replicatio... [09:55:20] 10serviceops, 10Performance-Team, 10Wikimedia Enterprise, 10affects-Kiwix-and-openZIM: large amount of traffic to the action=parse API from MWOffliner - https://phabricator.wikimedia.org/T324866 (10Kelson) >>! In T324866#8459496, @Ladsgroup wrote: > I can think of several (I don't know the details of your... [09:57:43] It's a canary... _joe_ should I change conftool/scap config to swap it with another canary, or can we live with one of the 4 canaries depooled without breaking deployments? [09:58:18] <_joe_> claime: set it to pooled=inactive first [09:58:25] <_joe_> then we can check scap's lists [09:58:34] ack [10:00:41] done [10:00:52] I'm doing the DCops phab at the same time [10:04:12] 10serviceops, 10DC-Ops, 10ops-eqiad: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Clement_Goubert) [10:04:37] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 2 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10elukey) Question about the scope of the cookbook - do we want to aggregate functionalities already present in other co... [10:10:05] 10serviceops, 10DC-Ops, 10ops-eqiad: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Clement_Goubert) Host rebooted spontaneously: ` 09:30 <+icinga-wm> PROBLEM - Host parse1002 is DOWN: PING CRITICAL - Packet loss = 100% 09:31 ^ checking 09:31... [10:10:26] 10serviceops, 10DC-Ops, 10ops-eqiad: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Clement_Goubert) ` racadm>>racadm getsel Record: 1 Date/Time: 01/24/2022 17:43:06 Source: system Severity: Ok Description: Log cleared. -----------------... [10:10:44] 10serviceops, 10DC-Ops, 10ops-eqiad: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Clement_Goubert) ` cgoubert@parse1002:~$ sudo ipmi-sel ID | Date | Time | Name | Type | Event 1 | Jan-24-2022 | 17:43:0... [10:11:48] 10serviceops, 10DC-Ops, 10ops-eqiad: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Clement_Goubert) Host depooled: ` cgoubert@cumin1001:~$ sudo confctl select 'name=parse1002.eqiad.wmnet' set/pooled=inactive The selector you chose has selected the fo... [10:14:44] _joe_: host depooled, wdym by checking scap lists? [10:16:30] <_joe_> claime: about 10 minutes after setting the host to inactive, grep -nr parse1002 /etc/dsh/group/ on deploy1002 should confirm you haven't missed anything [10:50:29] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: PSU failure for restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T324572 (10Jclark-ctr) Will take another look at server when I get in today. [10:53:30] _joe_: cgoubert@deploy1002:~$ grep -nr parse1002 /etc/dsh/group/ [10:53:32] /etc/dsh/group/scap_targets:426:parse1002.eqiad.wmnet [10:53:51] Should be good, we don't want to remove it completely from the targets right? [11:01:00] <_joe_> right [11:02:04] Cool. [11:34:39] 10serviceops, 10MW-on-K8s, 10SRE, 10observability, 10Patch-For-Review: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) >>! In T324439#8455654, @colewhite wrote: > At the beginning, we should configure logstash t... [11:37:15] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [11:40:28] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=cafc663b-25d8-4e28-8aea-f704dec7742e) set by cgoubert@cumin1001 for 14 days, 0:00:00 on 1 host... [11:40:52] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Clement_Goubert) All yours DC-Ops :) [11:43:03] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Scrape controller-manager and scheduler metrics - https://phabricator.wikimedia.org/T324959 (10JMeybohm) [11:43:30] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Scrape controller-manager and scheduler metrics - https://phabricator.wikimedia.org/T324959 (10JMeybohm) p:05Triage→03Low [11:44:15] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Scrape controller-manager and scheduler metrics - https://phabricator.wikimedia.org/T324959 (10JMeybohm) [11:45:58] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [11:46:19] 10serviceops: Revisit PHP opcache health alarm - https://phabricator.wikimedia.org/T324649 (10Clement_Goubert) [11:46:29] 10serviceops, 10observability: "PHP opcache hit ratio" alert shouldn't bother on mwdebug*/scandium/etc - https://phabricator.wikimedia.org/T254025 (10Clement_Goubert) [11:52:49] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 2 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10JMeybohm) >>! In T277677#8459708, @elukey wrote: > Question about the scope of the cookbook - do we want to aggregate... [12:13:58] 10serviceops, 10MW-on-K8s, 10SRE, 10SRE Observability, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Clement_Goubert) >>! In T288851#7742391, @Krinkle wrote: >>>! In T288164#7742387, @Krinkle wrote: >> For the record, the logs from k8s-mwdebug p... [12:33:11] ahoyhoy - could someone validate my approach on this one? https://gerrit.wikimedia.org/r/c/operations/puppet/+/866445 [12:33:24] At this rate I see us testing this maybe once as this is my last week before the break [12:47:37] <_joe_> hnowlan: the change is correct AFAICT; remember they're added with weight 0 and pooled status "inactive" [12:47:44] <_joe_> so they won't appear in pybal immediately [12:48:10] <_joe_> you probably want to just pool a couple of them at very low weight at first [12:52:21] yeah absolutely [12:53:03] there won't be any issues with duplicate definitions of nodes I assume given that they're grouped under a different service - couldn't see other hosts with more than one definition like that [13:16:41] 10serviceops, 10Performance-Team, 10Wikimedia Enterprise, 10affects-Kiwix-and-openZIM: large amount of traffic to the action=parse API from MWOffliner - https://phabricator.wikimedia.org/T324866 (10daniel) > In a way or the other, you need a cache to store the last version. The current approach is that usi... [13:22:30] 10serviceops, 10Performance-Team, 10Wikimedia Enterprise, 10affects-Kiwix-and-openZIM: large amount of traffic to the action=parse API from MWOffliner - https://phabricator.wikimedia.org/T324866 (10daniel) >>! In T324866#8459446, @Kelson wrote: > Would that https://github.com/openzim/mwoffliner/issues/1664... [13:33:24] <_joe_> hnowlan: yeah no issues there, it's a different object under a different path [13:34:40] _joe_: ah, cool. Okay to merge? Seems safe enough with the default pooled/weight [13:34:58] <_joe_> +1 [13:38:41] thanks! [14:13:47] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: PSU failure for restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T324572 (10Jclark-ctr) 05Open→03Resolved @Clement_Goubert Swapped power supply out of recently decom Server looks to have resolved issue [14:24:05] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr Opened Dell support ticket Confirmed: Service Request 158148016 was successfully submitted [14:38:12] 10serviceops, 10Maps, 10Patch-For-Review, 10Platform Team Workboards (Platform Engineering Reliability), and 2 others: Disable unused services on maps nodes - https://phabricator.wikimedia.org/T298246 (10hnowlan) [14:46:34] 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, and 2 others: tinyrgb is distributed via puppet - https://phabricator.wikimedia.org/T323775 (10hnowlan) 05In progress→03Resolved [14:46:38] 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan) [14:59:58] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: PSU failure for restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T324572 (10Clement_Goubert) Last ipmi-sel log line is: `51 | Dec-12-2022 | 12:59:47 | PS Redundancy | Power Supply | Fully Redundant` Icinga all gree... [15:09:23] Scrounging another review if anyone has a minute - pretty simple fix for broken thumbor metrics in k8s https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/867186 [15:11:24] hnowlan: +1'd [15:14:18] thanks! [15:46:14] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Add kubernetes 1.17+ topology annotations - https://phabricator.wikimedia.org/T270191 (10JMeybohm) a:03JMeybohm [15:56:45] is there an easy way to specify a custom strategy in our default chart scaffolding? Might need one for thumbor given the resource requirements/limits [15:59:09] <_joe_> wdym a default strategy? [16:00:42] adjusting maxSurge and maxUnavailable specifically [16:01:11] <_joe_> ah ofc you'd have to add that to your own chart [16:01:18] <_joe_> we don't have anything for that [16:01:37] If you want to make a module for that, do :D [16:06:05] <_joe_> claime: "pull requests welcome" [16:15:15] _joe_: x) [16:36:21] 10serviceops, 10Data-Engineering, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Gehel) [16:38:50] 16:24:55 +icinga-wm │ RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. [16:39:06] ^These were the pre-tremors for the page friday [16:39:21] And it's still flapping so we're on the limit. [16:42:30] 10serviceops, 10Data-Engineering, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) > Ah, but the upstream helm chart does not work with this feature because of its use... [16:51:07] claime: that's for codfw though? which i think flaps all the time [16:51:08] https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1&from=now-7d&to=now [16:51:11] doesn't look too bad [16:51:49] cdanis: was in a meeting and just saw it out of the corner of my eye [16:51:57] I need to do an IR for Friday [16:55:14] 10serviceops: Incident: 2022-12-12 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Clement_Goubert) [17:16:50] 10serviceops: Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Clement_Goubert) [17:42:41] <_joe_> claime: we get like 3-4 requests per second for POSTs in codfw [17:43:02] Right, I actually confused it with another alert [17:43:09] That's my bad [17:43:48] <_joe_> so yeah that should not alert there; I was waiting to move it all to prometheus [19:44:31] 10serviceops, 10Wikimedia Enterprise, 10Performance-Team (Radar), 10affects-Kiwix-and-openZIM: large amount of traffic to the action=parse API from MWOffliner - https://phabricator.wikimedia.org/T324866 (10Krinkle) [21:53:44] 10serviceops, 10Data-Engineering, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) [23:16:03] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Jclark-ctr) @Clement_Goubert dell has requested firmware updates Updated BIOS and iDRAC firmware to latest versions as BIOS firmware contains updated proce... [23:17:33] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Jclark-ctr) 05Open→03Resolved