[07:47:12] <wikibugs>	 10serviceops, 10Performance-Team, 10Wikimedia Enterprise, 10affects-Kiwix-and-openZIM: large amount of traffic to the action=parse API from MWOffliner - https://phabricator.wikimedia.org/T324866 (10Kelson) Would that https://github.com/openzim/mwoffliner/issues/1664 fix the issue, so far we are really not...
[07:51:33] <wikibugs>	 10serviceops, 10Performance-Team, 10Wikimedia Enterprise, 10affects-Kiwix-and-openZIM: large amount of traffic to the action=parse API from MWOffliner - https://phabricator.wikimedia.org/T324866 (10Ladsgroup) I don't think it would fix the issue. The issue is that you shouldn't hit our API for every page e...
[07:59:27] <wikibugs>	 10serviceops, 10Performance-Team, 10Wikimedia Enterprise, 10affects-Kiwix-and-openZIM: large amount of traffic to the action=parse API from MWOffliner - https://phabricator.wikimedia.org/T324866 (10Kelson) @Ladsgroup The MWoffliner scraper has already been quite optimised over years. I have no obvious impr...
[08:06:25] <wikibugs>	 10serviceops, 10Performance-Team, 10Wikimedia Enterprise, 10affects-Kiwix-and-openZIM: large amount of traffic to the action=parse API from MWOffliner - https://phabricator.wikimedia.org/T324866 (10Ladsgroup) I can think of several (I don't know the details of your system and might have missed something):...
[09:30:26] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: PSU failure for restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T324572 (10Clement_Goubert) 05Resolved→03Open Hi,  Checking up on this on the server, it would seem it started failing again immediately: ` 32  | Dec-09-2022 | 13...
[09:40:37] <claime>	 parse1002 just rebooted on its own
[09:40:41] <claime>	 58  | Dec-12-2022 | 08:27:38 | CPU Machine Chk  | Processor                   | transition to Non-recoverable ; OEM Event Data2 code = 02h ; OEM Event Data3 code = 20h
[09:42:33] <claime>	 That doesn´t  match the reboot time tho
[09:49:10] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10observability: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) I've dug into it a bit, and we have 3 brokers per datacenter for kafka-logging, so for balance's sake I'll create...
[09:50:06] <moritzm>	 claime: it's probably just miscofigured for DST. given that this is a new host under warranty and that it flagged multiple CPU alerts over the past months, we should get the CPU replaced, best to open a DC ops task
[09:50:39] <claime>	 moritzm: Yeah, will do
[09:52:26] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10observability: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) ` cgoubert@kafka-logging1001:~$ kafka topics --create --topic mediawiki.http.accesslog --partitions 6 --replicatio...
[09:55:20] <wikibugs>	 10serviceops, 10Performance-Team, 10Wikimedia Enterprise, 10affects-Kiwix-and-openZIM: large amount of traffic to the action=parse API from MWOffliner - https://phabricator.wikimedia.org/T324866 (10Kelson) >>! In T324866#8459496, @Ladsgroup wrote: > I can think of several (I don't know the details of your...
[09:57:43] <claime>	 It's a canary... _joe_ should I change conftool/scap config to swap it with another canary, or can we live with one of the 4 canaries depooled without breaking deployments?
[09:58:18] <_joe_>	 claime: set it to pooled=inactive first
[09:58:25] <_joe_>	 then we can check scap's lists
[09:58:34] <claime>	 ack
[10:00:41] <claime>	 done
[10:00:52] <claime>	 I'm doing the DCops phab at the same time
[10:04:12] <wikibugs>	 10serviceops, 10DC-Ops, 10ops-eqiad: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Clement_Goubert)
[10:04:37] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 2 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10elukey) Question about the scope of the cookbook - do we want to aggregate functionalities already present in other co...
[10:10:05] <wikibugs>	 10serviceops, 10DC-Ops, 10ops-eqiad: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Clement_Goubert) Host rebooted spontaneously: ` 09:30 <+icinga-wm> PROBLEM - Host parse1002 is DOWN: PING CRITICAL - Packet loss = 100% 09:31 <claime> ^ checking 09:31...
[10:10:26] <wikibugs>	 10serviceops, 10DC-Ops, 10ops-eqiad: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Clement_Goubert) ` racadm>>racadm getsel Record:      1 Date/Time:   01/24/2022 17:43:06 Source:      system Severity:    Ok Description: Log cleared. -----------------...
[10:10:44] <wikibugs>	 10serviceops, 10DC-Ops, 10ops-eqiad: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Clement_Goubert) ` cgoubert@parse1002:~$ sudo ipmi-sel ID  | Date        | Time     | Name             | Type                        | Event 1   | Jan-24-2022 | 17:43:0...
[10:11:48] <wikibugs>	 10serviceops, 10DC-Ops, 10ops-eqiad: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Clement_Goubert) Host depooled: ` cgoubert@cumin1001:~$ sudo confctl  select 'name=parse1002.eqiad.wmnet' set/pooled=inactive The selector you chose has selected the fo...
[10:14:44] <claime>	 _joe_: host depooled, wdym by checking scap lists?
[10:16:30] <_joe_>	 claime: about 10 minutes after setting the host to inactive, grep -nr parse1002 /etc/dsh/group/ on deploy1002 should confirm you haven't missed anything
[10:50:29] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: PSU failure for restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T324572 (10Jclark-ctr) Will take another look at server when I get in today.
[10:53:30] <claime>	 _joe_: cgoubert@deploy1002:~$ grep -nr parse1002 /etc/dsh/group/
[10:53:32] <claime>	 /etc/dsh/group/scap_targets:426:parse1002.eqiad.wmnet
[10:53:51] <claime>	 Should be good, we don't want to remove it completely from the targets right?
[11:01:00] <_joe_>	 right
[11:02:04] <claime>	 Cool.
[11:34:39] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10observability, 10Patch-For-Review: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) >>! In T324439#8455654, @colewhite wrote: > At the beginning, we should configure logstash t...
[11:37:15] <wikibugs>	 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm)
[11:40:28] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=cafc663b-25d8-4e28-8aea-f704dec7742e) set by cgoubert@cumin1001 for 14 days, 0:00:00 on 1 host...
[11:40:52] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Clement_Goubert) All yours DC-Ops :)
[11:43:03] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Scrape controller-manager and scheduler metrics - https://phabricator.wikimedia.org/T324959 (10JMeybohm)
[11:43:30] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Scrape controller-manager and scheduler metrics - https://phabricator.wikimedia.org/T324959 (10JMeybohm) p:05Triage→03Low
[11:44:15] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Scrape controller-manager and scheduler metrics - https://phabricator.wikimedia.org/T324959 (10JMeybohm)
[11:45:58] <wikibugs>	 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm)
[11:46:19] <wikibugs>	 10serviceops: Revisit PHP opcache health alarm - https://phabricator.wikimedia.org/T324649 (10Clement_Goubert)
[11:46:29] <wikibugs>	 10serviceops, 10observability: "PHP opcache hit ratio" alert shouldn't bother on mwdebug*/scandium/etc - https://phabricator.wikimedia.org/T254025 (10Clement_Goubert)
[11:52:49] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 2 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10JMeybohm) >>! In T277677#8459708, @elukey wrote: > Question about the scope of the cookbook - do we want to aggregate...
[12:13:58] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10SRE Observability, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Clement_Goubert) >>! In T288851#7742391, @Krinkle wrote: >>>! In T288164#7742387, @Krinkle wrote: >> For the record, the logs from k8s-mwdebug p...
[12:33:11] <hnowlan>	 ahoyhoy - could someone validate my approach on this one? https://gerrit.wikimedia.org/r/c/operations/puppet/+/866445 
[12:33:24] <hnowlan>	 At this rate I see us testing this maybe once as this is my last week before the break
[12:47:37] <_joe_>	 hnowlan: the change is correct AFAICT; remember they're added with weight 0 and pooled status "inactive"
[12:47:44] <_joe_>	 so they won't appear in pybal immediately
[12:48:10] <_joe_>	 you probably want to just pool a couple of them at very low weight at first
[12:52:21] <hnowlan>	 yeah absolutely 
[12:53:03] <hnowlan>	 there won't be any issues with duplicate definitions of nodes I assume given that they're grouped under a different service - couldn't see other hosts with more than one definition like that 
[13:16:41] <wikibugs>	 10serviceops, 10Performance-Team, 10Wikimedia Enterprise, 10affects-Kiwix-and-openZIM: large amount of traffic to the action=parse API from MWOffliner - https://phabricator.wikimedia.org/T324866 (10daniel) > In a way or the other, you need a cache to store the last version. The current approach is that usi...
[13:22:30] <wikibugs>	 10serviceops, 10Performance-Team, 10Wikimedia Enterprise, 10affects-Kiwix-and-openZIM: large amount of traffic to the action=parse API from MWOffliner - https://phabricator.wikimedia.org/T324866 (10daniel) >>! In T324866#8459446, @Kelson wrote: > Would that https://github.com/openzim/mwoffliner/issues/1664...
[13:33:24] <_joe_>	 hnowlan: yeah no issues there, it's a different object under a different path 
[13:34:40] <hnowlan>	 _joe_: ah, cool. Okay to merge? Seems safe enough with the default pooled/weight 
[13:34:58] <_joe_>	 +1
[13:38:41] <hnowlan>	 thanks!
[14:13:47] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: PSU failure for restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T324572 (10Jclark-ctr) 05Open→03Resolved @Clement_Goubert  Swapped power supply out of recently decom Server  looks to have resolved issue
[14:24:05] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr Opened Dell support ticket Confirmed: Service Request 158148016 was successfully submitted
[14:38:12] <wikibugs>	 10serviceops, 10Maps, 10Patch-For-Review, 10Platform Team Workboards (Platform Engineering Reliability), and 2 others: Disable unused services on maps nodes - https://phabricator.wikimedia.org/T298246 (10hnowlan)
[14:46:34] <wikibugs>	 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, and 2 others: tinyrgb is distributed via puppet - https://phabricator.wikimedia.org/T323775 (10hnowlan) 05In progress→03Resolved
[14:46:38] <wikibugs>	 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan)
[14:59:58] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: PSU failure for restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T324572 (10Clement_Goubert) Last ipmi-sel log line is: `51  | Dec-12-2022 | 12:59:47 | PS Redundancy    | Power Supply             | Fully Redundant`  Icinga all gree...
[15:09:23] <hnowlan>	 Scrounging another review if anyone has a minute - pretty simple fix for broken thumbor metrics in k8s https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/867186 
[15:11:24] <claime>	 hnowlan: +1'd
[15:14:18] <hnowlan>	 thanks! 
[15:46:14] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Add kubernetes 1.17+ topology annotations - https://phabricator.wikimedia.org/T270191 (10JMeybohm) a:03JMeybohm
[15:56:45] <hnowlan>	 is there an easy way to specify a custom strategy in our default chart scaffolding? Might need one for thumbor given the resource requirements/limits 
[15:59:09] <_joe_>	 wdym a default strategy?
[16:00:42] <hnowlan>	 adjusting maxSurge and maxUnavailable specifically 
[16:01:11] <_joe_>	 ah ofc you'd have to add that to your own chart
[16:01:18] <_joe_>	 we don't have anything for that
[16:01:37] <claime>	 If you want to make a module for that, do :D
[16:06:05] <_joe_>	 claime: "pull requests welcome"
[16:15:15] <claime>	 _joe_: x)
[16:36:21] <wikibugs>	 10serviceops, 10Data-Engineering, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Gehel)
[16:38:50] <claime>	 16:24:55     +icinga-wm │ RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. 
[16:39:06] <claime>	 ^These were the pre-tremors for the page friday
[16:39:21] <claime>	 And it's still flapping so we're on the limit.
[16:42:30] <wikibugs>	 10serviceops, 10Data-Engineering, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) > Ah, but the upstream helm chart does not work with this feature because of its use...
[16:51:07] <cdanis>	 claime: that's for codfw though? which i think flaps all the time
[16:51:08] <cdanis>	 https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1&from=now-7d&to=now
[16:51:11] <cdanis>	 doesn't look too bad
[16:51:49] <claime>	 cdanis: was in a meeting and just saw it out of the corner of my eye
[16:51:57] <claime>	 I need to do an IR for Friday
[16:55:14] <wikibugs>	 10serviceops: Incident: 2022-12-12 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Clement_Goubert)
[17:16:50] <wikibugs>	 10serviceops: Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Clement_Goubert)
[17:42:41] <_joe_>	 claime: we get like 3-4 requests per second for POSTs in codfw
[17:43:02] <claime>	 Right, I actually confused it with another alert
[17:43:09] <claime>	 That's my bad
[17:43:48] <_joe_>	 so yeah that should not alert there; I was waiting to move it all to prometheus
[19:44:31] <wikibugs>	 10serviceops, 10Wikimedia Enterprise, 10Performance-Team (Radar), 10affects-Kiwix-and-openZIM: large amount of traffic to the action=parse API from MWOffliner - https://phabricator.wikimedia.org/T324866 (10Krinkle)
[21:53:44] <wikibugs>	 10serviceops, 10Data-Engineering, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata)
[23:16:03] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Jclark-ctr) @Clement_Goubert  dell has requested firmware updates    Updated BIOS and iDRAC firmware to latest versions as BIOS firmware contains updated proce...
[23:17:33] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Jclark-ctr) 05Open→03Resolved