[09:03:25] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Clement_Goubert) >>! In T326425#8505438, @Dzahn wrote: > 18:16 <+icinga-wm> PROBLEM - Host mw1486 is DOWN: PING CRITICAL - Packet loss = 100...
[09:44:56] <wikibugs>	 10serviceops, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10SRE-OnFire, and 2 others: Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10JMeybohm) >>! In T324994#8463619, @Clement_Goubert wrote: > We have the resources to keep it at 30 replic...
[09:48:08] <wikibugs>	 10serviceops, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10SRE-OnFire, and 2 others: Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Clement_Goubert) No worries, I took a look at the resources and it seemed fine to leave it like that. We...
[10:45:41] <wikibugs>	 10serviceops, 10MW-on-K8s: mw-debug uses production images instead of debug - https://phabricator.wikimedia.org/T326542 (10Clement_Goubert)
[10:46:22] <wikibugs>	 10serviceops, 10MW-on-K8s: mw-debug uses production images instead of debug - https://phabricator.wikimedia.org/T326542 (10Clement_Goubert) 05Open→03In progress
[10:50:36] <claime>	 No objection to me starting rolling reboots for appserver cluster today?
[10:55:42] <jayme>	 not from me, but thanks for doing that! <3
[10:55:51] <claime>	 np
[10:55:54] <jayme>	 do you always run those in the SRE deployment windows?
[10:56:05] <claime>	 Not really
[10:56:12] <claime>	 That's more for mw-on-k8s stuff
[10:56:38] <jayme>	 yeah, I know. I just thought because of the time correlation :)
[10:56:41] <claime>	 It'll probably take most of the day, with pauses to avoid scap deploy failures
[10:56:54] <claime>	 Well today I'm taking advantage to start yeah
[10:57:33] <jayme>	 do you manually select batches of a specific size to realize the pause during deployment windows?
[11:04:05] <claime>	 I wait until the current 3 machine reboot is done, ctrl-c and cleanup manually
[11:04:28] <claime>	 Then I get the list of machines up to date with cumin and use them in the --exclude of the cookbook
[11:04:41] <claime>	 (so no, I don't :))
[11:08:43] <wikibugs>	 10serviceops, 10Icinga, 10SRE, 10SRE Observability: High average POST latency for mw requests on api_appserver in codfw on alert1001 - https://phabricator.wikimedia.org/T326544 (10jcrespo)
[11:11:37] <wikibugs>	 10serviceops, 10Icinga, 10SRE, 10SRE Observability: High average POST latency for mw requests on api_appserver in codfw on alert1001 - https://phabricator.wikimedia.org/T326544 (10jcrespo)
[11:13:30] <volans>	 you could improve the cookbook to either allow to be stopped/resumed and/or skip hosts already rebooted :)
[11:16:58] <wikibugs>	 10serviceops, 10Icinga, 10SRE, 10SRE Observability: High average POST latency for mw requests on api_appserver in codfw on alert1001 - https://phabricator.wikimedia.org/T326544 (10jcrespo)
[11:20:49] <volans>	 claime: what manual cleanup? could the cookbook handle that? (there is a hook called on failure, if you're interested I can give you more pointers)
[11:21:32] <claime>	 volans: Removing downtimes, repooling.
[11:22:00] <claime>	 Which we don't want the cookbook to do on its own on failure :p
[11:23:36] <volans>	 got it, lmk if there is anything I could help with :)
[11:24:09] <claime>	 thanks <3
[11:49:24] <_joe_>	 volans: I find that for large clusters having the ability to pick up a job where it was interrupted would be great
[11:50:03] <_joe_>	 but even if we could just have a way to schedule a pause in execution.
[11:51:12] <volans>	 indeed
[12:40:18] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629 (10LSobanski)
[12:56:20] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Patch-For-Review: mw-debug uses production images instead of debug - https://phabricator.wikimedia.org/T326542 (10Clement_Goubert)
[13:00:24] <wikibugs>	 10serviceops, 10SRE, 10Wikimedia-Portals, 10Regression: www.wikipedia.org/robots.txt should not be a redirect - https://phabricator.wikimedia.org/T242500 (10LSobanski)
[14:16:10] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Jclark-ctr) Preformed  Flea Power Drain As requested by Dell
[14:20:10] <wikibugs>	 10serviceops, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 06): k8s deployment-charts mesh module should allow use of mesh without public_port Service - https://phabricator.wikimedia.org/T326252 (10JArguello-WMF)
[14:20:44] <wikibugs>	 10serviceops, 10Data-Engineering-Planning, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 06), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10JArguello-WMF)
[14:21:53] <ottomata>	 thank jayme, responded to recent comments
[14:46:43] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10ssingh) >>! In T326425#8508075, @Clement_Goubert wrote: >>>! In T326425#8505438, @Dzahn wrote: >> 18:16 <+icinga-wm> PROBLEM - Host mw1486 i...
[14:52:00] <wikibugs>	 10serviceops, 10DC-Ops, 10ops-eqiad: hw troubleshooting:  CPU1 machine check error on parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T326119 (10Jclark-ctr) Cleared CEL  Dell requested set the system profile to performance
[14:58:05] <ottomata>	 jayme:  i'm getting the feeling you'd prefer if we made the image name be flink1 ?
[15:00:22] <jayme>	 ottomata: Sorry if I came across strong, I was actually just trying to lay our my reasoning. I don't feel super strong about this, I just assume we will end up with different versions at some point :)
[15:01:04] <ottomata>	 no sorries!  :)
[15:01:17] <jayme>	 and, as said. I'm not sure docker-pkg supports having the same image (name) declared in more then one directory - have not thought about that before
[15:01:23] <ottomata>	 ahhh i see
[15:01:40] <ottomata>	 hm, so if we have to do that in the future, we'd just rename all the dirs right?
[15:02:17] <jayme>	 if we would need to go from flink to flinkX.Y you mean?
[15:03:20] <jayme>	 I think that would require renaming the directory + renaming the image ("Package" in the control file) from flink to flinkX.Y
[15:03:39] <jayme>	 but I'm not super sure about the latter
[15:26:20] <ottomata>	 right, that seems okay to me to do later?  
[15:26:24] <ottomata>	 if we need to?
[15:28:16] <wikibugs>	 10serviceops, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Patch-For-Review, 10User-jijiki: Maps 2.0 roll-out plan - https://phabricator.wikimedia.org/T280767 (10hnowlan)
[15:28:52] <jayme>	 ottomata: yes, sure!
[15:28:53] <wikibugs>	 10serviceops, 10Maps, 10Patch-For-Review, 10Platform Team Workboards (Platform Engineering Reliability), and 2 others: Disable unused services on maps nodes - https://phabricator.wikimedia.org/T298246 (10hnowlan) 05Open→03In progress p:05Triage→03Medium a:03hnowlan
[15:29:36] <jayme>	 let me actually +1 your change
[15:35:58] <ottomata>	 awesome, thank you.  okay image change +1 too
[15:36:03] <ottomata>	 so far...i've only build the images locally
[15:36:21] <ottomata>	 jayme: after I merge, i can proceed to do so on the image build server...which will result in uploaded images?
[15:36:49] <ottomata>	 also, only remaining change that needs +1 is flink-app: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/866510
[15:37:26] <jayme>	 ottomata: sure, you can absolutely build the images after merge (https://wikitech.wikimedia.org/wiki/Kubernetes/Images#Production_images)
[15:38:15] <ottomata>	 okay!
[15:38:44] <jayme>	 re: flink-app yea, I wanted to give it another CI run after a rebase on top of the new operator version (changed CRDs)
[15:38:50] <ottomata>	 k cool
[15:39:36] <jayme>	 also ofc I'm not keen to loose my last leverage against you :-p
[15:40:39] <ottomata>	 haha
[15:44:42] <effie>	 !log enable puppet on all mw hosts
[15:45:22] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Clement_Goubert) >>! In T326425#8509196, @Jclark-ctr wrote: > Preformed  Flea Power Drain As requested by Dell   Can we pool it back, or do...
[15:47:41] <ottomata>	 jayme:  got: No mapping found for user 'flink'
[15:47:47] <ottomata>	 while trying to build images
[15:48:01] <jayme>	 ah, yeah...I remember
[15:48:16] <wikibugs>	 10serviceops, 10DC-Ops, 10ops-eqiad: hw troubleshooting:  CPU1 machine check error on parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T326119 (10Clement_Goubert) >>! In T326119#8509402, @Jclark-ctr wrote: > Cleared SEL  Dell requested set the system profile to performance  The cpu governor is alrea...
[15:49:38] <jayme>	 ottomata: that file is puppet managed in prod, see profile::docker::builder::known_uid_mappings
[15:50:17] <jayme>	 it might make very much sense to make this more clear in the comment above, which would be very kind of you to do :)
[15:50:34] <jayme>	 the comment in https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/858356/17/config.yaml I mean
[15:50:50] <ottomata>	 i gotchya...
[15:50:51] <ottomata>	 :)
[15:59:12] <ottomata>	 done.
[16:11:06] <wikibugs>	 10serviceops: Decommission mc2019-mc2037 - https://phabricator.wikimedia.org/T313733 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc2019.codfw.wmnet` - mc2019.codfw.wmnet (**WARN**)   - Downtimed host on Icinga/Alertmanager   - Found physical host   - //Managemen...
[16:27:11] <ottomata>	 jayme:  i'm having issues retrieving a key from hkp://keyserver.ubuntu.com, i think gpg is not using the $http_proxy env var(?).  
[16:27:24] <ottomata>	 trying some weird dockerfile ENV workarounds, but let me know if you have seen this before
[16:29:59] <jayme>	 ottomata: have not seen that specifically. Maybe it's easier to be lazy and commit the key to git?
[16:30:10] <ottomata>	 oh
[16:30:19] <ottomata>	 meh hm.
[16:30:32] <ottomata>	 kinda nice it is automated at build time?  easy to just change the versions atm
[16:30:39] <ottomata>	 lemme see if this works...
[16:35:21] <ottomata>	 i think this should work: https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/877201
[16:35:26] <ottomata>	 jayme:  okay if i merge that and try?
[16:39:01] <ottomata>	 being bold
[16:40:49] <ottomata>	 i think its working!
[16:41:28] <jayme>	 sorry, meeting
[16:46:54] <wikibugs>	 10serviceops: Decommission mc2019-mc2037 - https://phabricator.wikimedia.org/T313733 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc2020.codfw.wmnet` - mc2020.codfw.wmnet (**FAIL**)   - Downtimed host on Icinga/Alertmanager   - Found physical host   - //Managemen...
[17:46:47] <wikibugs>	 10serviceops: Decommission mc2019-mc2037 - https://phabricator.wikimedia.org/T313733 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc2021.codfw.wmnet` - mc2021.codfw.wmnet (**FAIL**)   - Downtimed host on Icinga/Alertmanager   - Found physical host   - //Managemen...
[17:59:50] <wikibugs>	 10serviceops, 10DC-Ops, 10ops-eqiad: hw troubleshooting:  CPU1 machine check error on parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T326119 (10Jclark-ctr) 05Open→03Resolved @clement_goubert. I updated this morning.  Dell has said this will resolve our issue I am closing this ticket and hope i...
[18:02:06] <wikibugs>	 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Roll out remote-DC gutter pool for /*/mw-wan/ - https://phabricator.wikimedia.org/T258779 (10jijiki) 05Open→03Resolved
[18:02:13] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki)
[18:07:52] <wikibugs>	 10serviceops: Decommission mc2019-mc2037 - https://phabricator.wikimedia.org/T313733 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc2022.codfw.wmnet` - mc2022.codfw.wmnet (**WARN**)   - Downtimed host on Icinga/Alertmanager   - Found physical host   - //Managemen...
[18:09:35] <wikibugs>	 10serviceops, 10DC-Ops, 10ops-eqiad, 10Patch-For-Review: hw troubleshooting:  CPU1 machine check error on parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T326119 (10Clement_Goubert) @Jclark-ctr Thank you :) I'll repool the machine and remove the downtimes tomorrow.
[18:19:21] <ottomata>	 ty jay me, flink image builds fine now.  having trouble with mvn and proxies for build from source of flink-kubernetes-operator... 
[18:22:06] <ottomata>	 i think i got it, phewf
[18:38:22] <ottomata>	 woohoo it built!
[18:43:28] <wikibugs>	 10serviceops: Decommission mc2019-mc2037 - https://phabricator.wikimedia.org/T313733 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc2023.codfw.wmnet` - mc2023.codfw.wmnet (**WARN**)   - Downtimed host on Icinga/Alertmanager   - Found physical host   - //Managemen...
[19:04:25] <wikibugs>	 10serviceops: Decommission mc2019-mc2037 - https://phabricator.wikimedia.org/T313733 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc2024.codfw.wmnet` - mc2024.codfw.wmnet (**WARN**)   - Downtimed host on Icinga/Alertmanager   - Found physical host   - //Managemen...
[19:22:25] <wikibugs>	 10serviceops: Decommission mc2019-mc2037 - https://phabricator.wikimedia.org/T313733 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc2025.codfw.wmnet` - mc2025.codfw.wmnet (**WARN**)   - Downtimed host on Icinga/Alertmanager   - Found physical host   - //Managemen...
[20:30:32] <wikibugs>	 10serviceops, 10SRE, 10Thumbor: Image fails to load with CORS violation - https://phabricator.wikimedia.org/T270209 (10BCornwall)
[21:18:43] <wikibugs>	 10serviceops: Decommission mc2019-mc2037 - https://phabricator.wikimedia.org/T313733 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc2026.codfw.wmnet` - mc2026.codfw.wmnet (**WARN**)   - Downtimed host on Icinga/Alertmanager   - Found physical host   - //Managemen...
[21:37:24] <wikibugs>	 10serviceops: Decommission mc2019-mc2037 - https://phabricator.wikimedia.org/T313733 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc2027.codfw.wmnet` - mc2027.codfw.wmnet (**WARN**)   - Downtimed host on Icinga/Alertmanager   - Found physical host   - //Managemen...
[22:03:44] <wikibugs>	 10serviceops: Decommission mc2019-mc2037 - https://phabricator.wikimedia.org/T313733 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc2029.codfw.wmnet` - mc2029.codfw.wmnet (**WARN**)   - Downtimed host on Icinga/Alertmanager   - Found physical host   - //Managemen...
[22:25:14] <wikibugs>	 10serviceops: Decommission mc2019-mc2037 - https://phabricator.wikimedia.org/T313733 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc2030.codfw.wmnet` - mc2030.codfw.wmnet (**WARN**)   - Downtimed host on Icinga/Alertmanager   - Found physical host   - //Managemen...