[09:03:25] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Clement_Goubert) >>! In T326425#8505438, @Dzahn wrote: > 18:16 <+icinga-wm> PROBLEM - Host mw1486 is DOWN: PING CRITICAL - Packet loss = 100... [09:44:56] 10serviceops, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10SRE-OnFire, and 2 others: Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10JMeybohm) >>! In T324994#8463619, @Clement_Goubert wrote: > We have the resources to keep it at 30 replic... [09:48:08] 10serviceops, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10SRE-OnFire, and 2 others: Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Clement_Goubert) No worries, I took a look at the resources and it seemed fine to leave it like that. We... [10:45:41] 10serviceops, 10MW-on-K8s: mw-debug uses production images instead of debug - https://phabricator.wikimedia.org/T326542 (10Clement_Goubert) [10:46:22] 10serviceops, 10MW-on-K8s: mw-debug uses production images instead of debug - https://phabricator.wikimedia.org/T326542 (10Clement_Goubert) 05Open→03In progress [10:50:36] No objection to me starting rolling reboots for appserver cluster today? [10:55:42] not from me, but thanks for doing that! <3 [10:55:51] np [10:55:54] do you always run those in the SRE deployment windows? [10:56:05] Not really [10:56:12] That's more for mw-on-k8s stuff [10:56:38] yeah, I know. I just thought because of the time correlation :) [10:56:41] It'll probably take most of the day, with pauses to avoid scap deploy failures [10:56:54] Well today I'm taking advantage to start yeah [10:57:33] do you manually select batches of a specific size to realize the pause during deployment windows? [11:04:05] I wait until the current 3 machine reboot is done, ctrl-c and cleanup manually [11:04:28] Then I get the list of machines up to date with cumin and use them in the --exclude of the cookbook [11:04:41] (so no, I don't :)) [11:08:43] 10serviceops, 10Icinga, 10SRE, 10SRE Observability: High average POST latency for mw requests on api_appserver in codfw on alert1001 - https://phabricator.wikimedia.org/T326544 (10jcrespo) [11:11:37] 10serviceops, 10Icinga, 10SRE, 10SRE Observability: High average POST latency for mw requests on api_appserver in codfw on alert1001 - https://phabricator.wikimedia.org/T326544 (10jcrespo) [11:13:30] you could improve the cookbook to either allow to be stopped/resumed and/or skip hosts already rebooted :) [11:16:58] 10serviceops, 10Icinga, 10SRE, 10SRE Observability: High average POST latency for mw requests on api_appserver in codfw on alert1001 - https://phabricator.wikimedia.org/T326544 (10jcrespo) [11:20:49] claime: what manual cleanup? could the cookbook handle that? (there is a hook called on failure, if you're interested I can give you more pointers) [11:21:32] volans: Removing downtimes, repooling. [11:22:00] Which we don't want the cookbook to do on its own on failure :p [11:23:36] got it, lmk if there is anything I could help with :) [11:24:09] thanks <3 [11:49:24] <_joe_> volans: I find that for large clusters having the ability to pick up a job where it was interrupted would be great [11:50:03] <_joe_> but even if we could just have a way to schedule a pause in execution. [11:51:12] indeed [12:40:18] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629 (10LSobanski) [12:56:20] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: mw-debug uses production images instead of debug - https://phabricator.wikimedia.org/T326542 (10Clement_Goubert) [13:00:24] 10serviceops, 10SRE, 10Wikimedia-Portals, 10Regression: www.wikipedia.org/robots.txt should not be a redirect - https://phabricator.wikimedia.org/T242500 (10LSobanski) [14:16:10] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Jclark-ctr) Preformed Flea Power Drain As requested by Dell [14:20:10] 10serviceops, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 06): k8s deployment-charts mesh module should allow use of mesh without public_port Service - https://phabricator.wikimedia.org/T326252 (10JArguello-WMF) [14:20:44] 10serviceops, 10Data-Engineering-Planning, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 06), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10JArguello-WMF) [14:21:53] thank jayme, responded to recent comments [14:46:43] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10ssingh) >>! In T326425#8508075, @Clement_Goubert wrote: >>>! In T326425#8505438, @Dzahn wrote: >> 18:16 <+icinga-wm> PROBLEM - Host mw1486 i... [14:52:00] 10serviceops, 10DC-Ops, 10ops-eqiad: hw troubleshooting: CPU1 machine check error on parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T326119 (10Jclark-ctr) Cleared CEL Dell requested set the system profile to performance [14:58:05] jayme: i'm getting the feeling you'd prefer if we made the image name be flink1 ? [15:00:22] ottomata: Sorry if I came across strong, I was actually just trying to lay our my reasoning. I don't feel super strong about this, I just assume we will end up with different versions at some point :) [15:01:04] no sorries! :) [15:01:17] and, as said. I'm not sure docker-pkg supports having the same image (name) declared in more then one directory - have not thought about that before [15:01:23] ahhh i see [15:01:40] hm, so if we have to do that in the future, we'd just rename all the dirs right? [15:02:17] if we would need to go from flink to flinkX.Y you mean? [15:03:20] I think that would require renaming the directory + renaming the image ("Package" in the control file) from flink to flinkX.Y [15:03:39] but I'm not super sure about the latter [15:26:20] right, that seems okay to me to do later? [15:26:24] if we need to? [15:28:16] 10serviceops, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Patch-For-Review, 10User-jijiki: Maps 2.0 roll-out plan - https://phabricator.wikimedia.org/T280767 (10hnowlan) [15:28:52] ottomata: yes, sure! [15:28:53] 10serviceops, 10Maps, 10Patch-For-Review, 10Platform Team Workboards (Platform Engineering Reliability), and 2 others: Disable unused services on maps nodes - https://phabricator.wikimedia.org/T298246 (10hnowlan) 05Open→03In progress p:05Triage→03Medium a:03hnowlan [15:29:36] let me actually +1 your change [15:35:58] awesome, thank you. okay image change +1 too [15:36:03] so far...i've only build the images locally [15:36:21] jayme: after I merge, i can proceed to do so on the image build server...which will result in uploaded images? [15:36:49] also, only remaining change that needs +1 is flink-app: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/866510 [15:37:26] ottomata: sure, you can absolutely build the images after merge (https://wikitech.wikimedia.org/wiki/Kubernetes/Images#Production_images) [15:38:15] okay! [15:38:44] re: flink-app yea, I wanted to give it another CI run after a rebase on top of the new operator version (changed CRDs) [15:38:50] k cool [15:39:36] also ofc I'm not keen to loose my last leverage against you :-p [15:40:39] haha [15:44:42] !log enable puppet on all mw hosts [15:45:22] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Clement_Goubert) >>! In T326425#8509196, @Jclark-ctr wrote: > Preformed Flea Power Drain As requested by Dell Can we pool it back, or do... [15:47:41] jayme: got: No mapping found for user 'flink' [15:47:47] while trying to build images [15:48:01] ah, yeah...I remember [15:48:16] 10serviceops, 10DC-Ops, 10ops-eqiad: hw troubleshooting: CPU1 machine check error on parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T326119 (10Clement_Goubert) >>! In T326119#8509402, @Jclark-ctr wrote: > Cleared SEL Dell requested set the system profile to performance The cpu governor is alrea... [15:49:38] ottomata: that file is puppet managed in prod, see profile::docker::builder::known_uid_mappings [15:50:17] it might make very much sense to make this more clear in the comment above, which would be very kind of you to do :) [15:50:34] the comment in https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/858356/17/config.yaml I mean [15:50:50] i gotchya... [15:50:51] :) [15:59:12] done. [16:11:06] 10serviceops: Decommission mc2019-mc2037 - https://phabricator.wikimedia.org/T313733 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc2019.codfw.wmnet` - mc2019.codfw.wmnet (**WARN**) - Downtimed host on Icinga/Alertmanager - Found physical host - //Managemen... [16:27:11] jayme: i'm having issues retrieving a key from hkp://keyserver.ubuntu.com, i think gpg is not using the $http_proxy env var(?). [16:27:24] trying some weird dockerfile ENV workarounds, but let me know if you have seen this before [16:29:59] ottomata: have not seen that specifically. Maybe it's easier to be lazy and commit the key to git? [16:30:10] oh [16:30:19] meh hm. [16:30:32] kinda nice it is automated at build time? easy to just change the versions atm [16:30:39] lemme see if this works... [16:35:21] i think this should work: https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/877201 [16:35:26] jayme: okay if i merge that and try? [16:39:01] being bold [16:40:49] i think its working! [16:41:28] sorry, meeting [16:46:54] 10serviceops: Decommission mc2019-mc2037 - https://phabricator.wikimedia.org/T313733 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc2020.codfw.wmnet` - mc2020.codfw.wmnet (**FAIL**) - Downtimed host on Icinga/Alertmanager - Found physical host - //Managemen... [17:46:47] 10serviceops: Decommission mc2019-mc2037 - https://phabricator.wikimedia.org/T313733 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc2021.codfw.wmnet` - mc2021.codfw.wmnet (**FAIL**) - Downtimed host on Icinga/Alertmanager - Found physical host - //Managemen... [17:59:50] 10serviceops, 10DC-Ops, 10ops-eqiad: hw troubleshooting: CPU1 machine check error on parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T326119 (10Jclark-ctr) 05Open→03Resolved @clement_goubert. I updated this morning. Dell has said this will resolve our issue I am closing this ticket and hope i... [18:02:06] 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Roll out remote-DC gutter pool for /*/mw-wan/ - https://phabricator.wikimedia.org/T258779 (10jijiki) 05Open→03Resolved [18:02:13] 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [18:07:52] 10serviceops: Decommission mc2019-mc2037 - https://phabricator.wikimedia.org/T313733 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc2022.codfw.wmnet` - mc2022.codfw.wmnet (**WARN**) - Downtimed host on Icinga/Alertmanager - Found physical host - //Managemen... [18:09:35] 10serviceops, 10DC-Ops, 10ops-eqiad, 10Patch-For-Review: hw troubleshooting: CPU1 machine check error on parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T326119 (10Clement_Goubert) @Jclark-ctr Thank you :) I'll repool the machine and remove the downtimes tomorrow. [18:19:21] ty jay me, flink image builds fine now. having trouble with mvn and proxies for build from source of flink-kubernetes-operator... [18:22:06] i think i got it, phewf [18:38:22] woohoo it built! [18:43:28] 10serviceops: Decommission mc2019-mc2037 - https://phabricator.wikimedia.org/T313733 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc2023.codfw.wmnet` - mc2023.codfw.wmnet (**WARN**) - Downtimed host on Icinga/Alertmanager - Found physical host - //Managemen... [19:04:25] 10serviceops: Decommission mc2019-mc2037 - https://phabricator.wikimedia.org/T313733 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc2024.codfw.wmnet` - mc2024.codfw.wmnet (**WARN**) - Downtimed host on Icinga/Alertmanager - Found physical host - //Managemen... [19:22:25] 10serviceops: Decommission mc2019-mc2037 - https://phabricator.wikimedia.org/T313733 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc2025.codfw.wmnet` - mc2025.codfw.wmnet (**WARN**) - Downtimed host on Icinga/Alertmanager - Found physical host - //Managemen... [20:30:32] 10serviceops, 10SRE, 10Thumbor: Image fails to load with CORS violation - https://phabricator.wikimedia.org/T270209 (10BCornwall) [21:18:43] 10serviceops: Decommission mc2019-mc2037 - https://phabricator.wikimedia.org/T313733 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc2026.codfw.wmnet` - mc2026.codfw.wmnet (**WARN**) - Downtimed host on Icinga/Alertmanager - Found physical host - //Managemen... [21:37:24] 10serviceops: Decommission mc2019-mc2037 - https://phabricator.wikimedia.org/T313733 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc2027.codfw.wmnet` - mc2027.codfw.wmnet (**WARN**) - Downtimed host on Icinga/Alertmanager - Found physical host - //Managemen... [22:03:44] 10serviceops: Decommission mc2019-mc2037 - https://phabricator.wikimedia.org/T313733 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc2029.codfw.wmnet` - mc2029.codfw.wmnet (**WARN**) - Downtimed host on Icinga/Alertmanager - Found physical host - //Managemen... [22:25:14] 10serviceops: Decommission mc2019-mc2037 - https://phabricator.wikimedia.org/T313733 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc2030.codfw.wmnet` - mc2030.codfw.wmnet (**WARN**) - Downtimed host on Icinga/Alertmanager - Found physical host - //Managemen...