[06:59:08] 10serviceops, 10Patch-For-Review: Test running php7.2 and php7.4 in parallel on the beta cluster - https://phabricator.wikimedia.org/T295578 (10tstarling) I inspected the logs from the last 5 days of PHP 7.4 job runner execution on beta. The only relevant log entries I could find were three exceptions with the... [07:57:41] 10serviceops, 10SRE, 10Patch-For-Review: Update conf1* servers - https://phabricator.wikimedia.org/T310062 (10SLyngshede-WMF) p:05Triage→03Medium [08:35:13] 10serviceops, 10SRE, 10Patch-For-Review: Update conf1* servers - https://phabricator.wikimedia.org/T310062 (10SLyngshede-WMF) p:05Medium→03High [08:38:16] 10serviceops, 10SRE, 10Traffic: fawiki user reports getting 503 errors with message "upstream connect error or disconnect before headers" - https://phabricator.wikimedia.org/T310450 (10SLyngshede-WMF) p:05Triage→03Medium [08:41:47] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10Wikimedia-production-error: fawiki user reports getting 503 errors with message "upstream connect error or disconnect before headers" - https://phabricator.wikimedia.org/T310450 (10SLyngshede-WMF) [08:42:09] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10Wikimedia-production-error: fawiki user reports getting 503 errors with message "upstream connect error or disconnect before headers" - https://phabricator.wikimedia.org/T310450 (10SLyngshede-WMF) Merged with an unrelated bug, and the relevant tags was dr... [08:50:51] 10serviceops, 10SRE: Requesting SSH keypair for deployment server keyholder to push to Gerrit - https://phabricator.wikimedia.org/T310620 (10SLyngshede-WMF) p:05Triage→03Medium [08:51:41] 10serviceops, 10SRE: Requesting SSH keypair for deployment server keyholder to push to Gerrit - https://phabricator.wikimedia.org/T310620 (10SLyngshede-WMF) This doesn't appear to be an SRE-Access-Request. Adding the ServiceOps tags, as they are involved in the Kubernetes migration and it makes sense to loop t... [10:49:30] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 4 others: Create a cookbook to perform a rolling reboot of a kubernetes cluster - https://phabricator.wikimedia.org/T260661 (10joanna_borun) [10:49:50] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 3 others: Create a cookbook for depooling one or all services from one kubernetes cluster - https://phabricator.wikimedia.org/T260663 (10joanna_borun) [13:45:53] o/ i'm getting a strange failure when trying to deploy eventstreams to staging k8s [13:46:04] command "/usr/bin/helm3" exited with non-zero status [13:46:11] STDERR: [13:46:12] WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/eventstreams-deploy-staging.config [13:46:12] Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress [13:52:17] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [13:53:15] ottomata: uh, "nice" (I wanted to capture a picture of that state :D) [13:54:08] I'll quickly collect some stuff before fixing the situation if that's oka [13:54:12] *okay [14:38:48] ya no rush! [15:00:50] ottomata: I'm done, you should be all good to deploy again. Happy Helming! [15:06:51] ty! [15:06:52] trying [15:12:20] jayme: ... i [15:12:29] i'm not sure. it seems to be hanging? [15:12:34] been at [15:12:34] Upgrading release=production, chart=wmf-stable/eventstreams [15:12:40] for maybe 5+ mintues? [15:12:47] hanging == doing something I guess :) [15:12:53] i guess....? [15:13:05] hm [15:13:11] let me check [15:13:41] the new pod is not starting/crashing [15:13:49] there will be a rollback shortly [15:14:15] okay, probably a config problem or something then. [15:14:17] lets see [15:15:55] hmm curl 10.64.75.198:8092/_info works fine [15:16:09] logs look good too [15:16:10] hm [15:16:26] 10.64.75.198:8092/_info is the readinessProbe [15:17:05] oh, eventstreams container is good? [15:17:05] hm [15:17:31] it is eventstreams-tls-proxy that is failing [15:17:45] oh, looks ..yeah, that [15:18:04] Configuration does not parse cleanly as v3. v2 configuration is deprecated [15:18:04] ? [15:19:01] I guess there is custom templating in the chart regarding envoy [15:19:04] hmmm [15:19:11] and that does not work with the latest update [15:19:14] aye [15:19:19] (of envoy ot 1.18 IIRC) [15:19:19] when did that latest update happen? [15:20:12] I have to check. But we tend to not roll out the update if there is a diff in the deployment (other than the new envoy version) [15:20:51] https://gerrit.wikimedia.org/r/c/operations/puppet/+/771053 [15:20:57] May 17 [15:21:06] okay [15:22:27] quick fix is pin the version in your deployment back to 1.15.5-1 [15:24:19] hm, the image version has not changed [15:24:20] image_version: 1.13.1-2 [15:25:08] wait [15:25:12] something is overriding it [15:25:15] - image: docker-registry.wikimedia.org/envoy:1.15.5-1 [15:25:15] + image: docker-registry.wikimedia.org/envoy:1.18.3-1 [15:25:18] the chart has 1.13.1-2 [15:26:38] yes, yes. We override that with default. You may override our default by specifying it in helmfile.d/services/eventstreams/values.yaml [15:26:56] that will take precedence [15:27:05] shouldn't the one in the chart values.yaml take precidence anyway? [15:27:27] how is it overridden? [15:28:18] the values.yaml in the chart itself is the one with the lowest priority, one second [15:29:42] here is a proper description https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/services/README.MD#conventions [15:31:28] nice okay, good to know [15:31:32] pinning to 1.15.5-1 [15:35:35] The v2 xDS major version is deprecated and disabled by default. Support for v2 will be removed from Envoy at the start of Q1 2021. You may make use of v2 in Q4 2020 by following the advice in https://www.envoyproxy.io/docs/envoy/latest/faq/api/transition. [15:35:36] oops [15:35:45] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/805843 [15:37:44] just FYI - RunJobs.php is being removed from mediawiki-config. It was disabled in the apache config on jobrunners over a year ago though so nbd https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/805775/ [15:39:43] hehe jayme [15:39:45] WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/eventstreams-deploy-staging.config [15:39:45] Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress [15:39:51] is puppet touching something it shouldn't!? [15:40:23] did you ^C at some point maybe? [15:40:38] i may have! [15:40:50] maybe when it was hanging [15:41:17] how can I fix so I don't have to bother you if I do it again? [15:41:17] :) [15:42:57] I'd say https://phabricator.wikimedia.org/T310714 but .... [15:43:36] https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency [15:44:51] bascically: rollback to the last state: deployed version [15:50:26] gr8 ty [15:50:29] about the "hanging" part: helmfile will wait 600 seconds for the new deployment to become ready and will roll back in case that does not happen. That process is client side, so if you ^C it does not happen [15:50:39] leaving your latest release in the pending stage [15:50:41] *state [15:51:33] aye so avoid ctrl c [15:51:35] :) [15:51:37] you may lower the timeout (top of your helmfile.yaml) if you wish. The 600 sec are just something we defined as default [15:52:02] grr codfw also in weird state...maybe some change was attempted be rolled out by someone else in the past that also failed? [15:52:33] Mon Mar 21 16:02:23 2022 pending-upgrade eventstreams-0.4.1 Preparing upgrade [15:52:56] same in eqiad [15:52:57] fixing [15:53:20] yeah, possible :-( [15:53:49] https://phabricator.wikimedia.org/T310714 is about detecting and alerting for releases in that state [15:55:42] aye [15:55:51] oo, also need to remember to rollback canary release [15:56:18] ah, indeed [15:57:38] alrigh! deploy successful. thank you! [15:57:41] and, obviously: Please update the chart to the latest common_templates so it supports more recent envoy versions :) [15:57:49] yeahHhhhHhhhhhh.... [15:57:50] ya [15:57:57] :D [15:58:25] i'll make a task for eventstreams too. [15:58:29] there is one for eventgate already [15:58:42] ack [19:01:12] 10serviceops, 10DNS, 10SRE, 10WMF-Legal, 10wikimediafoundation.org: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Varnent) As the #WMF-Legal project tag was added to this task, some general information to avoid wrong expecta... [19:20:49] 10serviceops, 10DNS, 10SRE, 10WMF-Legal, 10wikimediafoundation.org: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Dzahn) just a note for serviceops: policy.wikimedia.org is not currently under the control of SRE/prod servers... [19:28:54] 10serviceops, 10DNS, 10SRE, 10WMF-Legal, 10wikimediafoundation.org: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Dzahn) There are incoming redirects into policy.wikimedia.org: https://wikimedia.org/stopsurveillance -> http...