[00:38:42] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/935129 [00:38:48] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/935129 (owner: 10TrainBranchBot) [00:41:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [00:57:49] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/935129 (owner: 10TrainBranchBot) [01:03:45] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T341019 (10phaultfinder) [01:03:51] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T341020 (10phaultfinder) [01:03:56] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T341021 (10phaultfinder) [01:11:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230704T0200) [02:07:30] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.16 [core] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/935130 (https://phabricator.wikimedia.org/T340244) [02:07:36] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.16 [core] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/935130 (https://phabricator.wikimedia.org/T340244) (owner: 10TrainBranchBot) [02:07:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:47] PROBLEM - Check systemd state on mwlog1002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:10:17] PROBLEM - Check systemd state on mwlog2002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:26:43] (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.16 [core] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/935130 (https://phabricator.wikimedia.org/T340244) (owner: 10TrainBranchBot) [02:27:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:32:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230704T0300) [03:00:49] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:05:25] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:37:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:42:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:28:03] RECOVERY - Check systemd state on puppetboard1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:32:43] PROBLEM - Check systemd state on puppetboard1003 is CRITICAL: CRITICAL - degraded: The following units failed: uwsgi-puppetboard.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:41:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [04:55:47] (03CR) 10Giuseppe Lavagetto: "The change seems overall correct; I am wondering if it wouldn't make more sense to do as follows:" [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [05:21:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:55:21] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 131 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230704T0600) [06:00:05] kormat, marostegui, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230704T0600). [06:32:07] 10SRE, 10SRE-Access-Requests: Requesting access to ops group for nskaggs - https://phabricator.wikimedia.org/T337571 (10SLyngshede-WMF) 05Open→03Resolved I think we can close this. [06:32:36] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:00:05] Amir1, Urbanecm, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230704T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:17:09] (03PS1) 10Marostegui: dbproxy1012: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/935329 [07:17:37] (03CR) 10Marostegui: [C: 03+2] dbproxy1012: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/935329 (owner: 10Marostegui) [07:23:25] (03CR) 10Elukey: [C: 03+1] "Thanks for all the answers Ben!" [puppet] - 10https://gerrit.wikimedia.org/r/935071 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [07:24:07] (03CR) 10Elukey: [V: 03+1 C: 04-1] C:bigtop::hadoop move net-topology.py to files. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [07:25:18] (03CR) 10Kosta Harlan: gitlab runner: Allow mariadb:* images (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932328 (https://phabricator.wikimedia.org/T339352) (owner: 10Kosta Harlan) [07:37:29] (03CR) 10Vgutierrez: haproxy: support different actions for tls and http frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [07:45:26] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T341033 (10phaultfinder) [07:45:32] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T341034 (10phaultfinder) [07:45:37] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T341035 (10phaultfinder) [07:50:25] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T341035 (10phaultfinder) [07:52:40] (03CR) 10Filippo Giunchedi: "<3 <3 <3 <3 LGTM (waiting for July 17th)" [puppet] - 10https://gerrit.wikimedia.org/r/935103 (https://phabricator.wikimedia.org/T317032) (owner: 10Majavah) [08:00:05] hashar and brennen: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-0+Utc-7 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230704T0800). [08:00:32] tchou thchou [08:00:48] I got fix / rebase a bunch of security patches [08:09:44] (03PS1) 10Filippo Giunchedi: wmcs: deploy openstack_apis_response in eqiad only [alerts] - 10https://gerrit.wikimedia.org/r/935375 [08:09:49] (03PS1) 10Slyngshede: Allow users to be created in MediaWiki. [software/bitu] - 10https://gerrit.wikimedia.org/r/935376 [08:10:44] (03CR) 10Btullis: [C: 03+2] Add an apt mirror for the confluent-kafka 7.4 release [puppet] - 10https://gerrit.wikimedia.org/r/935071 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [08:11:09] hashar: I was already looking into the failed patches [08:12:33] RECOVERY - Check systemd state on mwlog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:12:33] RECOVERY - Check systemd state on mwlog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:15:07] (03CR) 10Alexandros Kosiaris: [C: 03+1] kubernetes::deployment_server: Globally enable envoy telemetry [puppet] - 10https://gerrit.wikimedia.org/r/935097 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [08:19:27] (03PS1) 10Marostegui: dbproxy1013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/935377 [08:19:36] (03CR) 10Alexandros Kosiaris: [C: 03+1] changeprop: Change normal_rule_processing_delay to histogram [deployment-charts] - 10https://gerrit.wikimedia.org/r/935089 (owner: 10Clément Goubert) [08:20:22] (03CR) 10Marostegui: [C: 03+2] dbproxy1013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/935377 (owner: 10Marostegui) [08:25:10] jnuche: ah my bad sorry :D [08:25:26] looks like they were all straight forward, there is one left for T339016 but I think the patch on the deployment server simply got split in smaller lbits [08:28:16] I just saw the message on https://phabricator.wikimedia.org/T339016, thanks for looking into it [08:29:27] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I would rather we did not create alerts on metrics that we know are flawed. Lets fix it first and re-evaluate after we ve done that" [alerts] - 10https://gerrit.wikimedia.org/r/935078 (https://phabricator.wikimedia.org/T336627) (owner: 10Clément Goubert) [08:31:02] (03CR) 10David Caro: [C: 03+1] "LGTM, would be nice to have it eventually in codfw too (but as you say it's filtering by cloudcontrol1*, so would not work as is)." [alerts] - 10https://gerrit.wikimedia.org/r/935375 (owner: 10Filippo Giunchedi) [08:40:50] (03PS1) 10Elukey: services: raise anoymous traffic limit for liftwing endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/935379 (https://phabricator.wikimedia.org/T340982) [08:42:43] (03PS2) 10Majavah: add fake novaproxy passwords [labs/private] - 10https://gerrit.wikimedia.org/r/928477 [08:44:19] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops (or wmcs-roots) for TheresNoTime - https://phabricator.wikimedia.org/T337829 (10SLyngshede-WMF) I'll just push to get a review today so we can merge and close this. [08:45:03] (03PS13) 10Slyngshede: C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) [08:47:16] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42194/console" [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [08:47:24] (03PS1) 10Majavah: Add fake WMCS DKIM keys [labs/private] - 10https://gerrit.wikimedia.org/r/935380 (https://phabricator.wikimedia.org/T249237) [08:50:06] (03PS1) 10Btullis: Add the GPG key for the Confluent Platform 7 repository [puppet] - 10https://gerrit.wikimedia.org/r/935381 (https://phabricator.wikimedia.org/T329514) [08:53:46] (03PS14) 10Slyngshede: C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) [08:56:28] !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-workers restart workers for Hadoop analytics cluster: Roll restart of jvm daemons for openjdk upgrade. [08:59:06] (03CR) 10Filippo Giunchedi: [C: 03+2] "Indeed, thank you for the quick review!" [alerts] - 10https://gerrit.wikimedia.org/r/935375 (owner: 10Filippo Giunchedi) [09:00:49] (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935382 (https://phabricator.wikimedia.org/T340244) [09:00:51] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935382 (https://phabricator.wikimedia.org/T340244) (owner: 10TrainBranchBot) [09:01:36] (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935382 (https://phabricator.wikimedia.org/T340244) (owner: 10TrainBranchBot) [09:02:03] !log jnuche@deploy1002 Started scap: testwikis wikis to 1.41.0-wmf.16 refs T340244 [09:02:06] T340244: 1.41.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T340244 [09:03:53] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::client: make mutual_tls_add_puppet_ca the default behaviour [puppet] - 10https://gerrit.wikimedia.org/r/935070 (https://phabricator.wikimedia.org/T340557) (owner: 10Jbond) [09:04:30] (03CR) 10Alexandros Kosiaris: [C: 04-1] "5 minutes sounds indeed low for what is a failsafe alert. The >3 part is also pretty opaque to me" [alerts] - 10https://gerrit.wikimedia.org/r/884039 (owner: 10Giuseppe Lavagetto) [09:05:22] (03CR) 10JMeybohm: [C: 03+2] kubernetes::deployment_server: Globally enable envoy telemetry [puppet] - 10https://gerrit.wikimedia.org/r/935097 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [09:05:26] (03CR) 10JMeybohm: [C: 03+2] deployment_server::general: bump default envoy version to 1.23.10 [puppet] - 10https://gerrit.wikimedia.org/r/935074 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [09:07:53] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hadoop.roll-restart-workers (exit_code=99) restart workers for Hadoop analytics cluster: Roll restart of jvm daemons for openjdk upgrade. [09:11:57] (03PS1) 10Btullis: Fail back hive services to the primary server [dns] - 10https://gerrit.wikimedia.org/r/935383 [09:13:56] (03CR) 10Btullis: [C: 03+2] Fail back hive services to the primary server [dns] - 10https://gerrit.wikimedia.org/r/935383 (owner: 10Btullis) [09:16:52] (03PS1) 10Jbond: pki-root: fix lookup argument [puppet] - 10https://gerrit.wikimedia.org/r/935384 [09:18:03] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42195/console" [puppet] - 10https://gerrit.wikimedia.org/r/935384 (owner: 10Jbond) [09:21:01] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump-s1.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:21:57] (03CR) 10Jgiannelos: "On a side note, automated service checks are gonna fail because of the openapi spec definitions." [deployment-charts] - 10https://gerrit.wikimedia.org/r/935051 (https://phabricator.wikimedia.org/T340769) (owner: 10Jgiannelos) [09:25:06] (03PS3) 10Effie Mouzeli: service::catalog: Switch kubestagemaster service to production (#6) [puppet] - 10https://gerrit.wikimedia.org/r/935090 (https://phabricator.wikimedia.org/T329827) [09:25:09] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/935132 [09:25:23] (03PS5) 10Effie Mouzeli: Convert kubestagemaster from CNAME to A record (#7) [dns] - 10https://gerrit.wikimedia.org/r/935035 (https://phabricator.wikimedia.org/T329827) [09:26:52] (03CR) 10Effie Mouzeli: [C: 03+2] service::catalog: Switch kubestagemaster service to production (#6) [puppet] - 10https://gerrit.wikimedia.org/r/935090 (https://phabricator.wikimedia.org/T329827) (owner: 10Effie Mouzeli) [09:32:36] (03CR) 10Clément Goubert: "> Patch Set 1: Code-Review-1" [alerts] - 10https://gerrit.wikimedia.org/r/884039 (owner: 10Giuseppe Lavagetto) [09:33:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:35:04] (03CR) 10Clément Goubert: [C: 03+2] changeprop: Change normal_rule_processing_delay to histogram [deployment-charts] - 10https://gerrit.wikimedia.org/r/935089 (owner: 10Clément Goubert) [09:35:17] (03PS1) 10Majavah: P:toolforge: mail: DKIM sign outgoing mail [puppet] - 10https://gerrit.wikimedia.org/r/935385 (https://phabricator.wikimedia.org/T249237) [09:36:10] (03Merged) 10jenkins-bot: changeprop: Change normal_rule_processing_delay to histogram [deployment-charts] - 10https://gerrit.wikimedia.org/r/935089 (owner: 10Clément Goubert) [09:36:14] (03CR) 10Elukey: [C: 03+1] "This is the current script's output:" [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [09:36:22] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [09:36:22] (03CR) 10Elukey: [C: 03+1] C:bigtop::hadoop move net-topology.py to files. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [09:36:24] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [09:36:34] (03PS1) 10Urbanecm: DeleteAction: Avoid displaying the form unconditionally [core] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/935121 (https://phabricator.wikimedia.org/T341002) [09:36:46] (03PS1) 10Urbanecm: DeleteAction: Avoid displaying the form unconditionally [core] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/935122 (https://phabricator.wikimedia.org/T341002) [09:37:21] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [09:37:24] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [09:37:27] (03CR) 10Jbond: [V: 03+1 C: 03+2] pki-root: fix lookup argument [puppet] - 10https://gerrit.wikimedia.org/r/935384 (owner: 10Jbond) [09:38:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:38:35] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] Add fake WMCS DKIM keys [labs/private] - 10https://gerrit.wikimedia.org/r/935380 (https://phabricator.wikimedia.org/T249237) (owner: 10Majavah) [09:38:44] !log updated envoyproxy to 1.23.10 on all nodes - T300324 [09:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:47] T300324: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 [09:39:34] (03PS2) 10Elukey: services: raise anoymous traffic limit for liftwing endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/935379 (https://phabricator.wikimedia.org/T340982) [09:39:43] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42197/console" [puppet] - 10https://gerrit.wikimedia.org/r/935385 (https://phabricator.wikimedia.org/T249237) (owner: 10Majavah) [09:39:43] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:39:43] (SystemdUnitFailed) firing: (2) envoyproxy.service Failed on wdqs1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:39:52] this is me [09:40:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:toolforge: mail: DKIM sign outgoing mail [puppet] - 10https://gerrit.wikimedia.org/r/935385 (https://phabricator.wikimedia.org/T249237) (owner: 10Majavah) [09:41:03] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:41:32] (03CR) 10Jbond: [C: 03+2] ssh: support listening on multiple ports [puppet] - 10https://gerrit.wikimedia.org/r/928797 (https://phabricator.wikimedia.org/T337241) (owner: 10Majavah) [09:41:58] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [09:42:11] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [09:42:12] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [09:42:44] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [09:42:46] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [09:43:21] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [09:45:47] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:11] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply [09:46:35] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [09:46:36] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: apply [09:47:09] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:47:25] 10ops-codfw, 10Traffic: ManagementSSHDown - https://phabricator.wikimedia.org/T340960 (10fgiunchedi) Opened a dedicated task for the issue: {T341039} [09:47:36] (03CR) 10Fabfur: [V: 03+1] haproxy: support different actions for tls and http frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [09:47:48] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [09:47:49] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [09:47:51] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T341021 (10fgiunchedi) [09:47:53] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T341020 (10fgiunchedi) [09:47:55] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T341019 (10fgiunchedi) [09:48:10] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [09:48:16] (03CR) 10Elukey: "Forgot to check the puppet part, I added a couple of question but again I'll defer to Ben and Steve to decide what's best :)" [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [09:48:23] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T341035 (10fgiunchedi) [09:48:25] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T341033 (10fgiunchedi) [09:48:27] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T341034 (10fgiunchedi) [09:49:37] (03CR) 10Btullis: [C: 03+2] Add the GPG key for the Confluent Platform 7 repository [puppet] - 10https://gerrit.wikimedia.org/r/935381 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [09:49:43] (SystemdUnitFailed) resolved: (2) envoyproxy.service Failed on wdqs1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:50:30] (03PS4) 10Jbond: P:systemd::timesyncd: automate generation of ntp_servers list [puppet] - 10https://gerrit.wikimedia.org/r/933473 (https://phabricator.wikimedia.org/T340479) (owner: 10Ssingh) [09:50:32] (03PS1) 10Jbond: profile::base: add boolean to manage timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/935387 [09:51:57] (03CR) 10David Caro: maintain-dbusers: ensure get_global_wiki_user is only called when needed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe) [09:52:55] !log jnuche@deploy1002 Finished scap: testwikis wikis to 1.41.0-wmf.16 refs T340244 (duration: 50m 51s) [09:52:58] T340244: 1.41.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T340244 [09:53:35] (03CR) 10Jbond: [C: 04-1] "-1: see comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/933473 (https://phabricator.wikimedia.org/T340479) (owner: 10Ssingh) [09:53:43] (03PS2) 10Jbond: profile::base: add boolean to manage timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/935387 [09:53:45] (03PS5) 10Jbond: P:systemd::timesyncd: automate generation of ntp_servers list [puppet] - 10https://gerrit.wikimedia.org/r/933473 (https://phabricator.wikimedia.org/T340479) (owner: 10Ssingh) [09:54:03] (03CR) 10Hnowlan: [C: 03+1] services: raise anoymous traffic limit for liftwing endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/935379 (https://phabricator.wikimedia.org/T340982) (owner: 10Elukey) [09:54:51] (03CR) 10Hnowlan: [C: 03+2] api-gateway: set memory limit for ratelimit container [deployment-charts] - 10https://gerrit.wikimedia.org/r/933084 (owner: 10Hnowlan) [09:55:08] !log jnuche@deploy1002 Pruned MediaWiki: 1.41.0-wmf.13 (duration: 02m 11s) [09:55:38] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42198/console" [puppet] - 10https://gerrit.wikimedia.org/r/935387 (owner: 10Jbond) [09:55:45] (03CR) 10Jbond: [V: 03+1 C: 03+2] profile::base: add boolean to manage timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/935387 (owner: 10Jbond) [09:55:47] (03Merged) 10jenkins-bot: api-gateway: set memory limit for ratelimit container [deployment-charts] - 10https://gerrit.wikimedia.org/r/933084 (owner: 10Hnowlan) [09:57:06] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935388 (https://phabricator.wikimedia.org/T340244) [09:57:08] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935388 (https://phabricator.wikimedia.org/T340244) (owner: 10TrainBranchBot) [09:57:53] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935388 (https://phabricator.wikimedia.org/T340244) (owner: 10TrainBranchBot) [09:58:09] moving train to group0, it will run over the train deployment window for a few minutes [10:00:06] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230704T1000) [10:01:10] (03PS1) 10Majavah: P:toolforge: mail: blackhole noreply@ [puppet] - 10https://gerrit.wikimedia.org/r/935389 [10:01:26] (03CR) 10Vgutierrez: [C: 03+1] "NOOP at haproxy level in production. Take into account deployment-prep environment as well: https://puppet-compiler.wmflabs.org/output/935" [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [10:03:50] (03CR) 10Jelto: [C: 03+1] miscweb: add statictendril release to miscweb staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/930886 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [10:04:46] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.16 refs T340244 [10:04:49] T340244: 1.41.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T340244 [10:05:16] done now [10:05:33] 10SRE, 10RESTBase, 10RESTBase-API, 10Traffic: REST API is not invalidating caches after template and/or module changes - https://phabricator.wikimedia.org/T335770 (10akosiaris) @Brycehughes it does, albeit I think you shouldn't be able to reproduce now. This was possibly the result of cache purging taking... [10:05:35] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:05:51] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:05:53] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [10:05:57] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [10:05:59] PROBLEM - Check systemd state on an-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-namenode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:13] PROBLEM - Hadoop Namenode - Primary on an-master1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process [10:06:51] (03CR) 10Alexandros Kosiaris: [C: 03+1] services: raise anoymous traffic limit for liftwing endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/935379 (https://phabricator.wikimedia.org/T340982) (owner: 10Elukey) [10:08:37] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/935389 (owner: 10Majavah) [10:09:20] (03CR) 10Giuseppe Lavagetto: "> Patch Set 1: Code-Review-1" [alerts] - 10https://gerrit.wikimedia.org/r/884039 (owner: 10Giuseppe Lavagetto) [10:10:37] RECOVERY - Check systemd state on an-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:10:49] RECOVERY - Hadoop Namenode - Primary on an-master1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process [10:11:41] (03CR) 10David Caro: [C: 03+2] P:toolforge: mail: blackhole noreply@ [puppet] - 10https://gerrit.wikimedia.org/r/935389 (owner: 10Majavah) [10:12:44] (03CR) 10Alexandros Kosiaris: [C: 04-1] "15 minute sounds more plausible for a failsafe. If 3 is derived from historical data, add a comment as to how it was derived so that other" [alerts] - 10https://gerrit.wikimedia.org/r/884039 (owner: 10Giuseppe Lavagetto) [10:20:53] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/rest-gateway: -i apply [10:20:54] !log jayme@deploy1002 helmfile [staging] FAIL (3) helmfile.d/services/mw-api-int: -i apply [10:21:12] (03CR) 10Hashar: "Via T340814 I have found the root cause being cloud-init overwriting the /etc/apt/sources.list we provided by Puppet when creating the ins" [puppet] - 10https://gerrit.wikimedia.org/r/934409 (owner: 10Majavah) [10:22:32] (03PS6) 10Fabfur: haproxy: support different actions for tls and http frontend [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) [10:23:24] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480 (10SLyngshede-WMF) [10:24:00] (03CR) 10Slyngshede: [C: 03+2] P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [10:24:28] (03Abandoned) 10Slyngshede: C:beta:autoupdate Move wmf-beta-autoupdate to files. [puppet] - 10https://gerrit.wikimedia.org/r/931499 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [10:28:37] RECOVERY - Check systemd state on puppetboard1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:30:32] (03PS1) 10Btullis: Configure the confluent7 component for reprepro updates [puppet] - 10https://gerrit.wikimedia.org/r/935391 (https://phabricator.wikimedia.org/T329514) [10:32:51] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:33:11] PROBLEM - Check systemd state on puppetboard1003 is CRITICAL: CRITICAL - degraded: The following units failed: uwsgi-puppetboard.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:34:16] (03CR) 10Btullis: [C: 03+2] Configure the confluent7 component for reprepro updates [puppet] - 10https://gerrit.wikimedia.org/r/935391 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [10:36:42] (03CR) 10Arturo Borrero Gonzalez: "LGTM. I can merge it, let me know when you are ready." [puppet] - 10https://gerrit.wikimedia.org/r/934505 (https://phabricator.wikimedia.org/T340070) (owner: 10Hashar) [10:41:42] (03PS8) 10Arturo Borrero Gonzalez: apt: add package_from_bpo define [puppet] - 10https://gerrit.wikimedia.org/r/935047 [10:42:27] (03CR) 10Hashar: [V: 03+1] "Tested by cherry picking it to the integration standalone Puppet master." [puppet] - 10https://gerrit.wikimedia.org/r/934505 (https://phabricator.wikimedia.org/T340070) (owner: 10Hashar) [10:42:35] (03PS2) 10Majavah: P:toolforge: mailrelay: reject outbound emails without a sender [puppet] - 10https://gerrit.wikimedia.org/r/935093 (https://phabricator.wikimedia.org/T337259) [10:43:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] contint: parameterize the docker lvm disk size [puppet] - 10https://gerrit.wikimedia.org/r/934505 (https://phabricator.wikimedia.org/T340070) (owner: 10Hashar) [10:44:42] (03CR) 10Effie Mouzeli: [C: 03+2] Convert kubestagemaster from CNAME to A record (#7) [dns] - 10https://gerrit.wikimedia.org/r/935035 (https://phabricator.wikimedia.org/T329827) (owner: 10Effie Mouzeli) [10:48:56] (03Abandoned) 10Clément Goubert: termbox: Migrate from staging-test to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/914275 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert) [10:52:15] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/rest-gateway: -i apply [10:52:45] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/eventgate-logging-external: -i apply [10:52:47] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/developer-portal: -i apply [10:52:50] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/sessionstore: -i apply [10:52:52] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/machinetranslation: -i apply [10:52:55] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/recommendation-api: -i apply [10:52:57] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/image-suggestion: -i apply [10:52:59] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/echostore: -i apply [10:53:11] well...this is annoying - sorry [10:53:27] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/eventstreams: -i apply [10:56:59] (03PS7) 10Fabfur: haproxy: support different actions for tls and http frontend [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) [10:58:01] 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): PKI: add the new puppet CA to the pki infrastructre - https://phabricator.wikimedia.org/T340557 (10jbond) > I will update the pki orchestration so that we create a cert file with made up of > ` > $ cat $(facte... [10:58:12] 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): PKI: add the new puppet CA to the pki infrastructre - https://phabricator.wikimedia.org/T340557 (10jbond) 05Open→03Resolved [10:58:15] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [10:58:20] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42204/console" [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [11:00:39] 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Configuer SRV records for new puppet infrastructre - https://phabricator.wikimedia.org/T341053 (10jbond) [11:00:48] 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Configuer SRV records for new puppet infrastructre - https://phabricator.wikimedia.org/T341053 (10jbond) 05Open→03In progress p:05Triage→03Medium [11:00:56] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [11:01:37] (03PS1) 10Btullis: Correct a problem with the confluent7 component updating [puppet] - 10https://gerrit.wikimedia.org/r/935395 (https://phabricator.wikimedia.org/T329514) [11:02:51] RECOVERY - Check systemd state on kubestagemaster2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:03:21] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [11:03:22] (03CR) 10Btullis: [C: 03+2] Correct a problem with the confluent7 component updating [puppet] - 10https://gerrit.wikimedia.org/r/935395 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [11:03:25] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [11:04:18] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [11:04:22] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [11:08:02] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] apt: add package_from_bpo define (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/935047 (owner: 10Arturo Borrero Gonzalez) [11:08:03] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [11:08:05] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [11:10:13] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/rdf-streaming-updater: apply [11:10:14] 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Configuer SRV records for new puppet infrastructre - https://phabricator.wikimedia.org/T341053 (10jbond) [11:12:38] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/push-notifications: apply [11:13:19] (03PS1) 10Jbond: wmnet: add SRV records for puppet compileres and CA [dns] - 10https://gerrit.wikimedia.org/r/935396 (https://phabricator.wikimedia.org/T341053) [11:14:13] (03CR) 10CI reject: [V: 04-1] wmnet: add SRV records for puppet compileres and CA [dns] - 10https://gerrit.wikimedia.org/r/935396 (https://phabricator.wikimedia.org/T341053) (owner: 10Jbond) [11:14:39] RECOVERY - Check whether ferm is active by checking the default input chain on kubestagemaster2002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:17:22] (03PS2) 10Jbond: wmnet: add SRV records for puppet compileres and CA [dns] - 10https://gerrit.wikimedia.org/r/935396 (https://phabricator.wikimedia.org/T341053) [11:17:51] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [11:17:54] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [11:18:14] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Configuer SRV records for new puppet infrastructre - https://phabricator.wikimedia.org/T341053 (10akosiaris) I am not sure how the is now with that setting, but back in 2016, when we tried enabling this, it ended up in more tha... [11:18:27] (03CR) 10CI reject: [V: 04-1] wmnet: add SRV records for puppet compileres and CA [dns] - 10https://gerrit.wikimedia.org/r/935396 (https://phabricator.wikimedia.org/T341053) (owner: 10Jbond) [11:18:29] (03PS2) 10Clément Goubert: termbox: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/935397 (https://phabricator.wikimedia.org/T334064) [11:23:30] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Configuer SRV records for new puppet infrastructre - https://phabricator.wikimedia.org/T341053 (10jbond) >>! In T341053#8987717, @akosiaris wrote: > I am not sure how the is now with that setting, but back in 2016, when we trie... [11:25:37] (03CR) 10Gmodena: "FYI: I deployed the new docker image on staging. Application PODs restarted, with no apparent issue." [deployment-charts] - 10https://gerrit.wikimedia.org/r/935153 (https://phabricator.wikimedia.org/T338169) (owner: 10Gmodena) [11:26:15] !log jayme@deploy1002 helmfile [staging] FAIL (1) helmfile.d/services/similar-users: apply [11:26:43] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/blubberoid: apply [11:27:16] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/termbox: apply [11:28:00] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/wikifeeds: apply [11:28:23] 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): volatile: We need to configure the volatile endpoint on puppetserveres - https://phabricator.wikimedia.org/T341056 (10jbond) [11:28:41] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): volatile: We need to configure the volatile endpoint on puppetserveres - https://phabricator.wikimedia.org/T341056 (10jbond) p:05Triage→03Medium [11:28:41] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/api-gateway: apply [11:29:04] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [11:29:44] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/shellbox-constraints: apply [11:31:37] (03PS3) 10Jbond: wmnet: add SRV records for puppet compileres and CA [dns] - 10https://gerrit.wikimedia.org/r/935396 (https://phabricator.wikimedia.org/T341053) [11:31:44] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/shellbox-timeline: apply [11:32:13] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/tegola-vector-tiles: apply [11:33:13] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/apertium: apply [11:33:49] (03CR) 10Jbond: "please review" [dns] - 10https://gerrit.wikimedia.org/r/935396 (https://phabricator.wikimedia.org/T341053) (owner: 10Jbond) [11:37:42] (03PS1) 10Effie Mouzeli: Add kubestagemaster1002 and 2002 to k8s_staging eBGP config (#8) [homer/public] - 10https://gerrit.wikimedia.org/r/935402 (https://phabricator.wikimedia.org/T329827) [11:38:35] (03CR) 10CI reject: [V: 04-1] Add kubestagemaster1002 and 2002 to k8s_staging eBGP config (#8) [homer/public] - 10https://gerrit.wikimedia.org/r/935402 (https://phabricator.wikimedia.org/T329827) (owner: 10Effie Mouzeli) [11:39:44] (03PS6) 10Jbond: P:systemd::timesyncd: automate generation of ntp_servers list [puppet] - 10https://gerrit.wikimedia.org/r/933473 (https://phabricator.wikimedia.org/T340479) (owner: 10Ssingh) [11:39:46] (03PS2) 10Jbond: P:dns::recursor: automatically generate resolv.conf for DNS hosts [puppet] - 10https://gerrit.wikimedia.org/r/933497 (https://phabricator.wikimedia.org/T340479) (owner: 10Ssingh) [11:39:48] (03PS1) 10Jbond: site_nearest_core: Add an data element to indicate the closest core [puppet] - 10https://gerrit.wikimedia.org/r/935403 (https://phabricator.wikimedia.org/T340479) [11:39:50] (03CR) 10Jbond: P:systemd::timesyncd: automate generation of ntp_servers list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933473 (https://phabricator.wikimedia.org/T340479) (owner: 10Ssingh) [11:40:04] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/eventgate-analytics-external: apply [11:40:52] (03PS1) 10JMeybohm: similar-users: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/935404 (https://phabricator.wikimedia.org/T300324) [11:41:08] (03PS2) 10Effie Mouzeli: Add kubestagemaster1002 and 2002 to k8s_staging eBGP config (#8) [homer/public] - 10https://gerrit.wikimedia.org/r/935402 (https://phabricator.wikimedia.org/T329827) [11:41:09] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/cxserver: apply [11:41:45] (03PS1) 10Hashar: ci: setup dockervolume before Docker daemon [puppet] - 10https://gerrit.wikimedia.org/r/935405 (https://phabricator.wikimedia.org/T341051) [11:41:59] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/zotero: apply [11:42:43] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/shellbox-media: apply [11:43:06] (03PS3) 10Effie Mouzeli: Add kubestagemaster1002 and 2002 to k8s_staging eBGP config (#8) [homer/public] - 10https://gerrit.wikimedia.org/r/935402 (https://phabricator.wikimedia.org/T329827) [11:43:11] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/citoid: apply [11:43:46] (03CR) 10Cathal Mooney: "LGTM. I think for our more permanent setup eqiad still makes sense, but this will help with the new server installs in knams for sure. +" [cookbooks] - 10https://gerrit.wikimedia.org/r/933434 (https://phabricator.wikimedia.org/T340465) (owner: 10Volans) [11:43:49] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/device-analytics: apply [11:44:35] (03CR) 10JMeybohm: [C: 03+1] Add kubestagemaster1002 and 2002 to k8s_staging eBGP config (#8) [homer/public] - 10https://gerrit.wikimedia.org/r/935402 (https://phabricator.wikimedia.org/T329827) (owner: 10Effie Mouzeli) [11:45:18] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/mobileapps: apply [11:45:31] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/linkrecommendation: apply [11:46:02] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/eventgate-analytics: apply [11:46:46] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/shellbox-syntaxhighlight: apply [11:47:16] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/eventgate-main: apply [11:47:56] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/eventstreams-internal: apply [11:48:12] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/shellbox: apply [11:48:36] (03CR) 10Cathal Mooney: [C: 03+1] Add kubestagemaster1002 and 2002 to k8s_staging eBGP config (#8) [homer/public] - 10https://gerrit.wikimedia.org/r/935402 (https://phabricator.wikimedia.org/T329827) (owner: 10Effie Mouzeli) [11:48:51] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/toolhub: apply [11:49:03] (03CR) 10JMeybohm: [C: 03+2] similar-users: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/935404 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [11:49:52] (03Merged) 10jenkins-bot: similar-users: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/935404 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [11:50:43] (03CR) 10Majavah: [C: 04-1] replica_cnf_api: refactor to use multiple backends (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [11:53:34] !log jayme@deploy1002 helmfile [staging] FAIL (1) helmfile.d/services/miscweb: apply [11:55:32] !log jayme@deploy1002 helmfile [staging] OK helmfile.d/services/similar-users: apply [11:55:52] (03CR) 10Effie Mouzeli: [C: 03+2] Add kubestagemaster1002 and 2002 to k8s_staging eBGP config (#8) [homer/public] - 10https://gerrit.wikimedia.org/r/935402 (https://phabricator.wikimedia.org/T329827) (owner: 10Effie Mouzeli) [11:57:06] (03PS1) 10Hnowlan: device-analytics: add etag header support [deployment-charts] - 10https://gerrit.wikimedia.org/r/935406 (https://phabricator.wikimedia.org/T340735) [11:57:28] (03Merged) 10jenkins-bot: Add kubestagemaster1002 and 2002 to k8s_staging eBGP config (#8) [homer/public] - 10https://gerrit.wikimedia.org/r/935402 (https://phabricator.wikimedia.org/T329827) (owner: 10Effie Mouzeli) [11:57:35] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [11:57:46] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [11:59:54] (03PS1) 10Jbond: puppet::agent: add support for srv records [puppet] - 10https://gerrit.wikimedia.org/r/935407 (https://phabricator.wikimedia.org/T341053) [12:01:23] (03CR) 10Jbond: [C: 03+1] wmnet: add SRV records for puppet compileres and CA [dns] - 10https://gerrit.wikimedia.org/r/935396 (https://phabricator.wikimedia.org/T341053) (owner: 10Jbond) [12:01:39] (03CR) 10Jbond: [C: 03+1] "done" [puppet] - 10https://gerrit.wikimedia.org/r/929726 (https://phabricator.wikimedia.org/T337829) (owner: 10Cathal Mooney) [12:01:41] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42207/console" [puppet] - 10https://gerrit.wikimedia.org/r/935407 (https://phabricator.wikimedia.org/T341053) (owner: 10Jbond) [12:01:48] (03CR) 10Slyngshede: [C: 03+2] Add user samtar to shell group wmcs-roots [puppet] - 10https://gerrit.wikimedia.org/r/929726 (https://phabricator.wikimedia.org/T337829) (owner: 10Cathal Mooney) [12:02:41] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops (or wmcs-roots) for TheresNoTime - https://phabricator.wikimedia.org/T337829 (10SLyngshede-WMF) 05Open→03Resolved [12:05:34] (HelmReleaseBadStatus) firing: (2) Helm release miscweb/annualreport on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=miscweb - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:07:01] my fault...gonna fix in a bit [12:10:09] (03PS1) 10Slyngshede: P:backup::host unique id is not available in facter3 [puppet] - 10https://gerrit.wikimedia.org/r/935408 (https://phabricator.wikimedia.org/T221083) [12:11:46] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42208/console" [puppet] - 10https://gerrit.wikimedia.org/r/935408 (https://phabricator.wikimedia.org/T221083) (owner: 10Slyngshede) [12:12:32] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [12:13:45] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [12:15:34] (HelmReleaseBadStatus) resolved: (2) Helm release miscweb/annualreport on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=miscweb - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:18:46] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/apertium: apply [12:19:45] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/blubberoid: apply [12:20:23] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/citoid: apply [12:20:27] (03PS1) 10Jbond: puppet::agent: configure srv_domain based on site [puppet] - 10https://gerrit.wikimedia.org/r/935409 (https://phabricator.wikimedia.org/T341053) [12:20:57] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/cxserver: apply [12:21:30] (03CR) 10Btullis: [C: 03+2] Fix the script to install the spark3 yarn shuffler jar symlink [puppet] - 10https://gerrit.wikimedia.org/r/922585 (https://phabricator.wikimedia.org/T332765) (owner: 10Btullis) [12:21:40] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/developer-portal: apply [12:22:23] (03CR) 10CI reject: [V: 04-1] puppet::agent: configure srv_domain based on site [puppet] - 10https://gerrit.wikimedia.org/r/935409 (https://phabricator.wikimedia.org/T341053) (owner: 10Jbond) [12:22:28] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/device-analytics: apply [12:23:14] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/echostore: apply [12:24:07] (03PS2) 10Slyngshede: P:backup::host unique id is not available in facter3 [puppet] - 10https://gerrit.wikimedia.org/r/935408 (https://phabricator.wikimedia.org/T221083) [12:24:23] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/eventgate-analytics: apply [12:25:04] (HelmReleaseBadStatus) firing: (2) Helm release miscweb/annualreport on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=miscweb - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:25:17] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42210/console" [puppet] - 10https://gerrit.wikimedia.org/r/935408 (https://phabricator.wikimedia.org/T221083) (owner: 10Slyngshede) [12:25:22] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/eventgate-analytics-external: apply [12:26:16] (03CR) 10Klausman: [C: 03+1] services: raise anoymous traffic limit for liftwing endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/935379 (https://phabricator.wikimedia.org/T340982) (owner: 10Elukey) [12:26:56] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [12:27:09] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [12:27:31] (03CR) 10Elukey: [C: 03+2] services: raise anoymous traffic limit for liftwing endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/935379 (https://phabricator.wikimedia.org/T340982) (owner: 10Elukey) [12:27:50] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/eventgate-logging-external: apply [12:28:19] (HelmReleaseBadStatus) resolved: (2) Helm release miscweb/annualreport on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=miscweb - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:28:26] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/eventgate-main: apply [12:29:05] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/eventstreams: apply [12:29:11] (03PS8) 10Fabfur: haproxy: support different actions for tls and http frontend [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) [12:29:18] (03CR) 10Hashar: [C: 04-1] "I have cherry picked it on the integration standalone Puppet master. I will confirm whether it works properly after I have created a brand" [puppet] - 10https://gerrit.wikimedia.org/r/935405 (https://phabricator.wikimedia.org/T341051) (owner: 10Hashar) [12:29:40] (03CR) 10CI reject: [V: 04-1] haproxy: support different actions for tls and http frontend [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [12:29:55] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/eventstreams-internal: apply [12:30:54] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/image-suggestion: apply [12:30:58] (03CR) 10Jcrespo: [C: 04-1] "This won't work as is- changing all the days backups are done instantly will be a huge coordination and monitoring problem. Needs more thi" [puppet] - 10https://gerrit.wikimedia.org/r/935408 (https://phabricator.wikimedia.org/T221083) (owner: 10Slyngshede) [12:31:06] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync [12:31:16] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync [12:33:01] (03CR) 10Jcrespo: [C: 04-1] "As in, if we started from 0 or eventually, this is ok, but it should be migrated progresively." [puppet] - 10https://gerrit.wikimedia.org/r/935408 (https://phabricator.wikimedia.org/T221083) (owner: 10Slyngshede) [12:34:25] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/linkrecommendation: apply [12:36:38] (03PS1) 10Btullis: Deploy a new version of the datahub images [deployment-charts] - 10https://gerrit.wikimedia.org/r/935413 (https://phabricator.wikimedia.org/T329514) [12:39:21] (03PS2) 10Jbond: puppet::agent: configure srv_domain based on site [puppet] - 10https://gerrit.wikimedia.org/r/935409 (https://phabricator.wikimedia.org/T341053) [12:40:44] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/machinetranslation: apply [12:41:01] (03CR) 10CI reject: [V: 04-1] puppet::agent: configure srv_domain based on site [puppet] - 10https://gerrit.wikimedia.org/r/935409 (https://phabricator.wikimedia.org/T341053) (owner: 10Jbond) [12:41:27] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/miscweb: apply [12:42:00] (03PS9) 10Fabfur: haproxy: support different actions for tls and http frontend [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) [12:42:52] (03PS2) 10Jgiannelos: wikifeeds: Add CSP headers for restbase sunset [deployment-charts] - 10https://gerrit.wikimedia.org/r/935051 (https://phabricator.wikimedia.org/T340769) [12:42:55] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/mobileapps: apply [12:43:25] (03PS1) 10Kosta Harlan: ipoid: Debug staging deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/935414 [12:44:19] (03CR) 10Btullis: [C: 03+2] Deploy a new version of the datahub images [deployment-charts] - 10https://gerrit.wikimedia.org/r/935413 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [12:44:23] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/proton: apply [12:45:01] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/push-notifications: apply [12:45:20] (03Merged) 10jenkins-bot: Deploy a new version of the datahub images [deployment-charts] - 10https://gerrit.wikimedia.org/r/935413 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [12:45:38] (03PS2) 10Effie Mouzeli: ipoid: Use config.dev.yaml for debugging purposes [deployment-charts] - 10https://gerrit.wikimedia.org/r/935414 (owner: 10Kosta Harlan) [12:46:03] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/rdf-streaming-updater: apply [12:47:03] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/recommendation-api: apply [12:47:05] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42213/console" [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [12:47:20] (03CR) 10Effie Mouzeli: [C: 03+2] ipoid: Use config.dev.yaml for debugging purposes [deployment-charts] - 10https://gerrit.wikimedia.org/r/935414 (owner: 10Kosta Harlan) [12:47:31] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/shellbox: apply [12:47:56] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/shellbox-constraints: apply [12:48:14] (03Merged) 10jenkins-bot: ipoid: Use config.dev.yaml for debugging purposes [deployment-charts] - 10https://gerrit.wikimedia.org/r/935414 (owner: 10Kosta Harlan) [12:48:20] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/shellbox-media: apply [12:48:43] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply [12:48:45] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/shellbox-syntaxhighlight: apply [12:48:47] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [12:48:55] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply [12:48:59] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [12:49:10] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/shellbox-timeline: apply [12:49:16] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply [12:49:28] (03PS10) 10Fabfur: haproxy: support different actions for tls and http frontend [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) [12:49:52] (03PS3) 10ArielGlenn: dumps: Update documentation [puppet] - 10https://gerrit.wikimedia.org/r/902738 (owner: 10Meno25) [12:50:14] (03PS9) 10Jbond: apt: add package_from_bpo define [puppet] - 10https://gerrit.wikimedia.org/r/935047 (owner: 10Arturo Borrero Gonzalez) [12:50:17] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/similar-users: apply [12:50:29] (03CR) 10ArielGlenn: [C: 03+2] dumps: Update documentation [puppet] - 10https://gerrit.wikimedia.org/r/902738 (owner: 10Meno25) [12:50:39] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [12:50:42] (03CR) 10Jcrespo: [C: 04-1] "I think the easiest solution would be to hardcode the day for existing backups for migration (leaving the default for new configured backu" [puppet] - 10https://gerrit.wikimedia.org/r/935408 (https://phabricator.wikimedia.org/T221083) (owner: 10Slyngshede) [12:50:53] (03CR) 10FNegri: [C: 03+2] Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri) [12:51:28] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply [12:51:49] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [12:52:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:53:23] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/tegola-vector-tiles: apply [12:53:56] (03CR) 10Jbond: "lgtm but see comment re notice" [puppet] - 10https://gerrit.wikimedia.org/r/935047 (owner: 10Arturo Borrero Gonzalez) [12:54:46] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/termbox: apply [12:55:18] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/blubberoid: apply [12:55:56] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/toolhub: apply [12:56:08] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/citoid: apply [12:57:15] (03PS10) 10Arturo Borrero Gonzalez: apt: add package_from_bpo define [puppet] - 10https://gerrit.wikimedia.org/r/935047 [12:57:16] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/cxserver: apply [12:57:20] (03CR) 10Arturo Borrero Gonzalez: apt: add package_from_bpo define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935047 (owner: 10Arturo Borrero Gonzalez) [12:58:02] (03PS1) 10Aklapper: phabricator: quarterly_metrics.sh: Improve Bitergia instructions [puppet] - 10https://gerrit.wikimedia.org/r/935416 (https://phabricator.wikimedia.org/T341064) [12:58:14] (03PS1) 10Jaime Nuche: releases-jenkins: fix access control [puppet] - 10https://gerrit.wikimedia.org/r/935417 (https://phabricator.wikimedia.org/T338071) [12:59:21] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/developer-portal: apply [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230704T1300). [13:00:04] No Gerrit patches in the queue for this window AFAICS. [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230704T1300) [13:00:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:11] * Lucas_WMDE around [13:00:14] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/device-analytics: apply [13:00:22] yup, nothing to deploy it seems [13:00:31] yay [13:00:33] I might have a config change later but no idea if it’ll get done in time, we’ll see [13:01:02] (03CR) 10Jbond: [C: 03+1] "cheers lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/935047 (owner: 10Arturo Borrero Gonzalez) [13:01:13] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/echostore: apply [13:01:20] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/wikifeeds: apply [13:01:59] !log jayme@deploy1002 helmfile [codfw] OK helmfile.d/services/zotero: apply [13:02:40] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/eventgate-analytics: apply [13:02:42] (03CR) 10Jgiannelos: "@MSantos regarding the interim between switching over, restbase already adds the headers so its should be ok" [deployment-charts] - 10https://gerrit.wikimedia.org/r/935051 (https://phabricator.wikimedia.org/T340769) (owner: 10Jgiannelos) [13:03:06] (03PS1) 10Hashar: labs_lvm: wipe fs signatures when creating logical volume [puppet] - 10https://gerrit.wikimedia.org/r/935418 (https://phabricator.wikimedia.org/T300002) [13:03:17] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/eventgate-analytics-external: apply [13:03:40] Lucas_WMDE: Urbanecm cherry-picked one of my patch to wmf.15 and 16, not sure if they wants it deployed? [13:03:56] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Creat cookbook to migrate serveres from the puppetmasters to puppetservers - https://phabricator.wikimedia.org/T340739 (10jbond) p:05Triage→03Medium [13:04:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] apt: add package_from_bpo define [puppet] - 10https://gerrit.wikimedia.org/r/935047 (owner: 10Arturo Borrero Gonzalez) [13:05:47] (03PS1) 10Arturo Borrero Gonzalez: dynamicproxy: api: migrate python3-flask-sqlalchemy to apt::package_from_bpo [puppet] - 10https://gerrit.wikimedia.org/r/935419 [13:05:48] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/eventgate-logging-external: apply [13:05:52] (03CR) 10Jgiannelos: wikifeeds: Add CSP headers for restbase sunset (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935051 (https://phabricator.wikimedia.org/T340769) (owner: 10Jgiannelos) [13:06:17] Func: would be great to backport, but I cannot do it now. If Lucas_WMDE wants to, I'd appreciate that. [13:06:41] I’m actually in a meeting at the moment, sorry [13:06:45] but I can maybe do it later [13:07:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:38] (03PS1) 10Jbond: puppetboard1003: force puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/935420 [13:08:03] (03PS1) 10Arturo Borrero Gonzalez: keepalived: migrate to apt::package_from_bpo [puppet] - 10https://gerrit.wikimedia.org/r/935421 [13:08:23] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/eventgate-main: apply [13:08:24] (03CR) 10Jbond: [C: 03+2] puppetboard1003: force puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/935420 (owner: 10Jbond) [13:08:45] (03PS1) 10Hashar: labs_lvm: add `.sh` extension to shell scripts [puppet] - 10https://gerrit.wikimedia.org/r/935422 [13:08:53] (03PS3) 10Slyngshede: P:backup::host unique id is not available in facter3 [puppet] - 10https://gerrit.wikimedia.org/r/935408 (https://phabricator.wikimedia.org/T221083) [13:09:15] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/eventstreams-internal: apply [13:09:46] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Configure SRV records for new puppet infrastructure - https://phabricator.wikimedia.org/T341053 (10Aklapper) [13:10:33] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42214/console" [puppet] - 10https://gerrit.wikimedia.org/r/935408 (https://phabricator.wikimedia.org/T221083) (owner: 10Slyngshede) [13:10:37] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42215/console" [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [13:10:42] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/image-suggestion: apply [13:11:18] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/linkrecommendation: apply [13:12:00] (03CR) 10Hashar: "Somehow shellcheck reports the issue but passes CI:" [puppet] - 10https://gerrit.wikimedia.org/r/935422 (owner: 10Hashar) [13:12:07] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/935421/42216/" [puppet] - 10https://gerrit.wikimedia.org/r/935421 (owner: 10Arturo Borrero Gonzalez) [13:12:22] (03PS1) 10Jbond: Revert "puppetboard1003: force puppet7" [puppet] - 10https://gerrit.wikimedia.org/r/935123 [13:12:28] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "puppetboard1003: force puppet7" [puppet] - 10https://gerrit.wikimedia.org/r/935123 (owner: 10Jbond) [13:12:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] dynamicproxy: api: migrate python3-flask-sqlalchemy to apt::package_from_bpo [puppet] - 10https://gerrit.wikimedia.org/r/935419 (owner: 10Arturo Borrero Gonzalez) [13:12:35] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] keepalived: migrate to apt::package_from_bpo [puppet] - 10https://gerrit.wikimedia.org/r/935421 (owner: 10Arturo Borrero Gonzalez) [13:13:07] arturo: feel free to merge mine [13:13:08] (03PS4) 10Slyngshede: P:backup::host unique id is not available in facter3 [puppet] - 10https://gerrit.wikimedia.org/r/935408 (https://phabricator.wikimedia.org/T221083) [13:13:23] nevermind i got it [13:13:26] jbond: it was already merged [13:13:58] alright, I can deploy now I think [13:14:05] * Lucas_WMDE looks for the change [13:14:22] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42219/console" [puppet] - 10https://gerrit.wikimedia.org/r/935408 (https://phabricator.wikimedia.org/T221083) (owner: 10Slyngshede) [13:14:35] (03PS1) 10Hashar: labs_lvm: pass shellcheck on scripts [puppet] - 10https://gerrit.wikimedia.org/r/935423 [13:14:58] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42218/console" [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [13:15:09] (03PS1) 10Jbond: puppetboard::bookworm: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/935424 (https://phabricator.wikimedia.org/T340739) [13:15:16] Lucas_WMDE: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/935121 and https://gerrit.wikimedia.org/r/c/mediawiki/core/+/935122 on wmf.15 and 16 respectively, I am around and can test. [13:15:43] (03PS1) 10Btullis: Temporarily disable gobblin jobs on the analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/935425 (https://phabricator.wikimedia.org/T332765) [13:15:45] is it easier to test on wmf.16 or wmf.15? [13:15:45] (03PS1) 10Btullis: Temporarily disable the spark jobs that are running on an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/935426 (https://phabricator.wikimedia.org/T332765) [13:15:47] (03PS1) 10Btullis: Upgrade the spark shuffler service from version 2 to version 3 [puppet] - 10https://gerrit.wikimedia.org/r/935427 (https://phabricator.wikimedia.org/T332765) [13:16:08] ]Lucas_WMDE: same [13:16:14] ok [13:16:17] (03CR) 10Hashar: [C: 03+1] "The rake task only fails on shellcheck errors :]" [puppet] - 10https://gerrit.wikimedia.org/r/935422 (owner: 10Hashar) [13:16:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:16:28] then doing .16 first [13:16:45] (03CR) 10Btullis: [C: 04-1] "Not to be merged until 2023-07-05 as part of the upgrade." [puppet] - 10https://gerrit.wikimedia.org/r/935425 (https://phabricator.wikimedia.org/T332765) (owner: 10Btullis) [13:16:53] (03CR) 10Jbond: [C: 03+2] puppetboard::bookworm: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/935424 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [13:16:55] (03CR) 10Btullis: [C: 04-1] "Not to be merged until 2023-07-05 as part of the upgrade." [puppet] - 10https://gerrit.wikimedia.org/r/935426 (https://phabricator.wikimedia.org/T332765) (owner: 10Btullis) [13:16:58] * Lucas_WMDE looks at the diff and screams a little bit [13:17:03] (03CR) 10Slyngshede: [V: 03+1] P:backup::host unique id is not available in facter3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/935408 (https://phabricator.wikimedia.org/T221083) (owner: 10Slyngshede) [13:17:05] (03CR) 10Btullis: [C: 04-1] "Not to be merged until 2023-07-05 as part of the upgrade." [puppet] - 10https://gerrit.wikimedia.org/r/935427 (https://phabricator.wikimedia.org/T332765) (owner: 10Btullis) [13:17:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [core] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/935121 (https://phabricator.wikimedia.org/T341002) (owner: 10Urbanecm) [13:17:27] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/machinetranslation: apply [13:18:15] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/miscweb: apply [13:19:13] (03PS1) 10Hnowlan: api-gateway: bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/935428 [13:19:39] (03CR) 10Elukey: [C: 03+1] api-gateway: bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/935428 (owner: 10Hnowlan) [13:19:44] (03PS2) 10Hashar: labs_lvm: wipe fs signatures when creating logical volume [puppet] - 10https://gerrit.wikimedia.org/r/935418 (https://phabricator.wikimedia.org/T300002) [13:20:05] (03CR) 10Ssingh: [C: 03+1] "Doing a separate commit make sense!" [puppet] - 10https://gerrit.wikimedia.org/r/935403 (https://phabricator.wikimedia.org/T340479) (owner: 10Jbond) [13:20:50] (03CR) 10Jbond: [C: 03+2] site_nearest_core: Add an data element to indicate the closest core (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935403 (https://phabricator.wikimedia.org/T340479) (owner: 10Jbond) [13:21:20] (03PS3) 10Hashar: labs_lvm: wipe fs signatures when creating logical volume [puppet] - 10https://gerrit.wikimedia.org/r/935418 (https://phabricator.wikimedia.org/T300002) [13:21:44] (03CR) 10Hnowlan: [C: 03+2] api-gateway: bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/935428 (owner: 10Hnowlan) [13:21:58] 10SRE, 10Observability-Alerting, 10Traffic, 10serviceops: Timeouts when talking to phabricator API - https://phabricator.wikimedia.org/T341039 (10fgiunchedi) [13:22:15] (03CR) 10Slyngshede: [V: 03+1] P:backup::host unique id is not available in facter3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935408 (https://phabricator.wikimedia.org/T221083) (owner: 10Slyngshede) [13:22:30] (03Merged) 10jenkins-bot: api-gateway: bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/935428 (owner: 10Hnowlan) [13:22:35] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/mobileapps: apply [13:22:39] (03PS1) 10Jbond: puppetboard::bookworm: switch server to new puppet infrastructre [puppet] - 10https://gerrit.wikimedia.org/r/935429 (https://phabricator.wikimedia.org/T340739) [13:22:56] (03CR) 10Ssingh: P:systemd::timesyncd: automate generation of ntp_servers list (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/933473 (https://phabricator.wikimedia.org/T340479) (owner: 10Ssingh) [13:24:08] (03CR) 10Jbond: [C: 03+2] puppetboard::bookworm: switch server to new puppet infrastructre [puppet] - 10https://gerrit.wikimedia.org/r/935429 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [13:25:16] (03CR) 10Vgutierrez: [C: 04-1] haproxy: support different actions for tls and http frontend (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [13:26:42] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/proton: apply [13:27:33] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/push-notifications: apply [13:27:42] (03PS4) 10Hashar: labs_lvm: wipe fs signatures when creating logical volume [puppet] - 10https://gerrit.wikimedia.org/r/935418 (https://phabricator.wikimedia.org/T300002) [13:28:35] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync [13:28:37] 10SRE, 10Observability-Alerting, 10Traffic, 10serviceops: Timeouts when talking to phabricator API - https://phabricator.wikimedia.org/T341039 (10fgiunchedi) [13:28:44] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync [13:28:55] (03CR) 10Ssingh: "Essentially what this commit is doing is automating the data in /etc/systemd/timesyncd.conf, instead of handling it manually. Example:" [puppet] - 10https://gerrit.wikimedia.org/r/933473 (https://phabricator.wikimedia.org/T340479) (owner: 10Ssingh) [13:29:56] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Creat cookbook to migrate serveres from the puppetmasters to puppetservers - https://phabricator.wikimedia.org/T340739 (10jbond) Theses are the manual steps i made to migrate puppetboard1003 * Agent: [[ https://gerrit.wikimedi... [13:31:14] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: sync [13:31:33] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync [13:32:18] (03CR) 10Hashar: "I had the issue on integration-docker-agent-1040. It has two logical volume, the first /var/lib/docker was set to 24G, I have deleted it a" [puppet] - 10https://gerrit.wikimedia.org/r/935418 (https://phabricator.wikimedia.org/T300002) (owner: 10Hashar) [13:32:24] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync [13:32:29] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/rdf-streaming-updater: apply [13:32:40] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync [13:33:27] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/recommendation-api: apply [13:33:32] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "starting the gate-and-submit already" [core] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/935122 (https://phabricator.wikimedia.org/T341002) (owner: 10Urbanecm) [13:34:00] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/shellbox-constraints: apply [13:34:20] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/shellbox-media: apply [13:34:31] (03CR) 10Ssingh: "Thanks for the review! Addressing one nit in the follow-up patch and leaving the other for later." [puppet] - 10https://gerrit.wikimedia.org/r/933497 (https://phabricator.wikimedia.org/T340479) (owner: 10Ssingh) [13:34:35] (03PS11) 10Clément Goubert: api-gateway: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) [13:34:42] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/shellbox-syntaxhighlight: apply [13:35:01] (03Merged) 10jenkins-bot: DeleteAction: Avoid displaying the form unconditionally [core] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/935121 (https://phabricator.wikimedia.org/T341002) (owner: 10Urbanecm) [13:35:04] (03CR) 10Joal: [C: 03+1] Temporarily disable gobblin jobs on the analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/935425 (https://phabricator.wikimedia.org/T332765) (owner: 10Btullis) [13:35:08] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/shellbox-timeline: apply [13:35:25] (03PS2) 10Hashar: labs_lvm: add `.sh` extension to shell scripts [puppet] - 10https://gerrit.wikimedia.org/r/935422 [13:35:49] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:935121|DeleteAction: Avoid displaying the form unconditionally (T341002)]] [13:35:51] T341002: Manually constructing action=delete URLs displays a deletion form instead of an error message - https://phabricator.wikimedia.org/T341002 [13:36:02] Thanks for the deployment Lucas_WMDE. [13:36:16] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/similar-users: apply [13:36:41] jouncebot: nowandnext [13:36:41] For the next 0 hour(s) and 23 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230704T1300) [13:36:42] For the next 0 hour(s) and 23 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230704T1300) [13:36:42] In 2 hour(s) and 23 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230704T1600) [13:36:45] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/tegola-vector-tiles: apply [13:37:00] (03PS2) 10Hashar: labs_lvm: pass shellcheck on scripts [puppet] - 10https://gerrit.wikimedia.org/r/935423 [13:37:22] I'll wait until the end of the backports to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/933911 :D [13:37:23] !log lucaswerkmeister-wmde@deploy1002 urbanecm and lucaswerkmeister-wmde: Backport for [[gerrit:935121|DeleteAction: Avoid displaying the form unconditionally (T341002)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:37:24] Seems prudent [13:37:35] Func: can you test? [13:37:36] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/termbox: apply [13:37:41] testing.. [13:37:42] claime: I can take a break between the two backports ^^ [13:37:53] Lucas_WMDE: No need, there's no rush [13:37:57] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/toolhub: apply [13:37:58] ok [13:38:00] (03CR) 10DCausse: mw-page-content-change-enrichment stream partition WIP (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935153 (https://phabricator.wikimedia.org/T338169) (owner: 10Gmodena) [13:38:20] but maybe the second backport would be useful to immediately test the new rsync procedure? 🤔 [13:38:29] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/wikifeeds: apply [13:38:41] * Lucas_WMDE doesn’t really understand what that puppet change does tho [13:38:43] Lucas_WMDE: looks good. [13:38:50] ok, syncing [13:39:00] Hey there, I would like to run a schema change on beta for T340694. May I go ahead when the deployment window is over? [13:39:01] T340694: Create new participant questions columns in beta DB - https://phabricator.wikimedia.org/T340694 [13:39:19] (03PS1) 10Jbond: ipuppetserver::g10k: dont recurse cache directory [puppet] - 10https://gerrit.wikimedia.org/r/935432 [13:39:21] (03CR) 10Hashar: "That indeed cause CI (or bundle exec rake shellcheck) to pass shellcheck on those bash scripts :)" [puppet] - 10https://gerrit.wikimedia.org/r/935422 (owner: 10Hashar) [13:39:28] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [13:39:30] (03PS1) 10Effie Mouzeli: admin: add kubestagemaster1002 and 2002 helmfile (#9) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935433 (https://phabricator.wikimedia.org/T329827) [13:39:53] (03CR) 10CI reject: [V: 04-1] ipuppetserver::g10k: dont recurse cache directory [puppet] - 10https://gerrit.wikimedia.org/r/935432 (owner: 10Jbond) [13:39:55] (03PS1) 10Majavah: P:toolforge: redis: drop unused profile [puppet] - 10https://gerrit.wikimedia.org/r/935434 [13:39:58] aren’t new beta tables usually created via update.php? or does that not work for wikishared? [13:40:07] (03PS2) 10Gmodena: mw-page-content-change-enrichment stream partition WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/935153 (https://phabricator.wikimedia.org/T338169) [13:40:09] Yeah, not working for wikishared [13:40:11] ok [13:40:17] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:40:47] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/zotero: apply [13:41:24] 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Create dynamic CRL - https://phabricator.wikimedia.org/T340543 (10jbond) I have added config to set the agent to `certificate_revocation = leaf` however we will need to move this config to the main section as we also need to to do a `... [13:42:07] (03CR) 10DCausse: [C: 03+1] mw-page-content-change-enrichment stream partition WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/935153 (https://phabricator.wikimedia.org/T338169) (owner: 10Gmodena) [13:44:04] (03CR) 10JMeybohm: [C: 03+1] admin: add kubestagemaster1002 and 2002 helmfile (#9) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935433 (https://phabricator.wikimedia.org/T329827) (owner: 10Effie Mouzeli) [13:44:14] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:935121|DeleteAction: Avoid displaying the form unconditionally (T341002)]] (duration: 08m 25s) [13:44:17] T341002: Manually constructing action=delete URLs displays a deletion form instead of an error message - https://phabricator.wikimedia.org/T341002 [13:44:25] (03CR) 10Effie Mouzeli: [C: 03+2] admin: add kubestagemaster1002 and 2002 helmfile (#9) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935433 (https://phabricator.wikimedia.org/T329827) (owner: 10Effie Mouzeli) [13:44:36] alright, I’ll go ahead with the second backport then [13:44:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [core] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/935122 (https://phabricator.wikimedia.org/T341002) (owner: 10Urbanecm) [13:45:22] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/eventstreams: apply [13:45:44] Also -- one more question: for T320258, I need to put a secret API key in the config. I know that goes into PrivateSettings.php or sth like that, but what's the process for getting that done? Put the patch file in a restricted phab task so the deployer can then apply that? [13:45:45] T320258: Dashboard integration: Configure the P&E Dashboard integration in beta - https://phabricator.wikimedia.org/T320258 [13:46:15] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:46:47] (03Merged) 10jenkins-bot: admin: add kubestagemaster1002 and 2002 helmfile (#9) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935433 (https://phabricator.wikimedia.org/T329827) (owner: 10Effie Mouzeli) [13:48:28] Daimona: something like that, yes. but looks like you're trying to contact an external API from MW directly - can I ask where that's been approved by SRE? [13:49:47] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [13:49:58] Thanks! As for your question, let me find the relevant tasks [13:50:19] !log jiji@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [13:50:22] !log jayme@deploy1002 helmfile [eqiad] OK helmfile.d/services/shellbox: apply [13:50:30] !log jiji@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [13:50:39] !log jiji@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [13:51:04] !log jiji@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:51:05] (03CR) 10Jcrespo: "Could you do a puppet run on a couple of existing hosts to make sure it works?" [puppet] - 10https://gerrit.wikimedia.org/r/935408 (https://phabricator.wikimedia.org/T221083) (owner: 10Slyngshede) [13:52:23] 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): puptet7: drop iunstances of :undef in erb files - https://phabricator.wikimedia.org/T341071 (10jbond) [13:52:59] 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): puptet7: drop iunstances of :undef in erb files - https://phabricator.wikimedia.org/T341071 (10jbond) [13:53:02] 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Create dynamic CRL - https://phabricator.wikimedia.org/T340543 (10jbond) [13:53:10] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [13:53:15] So, https://phabricator.wikimedia.org/T320641 is the security/privacy review. There might have been more conversations but I'm not 100% sure about that. As for SRE, I remember an old conversation where we were basically told that as long as it uses the url-downloader proxy (via MW's HttpRequestFactory) it wouldn't be an issue. I don't think it went through an "official" review process though. [13:53:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:55:03] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: sync [13:55:05] (03CR) 10Hnowlan: [C: 03+2] device-analytics: add etag header support [deployment-charts] - 10https://gerrit.wikimedia.org/r/935406 (https://phabricator.wikimedia.org/T340735) (owner: 10Hnowlan) [13:55:18] (03PS1) 10JMeybohm: mathoid: Switch back to default envoy [deployment-charts] - 10https://gerrit.wikimedia.org/r/935435 (https://phabricator.wikimedia.org/T300324) [13:55:29] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: sync [13:55:59] (03Merged) 10jenkins-bot: device-analytics: add etag header support [deployment-charts] - 10https://gerrit.wikimedia.org/r/935406 (https://phabricator.wikimedia.org/T340735) (owner: 10Hnowlan) [13:57:08] (03Merged) 10jenkins-bot: DeleteAction: Avoid displaying the form unconditionally [core] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/935122 (https://phabricator.wikimedia.org/T341002) (owner: 10Urbanecm) [13:57:33] (03PS11) 10Fabfur: haproxy: support different actions for tls and http frontend [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) [13:57:35] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:935122|DeleteAction: Avoid displaying the form unconditionally (T341002)]] [13:57:38] T341002: Manually constructing action=delete URLs displays a deletion form instead of an error message - https://phabricator.wikimedia.org/T341002 [13:57:55] (03PS1) 10Jbond: envoproxy: drop the use of :undef [puppet] - 10https://gerrit.wikimedia.org/r/935436 (https://phabricator.wikimedia.org/T341071) [13:58:10] (03CR) 10Fabfur: haproxy: support different actions for tls and http frontend (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [13:58:33] (03PS2) 10JMeybohm: mathoid,mw-debug: Switch back to default envoy [deployment-charts] - 10https://gerrit.wikimedia.org/r/935435 (https://phabricator.wikimedia.org/T300324) [13:58:45] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply [13:58:49] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [13:59:03] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and urbanecm: Backport for [[gerrit:935122|DeleteAction: Avoid displaying the form unconditionally (T341002)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:59:08] Func: ^ [13:59:46] (03CR) 10JMeybohm: [C: 03+2] mathoid,mw-debug: Switch back to default envoy [deployment-charts] - 10https://gerrit.wikimedia.org/r/935435 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [14:00:05] Lucas_WMDE: tested on enwiki, good to go [14:00:06] (03CR) 10Ladsgroup: [C: 03+1] Stop setting $wgCommentTempTableSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929997 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [14:00:16] ok thanks [14:00:19] syncing [14:00:37] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Creat cookbook to migrate serveres from the puppetmasters to puppetservers - https://phabricator.wikimedia.org/T340739 (10Volans) >>! In T340739#8988130, @jbond wrote: > Theses are the manual steps i made to migrate puppetboard1... [14:00:47] (03CR) 10CI reject: [V: 04-1] envoproxy: drop the use of :undef [puppet] - 10https://gerrit.wikimedia.org/r/935436 (https://phabricator.wikimedia.org/T341071) (owner: 10Jbond) [14:00:51] (03Merged) 10jenkins-bot: mathoid,mw-debug: Switch back to default envoy [deployment-charts] - 10https://gerrit.wikimedia.org/r/935435 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [14:00:53] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/mathoid: apply [14:00:53] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply [14:00:55] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/mathoid: apply [14:00:57] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [14:01:11] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/mathoid: apply [14:01:21] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/mathoid: apply [14:01:29] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:34] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/mathoid: apply [14:02:16] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:02:21] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply [14:02:34] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/mathoid: apply [14:02:50] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:03:04] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [14:03:12] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:03:36] (03CR) 10Jbond: "lgtm but i would like to stop using the $::ntp_peers directly, i would like to get rid of global variable usage as much as possible. empt" [puppet] - 10https://gerrit.wikimedia.org/r/933473 (https://phabricator.wikimedia.org/T340479) (owner: 10Ssingh) [14:03:56] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:04:52] oops, we’re already overrunning the window [14:05:12] (03CR) 10FNegri: "Is there any way to check that this profile is not used anywhere, including VPS hosts?" [puppet] - 10https://gerrit.wikimedia.org/r/935434 (owner: 10Majavah) [14:05:51] (03PS2) 10Jbond: envoproxy: drop the use of :undef [puppet] - 10https://gerrit.wikimedia.org/r/935436 (https://phabricator.wikimedia.org/T341071) [14:06:03] (03CR) 10Majavah: P:toolforge: redis: drop unused profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935434 (owner: 10Majavah) [14:06:36] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:07:09] hm, my scap isn’t printing a lot of output at the moment [14:07:11] (last was “14:02:23 Finished Running helmfile…”) [14:07:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42221/console" [puppet] - 10https://gerrit.wikimedia.org/r/935436 (https://phabricator.wikimedia.org/T341071) (owner: 10Jbond) [14:07:20] maybe some image is taking a long time to download in k8s [14:07:29] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) All nodes and most k8s deployments have been updated to run 1.23.10, only exceptions are api-gateway and rest-gateway which still run 1.18 as well as da... [14:07:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:08:09] 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): puptet7: drop instances of :undef in erb files - https://phabricator.wikimedia.org/T341071 (10Aklapper) [14:08:59] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Create cookbook to migrate servers from the puppetmasters to puppetservers - https://phabricator.wikimedia.org/T340739 (10Aklapper) [14:09:09] 10SRE, 10envoy, 10serviceops: Refactor envoy max_requests_per_connection from Cluster to HttpProtocolOptions - https://phabricator.wikimedia.org/T304124 (10JMeybohm) 05Stalled→03Open [14:09:14] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [14:11:26] Lucas_WMDE: Does it say which helmfile ? [14:11:36] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:11:46] “14:02:23 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-int (duration: 00m 58s)” [14:11:49] is the last line it printed [14:11:54] ack [14:11:55] wait, I should be able to [14:11:55] oh [14:11:59] it just timed out I guess [14:12:08] “Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-web (duration: 10m 27s)” [14:12:21] : Deployment of mw-web-main failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=main', 'apply']' returned non-zero exit status 1. [14:12:26] *codfw: Deployment… [14:12:30] ok I'll check out mw-web on codfw, don't worry about it [14:12:34] and Rolling back to prior state [14:12:43] bit tricky to copy+paste while the terminal keeps moving [14:12:44] yeah [14:12:44] ok thanks! [14:12:46] it's non blocking [14:12:58] I guess the next scap would push the change to k8s either way [14:12:58] Lucas_WMDE: Are you in a tmux ? ctrl-b q [14:13:05] too lazy ;) [14:13:08] (but yeah) [14:13:11] Lucas_WMDE: It would yeah, but I'd like to check why it timed out [14:13:28] ok one sec [14:13:44] (03CR) 10Gmodena: mw-page-content-change-enrichment stream partition WIP (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935153 (https://phabricator.wikimedia.org/T338169) (owner: 10Gmodena) [14:14:12] (03CR) 10Gmodena: [C: 03+2] mw-page-content-change-enrichment stream partition WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/935153 (https://phabricator.wikimedia.org/T338169) (owner: 10Gmodena) [14:14:49] claime: https://phabricator.wikimedia.org/P49503 [14:14:58] (03Merged) 10jenkins-bot: mw-page-content-change-enrichment stream partition WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/935153 (https://phabricator.wikimedia.org/T338169) (owner: 10Gmodena) [14:15:11] (c-space, m-w, :paste into cat) [14:15:22] thanks [14:15:43] php-fpm-restart is running now, so soon it’ll be over as far as I’m concerned, I guess :) [14:15:44] I think it just took too long to roll forward [14:16:05] Ah no, limits issue [14:16:07] Great [14:16:09] (03CR) 10Ssingh: P:systemd::timesyncd: automate generation of ntp_servers list (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/933473 (https://phabricator.wikimedia.org/T340479) (owner: 10Ssingh) [14:16:16] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:935122|DeleteAction: Avoid displaying the form unconditionally (T341002)]] (duration: 18m 41s) [14:16:20] T341002: Manually constructing action=delete URLs displays a deletion form instead of an error message - https://phabricator.wikimedia.org/T341002 [14:16:34] urbanecm, Func: ^ fyi, mostly done, claime is looking into k8s issue [14:16:43] !log UTC afternoon backport+config window done [14:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:47] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [14:16:51] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:17:02] thanks [14:17:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:18:40] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [14:18:47] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:19:25] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 57 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42222/console" [puppet] - 10https://gerrit.wikimedia.org/r/935436 (https://phabricator.wikimedia.org/T341071) (owner: 10Jbond) [14:19:47] (03PS1) 10Clément Goubert: admin_ng: Raise mw-web limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/935440 [14:20:10] (03PS1) 10Effie Mouzeli: ipoid: add NO_PROXY env vars [deployment-charts] - 10https://gerrit.wikimedia.org/r/935441 [14:20:15] !log gmodena@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [14:20:19] !log gmodena@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:22:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:57] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [14:23:02] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:24:38] (03CR) 10Effie Mouzeli: [C: 03+1] admin_ng: Raise mw-web limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/935440 (owner: 10Clément Goubert) [14:24:47] (03CR) 10Clément Goubert: [C: 03+2] admin_ng: Raise mw-web limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/935440 (owner: 10Clément Goubert) [14:27:08] (03Merged) 10jenkins-bot: admin_ng: Raise mw-web limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/935440 (owner: 10Clément Goubert) [14:27:18] (03CR) 10Jbond: [V: 03+1 C: 03+2] envoproxy: drop the use of :undef [puppet] - 10https://gerrit.wikimedia.org/r/935436 (https://phabricator.wikimedia.org/T341071) (owner: 10Jbond) [14:27:22] !log cgoubert@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:27:30] !log cgoubert@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:27:36] (03PS2) 10Jbond: ipuppetserver::g10k: dont recurse cache directory [puppet] - 10https://gerrit.wikimedia.org/r/935432 [14:28:03] 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): puppet7: drop instances of :undef in erb files - https://phabricator.wikimedia.org/T341071 (10Aklapper) [14:29:05] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [14:29:11] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:29:14] (03CR) 10Effie Mouzeli: [C: 03+2] ipoid: add NO_PROXY env vars [deployment-charts] - 10https://gerrit.wikimedia.org/r/935441 (owner: 10Effie Mouzeli) [14:29:17] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [14:29:23] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:29:29] (03PS7) 10Ssingh: P:systemd::timesyncd: automate generation of ntp_servers list [puppet] - 10https://gerrit.wikimedia.org/r/933473 (https://phabricator.wikimedia.org/T340479) [14:30:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:30] (03Merged) 10jenkins-bot: ipoid: add NO_PROXY env vars [deployment-charts] - 10https://gerrit.wikimedia.org/r/935441 (owner: 10Effie Mouzeli) [14:31:46] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply [14:32:06] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [14:32:21] (03PS2) 10Giuseppe Lavagetto: Add missing paging alert for high backend errors in trafficserver [alerts] - 10https://gerrit.wikimedia.org/r/884039 [14:32:58] (03PS1) 10Stevemunene: Create spark3 local directory [puppet] - 10https://gerrit.wikimedia.org/r/935444 (https://phabricator.wikimedia.org/T332765) [14:33:17] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [14:33:23] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:33:56] (03CR) 10CI reject: [V: 04-1] Add missing paging alert for high backend errors in trafficserver [alerts] - 10https://gerrit.wikimedia.org/r/884039 (owner: 10Giuseppe Lavagetto) [14:33:59] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [14:34:31] (03CR) 10Krinkle: api-gateway: Switch to mw-api-int-async on k8s (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) (owner: 10Clément Goubert) [14:35:24] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:36:12] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [14:36:18] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:36:29] (03CR) 10Jbond: [C: 03+2] ipuppetserver::g10k: dont recurse cache directory [puppet] - 10https://gerrit.wikimedia.org/r/935432 (owner: 10Jbond) [14:36:37] (03PS3) 10Jbond: envoproxy: drop the use of :undef [puppet] - 10https://gerrit.wikimedia.org/r/935436 (https://phabricator.wikimedia.org/T341071) [14:37:04] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:37:09] 10SRE-swift-storage, 10Wikimedia-Site-requests, 10serviceops: Cleanup cirrus keys in $wmfSwiftEqiadConfig - https://phabricator.wikimedia.org/T199220 (10MatthewVernon) @dcausse Are you able to confirm I can dispose of the `search:backup` ms-swift account, please? Or if not do you know who can give the OK? [14:37:34] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:37:35] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42226/console" [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [14:37:55] (03PS2) 10Krinkle: webperf: Enable `recurse_submodules` for performance/docroot clone [puppet] - 10https://gerrit.wikimedia.org/r/934710 [14:38:35] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [14:38:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:38:58] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [14:39:01] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:39:25] (03PS12) 10Clément Goubert: api-gateway: Switch to mw-api-int on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) [14:39:29] (03CR) 10Clément Goubert: api-gateway: Switch to mw-api-int on k8s (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) (owner: 10Clément Goubert) [14:39:41] RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard1003 is OK: HTTP OK: HTTP/1.1 200 OK - 9544 bytes in 0.270 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [14:40:02] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [14:40:03] (03CR) 10Vgutierrez: haproxy: support different actions for tls and http frontend (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [14:40:05] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [14:40:06] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [14:40:09] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [14:40:21] Heh, idjit. That's not how you deploy mw-on-k8s. [14:41:08] !log redeploying mw-on-k8s [14:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:11] !log cgoubert@deploy1002 Started scap: (no justification provided) [14:41:19] (03PS3) 10Giuseppe Lavagetto: Add missing paging alert for high backend errors in trafficserver [alerts] - 10https://gerrit.wikimedia.org/r/884039 [14:42:04] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [14:42:07] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:42:36] (03CR) 10Jelto: ci/zuul: switch gearman server from contint2001 to contint2002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867705 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [14:42:48] (03PS3) 10Ssingh: P:dns::recursor: automatically generate resolv.conf for DNS hosts [puppet] - 10https://gerrit.wikimedia.org/r/933497 (https://phabricator.wikimedia.org/T340479) [14:43:23] !log cgoubert@deploy1002 Finished scap: (no justification provided) (duration: 02m 12s) [14:43:42] Lucas_WMDE: ^ all redeployed :) [14:43:47] \o/ [14:43:50] (03CR) 10Giuseppe Lavagetto: [C: 03+1] api-gateway: Switch to mw-api-int on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) (owner: 10Clément Goubert) [14:46:15] !log jbond@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=puppetboard-next,name=codfw [14:46:21] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:47:16] (03PS1) 10Arturo Borrero Gonzalez: private.eqiad.wikimedia.cloud: introduce support for new zone [dns] - 10https://gerrit.wikimedia.org/r/935446 (https://phabricator.wikimedia.org/T341063) [14:48:10] (03CR) 10CI reject: [V: 04-1] private.eqiad.wikimedia.cloud: introduce support for new zone [dns] - 10https://gerrit.wikimedia.org/r/935446 (https://phabricator.wikimedia.org/T341063) (owner: 10Arturo Borrero Gonzalez) [14:49:39] (03PS1) 10Krinkle: mw-cli-wrapper: fix own dc reference in Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/935448 [14:49:47] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [14:49:50] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:50:18] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [14:50:22] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:50:31] 10SRE-swift-storage, 10Wikimedia-Site-requests, 10serviceops: Cleanup cirrus keys in $wmfSwiftEqiadConfig - https://phabricator.wikimedia.org/T199220 (10dcausse) @MatthewVernon yes we can delete this account and containers (cc @EBernhardson) [14:51:06] (03PS6) 10Lucas Werkmeister (WMDE): foundationwiki: Enable WikibaseClient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850547 (https://phabricator.wikimedia.org/T321967) (owner: 10Varnent) [14:52:14] (03CR) 10Krinkle: "Adding reviewers based on last authors (2021)." [puppet] - 10https://gerrit.wikimedia.org/r/935448 (owner: 10Krinkle) [14:52:24] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] deployment: Use rsync::quickdatacopy, enable encryption [puppet] - 10https://gerrit.wikimedia.org/r/933911 (https://phabricator.wikimedia.org/T289857) (owner: 10Clément Goubert) [14:52:29] (03PS7) 10Lucas Werkmeister (WMDE): foundationwiki: Enable WikibaseClient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850547 (https://phabricator.wikimedia.org/T321967) (owner: 10Varnent) [14:52:55] !log gmodena@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [14:52:57] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/933473/42228/" [puppet] - 10https://gerrit.wikimedia.org/r/933473 (https://phabricator.wikimedia.org/T340479) (owner: 10Ssingh) [14:52:58] !log gmodena@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:53:16] (03CR) 10Krinkle: "I imagine changing wikimedia-cluster is more difficult than changing the etcd key in Beta to match wikimedia-cluster. What would it take t" [puppet] - 10https://gerrit.wikimedia.org/r/935448 (owner: 10Krinkle) [14:53:44] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/933473 (https://phabricator.wikimedia.org/T340479) (owner: 10Ssingh) [14:53:55] !log Deploying encrypted rsync to deployment servers - T289857 [14:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:58] T289857: Use encrypted rsync for deployment::rsync - https://phabricator.wikimedia.org/T289857 [14:54:49] (03PS1) 10Jaime Nuche: releases-jenkins: block LDAP users page [puppet] - 10https://gerrit.wikimedia.org/r/935453 (https://phabricator.wikimedia.org/T341074) [14:55:11] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:56:51] !log gmodena@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [14:56:54] !log gmodena@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:58:21] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [14:58:29] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [14:58:35] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:59:59] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 132 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [15:00:40] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudsw-c8.eqiad.codfw.wikimedia.cloud - aborrero@cumin1001" [15:01:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:25] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudsw-c8.eqiad.codfw.wikimedia.cloud - aborrero@cumin1001" [15:01:25] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:01:35] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [15:02:55] 10SRE, 10Infrastructure-Foundations, 10Datacenter-Switchover, 10User-fgiunchedi: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10Clement_Goubert) [15:03:34] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudsw-c8.private.eqiad.wikimedia.cloud - aborrero@cumin1001" [15:03:38] 10SRE, 10serviceops, 10Patch-For-Review: Use encrypted rsync for deployment::rsync - https://phabricator.wikimedia.org/T289857 (10Clement_Goubert) 05In progress→03Resolved Deployed, data transfer works between deploy2002 and deploy1002. Resolving. [15:04:07] (03PS8) 10Lucas Werkmeister (WMDE): foundationwiki: Enable WikibaseClient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850547 (https://phabricator.wikimedia.org/T321967) (owner: 10Varnent) [15:04:09] (03PS1) 10Lucas Werkmeister (WMDE): outreachwiki: Set wmgWikibaseSiteGroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935455 [15:04:17] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudsw-c8.private.eqiad.wikimedia.cloud - aborrero@cumin1001" [15:04:17] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:06:49] (03CR) 10Hashar: [C: 04-1] "The issue might be the service { enable => true } is invoked BEFORE the Docker package is installed. Gotta give it a bit more thoughts." [puppet] - 10https://gerrit.wikimedia.org/r/935405 (https://phabricator.wikimedia.org/T341051) (owner: 10Hashar) [15:07:53] (03PS1) 10Effie Mouzeli: ipoid: enable staging in ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/935456 [15:08:14] (03PS2) 10Effie Mouzeli: ipoid: enable staging in ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/935456 [15:08:49] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache puppetboard-next.wikimedia.org on all recursors [15:08:52] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) puppetboard-next.wikimedia.org on all recursors [15:09:43] (03PS3) 10Effie Mouzeli: ipoid: enable staging in ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/935456 [15:09:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:00] (03CR) 10Effie Mouzeli: [C: 03+2] ipoid: enable staging in ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/935456 (owner: 10Effie Mouzeli) [15:11:56] (03Merged) 10jenkins-bot: ipoid: enable staging in ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/935456 (owner: 10Effie Mouzeli) [15:12:02] (03PS1) 10Hnowlan: api-gateway: add native AQS1-style routes for AQS services [deployment-charts] - 10https://gerrit.wikimedia.org/r/935457 (https://phabricator.wikimedia.org/T338916) [15:12:26] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply [15:13:27] (03PS1) 10Krinkle: Profiler: Switch xhgui backend in beta cluster to deployment-db11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935459 [15:14:58] (03PS2) 10Krinkle: Profiler: Switch xhgui backend in beta cluster to deployment-db11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935459 [15:15:14] (03CR) 10Krinkle: [C: 03+2] Profiler: Switch xhgui backend in beta cluster to deployment-db11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935459 (owner: 10Krinkle) [15:15:36] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Direct 0.5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341078 (10Clement_Goubert) [15:15:53] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:54] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [15:16:06] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Direct 0.5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341078 (10Clement_Goubert) 05Open→03In progress p:05Triage→03High [15:16:16] (03CR) 10Btullis: "Can you do a pcc run for this please?" [puppet] - 10https://gerrit.wikimedia.org/r/935444 (https://phabricator.wikimedia.org/T332765) (owner: 10Stevemunene) [15:16:26] (03Merged) 10jenkins-bot: Profiler: Switch xhgui backend in beta cluster to deployment-db11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935459 (owner: 10Krinkle) [15:16:44] (03PS1) 10FNegri: cloudcumin: don't send logs prod IRC [puppet] - 10https://gerrit.wikimedia.org/r/935461 (https://phabricator.wikimedia.org/T325756) [15:17:13] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [15:17:40] (03PS1) 10Slyngshede: Forgot username [software/bitu] - 10https://gerrit.wikimedia.org/r/935462 [15:19:17] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [15:19:47] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q4): Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10fnegri) As you can see from the messages above, logging is working correctly from cloudcum... [15:21:16] (03PS1) 10Daimona Eaytoy: beta: Enable wgCampaignEventsProgramsAndEventsDashboardInstance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935463 (https://phabricator.wikimedia.org/T320258) [15:21:21] (03PS2) 10FNegri: cloudcumin: don't send logs prod IRC [puppet] - 10https://gerrit.wikimedia.org/r/935461 (https://phabricator.wikimedia.org/T325756) [15:21:33] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/935461 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri) [15:22:01] 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): puppet7: drop instances of :undef in erb files - https://phabricator.wikimedia.org/T341071 (10jbond) p:05Triage→03Medium [15:23:09] (03PS1) 10Hnowlan: trafficserver: add gateway routing script, route device-analytics [puppet] - 10https://gerrit.wikimedia.org/r/935464 (https://phabricator.wikimedia.org/T320967) [15:23:59] (03CR) 10FNegri: [C: 03+2] P:toolforge: redis: drop unused profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935434 (owner: 10Majavah) [15:24:29] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/device-analytics: apply [15:25:00] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [15:25:44] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/device-analytics: apply [15:26:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:26:23] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply [15:26:42] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/device-analytics: apply [15:27:17] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply [15:28:47] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/935466 (https://phabricator.wikimedia.org/T341078) (owner: 10Clément Goubert) [15:29:41] (03PS2) 10Clément Goubert: mw-on-k8s: Redirect 0.5% of all traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/935466 (https://phabricator.wikimedia.org/T341078) [15:30:41] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:35:45] 10SRE, 10ExternalGuidance, 10Traffic-Icebox: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10Aklapper) What exactly is left to do in this open task? [15:37:15] Hi all, is there a deployer willing to deploy a beta config change for T320258? [15:37:16] T320258: Dashboard integration: Configure the P&E Dashboard integration in beta - https://phabricator.wikimedia.org/T320258 [15:41:01] 10SRE-swift-storage, 10serviceops: Remove search:backup swift account and storage - https://phabricator.wikimedia.org/T341081 (10MatthewVernon) [15:41:50] 10SRE-swift-storage, 10Wikimedia-Site-requests, 10serviceops: Cleanup cirrus keys in $wmfSwiftEqiadConfig - https://phabricator.wikimedia.org/T199220 (10MatthewVernon) Thanks for confirming; I'll track that work on the new subtask (and remove the swift storage tag from this one). [15:46:42] (03CR) 10FNegri: "Hmm unfortunately this is less straightforward than I thought, because Spicerack only checks for "None" and not for an empty string, but t" [puppet] - 10https://gerrit.wikimedia.org/r/935461 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri) [15:46:46] !log delete swift container global-data-elastic-backups in AUTH_search account T341081 [15:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:49] T341081: Remove search:backup swift account and storage - https://phabricator.wikimedia.org/T341081 [15:55:52] (03PS1) 10Hashar: ci: enabling docker require the docker-ce package [puppet] - 10https://gerrit.wikimedia.org/r/935471 (https://phabricator.wikimedia.org/T341051) [15:56:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:56] !log jbond@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=puppetboard,name=codfw [15:57:09] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache puppetboard.wikimedia.org on all recursors [15:57:12] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) puppetboard.wikimedia.org on all recursors [15:58:50] (03PS3) 10FNegri: cloudcumin: don't send logs to prod IRC [puppet] - 10https://gerrit.wikimedia.org/r/935461 (https://phabricator.wikimedia.org/T325756) [16:00:05] jbond and rzl: May I have your attention please! Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230704T1600) [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:11] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:02:05] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/935461 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri) [16:02:33] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache puppetboard.discovery.wmnet on all recursors [16:02:36] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) puppetboard.discovery.wmnet on all recursors [16:03:16] !log jbond@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=puppetboard,name=codfw [16:03:21] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache puppetboard.discovery.wmnet on all recursors [16:03:24] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) puppetboard.discovery.wmnet on all recursors [16:07:11] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:07:57] (03CR) 10Volans: "I don't think that this is a good solution as from the cloudcumin you could also run normal cookbooks on the cloud hosts and those should " [puppet] - 10https://gerrit.wikimedia.org/r/935461 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri) [16:09:05] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/933473/42230/" [puppet] - 10https://gerrit.wikimedia.org/r/933473 (https://phabricator.wikimedia.org/T340479) (owner: 10Ssingh) [16:09:07] (03PS1) 10Jbond: puppetboard-next: correct the site ip's [puppet] - 10https://gerrit.wikimedia.org/r/935473 [16:09:31] (03CR) 10Ssingh: [C: 03+1] puppetboard-next: correct the site ip's [puppet] - 10https://gerrit.wikimedia.org/r/935473 (owner: 10Jbond) [16:10:43] (03CR) 10Jbond: [C: 03+2] puppetboard-next: correct the site ip's [puppet] - 10https://gerrit.wikimedia.org/r/935473 (owner: 10Jbond) [16:12:35] (03PS4) 10FNegri: cloudcumin: don't send logs to prod IRC [puppet] - 10https://gerrit.wikimedia.org/r/935461 (https://phabricator.wikimedia.org/T325756) [16:14:38] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache puppetboard.discovery.wmnet on all recursors [16:14:41] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) puppetboard.discovery.wmnet on all recursors [16:14:47] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache puppetboard-next.discovery.wmnet on all recursors [16:14:50] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) puppetboard-next.discovery.wmnet on all recursors [16:15:30] (03PS5) 10FNegri: cloudcumin: don't send logs to prod IRC [puppet] - 10https://gerrit.wikimedia.org/r/935461 (https://phabricator.wikimedia.org/T325756) [16:16:11] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:17:41] (03CR) 10FNegri: cloudcumin: don't send logs to prod IRC (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935461 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri) [16:18:20] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/935461 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri) [16:22:19] (03PS1) 10Krinkle: Profiler: Actually switch xhgui backend in beta to deployment-db11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935475 [16:22:36] (03CR) 10Krinkle: [C: 03+2] Profiler: Actually switch xhgui backend in beta to deployment-db11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935475 (owner: 10Krinkle) [16:22:58] (03CR) 10FNegri: "I'm not sure if this is the most idiomatic way to make that variable optional, but at least the PCC is now compiling correctly." [puppet] - 10https://gerrit.wikimedia.org/r/935461 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri) [16:24:55] (03Merged) 10jenkins-bot: Profiler: Actually switch xhgui backend in beta to deployment-db11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935475 (owner: 10Krinkle) [16:28:54] (03CR) 10Stevemunene: Create spark3 local directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935444 (https://phabricator.wikimedia.org/T332765) (owner: 10Stevemunene) [16:32:38] (03PS1) 10Jbond: puppetdb2003: move to new role [puppet] - 10https://gerrit.wikimedia.org/r/935478 (https://phabricator.wikimedia.org/T321783) [16:33:23] (03CR) 10Jbond: [C: 03+2] puppetdb2003: move to new role [puppet] - 10https://gerrit.wikimedia.org/r/935478 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [16:36:24] (03CR) 10Btullis: Create spark3 local directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935444 (https://phabricator.wikimedia.org/T332765) (owner: 10Stevemunene) [16:37:46] (03CR) 10Stevemunene: Create spark3 local directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935444 (https://phabricator.wikimedia.org/T332765) (owner: 10Stevemunene) [16:40:29] (03CR) 10Ssingh: [C: 03+1] wmnet: add SRV records for puppet compileres and CA [dns] - 10https://gerrit.wikimedia.org/r/935396 (https://phabricator.wikimedia.org/T341053) (owner: 10Jbond) [16:46:06] (03PS1) 10Fabfur: users: add new user (fabfur) [homer/public] - 10https://gerrit.wikimedia.org/r/935479 [16:54:57] (03PS1) 10Jbond: pupetserver::git: ensure we build all parent directories [puppet] - 10https://gerrit.wikimedia.org/r/935480 (https://phabricator.wikimedia.org/T321783) [16:57:13] (03CR) 10CI reject: [V: 04-1] pupetserver::git: ensure we build all parent directories [puppet] - 10https://gerrit.wikimedia.org/r/935480 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [16:58:36] (03PS12) 10Fabfur: haproxy: support different actions for tls and http frontend [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) [16:58:59] (03CR) 10CI reject: [V: 04-1] haproxy: support different actions for tls and http frontend [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230704T1700) [17:09:11] (03PS2) 10Jbond: pupetserver::git: ensure we build all parent directories [puppet] - 10https://gerrit.wikimedia.org/r/935480 (https://phabricator.wikimedia.org/T321783) [17:09:15] 10SRE, 10RESTBase, 10RESTBase-API, 10Traffic: REST API is not invalidating caches after template and/or module changes - https://phabricator.wikimedia.org/T335770 (10Brycehughes) @akosiaris Yep all clear now from Georgia (the country). However, this lasted much more than "several minutes". What do you thin... [17:12:11] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:29] (03CR) 10Jbond: [C: 03+2] pupetserver::git: ensure we build all parent directories [puppet] - 10https://gerrit.wikimedia.org/r/935480 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [17:16:11] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:20:22] (03PS1) 10Jbond: pupetserver::git: Ensure we build the git repo if using init [puppet] - 10https://gerrit.wikimedia.org/r/935483 (https://phabricator.wikimedia.org/T321783) [17:22:38] (03CR) 10CI reject: [V: 04-1] pupetserver::git: Ensure we build the git repo if using init [puppet] - 10https://gerrit.wikimedia.org/r/935483 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [17:23:03] (03PS2) 10Jbond: pupetserver::git: Ensure we build the git repo if using init [puppet] - 10https://gerrit.wikimedia.org/r/935483 (https://phabricator.wikimedia.org/T321783) [17:25:18] (03CR) 10CI reject: [V: 04-1] pupetserver::git: Ensure we build the git repo if using init [puppet] - 10https://gerrit.wikimedia.org/r/935483 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [17:29:12] (03PS3) 10Jbond: pupetserver::git: Ensure we build the git repo if using init [puppet] - 10https://gerrit.wikimedia.org/r/935483 (https://phabricator.wikimedia.org/T321783) [17:30:18] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42236/console" [puppet] - 10https://gerrit.wikimedia.org/r/935483 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [17:31:56] (03PS4) 10Jbond: pupetserver::git: Ensure we build the git repo if using init [puppet] - 10https://gerrit.wikimedia.org/r/935483 (https://phabricator.wikimedia.org/T321783) [17:32:51] (03CR) 10Ssingh: [C: 03+1] users: add new user (fabfur) [homer/public] - 10https://gerrit.wikimedia.org/r/935479 (owner: 10Fabfur) [17:34:12] (03CR) 10CI reject: [V: 04-1] pupetserver::git: Ensure we build the git repo if using init [puppet] - 10https://gerrit.wikimedia.org/r/935483 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [17:34:16] (03CR) 10Ssingh: [C: 03+2] users: add new user (fabfur) [homer/public] - 10https://gerrit.wikimedia.org/r/935479 (owner: 10Fabfur) [17:36:37] !log homer "*" commit "Gerrit: 935479 add fabur" [17:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:47] !log [correction] homer "*" commit "Gerrit: 935479 add fabfur" [17:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:16] (03PS5) 10Jbond: pupetserver::git: Ensure we build the git repo if using init [puppet] - 10https://gerrit.wikimedia.org/r/935483 (https://phabricator.wikimedia.org/T321783) [17:40:16] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42238/console" [puppet] - 10https://gerrit.wikimedia.org/r/935483 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [17:41:43] (03CR) 10Jbond: [V: 03+1 C: 03+2] pupetserver::git: Ensure we build the git repo if using init [puppet] - 10https://gerrit.wikimedia.org/r/935483 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [17:52:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 32 hosts with reason: Primary switchover s8 T339223 [17:52:55] T339223: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T339223 [17:53:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s8 T339223 [17:56:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db2161 with weight 0 T339223', diff saved to https://phabricator.wikimedia.org/P49506 and previous config saved to /var/cache/conftool/dbconfig/20230704-175604-ladsgroup.json [17:59:09] (03PS1) 10Jbond: puppetserver: codfw update puppetdb url [puppet] - 10https://gerrit.wikimedia.org/r/935507 (https://phabricator.wikimedia.org/T321783) [17:59:32] (03CR) 10CI reject: [V: 04-1] puppetserver: codfw update puppetdb url [puppet] - 10https://gerrit.wikimedia.org/r/935507 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [17:59:40] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42239/console" [puppet] - 10https://gerrit.wikimedia.org/r/935507 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [18:00:06] hashar and brennen: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230704T1800). [18:01:36] (03PS2) 10Jbond: puppetserver: codfw update puppetdb url [puppet] - 10https://gerrit.wikimedia.org/r/935507 (https://phabricator.wikimedia.org/T321783) [18:01:47] !log disable puppet on A:wikidough to roll out CR 863295 [18:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:02] (03CR) 10Ssingh: [C: 03+2] wikidough: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863295 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [18:02:04] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42240/console" [puppet] - 10https://gerrit.wikimedia.org/r/935507 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [18:02:19] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver: codfw update puppetdb url [puppet] - 10https://gerrit.wikimedia.org/r/935507 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [18:02:37] jbond: ok to merge your change? [18:02:41] - - 'https://puppetdb2003.codfw.wmnet:8443' [18:02:41] + - 'https://puppetdb2003.codfw.wmnet' [18:03:04] ok looks like you merged mine too, thanks! [18:03:08] sukhe: was just coming to say i have merged yours :) [18:03:14] look safe [18:03:18] yep, thanks! all good [18:03:44] cool cheers [18:05:58] (03CR) 10Krinkle: "Confirmed in beta, works as intended." [puppet] - 10https://gerrit.wikimedia.org/r/934710 (owner: 10Krinkle) [18:06:06] !log enable puppet on A:wikidough to roll out CR 863295 [18:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:39] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:09:29] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:15:51] (03PS1) 10Btullis: Remove the GMS SSL and port options from the datahub GMS chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/935508 (https://phabricator.wikimedia.org/T329514) [18:16:55] (03PS1) 10Jbond: puppetmaster: add new puppetsrver [puppet] - 10https://gerrit.wikimedia.org/r/935509 (https://phabricator.wikimedia.org/T321783) [18:17:51] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:20:53] (03CR) 10Btullis: [C: 03+2] Remove the GMS SSL and port options from the datahub GMS chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/935508 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [18:21:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 24): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42241/console" [puppet] - 10https://gerrit.wikimedia.org/r/935509 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [18:21:18] (03PS1) 10Jbond: README: test puppet-merge functionality [puppet] - 10https://gerrit.wikimedia.org/r/935510 [18:21:20] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetmaster: add new puppetsrver [puppet] - 10https://gerrit.wikimedia.org/r/935509 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [18:21:54] (03Merged) 10jenkins-bot: Remove the GMS SSL and port options from the datahub GMS chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/935508 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [18:25:01] (03CR) 10Jbond: [C: 03+2] README: test puppet-merge functionality [puppet] - 10https://gerrit.wikimedia.org/r/935510 (owner: 10Jbond) [18:25:34] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [18:30:36] (03PS2) 10Ladsgroup: mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/929764 (https://phabricator.wikimedia.org/T339223) (owner: 10Gerrit maintenance bot) [18:30:39] (03CR) 10Ladsgroup: [C: 03+2] mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/929764 (https://phabricator.wikimedia.org/T339223) (owner: 10Gerrit maintenance bot) [18:30:42] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/929764 (https://phabricator.wikimedia.org/T339223) (owner: 10Gerrit maintenance bot) [18:31:37] !log finished running homer for adding fabfur [pushed to all 55 devices successfully] [18:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:43] !log Starting s8 codfw failover from db2165 to db2161 - T339223 [18:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:47] T339223: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T339223 [18:34:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db2161 to s8 primary T339223', diff saved to https://phabricator.wikimedia.org/P49507 and previous config saved to /var/cache/conftool/dbconfig/20230704-183434-ladsgroup.json [18:37:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db2165 T339223', diff saved to https://phabricator.wikimedia.org/P49508 and previous config saved to /var/cache/conftool/dbconfig/20230704-183748-ladsgroup.json [18:37:53] T339223: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T339223 [18:38:44] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [18:39:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2165.codfw.wmnet with reason: Maintenance [18:39:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2165.codfw.wmnet with reason: Maintenance [18:41:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2165 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P49509 and previous config saved to /var/cache/conftool/dbconfig/20230704-184132-ladsgroup.json [18:42:36] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:52:36] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:56:37] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [18:56:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2165 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P49510 and previous config saved to /var/cache/conftool/dbconfig/20230704-185637-ladsgroup.json [18:59:14] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [19:02:36] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:07:00] !log gmodena@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [19:07:04] !log gmodena@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [19:09:56] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [19:11:03] RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard2003 is OK: HTTP OK: HTTP/1.1 200 OK - 10433 bytes in 0.486 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [19:11:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2165 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P49511 and previous config saved to /var/cache/conftool/dbconfig/20230704-191142-ladsgroup.json [19:19:29] (03PS1) 10Krinkle: webperf: Add XHGui credentials as ENV variable to performance.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/935512 [19:19:52] (03CR) 10CI reject: [V: 04-1] webperf: Add XHGui credentials as ENV variable to performance.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/935512 (owner: 10Krinkle) [19:19:59] (03PS2) 10Krinkle: webperf: Add XHGui credentials as ENV variable to performance.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/935512 [19:20:03] (03PS3) 10Krinkle: webperf: Add XHGui credentials as ENV variable to performance.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/935512 [19:20:27] (03CR) 10CI reject: [V: 04-1] webperf: Add XHGui credentials as ENV variable to performance.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/935512 (owner: 10Krinkle) [19:20:53] (03PS4) 10Krinkle: webperf: Add XHGui credentials as ENV variable to performance.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/935512 [19:21:02] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/935512 (owner: 10Krinkle) [19:21:14] (03CR) 10CI reject: [V: 04-1] webperf: Add XHGui credentials as ENV variable to performance.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/935512 (owner: 10Krinkle) [19:21:37] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [19:22:40] (03PS5) 10Krinkle: webperf: Add XHGui credentials as ENV variable to performance.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/935512 [19:23:05] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/935512 (owner: 10Krinkle) [19:23:32] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [19:23:38] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [19:26:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2165 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P49512 and previous config saved to /var/cache/conftool/dbconfig/20230704-192646-ladsgroup.json [19:28:13] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Puppet7: Update documentation - https://phabricator.wikimedia.org/T341095 (10jbond) [19:28:54] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Puppet7: Update documentation - https://phabricator.wikimedia.org/T341095 (10jbond) p:05Triage→03Medium [19:29:38] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [19:29:45] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [19:30:18] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:33:28] 10SRE-swift-storage, 10Commons, 10MediaWiki-Action-API, 10Wikimedia-production-error: An unknown error occurred in storage backend "local-swift-eqiad" - https://phabricator.wikimedia.org/T341007 (10Umherirrender) [19:33:31] (03PS1) 10Jbond: puppetdb: add classification back [puppet] - 10https://gerrit.wikimedia.org/r/935514 (https://phabricator.wikimedia.org/T321783) [19:35:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:37:17] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [19:37:23] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [19:38:01] (03CR) 10Jbond: [C: 03+2] puppetdb: add classification back [puppet] - 10https://gerrit.wikimedia.org/r/935514 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [19:38:31] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [19:38:37] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [19:43:33] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:43:51] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:43:55] (03PS1) 10Jbond: puppetmaster::frontend: add ssh key for puppetserver2001 [puppet] - 10https://gerrit.wikimedia.org/r/935515 [19:44:13] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/934710 (owner: 10Krinkle) [19:44:17] (03CR) 10CI reject: [V: 04-1] puppetmaster::frontend: add ssh key for puppetserver2001 [puppet] - 10https://gerrit.wikimedia.org/r/935515 (owner: 10Jbond) [19:45:29] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:45:38] (03PS2) 10Jbond: puppetmaster::frontend: add ssh key for puppetserver2001 [puppet] - 10https://gerrit.wikimedia.org/r/935515 [19:47:15] (03CR) 10Andrea Denisse: Add missing build dependencies for the Debian package (032 comments) [software/librenms] - 10https://gerrit.wikimedia.org/r/928659 (https://phabricator.wikimedia.org/T278309) (owner: 10Andrea Denisse) [19:47:29] 10SRE-swift-storage, 10Commons, 10MediaWiki-Action-API, 10Wikimedia-production-error: An unknown error occurred in storage backend "local-swift-eqiad" - https://phabricator.wikimedia.org/T341007 (10Umherirrender) It is better to avoid publish your token used on api request (this could be an issue with the... [19:47:39] (03CR) 10Andrea Denisse: Add missing build dependencies for the Debian package (031 comment) [software/librenms] - 10https://gerrit.wikimedia.org/r/928659 (https://phabricator.wikimedia.org/T278309) (owner: 10Andrea Denisse) [19:48:02] (03CR) 10Jbond: [C: 03+2] puppetmaster::frontend: add ssh key for puppetserver2001 [puppet] - 10https://gerrit.wikimedia.org/r/935515 (owner: 10Jbond) [19:49:30] (03PS1) 10Krinkle: Update vendor to Ice8d2e9b6e538aebca [software/xhgui] (wmf_deploy) - 10https://gerrit.wikimedia.org/r/935517 [19:49:42] (03CR) 10Krinkle: [V: 03+2 C: 03+2] Update vendor to Ice8d2e9b6e538aebca [software/xhgui] (wmf_deploy) - 10https://gerrit.wikimedia.org/r/935517 (owner: 10Krinkle) [19:51:13] 10SRE-swift-storage, 10MediaWiki-Maintenance-system: clean_upload_stash maintenance cron fails to delete files on commons wiki (backend-fail-delete) - https://phabricator.wikimedia.org/T230179 (10Umherirrender) [19:51:41] 10SRE-swift-storage, 10MediaWiki-Maintenance-system, 10Privacy: commonswiki.uploadstash table has unexpectedly old data - https://phabricator.wikimedia.org/T130478 (10Umherirrender) [19:57:14] (03PS6) 10Krinkle: webperf: Add XHGui credentials as ENV variable to performance.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/935512 [19:57:24] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/935512 (owner: 10Krinkle) [19:58:10] (03CR) 10Andrea Denisse: [C: 03+2] webperf: Enable `recurse_submodules` for performance/docroot clone [puppet] - 10https://gerrit.wikimedia.org/r/934710 (owner: 10Krinkle) [20:12:19] (03PS7) 10Krinkle: webperf: Add XHGui credentials as ENV variable to performance.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/935512 [20:12:50] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/935512 (owner: 10Krinkle) [20:12:54] (03PS3) 10Andrea Denisse: Add missing build dependencies for the Debian package [software/librenms] - 10https://gerrit.wikimedia.org/r/928659 (https://phabricator.wikimedia.org/T278309) [20:15:50] (03PS8) 10Krinkle: webperf: Add XHGui credentials as ENV variable to performance.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/935512 [20:22:49] (03CR) 10Krinkle: "Tested in beta cluster through URLs like https://performance.wikimedia.beta.wmflabs.org/xhgui/ for the index, and https://performance.wiki" [puppet] - 10https://gerrit.wikimedia.org/r/935512 (owner: 10Krinkle) [20:31:16] (03PS9) 10Krinkle: webperf: Provision XHGui directly on performance.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/935512 [20:33:46] (03CR) 10Andrea Denisse: [C: 03+2] "PCC results: https://puppet-compiler.wmflabs.org/output/934710/42244/webperf1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/934710 (owner: 10Krinkle) [20:47:47] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50277 bytes in 0.246 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:47:47] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:48:38] (03PS10) 10Krinkle: webperf: Add XHGui credentials as ENV variable to performance.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/935512 [20:48:40] (03PS1) 10Krinkle: xhgui: remove 'xhgui' module, role, profile, and host mapping [puppet] - 10https://gerrit.wikimedia.org/r/935522 [20:48:42] (03PS1) 10Krinkle: webperf,arclamp: Rename to clarify as separate roles [puppet] - 10https://gerrit.wikimedia.org/r/935523 [20:48:59] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.319 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:49:21] (03PS11) 10Krinkle: webperf: Provision XHGui directly on performance.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/935512 [20:49:23] (03PS2) 10Krinkle: xhgui: remove 'xhgui' module, role, profile, and host mapping [puppet] - 10https://gerrit.wikimedia.org/r/935522 [20:49:25] (03PS2) 10Krinkle: webperf,arclamp: Rename to clarify as separate roles [puppet] - 10https://gerrit.wikimedia.org/r/935523 [20:58:41] 10SRE, 10Wikimedia-IRC-RC-Server: Spam in PMs - https://phabricator.wikimedia.org/T341097 (10taavi) [21:02:16] 10SRE, 10Wikimedia-IRC-RC-Server: Spam in PMs - https://phabricator.wikimedia.org/T341097 (10stwalkerster) I've advised on IRC that it's possible to set `/mode +g` to prevent receiving PMs. I don't know if it's possible to set that as a default umode in ratbox - a quick glance at the [[ https://gerrit.... [21:08:45] 10SRE, 10Wikimedia-IRC-RC-Server: Spam in PMs - https://phabricator.wikimedia.org/T341097 (10stwalkerster) After a brief look through the code, a quick and dirty fix might be something like this: `name=operations/debs/ircd-ratbox.git diff --git a/src/s_user.c b/src/s_user.c --- a/src/s_user.c (revision 2c1ff7... [21:13:02] 10SRE, 10Wikimedia-IRC-RC-Server: Spam in PMs on IRC recent changes server - https://phabricator.wikimedia.org/T341097 (10Peachey88) [21:14:01] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/935512 (owner: 10Krinkle) [21:16:20] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/935512 (owner: 10Krinkle) [22:24:08] 10SRE, 10Wikimedia-IRC-RC-Server: Spam in PMs on IRC recent changes server - https://phabricator.wikimedia.org/T341097 (10Urbanecm) I think ideally, we should be disabling private messages on irc.wikimedia.org, or at least, setting +g by default. [22:34:37] jouncebot: nowandnext [22:34:37] No deployments scheduled for the next 7 hour(s) and 25 minute(s) [22:34:37] In 7 hour(s) and 25 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230705T0600) [22:36:05] (03PS2) 10Zabe: Stop setting $wgCommentTempTableSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929997 (https://phabricator.wikimedia.org/T299954) [22:36:23] (03CR) 10Zabe: [C: 03+2] Stop setting $wgCommentTempTableSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929997 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [22:37:12] (03Merged) 10jenkins-bot: Stop setting $wgCommentTempTableSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929997 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [22:38:05] !log zabe@deploy1002 Started scap: Backport for [[gerrit:929997|Stop setting $wgCommentTempTableSchemaMigrationStage (T299954)]] [22:38:10] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [22:39:39] !log zabe@deploy1002 zabe: Backport for [[gerrit:929997|Stop setting $wgCommentTempTableSchemaMigrationStage (T299954)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [22:44:12] (03CR) 10Zabe: [C: 03+1] Remove migrateStewards.php reference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934686 (owner: 10Majavah) [22:46:01] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:929997|Stop setting $wgCommentTempTableSchemaMigrationStage (T299954)]] (duration: 07m 56s) [22:46:06] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [22:46:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:48:23] zabe: if you're already deploying, want to ship that migrateStewards config patch? [22:49:15] sure [22:49:18] (03CR) 10Zabe: [C: 03+2] Remove migrateStewards.php reference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934686 (owner: 10Majavah) [22:50:04] (03Merged) 10jenkins-bot: Remove migrateStewards.php reference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934686 (owner: 10Majavah) [22:50:32] !log zabe@deploy1002 Started scap: Backport for [[gerrit:934686|Remove migrateStewards.php reference]] [22:51:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:52:17] !log zabe@deploy1002 taavi and zabe: Backport for [[gerrit:934686|Remove migrateStewards.php reference]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [22:54:03] thx [22:57:56] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:934686|Remove migrateStewards.php reference]] (duration: 07m 23s) [23:02:51] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:05:43] 10SRE, 10Wikimedia-IRC-RC-Server: Spam in PMs on IRC recent changes server - https://phabricator.wikimedia.org/T341097 (10Ferien) +g is alright as it stops the spam from getting through, but I'm still getting notifications for message attempts. Private messages should just be disabled entirely. [23:16:57] (03PS1) 10Zabe: Initial configuration for gpewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935524 (https://phabricator.wikimedia.org/T335969) [23:19:02] (03CR) 10Zabe: [C: 03+2] Initial configuration for gpewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935524 (https://phabricator.wikimedia.org/T335969) (owner: 10Zabe) [23:19:46] (03Merged) 10jenkins-bot: Initial configuration for gpewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935524 (https://phabricator.wikimedia.org/T335969) (owner: 10Zabe) [23:30:15] * zabe is doing most of the addWiki.php stuff manually now since it break half way through [23:50:20] !log create Wikipedia Ghanaian Pidgin # T335969 [23:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:24] T335969: Create Wikipedia Ghanaian Pidgin - https://phabricator.wikimedia.org/T335969 [23:50:47] !log zabe@deploy1002 Started scap: T335969 [23:52:15] !log zabe@deploy1002 zabe: T335969 synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [23:58:28] !log zabe@deploy1002 Finished scap: T335969 (duration: 07m 40s) [23:58:31] T335969: Create Wikipedia Ghanaian Pidgin - https://phabricator.wikimedia.org/T335969