[00:38:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/932813 [00:38:51] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/932813 (owner: 10TrainBranchBot) [00:58:01] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/932813 (owner: 10TrainBranchBot) [01:03:34] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T340501 (10phaultfinder) [01:03:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:08:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:47:16] RECOVERY - dump of s6 in eqiad on backupmon1001 is OK: Last dump for s6 at eqiad (db1140) taken on 2023-06-27 00:00:06 (71 GiB, +0.5 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:50:20] RECOVERY - dump of s6 in codfw on backupmon1001 is OK: Last dump for s6 at codfw (db2141) taken on 2023-06-27 00:00:09 (71 GiB, +0.5 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230627T0200) [02:07:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.15 [core] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/932814 (https://phabricator.wikimedia.org/T340243) [02:07:23] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.15 [core] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/932814 (https://phabricator.wikimedia.org/T340243) (owner: 10TrainBranchBot) [02:07:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:27:49] (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.15 [core] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/932814 (https://phabricator.wikimedia.org/T340243) (owner: 10TrainBranchBot) [02:32:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:46:48] (03PS1) 10RLazarus: opentelemetry-collector: Vendor 0.61.0 as 0.61.0-wmf.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/933210 [02:53:12] (03PS2) 10RLazarus: opentelemetry-collector: Vendor 0.61.0 as 0.61.0-wmf.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/933210 (https://phabricator.wikimedia.org/T324117) [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230627T0300) [03:04:46] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:09:44] 10SRE, 10Wikimedia-Mailing-lists: Create wikija-g mailing list - https://phabricator.wikimedia.org/T340380 (10Sai10ukazuki) Confirmation that the listing has been created. [03:30:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:35:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:35:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:40:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:56:16] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:56:40] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:09:05] (03PS1) 10Marostegui: production-m1.sql.erb: Replace dbproxy1012 with dbproxy1022 [puppet] - 10https://gerrit.wikimedia.org/r/933215 (https://phabricator.wikimedia.org/T337812) [05:11:43] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/933215 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [05:11:46] (03CR) 10Marostegui: [C: 03+2] production-m1.sql.erb: Replace dbproxy1012 with dbproxy1022 [puppet] - 10https://gerrit.wikimedia.org/r/933215 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [05:13:42] (03PS1) 10Marostegui: mariadb: Productionize dbproxy1024 [puppet] - 10https://gerrit.wikimedia.org/r/933216 (https://phabricator.wikimedia.org/T337812) [05:27:17] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize dbproxy1024 [puppet] - 10https://gerrit.wikimedia.org/r/933216 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [05:29:03] (03CR) 10Hashar: [C: 03+1] "+1 Eoghan we can deploy it during our day if you want. It is rather low impact and Apache can be restarted at anytime (it has no effect on" [puppet] - 10https://gerrit.wikimedia.org/r/932435 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [05:58:26] (03PS1) 10Abijeet Patro: Display the language button on pages without languages [extensions/UniversalLanguageSelector] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/933154 (https://phabricator.wikimedia.org/T315036) [05:59:36] (03PS1) 10KartikMistry: Update MinT to 2023-06-27-053706-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/933221 (https://phabricator.wikimedia.org/T339896) [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230627T0600) [06:00:04] kormat, marostegui, and Amir1: Dear deployers, time to do the Primary database switchover deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230627T0600). [06:18:38] (03PS1) 10Elukey: cassandra::instance: use the instance's fqdn as TLS PKI CN [puppet] - 10https://gerrit.wikimedia.org/r/933224 (https://phabricator.wikimedia.org/T288470) [06:19:48] (03PS1) 10Marostegui: report_users.sh: Add dbproxy1024 [software] - 10https://gerrit.wikimedia.org/r/933225 (https://phabricator.wikimedia.org/T337812) [06:20:13] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42012/console" [puppet] - 10https://gerrit.wikimedia.org/r/933224 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [06:20:42] (03PS1) 10Marostegui: wmnet: Failover m1 master [dns] - 10https://gerrit.wikimedia.org/r/933226 (https://phabricator.wikimedia.org/T337812) [06:21:10] (03CR) 10Marostegui: [C: 03+2] report_users.sh: Add dbproxy1024 [software] - 10https://gerrit.wikimedia.org/r/933225 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [06:21:40] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m1 master [dns] - 10https://gerrit.wikimedia.org/r/933226 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [06:21:42] (03Merged) 10jenkins-bot: report_users.sh: Add dbproxy1024 [software] - 10https://gerrit.wikimedia.org/r/933225 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [06:22:07] !log Failover m1-master to dbproxy1022 T337812 [06:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:11] T337812: Productionize dbproxy10[22-27] - https://phabricator.wikimedia.org/T337812 [06:26:37] (03PS1) 10Marostegui: dbproxy1022: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/933228 (https://phabricator.wikimedia.org/T337812) [06:27:16] (03CR) 10Marostegui: [C: 03+2] dbproxy1022: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/933228 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [06:30:44] (03PS1) 10Marostegui: production-m1.sql.erb: Add dbproxy1024 IP [puppet] - 10https://gerrit.wikimedia.org/r/933235 (https://phabricator.wikimedia.org/T337812) [06:32:15] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/933235 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [06:32:51] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:33:15] (03CR) 10Marostegui: [C: 03+2] production-m1.sql.erb: Add dbproxy1024 IP [puppet] - 10https://gerrit.wikimedia.org/r/933235 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [06:38:38] (03PS1) 10Marostegui: mariadb: Productionize dbproxy1023 [puppet] - 10https://gerrit.wikimedia.org/r/933322 (https://phabricator.wikimedia.org/T337812) [06:39:39] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize dbproxy1023 [puppet] - 10https://gerrit.wikimedia.org/r/933322 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [06:41:06] (03PS1) 10KartikMistry: Update cxserver to 2023-06-27-053435-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/933369 (https://phabricator.wikimedia.org/T339105) [06:41:29] (03PS1) 10Marostegui: report_users: Add dbproxy1023 [software] - 10https://gerrit.wikimedia.org/r/933372 (https://phabricator.wikimedia.org/T337812) [06:42:37] (03CR) 10CI reject: [V: 04-1] report_users: Add dbproxy1023 [software] - 10https://gerrit.wikimedia.org/r/933372 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [06:43:44] (03PS1) 10Marostegui: production-m2.sql.erb: Add dbproxy1023 [puppet] - 10https://gerrit.wikimedia.org/r/933373 (https://phabricator.wikimedia.org/T337812) [06:45:18] (03CR) 10Marostegui: "recheck" [software] - 10https://gerrit.wikimedia.org/r/933372 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [06:46:08] (03CR) 10CI reject: [V: 04-1] production-m2.sql.erb: Add dbproxy1023 [puppet] - 10https://gerrit.wikimedia.org/r/933373 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [06:46:13] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/933373 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [06:52:27] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/933373 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [06:52:37] (03CR) 10CI reject: [V: 04-1] production-m2.sql.erb: Add dbproxy1023 [puppet] - 10https://gerrit.wikimedia.org/r/933373 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [06:55:43] (03PS1) 10Muehlenhoff: Remove expiry data/contact and update email address [puppet] - 10https://gerrit.wikimedia.org/r/933380 [06:55:52] (03PS2) 10Muehlenhoff: Remove expiry data/contact and update email address [puppet] - 10https://gerrit.wikimedia.org/r/933380 [06:55:54] (03CR) 10CI reject: [V: 04-1] Remove expiry data/contact and update email address [puppet] - 10https://gerrit.wikimedia.org/r/933380 (owner: 10Muehlenhoff) [07:00:05] Amir1, Urbanecm, and taavi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230627T0700). [07:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:02:07] (03PS1) 10Marostegui: production-m2.sql.erb: Add dbproxy1023 [puppet] - 10https://gerrit.wikimedia.org/r/933381 (https://phabricator.wikimedia.org/T337812) [07:03:09] That's me only. [07:03:23] marostegui: OK to backport, right? [07:03:27] yep [07:04:09] (03CR) 10Marostegui: [C: 03+2] production-m2.sql.erb: Add dbproxy1023 [puppet] - 10https://gerrit.wikimedia.org/r/933381 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [07:04:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933125 (https://phabricator.wikimedia.org/T338123) (owner: 10KartikMistry) [07:04:21] (03CR) 10CI reject: [V: 04-1] Enable Content and Section Translation for 4 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933125 (https://phabricator.wikimedia.org/T338123) (owner: 10KartikMistry) [07:04:26] (03CR) 10Muehlenhoff: [C: 03+2] Remove expiry data/contact and update email address [puppet] - 10https://gerrit.wikimedia.org/r/933380 (owner: 10Muehlenhoff) [07:05:35] (03CR) 10KartikMistry: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933125 (https://phabricator.wikimedia.org/T338123) (owner: 10KartikMistry) [07:05:42] (03CR) 10CI reject: [V: 04-1] Enable Content and Section Translation for 4 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933125 (https://phabricator.wikimedia.org/T338123) (owner: 10KartikMistry) [07:06:28] What's up with CI? [07:06:55] (03PS1) 10Marostegui: wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/933382 (https://phabricator.wikimedia.org/T337812) [07:07:01] kart_: we are checking [07:07:02] (03CR) 10CI reject: [V: 04-1] wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/933382 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [07:07:17] marostegui: OK. It seems failing for all patches. [07:09:39] Yeah, created https://phabricator.wikimedia.org/T340518 [07:15:01] !log `sudo kill `pgrep -u paramd`` on stat1005 to unblock puppet [07:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:34] (03PS2) 10Elukey: cassandra::instance::monitoring: move cql check to Prometheus for PKI [puppet] - 10https://gerrit.wikimedia.org/r/933134 (https://phabricator.wikimedia.org/T288470) [07:16:45] (03CR) 10CI reject: [V: 04-1] cassandra::instance::monitoring: move cql check to Prometheus for PKI [puppet] - 10https://gerrit.wikimedia.org/r/933134 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [07:17:01] (03PS2) 10KartikMistry: Update MinT to 2023-06-27-053706-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/933221 (https://phabricator.wikimedia.org/T339896) [07:17:08] (03CR) 10CI reject: [V: 04-1] Update MinT to 2023-06-27-053706-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/933221 (https://phabricator.wikimedia.org/T339896) (owner: 10KartikMistry) [07:19:06] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/931694 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [07:22:31] (03CR) 10Abijeet Patro: [C: 03+1] Display the language button on pages without languages [extensions/UniversalLanguageSelector] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/933154 (https://phabricator.wikimedia.org/T315036) (owner: 10Abijeet Patro) [07:49:27] Just saw all the pings, akosiaris sorry you had to deal with that [07:49:42] (03CR) 10Hashar: "recheck after restarting Zuul (T340518)" [puppet] - 10https://gerrit.wikimedia.org/r/933373 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [07:50:59] (03CR) 10Marostegui: "recheck" [software] - 10https://gerrit.wikimedia.org/r/933372 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [07:51:11] (03CR) 10Marostegui: [C: 03+2] production-m2.sql.erb: Add dbproxy1023 [puppet] - 10https://gerrit.wikimedia.org/r/933373 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [07:52:51] (03CR) 10KartikMistry: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933125 (https://phabricator.wikimedia.org/T338123) (owner: 10KartikMistry) [07:53:47] Not yet :/ [07:54:15] !log uploaded openjdk-8 8u372-ga-1~deb11u1 to component/jdk8 for bullseye (forward port of Java 8 for Buster) [07:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:50] hashar: Should it be good to go with deployment as CI is back to normal? [07:56:40] OK. I'll go ahead and try :) [07:56:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933125 (https://phabricator.wikimedia.org/T338123) (owner: 10KartikMistry) [07:57:29] (03PS3) 10Elukey: cassandra::instance::monitoring: move cql check to Prometheus for PKI [puppet] - 10https://gerrit.wikimedia.org/r/933134 (https://phabricator.wikimedia.org/T288470) [07:57:37] (03Merged) 10jenkins-bot: Enable Content and Section Translation for 4 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933125 (https://phabricator.wikimedia.org/T338123) (owner: 10KartikMistry) [07:58:10] !log kartik@deploy1002 Started scap: Backport for [[gerrit:933125|Enable Content and Section Translation for 4 Wikipedias (T338123)]] [07:58:14] T338123: Enable MinT, Content and Section Translation for a 4th group of languages previously lacking machine translation - https://phabricator.wikimedia.org/T338123 [07:58:20] (03PS4) 10Elukey: cassandra::instance::monitoring: move cql check to Prometheus for PKI [puppet] - 10https://gerrit.wikimedia.org/r/933134 (https://phabricator.wikimedia.org/T288470) [08:01:48] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, let me know when good to merge" [puppet] - 10https://gerrit.wikimedia.org/r/933199 (https://phabricator.wikimedia.org/T319460) (owner: 10Dwisehaupt) [08:01:52] (03CR) 10Filippo Giunchedi: "LGTM, let me know when good to merge" [puppet] - 10https://gerrit.wikimedia.org/r/933198 (https://phabricator.wikimedia.org/T340155) (owner: 10Dwisehaupt) [08:02:04] !log kartik@deploy1002 kartik: Backport for [[gerrit:933125|Enable Content and Section Translation for 4 Wikipedias (T338123)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [08:02:21] (03PS5) 10Elukey: cassandra::instance::monitoring: move cql check to Prometheus for PKI [puppet] - 10https://gerrit.wikimedia.org/r/933134 (https://phabricator.wikimedia.org/T288470) [08:03:05] !log installing openjdk-8 security updates for bullseye [08:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:32] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42015/console" [puppet] - 10https://gerrit.wikimedia.org/r/933134 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [08:05:34] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/933134 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [08:08:46] (03CR) 10Elukey: [V: 03+1] cassandra::instance::monitoring: move cql check to Prometheus for PKI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933134 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [08:08:51] (03CR) 10KartikMistry: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/933221 (https://phabricator.wikimedia.org/T339896) (owner: 10KartikMistry) [08:14:28] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:933125|Enable Content and Section Translation for 4 Wikipedias (T338123)]] (duration: 16m 17s) [08:14:32] T338123: Enable MinT, Content and Section Translation for a 4th group of languages previously lacking machine translation - https://phabricator.wikimedia.org/T338123 [08:16:54] I'm also deploying cxserver and MinT in a few minutes. [08:23:00] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-06-27-053435-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/933369 (https://phabricator.wikimedia.org/T339105) (owner: 10KartikMistry) [08:23:19] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/933180 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [08:23:59] (03Merged) 10jenkins-bot: Update cxserver to 2023-06-27-053435-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/933369 (https://phabricator.wikimedia.org/T339105) (owner: 10KartikMistry) [08:24:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:24:49] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [08:25:07] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [08:25:47] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/933202 (https://phabricator.wikimedia.org/T340491) (owner: 10Btullis) [08:27:51] (03PS3) 10KartikMistry: Update MinT to 2023-06-27-053706-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/933221 (https://phabricator.wikimedia.org/T339896) [08:28:12] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [08:28:44] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [08:29:03] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/932441 (https://phabricator.wikimedia.org/T258686) (owner: 10Dzahn) [08:29:11] !log Failover m2-master to dbproxy1022 T337812 [08:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:14] T337812: Productionize dbproxy10[22-27] - https://phabricator.wikimedia.org/T337812 [08:29:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:29:39] (03PS2) 10Marostegui: wmnet: Failover m2-master [dns] - 10https://gerrit.wikimedia.org/r/933382 (https://phabricator.wikimedia.org/T337812) [08:29:47] (03CR) 10Marostegui: [C: 03+2] report_users: Add dbproxy1023 [software] - 10https://gerrit.wikimedia.org/r/933372 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [08:30:15] (03CR) 10Marostegui: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/933382 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [08:30:42] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m2-master [dns] - 10https://gerrit.wikimedia.org/r/933382 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [08:30:53] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Neil P. Quinn-WMF out of all services on: 1265 hosts [08:30:57] (03Merged) 10jenkins-bot: report_users: Add dbproxy1023 [software] - 10https://gerrit.wikimedia.org/r/933372 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [08:31:40] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Neil P. Quinn-WMF out of all services on: 1265 hosts [08:32:07] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Neil P. Quinn-WMF out of all services on: 767 hosts [08:32:21] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [08:32:30] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Neil P. Quinn-WMF out of all services on: 767 hosts [08:32:58] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [08:33:03] (03PS1) 10Stevemunene: analytics: Exclude analytics1061_1069 from HDFS and YARN [puppet] - 10https://gerrit.wikimedia.org/r/933386 (https://phabricator.wikimedia.org/T317861) [08:33:05] (03PS1) 10Stevemunene: analytics: Remove analytics1064_1069 from hdfs net_topology [puppet] - 10https://gerrit.wikimedia.org/r/933387 (https://phabricator.wikimedia.org/T317861) [08:33:47] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Neil P. Quinn-WMF out of all services on: 19 hosts [08:33:52] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Neil P. Quinn-WMF out of all services on: 19 hosts [08:35:22] (03CR) 10Elukey: [C: 03+1] analytics: Exclude analytics1061_1069 from HDFS and YARN [puppet] - 10https://gerrit.wikimedia.org/r/933386 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene) [08:37:36] (03PS1) 10Muehlenhoff: Remove access for neilpquinn-wmf (old account name of nshahquinn-wmf) [puppet] - 10https://gerrit.wikimedia.org/r/933388 (https://phabricator.wikimedia.org/T337591) [08:38:32] (03CR) 10CI reject: [V: 04-1] Remove access for neilpquinn-wmf (old account name of nshahquinn-wmf) [puppet] - 10https://gerrit.wikimedia.org/r/933388 (https://phabricator.wikimedia.org/T337591) (owner: 10Muehlenhoff) [08:38:42] !log revoked puppet cert for 'varnishkafka' and cleaned up its cergen's files in puppet private - T337825 [08:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:46] T337825: Move varnishkafka to PKI - https://phabricator.wikimedia.org/T337825 [08:39:42] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Traffic: Move varnishkafka to PKI - https://phabricator.wikimedia.org/T337825 (10elukey) 05Open→03Resolved a:03elukey [08:40:13] (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-06-27-053706-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/933221 (https://phabricator.wikimedia.org/T339896) (owner: 10KartikMistry) [08:40:36] (ProbeDown) firing: (2) Service releases1003:8080 has failed probes (http_releases_jenkins_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases1003:8080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:41:11] (03Merged) 10jenkins-bot: Update MinT to 2023-06-27-053706-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/933221 (https://phabricator.wikimedia.org/T339896) (owner: 10KartikMistry) [08:41:22] !log Updated cxserver to 2023-06-27-053435-production (T339105) [08:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:26] T339105: Enable MinT for languages where mobile translation is supported - https://phabricator.wikimedia.org/T339105 [08:41:43] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_ulsfo [08:41:56] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks for this!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933179 (https://phabricator.wikimedia.org/T340483) (owner: 10Reedy) [08:42:00] (03CR) 10Gmodena: [C: 03+2] mw-page-content-change-enrich: HA in main [deployment-charts] - 10https://gerrit.wikimedia.org/r/931307 (https://phabricator.wikimedia.org/T338233) (owner: 10Gmodena) [08:42:03] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_ulsfo [08:42:21] 10SRE, 10Patch-For-Review, 10Platform Team Initiatives (PHP7 (TEC4)), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10TheDJ) [08:42:40] (03Merged) 10jenkins-bot: mw-page-content-change-enrich: HA in main [deployment-charts] - 10https://gerrit.wikimedia.org/r/931307 (https://phabricator.wikimedia.org/T338233) (owner: 10Gmodena) [08:42:57] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [08:44:35] (03PS2) 10Muehlenhoff: Remove access for neilpquinn-wmf (old account name of nshahquinn-wmf) [puppet] - 10https://gerrit.wikimedia.org/r/933388 (https://phabricator.wikimedia.org/T337591) [08:45:08] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [08:47:04] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [08:47:36] (03CR) 10Filippo Giunchedi: [C: 03+1] cassandra::instance::monitoring: move cql check to Prometheus for PKI [puppet] - 10https://gerrit.wikimedia.org/r/933134 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [08:48:57] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for neilpquinn-wmf (old account name of nshahquinn-wmf) [puppet] - 10https://gerrit.wikimedia.org/r/933388 (https://phabricator.wikimedia.org/T337591) (owner: 10Muehlenhoff) [08:51:22] (03CR) 10Elukey: [V: 03+1 C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/933134 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [08:51:31] claime: Did we change people.w.o IP again? Codfw deployment having issue downloading models with MinT like it had earlier: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/931086/ [08:52:07] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [08:53:23] !log akosiaris@deploy1002 Synchronized wmf-config/CommonSettings.php: (no justification provided) (duration: 07m 21s) [08:53:35] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:55:44] ah not sure what's wrong. Helm says it is deployed. [08:56:01] (03CR) 10Jbond: ferm: Allow passing sets to an srange or drange (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [08:58:00] 10SRE, 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, 10User-aborrero: codfw1dev: OpenStack services can only sort of talk to memacached on cloudcontrols - https://phabricator.wikimedia.org/T340488 (10aborrero) likely the problem is that those addresses need to be the `.private.codfw.wikimedia.cloud` o... [08:58:00] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=kubernetes200[0-9].codfw.wmnet [08:58:12] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=kubernetes100[0-9].eqiad.wmnet [08:58:27] !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: service=thumbor,name=kubernetes100[0-9].eqiad.wmnet [08:58:29] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Transfer Neil Shah-Quinn's production access to new developer account - https://phabricator.wikimedia.org/T337591 (10MoritzMuehlenhoff) 05Open→03Resolved I have removed SSH access for your neilpquinn-wmf account and removed it from the "wmf" LDAP group. T... [08:58:35] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:58:36] !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: service=thumbor,name=kubernetes200[0-9].codfw.wmnet [08:58:52] (03PS1) 10Marostegui: dbproxy1023: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/933389 (https://phabricator.wikimedia.org/T337812) [08:59:10] (03CR) 10Muehlenhoff: ferm: Allow passing sets to an srange or drange (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [08:59:27] (03CR) 10Marostegui: [C: 03+2] dbproxy1023: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/933389 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [09:00:08] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_ulsfo [09:00:34] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_ulsfo [09:01:02] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (one typo inline)" [software/bitu] - 10https://gerrit.wikimedia.org/r/932831 (owner: 10Slyngshede) [09:02:09] I'll go ahead with eqiad and see. [09:02:12] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [09:03:43] (03Abandoned) 10Stevemunene: analytics: Remove analytics106[4-6] from the HDFS topology [puppet] - 10https://gerrit.wikimedia.org/r/930583 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene) [09:03:54] (03Abandoned) 10Stevemunene: analytics: Decommission analytics106[7-8] from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/930584 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene) [09:04:08] (03Abandoned) 10Stevemunene: analytics: Remove analytics106[7-8] from the HDFS topology [puppet] - 10https://gerrit.wikimedia.org/r/930585 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene) [09:04:17] (03Abandoned) 10Stevemunene: analytics: Decommission analytics1069 from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/930606 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene) [09:04:28] (03Abandoned) 10Stevemunene: analytics: Remove analytics1069 from the HDFS topology [puppet] - 10https://gerrit.wikimedia.org/r/930607 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene) [09:07:20] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [09:07:59] (03PS1) 10JMeybohm: mesh.configuration: Do not require charts to define .Values.mesh.admin [deployment-charts] - 10https://gerrit.wikimedia.org/r/933391 (https://phabricator.wikimedia.org/T337405) [09:08:01] (03PS1) 10JMeybohm: Update all charts to mesh.configuration 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/933392 (https://phabricator.wikimedia.org/T337405) [09:09:18] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [09:09:41] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [09:09:49] !log repool cp1082 [09:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:05] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [09:10:09] (03CR) 10Jaime Nuche: [C: 03+1] releases-jenkins: replace Apache 2.2 with 2.4 syntax for access control (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932439 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [09:10:23] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [09:11:34] !log Updated MinT to 2023-06-27-053706-production (T339896, T340236) [09:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:39] T339896: Enable MinT for all languages supported by IndicTrans2 - https://phabricator.wikimedia.org/T339896 [09:11:39] T340236: MinT translates to English when Hindi-Santali or any other language-Santali is selected - https://phabricator.wikimedia.org/T340236 [09:11:40] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:11:56] jouncebot: nowandnext [09:11:57] No deployments scheduled for the next 0 hour(s) and 48 minute(s) [09:11:57] In 0 hour(s) and 48 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230627T1000) [09:12:05] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqsin and not P{cp5032*} and A:cp [09:12:18] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqsin [09:13:54] (03PS1) 10Marostegui: mariadb: Productionize dbproxy1025 [puppet] - 10https://gerrit.wikimedia.org/r/933393 (https://phabricator.wikimedia.org/T337812) [09:13:56] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: refresh cloudcontrol FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/933394 (https://phabricator.wikimedia.org/T340488) [09:14:24] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize dbproxy1025 [puppet] - 10https://gerrit.wikimedia.org/r/933393 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [09:15:42] (03PS1) 10Marostegui: report_users.sh: Add dbproxy1025 [software] - 10https://gerrit.wikimedia.org/r/933395 (https://phabricator.wikimedia.org/T337812) [09:16:58] (03CR) 10Marostegui: [C: 03+2] report_users.sh: Add dbproxy1025 [software] - 10https://gerrit.wikimedia.org/r/933395 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [09:17:32] (03Merged) 10jenkins-bot: report_users.sh: Add dbproxy1025 [software] - 10https://gerrit.wikimedia.org/r/933395 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [09:17:42] 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10elukey) @MatthewVernon @Eevans do we have a timeline for MOSS? :) [09:18:08] (03PS1) 10Marostegui: production-m2.sql.erb: Add dbproxy1025 [puppet] - 10https://gerrit.wikimedia.org/r/933396 (https://phabricator.wikimedia.org/T337812) [09:19:18] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [09:20:32] !log installing libvirt bugfix updates from Bullseye point release [09:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:51] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-test-worker1002.eqiad.wmnet with OS bullseye [09:22:20] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] "PCC shows a huge diff in cloudcontrol and cloudnet servers, which is expected https://puppet-compiler.wmflabs.org/output/933394/42018/" [puppet] - 10https://gerrit.wikimedia.org/r/933394 (https://phabricator.wikimedia.org/T340488) (owner: 10Arturo Borrero Gonzalez) [09:23:32] (03CR) 10Marostegui: [C: 03+2] production-m2.sql.erb: Add dbproxy1025 [puppet] - 10https://gerrit.wikimedia.org/r/933396 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [09:23:55] (03CR) 10Jbond: "-1: from pcc" [puppet] - 10https://gerrit.wikimedia.org/r/932459 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [09:25:15] (03PS1) 10Alexandros Kosiaris: CommonsSettings.php: Use $wgCopyUploadProxy, not $wgLocalHTTPProxy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933397 [09:26:31] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@1ddd94b] (releasing): (no justification provided) [09:27:20] 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10MatthewVernon) The extra hardware needed is due to arrive in Q1; so I expect getting MOSS going will be a KR for Q2 [obviously I can't promise that at this point!] [09:27:22] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@1ddd94b] (releasing): (no justification provided) (duration: 00m 51s) [09:27:24] (03CR) 10Alexandros Kosiaris: [C: 03+2] CommonsSettings.php: Use $wgCopyUploadProxy, not $wgLocalHTTPProxy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933397 (owner: 10Alexandros Kosiaris) [09:27:57] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/932183 (owner: 10Slyngshede) [09:28:49] (03Merged) 10jenkins-bot: CommonsSettings.php: Use $wgCopyUploadProxy, not $wgLocalHTTPProxy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933397 (owner: 10Alexandros Kosiaris) [09:29:35] (03PS2) 10Slyngshede: SUL Account: Allow users to dismiss account linking. [software/bitu] - 10https://gerrit.wikimedia.org/r/932831 [09:29:47] (03CR) 10Slyngshede: SUL Account: Allow users to dismiss account linking. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/932831 (owner: 10Slyngshede) [09:30:22] (03CR) 10Slyngshede: [C: 03+2] LDAP Attributes: Move actions and tooltip to templatetag [software/bitu] - 10https://gerrit.wikimedia.org/r/932183 (owner: 10Slyngshede) [09:30:26] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] LDAP Attributes: Move actions and tooltip to templatetag [software/bitu] - 10https://gerrit.wikimedia.org/r/932183 (owner: 10Slyngshede) [09:30:34] (03CR) 10Muehlenhoff: [C: 03+1] SUL Account: Allow users to dismiss account linking. [software/bitu] - 10https://gerrit.wikimedia.org/r/932831 (owner: 10Slyngshede) [09:30:36] (ProbeDown) resolved: (2) Service releases1003:8080 has failed probes (http_releases_jenkins_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases1003:8080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:30:41] (03CR) 10Slyngshede: [C: 03+2] SUL Account: Allow users to dismiss account linking. [software/bitu] - 10https://gerrit.wikimedia.org/r/932831 (owner: 10Slyngshede) [09:30:43] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] SUL Account: Allow users to dismiss account linking. [software/bitu] - 10https://gerrit.wikimedia.org/r/932831 (owner: 10Slyngshede) [09:34:07] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-worker1002.eqiad.wmnet with reason: host reimage [09:35:35] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqsin and not P{cp5032*} and A:cp [09:36:32] !log akosiaris@deploy1002 Synchronized wmf-config/CommonSettings.php: (no justification provided) (duration: 07m 16s) [09:37:16] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-worker1002.eqiad.wmnet with reason: host reimage [09:38:00] (03PS1) 10Jbond: pybal: update check to conform to the nagios plugin api [puppet] - 10https://gerrit.wikimedia.org/r/933398 (https://phabricator.wikimedia.org/T322377) [09:38:10] (03CR) 10Stevemunene: [C: 03+2] analytics: Exclude analytics1061_1069 from HDFS and YARN [puppet] - 10https://gerrit.wikimedia.org/r/933386 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene) [09:38:24] (03CR) 10CI reject: [V: 04-1] pybal: update check to conform to the nagios plugin api [puppet] - 10https://gerrit.wikimedia.org/r/933398 (https://phabricator.wikimedia.org/T322377) (owner: 10Jbond) [09:41:39] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqsin [09:43:33] (03PS2) 10Jbond: pybal: update check to conform to the nagios plugin api [puppet] - 10https://gerrit.wikimedia.org/r/933398 (https://phabricator.wikimedia.org/T322377) [09:48:00] (03CR) 10Jbond: [C: 03+1] pybal: Fix hostnames not being sent on alert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/913004 (https://phabricator.wikimedia.org/T322377) (owner: 10BCornwall) [09:48:27] (03PS1) 10Jgiannelos: mobileapps: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/933399 [09:49:27] (03PS3) 10Jbond: promethus: switch to using cfssl [puppet] - 10https://gerrit.wikimedia.org/r/930187 (https://phabricator.wikimedia.org/T326657) [09:50:33] (03CR) 10Klausman: [C: 03+1] cassandra::instance: use the instance's fqdn as TLS PKI CN [puppet] - 10https://gerrit.wikimedia.org/r/933224 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [09:51:02] (03CR) 10Jbond: promethus: switch to using cfssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/930187 (https://phabricator.wikimedia.org/T326657) (owner: 10Jbond) [09:52:35] (03CR) 10Jbond: [C: 03+2] tlsproxy::envoy: update support for profile::tlsproxy::envoy::services [puppet] - 10https://gerrit.wikimedia.org/r/930184 (https://phabricator.wikimedia.org/T326657) (owner: 10Jbond) [09:54:21] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [09:54:34] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [09:55:56] (03PS1) 10Jbond: cfssl: add default to wmcs as well [puppet] - 10https://gerrit.wikimedia.org/r/933400 [09:56:13] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [09:56:17] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [09:56:42] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [09:56:45] (03CR) 10Jbond: [C: 03+2] cfssl: add default to wmcs as well [puppet] - 10https://gerrit.wikimedia.org/r/933400 (owner: 10Jbond) [09:58:36] (03PS1) 10Alexandros Kosiaris: url_downloader: Remove the esams entries marked TODO [puppet] - 10https://gerrit.wikimedia.org/r/933401 [09:58:38] (03PS1) 10Alexandros Kosiaris: urldownloader: Simplify wikimedia variable [puppet] - 10https://gerrit.wikimedia.org/r/933402 [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230627T1000) [10:01:24] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:01:40] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:01:51] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-test-worker1002.eqiad.wmnet with OS bullseye [10:03:35] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:03:36] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:04:27] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:05:15] (03CR) 10Elukey: [V: 03+1 C: 03+2] cassandra::instance: use the instance's fqdn as TLS PKI CN [puppet] - 10https://gerrit.wikimedia.org/r/933224 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [10:05:41] PROBLEM - Host ripe-atlas-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [10:05:41] PROBLEM - Host ripe-atlas-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [10:05:54] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:06:10] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:06:21] (03CR) 10Vgutierrez: [C: 03+1] "LGTM, thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [10:06:42] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:06:58] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:07:07] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:07:30] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:10:09] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [10:10:57] (03PS2) 10JMeybohm: Update all charts to mesh.configuration 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/933392 (https://phabricator.wikimedia.org/T337405) [10:10:59] (03PS1) 10JMeybohm: Update mathoid to mesh.configuration 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/933403 (https://phabricator.wikimedia.org/T337405) [10:11:53] (03CR) 10JMeybohm: [C: 03+2] mesh.configuration: Do not require charts to define .Values.mesh.admin [deployment-charts] - 10https://gerrit.wikimedia.org/r/933391 (https://phabricator.wikimedia.org/T337405) (owner: 10JMeybohm) [10:12:26] (03Merged) 10jenkins-bot: mesh.configuration: Do not require charts to define .Values.mesh.admin [deployment-charts] - 10https://gerrit.wikimedia.org/r/933391 (https://phabricator.wikimedia.org/T337405) (owner: 10JMeybohm) [10:12:34] (03CR) 10JMeybohm: [C: 03+2] Update mathoid to mesh.configuration 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/933403 (https://phabricator.wikimedia.org/T337405) (owner: 10JMeybohm) [10:13:25] (03Merged) 10jenkins-bot: Update mathoid to mesh.configuration 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/933403 (https://phabricator.wikimedia.org/T337405) (owner: 10JMeybohm) [10:20:07] (03CR) 10Jbond: [C: 04-1] "-1: for the name collision otherwise lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/933192 (https://phabricator.wikimedia.org/T218900) (owner: 10Ahmon Dancy) [10:21:10] jouncebot: nowandnext [10:21:10] For the next 0 hour(s) and 38 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230627T1000) [10:21:10] In 2 hour(s) and 38 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230627T1300) [10:21:10] In 2 hour(s) and 38 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230627T1300) [10:21:27] (03PS2) 10Alexandros Kosiaris: url_downloader: Remove the esams entries marked TODO [puppet] - 10https://gerrit.wikimedia.org/r/933401 [10:21:29] (03PS2) 10Alexandros Kosiaris: urldownloader: Simplify wikimedia variable [puppet] - 10https://gerrit.wikimedia.org/r/933402 [10:21:31] (03PS1) 10Alexandros Kosiaris: urldownloader: Remove FTP from safe ports list [puppet] - 10https://gerrit.wikimedia.org/r/933404 [10:21:33] (03PS1) 10Alexandros Kosiaris: network: Split $mwappserver_networks to public/private [puppet] - 10https://gerrit.wikimedia.org/r/933405 [10:21:35] (03PS1) 10Alexandros Kosiaris: urldownloader: Switch $towikimedia to $mw_appserver_networks_private [puppet] - 10https://gerrit.wikimedia.org/r/933426 [10:22:35] (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:22:58] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42020/console" [puppet] - 10https://gerrit.wikimedia.org/r/933401 (owner: 10Alexandros Kosiaris) [10:23:02] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [10:25:15] (03CR) 10Jbond: [C: 03+1] "ahh that make senses" [puppet] - 10https://gerrit.wikimedia.org/r/933180 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [10:25:38] (03CR) 10Jbond: [C: 03+1] Record the fact that cjming is now kerberos enabled [puppet] - 10https://gerrit.wikimedia.org/r/933202 (https://phabricator.wikimedia.org/T340491) (owner: 10Btullis) [10:25:41] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/mathoid: apply [10:26:07] (03PS1) 10Hnowlan: rest-gateway: add domain list for restbase parity [deployment-charts] - 10https://gerrit.wikimedia.org/r/933427 (https://phabricator.wikimedia.org/T324678) [10:26:17] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/mathoid: apply [10:27:10] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:27:35] (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:27:49] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] url_downloader: Remove the esams entries marked TODO [puppet] - 10https://gerrit.wikimedia.org/r/933401 (owner: 10Alexandros Kosiaris) [10:28:11] (03CR) 10Jbond: [C: 03+1] Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [10:29:24] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/mathoid: apply [10:29:41] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42021/console" [puppet] - 10https://gerrit.wikimedia.org/r/933402 (owner: 10Alexandros Kosiaris) [10:30:05] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply [10:30:54] !log elukey@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-eqiad: Roll restart to pick up new certs and openjdk version - elukey@cumin1001 [10:32:00] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/mathoid: apply [10:33:42] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [10:37:36] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:41:38] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=ats-be,name=cp2037.codfw.wmnet [10:43:41] !log disabling puppet on A:cp-text to test rollout of r/929674 [10:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:33] PROBLEM - Check systemd state on doc2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-host-data-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:12] kart_: Is it ok for MinT ? [10:48:09] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Kerberos for cjming - https://phabricator.wikimedia.org/T340491 (10WDoranWMF) Not sure if it is still needed but as Clare's manager I approve this access. [10:48:33] !log elukey@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-eqiad: Roll restart to pick up new certs and openjdk version - elukey@cumin1001 [10:48:45] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 132 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [10:48:47] (03CR) 10Hnowlan: [C: 03+2] trafficserver: route proton requests via the API gateway [puppet] - 10https://gerrit.wikimedia.org/r/929674 (https://phabricator.wikimedia.org/T324678) (owner: 10Hnowlan) [10:49:31] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/933402 (owner: 10Alexandros Kosiaris) [10:51:32] (03PS18) 10Muehlenhoff: ferm: Allow passing sets to an srange or drange [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) [10:51:39] (03CR) 10Muehlenhoff: ferm: Allow passing sets to an srange or drange (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:53:55] (03PS1) 10Muehlenhoff: Extend access [puppet] - 10https://gerrit.wikimedia.org/r/933429 [10:55:41] !log joal@deploy1002 Started deploy [analytics/refinery@259c5e2]: Regular analytics weekly train [analytics/refinery@259c5e2] [10:57:00] (03CR) 10Muehlenhoff: [C: 03+2] Extend access [puppet] - 10https://gerrit.wikimedia.org/r/933429 (owner: 10Muehlenhoff) [10:58:34] (03CR) 10Btullis: [C: 03+2] Record the fact that cjming is now kerberos enabled [puppet] - 10https://gerrit.wikimedia.org/r/933202 (https://phabricator.wikimedia.org/T340491) (owner: 10Btullis) [11:01:01] PROBLEM - Hadoop NodeManager on analytics1067 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:01:07] PROBLEM - Check systemd state on analytics1067 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:02:29] (03PS1) 10Alexandros Kosiaris: network: Narrow down what mw_appserver_networks means [puppet] - 10https://gerrit.wikimedia.org/r/933430 [11:02:34] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-test-worker1003.eqiad.wmnet with OS bullseye [11:02:42] (03PS2) 10Alexandros Kosiaris: network: Narrow down what mw_appserver_networks means [puppet] - 10https://gerrit.wikimedia.org/r/933430 [11:04:05] !log joal@deploy1002 Finished deploy [analytics/refinery@259c5e2]: Regular analytics weekly train [analytics/refinery@259c5e2] (duration: 08m 23s) [11:06:03] !log joal@deploy1002 Started deploy [analytics/refinery@259c5e2] (thin): Regular analytics weekly train THIN [analytics/refinery@259c5e2] [11:06:08] !log joal@deploy1002 Finished deploy [analytics/refinery@259c5e2] (thin): Regular analytics weekly train THIN [analytics/refinery@259c5e2] (duration: 00m 04s) [11:06:12] (03PS1) 10Volans: constants: add knams as supported PoP datacenter [software/pywmflib] - 10https://gerrit.wikimedia.org/r/933431 (https://phabricator.wikimedia.org/T340465) [11:06:20] !log joal@deploy1002 Started deploy [analytics/refinery@259c5e2] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@259c5e2] [11:06:37] (03PS1) 10Btullis: Update the hadoop-worker-canary cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/933432 (https://phabricator.wikimedia.org/T338227) [11:08:03] !log joal@deploy1002 Finished deploy [analytics/refinery@259c5e2] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@259c5e2] (duration: 01m 43s) [11:08:27] (03CR) 10Clément Goubert: [C: 03+1] Parsoid: Disable PC writes on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933184 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler) [11:08:34] I'm planning to deploy a config change in a few minutes. Any objections? See --^ [11:09:21] (03PS1) 10Hnowlan: Revert "trafficserver: route proton requests via the API gateway" [puppet] - 10https://gerrit.wikimedia.org/r/933408 [11:10:04] (03CR) 10AOkoth: [C: 03+1] vrts::web: replace OTRS with VRTS in comments [puppet] - 10https://gerrit.wikimedia.org/r/932319 (https://phabricator.wikimedia.org/T280392) (owner: 10Dzahn) [11:10:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by daniel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933184 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler) [11:10:46] (03CR) 10Volans: [C: 04-2] "To be merged and deployed only close to the knams migration" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/933431 (https://phabricator.wikimedia.org/T340465) (owner: 10Volans) [11:10:50] (03PS3) 10Daniel Kinzler: Parsoid: Disable PC writes on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933184 (https://phabricator.wikimedia.org/T339867) [11:11:02] (03CR) 10TrainBranchBot: "Approved by daniel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933184 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler) [11:11:52] (03PS19) 10Jbond: ferm: Allow passing sets to an srange or drange [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:12:24] (03Merged) 10jenkins-bot: Parsoid: Disable PC writes on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933184 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler) [11:12:33] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:12:41] !log daniel@deploy1002 Started scap: Backport for [[gerrit:933184|Parsoid: Disable PC writes on dewiki (T339867)]] [11:12:45] T339867: RESTbase: Turn off pre-generation and caching for parsoid endpoints - https://phabricator.wikimedia.org/T339867 [11:13:00] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/933432 (https://phabricator.wikimedia.org/T338227) (owner: 10Btullis) [11:14:09] !log daniel@deploy1002 daniel: Backport for [[gerrit:933184|Parsoid: Disable PC writes on dewiki (T339867)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [11:14:30] (03PS1) 10Volans: sre.discovery: add support for knams as PoP DC [cookbooks] - 10https://gerrit.wikimedia.org/r/933433 (https://phabricator.wikimedia.org/T340465) [11:14:32] (03PS1) 10Volans: Install hosts: fallback to esams [cookbooks] - 10https://gerrit.wikimedia.org/r/933434 (https://phabricator.wikimedia.org/T340465) [11:16:07] (03CR) 10Vgutierrez: [C: 03+1] "regex_map only matches regex on the Host part of the URL, we need to explore other venues to solve this" [puppet] - 10https://gerrit.wikimedia.org/r/933408 (owner: 10Hnowlan) [11:16:10] (03CR) 10Volans: [C: 04-2] "To be merged only close to the esams->knams migration" [cookbooks] - 10https://gerrit.wikimedia.org/r/933433 (https://phabricator.wikimedia.org/T340465) (owner: 10Volans) [11:16:38] (03CR) 10Hnowlan: [C: 03+2] Revert "trafficserver: route proton requests via the API gateway" [puppet] - 10https://gerrit.wikimedia.org/r/933408 (owner: 10Hnowlan) [11:16:43] (03CR) 10Volans: [C: 04-2] "To be merged only close to the esams->knams migration" [cookbooks] - 10https://gerrit.wikimedia.org/r/933434 (https://phabricator.wikimedia.org/T340465) (owner: 10Volans) [11:19:00] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42023/console" [puppet] - 10https://gerrit.wikimedia.org/r/931275 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [11:19:44] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:20:02] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=ats-be,name=cp2037.codfw.wmnet [11:21:16] !log daniel@deploy1002 Finished scap: Backport for [[gerrit:933184|Parsoid: Disable PC writes on dewiki (T339867)]] (duration: 08m 34s) [11:21:20] T339867: RESTbase: Turn off pre-generation and caching for parsoid endpoints - https://phabricator.wikimedia.org/T339867 [11:22:32] (03CR) 10Clément Goubert: [C: 03+1] opentelemetry-collector: Vendor 0.61.0 as 0.61.0-wmf.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/933210 (https://phabricator.wikimedia.org/T324117) (owner: 10RLazarus) [11:26:00] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/933404 (owner: 10Alexandros Kosiaris) [11:26:21] (03PS2) 10Jbond: puppetserver: Add new puppet server to block [puppet] - 10https://gerrit.wikimedia.org/r/931275 (https://phabricator.wikimedia.org/T330490) [11:26:23] (03PS1) 10Jbond: private-repo: update repo hook to deal with different directories [puppet] - 10https://gerrit.wikimedia.org/r/933435 [11:27:32] (03CR) 10Btullis: [C: 03+2] Add Airflow configuration to connect to DataHub [puppet] - 10https://gerrit.wikimedia.org/r/919019 (https://phabricator.wikimedia.org/T333004) (owner: 10Aqu) [11:27:59] (03CR) 10Volans: [C: 03+1] cumin: Properly set connect_timeout (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [11:29:35] (03PS2) 10Jbond: private-repo: update repo hook to deal with different directories [puppet] - 10https://gerrit.wikimedia.org/r/933435 [11:29:37] (03PS3) 10Jbond: puppetserver: Add new puppet server to block [puppet] - 10https://gerrit.wikimedia.org/r/931275 (https://phabricator.wikimedia.org/T330490) [11:31:25] (03PS1) 10Fabfur: hiera: Removed unused cache instances from deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/933436 (https://phabricator.wikimedia.org/T327742) [11:31:56] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42025/console" [puppet] - 10https://gerrit.wikimedia.org/r/931275 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [11:35:46] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [11:36:44] (03CR) 10Alexandros Kosiaris: "I 've looked into https://puppet-compiler.wmflabs.org/output/933430/42022/mw2267.codfw.wmnet/index.html diff and what is indeed getting ex" [puppet] - 10https://gerrit.wikimedia.org/r/933430 (owner: 10Alexandros Kosiaris) [11:36:49] (03CR) 10Alexandros Kosiaris: [C: 03+1] network: Narrow down what mw_appserver_networks means [puppet] - 10https://gerrit.wikimedia.org/r/933430 (owner: 10Alexandros Kosiaris) [11:37:25] (03PS1) 10Daniel Kinzler: Parsoid: Disable PC writes on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933437 (https://phabricator.wikimedia.org/T339867) [11:37:34] (03CR) 10CI reject: [V: 04-1] Parsoid: Disable PC writes on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933437 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler) [11:40:31] (03PS2) 10Daniel Kinzler: Parsoid: Disable PC writes on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933437 (https://phabricator.wikimedia.org/T339867) [11:41:25] (03CR) 10Clément Goubert: [C: 03+1] Parsoid: Disable PC writes on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933437 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler) [11:42:05] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/932816 [11:42:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by daniel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933437 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler) [11:42:58] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/932816 (owner: 10PipelineBot) [11:43:02] 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Create dynamic CRL - https://phabricator.wikimedia.org/T340543 (10jbond) [11:43:10] 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Create dynamic CRL - https://phabricator.wikimedia.org/T340543 (10jbond) p:05Triage→03Medium [11:43:17] (03Merged) 10jenkins-bot: Parsoid: Disable PC writes on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933437 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler) [11:43:29] !log daniel@deploy1002 Started scap: Backport for [[gerrit:933437|Parsoid: Disable PC writes on enwiki (T339867)]] [11:43:32] RECOVERY - Check systemd state on doc2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:33] T339867: RESTbase: Turn off pre-generation and caching for parsoid endpoints - https://phabricator.wikimedia.org/T339867 [11:43:53] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/932816 (owner: 10PipelineBot) [11:44:55] !log daniel@deploy1002 daniel: Backport for [[gerrit:933437|Parsoid: Disable PC writes on enwiki (T339867)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [11:48:24] (03PS1) 10Jbond: puppet:agent: force certificate_revocation leaf on puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/933438 (https://phabricator.wikimedia.org/T340543) [11:48:48] (03CR) 10CI reject: [V: 04-1] puppet:agent: force certificate_revocation leaf on puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/933438 (https://phabricator.wikimedia.org/T340543) (owner: 10Jbond) [11:48:59] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42026/console" [puppet] - 10https://gerrit.wikimedia.org/r/933438 (https://phabricator.wikimedia.org/T340543) (owner: 10Jbond) [11:49:50] (03PS5) 10Muehlenhoff: Fix migration when "plain" instances are involved [cookbooks] - 10https://gerrit.wikimedia.org/r/932237 (https://phabricator.wikimedia.org/T203964) [11:50:51] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [11:51:01] (03PS2) 10Jbond: puppet:agent: force certificate_revocation leaf on puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/933438 (https://phabricator.wikimedia.org/T340543) [11:51:07] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [11:51:25] (03CR) 10CI reject: [V: 04-1] puppet:agent: force certificate_revocation leaf on puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/933438 (https://phabricator.wikimedia.org/T340543) (owner: 10Jbond) [11:52:18] RECOVERY - Hadoop NodeManager on analytics1067 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:53:04] (03PS3) 10Jbond: puppet:agent: force certificate_revocation leaf on puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/933438 (https://phabricator.wikimedia.org/T340543) [11:54:34] (03CR) 10CI reject: [V: 04-1] puppet:agent: force certificate_revocation leaf on puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/933438 (https://phabricator.wikimedia.org/T340543) (owner: 10Jbond) [11:55:35] !log daniel@deploy1002 Finished scap: Backport for [[gerrit:933437|Parsoid: Disable PC writes on enwiki (T339867)]] (duration: 12m 06s) [11:55:39] T339867: RESTbase: Turn off pre-generation and caching for parsoid endpoints - https://phabricator.wikimedia.org/T339867 [11:56:44] (03PS4) 10Jbond: puppet:agent: force certificate_revocation leaf on puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/933438 (https://phabricator.wikimedia.org/T340543) [11:57:38] (03CR) 10FNegri: cumin: Properly set connect_timeout (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [11:57:48] (03CR) 10FNegri: [C: 03+2] cumin: Properly set connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [11:58:01] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42029/console" [puppet] - 10https://gerrit.wikimedia.org/r/933438 (https://phabricator.wikimedia.org/T340543) (owner: 10Jbond) [12:00:56] (03CR) 10Alexandros Kosiaris: [C: 03+2] "PCC moved on to compiling stuff for cloud, a had a look at various parts of PCC output for production, it looked pretty identical and sane" [puppet] - 10https://gerrit.wikimedia.org/r/933430 (owner: 10Alexandros Kosiaris) [12:03:03] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppet:agent: force certificate_revocation leaf on puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/933438 (https://phabricator.wikimedia.org/T340543) (owner: 10Jbond) [12:03:21] 10SRE-tools, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2022/2023-Q4): WMCS Cookbook Automation FY2022-23 Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri) [12:09:46] (03PS20) 10Muehlenhoff: ferm: Allow passing sets to an srange or drange [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) [12:12:15] claime: MinT seems OK so far. Need to check errors. [12:17:12] (03PS1) 10Jaime Nuche: doc: grant access to Jenkins releases instances [puppet] - 10https://gerrit.wikimedia.org/r/933439 (https://phabricator.wikimedia.org/T336168) [12:18:03] (03PS2) 10Alexandros Kosiaris: urldownloader: Remove FTP from safe ports list [puppet] - 10https://gerrit.wikimedia.org/r/933404 [12:18:05] (03PS3) 10Alexandros Kosiaris: urldownloader: Simplify wikimedia variable [puppet] - 10https://gerrit.wikimedia.org/r/933402 [12:18:07] (03PS2) 10Alexandros Kosiaris: network: Split $mwappserver_networks to public/private [puppet] - 10https://gerrit.wikimedia.org/r/933405 [12:18:09] (03PS2) 10Alexandros Kosiaris: urldownloader: Switch $towikimedia to $mw_appserver_networks_private [puppet] - 10https://gerrit.wikimedia.org/r/933426 [12:20:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:20:31] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (FY2022/2023-Q4): Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10fnegri) p:05Triage→03High a:03fnegri [12:20:39] (03CR) 10Muehlenhoff: "Tested with test-cookbooks, ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/932237 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [12:20:43] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/933404 (owner: 10Alexandros Kosiaris) [12:20:53] 10SRE-tools, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2022/2023-Q4): WMCS Cookbook Automation FY2022-23 Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri) [12:21:08] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (FY2022/2023-Q4): Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10fnegri) 05Open→03In progress [12:22:09] (03PS1) 10Jbond: puppetserver: open up firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/933440 (https://phabricator.wikimedia.org/T330490) [12:25:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:25:41] (03CR) 10Jgreen: [C: 03+2] Add frmon1002 to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/933199 (https://phabricator.wikimedia.org/T319460) (owner: 10Dwisehaupt) [12:25:57] (03CR) 10Jgreen: [C: 03+2] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/933199 (https://phabricator.wikimedia.org/T319460) (owner: 10Dwisehaupt) [12:26:05] (03PS1) 10Muehlenhoff: Point eqiad URL downloaders to bullseye host [dns] - 10https://gerrit.wikimedia.org/r/933441 (https://phabricator.wikimedia.org/T329945) [12:26:22] (03CR) 10Jaime Nuche: "PCC: https://puppet-compiler.wmflabs.org/output/933439/42030/" [puppet] - 10https://gerrit.wikimedia.org/r/933439 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [12:26:31] (03CR) 10Jgreen: [C: 03+2] Remove hosts to be decommissioned. [puppet] - 10https://gerrit.wikimedia.org/r/933198 (https://phabricator.wikimedia.org/T340155) (owner: 10Dwisehaupt) [12:26:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:26:39] (03CR) 10Jgreen: [C: 03+2] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/933198 (https://phabricator.wikimedia.org/T340155) (owner: 10Dwisehaupt) [12:27:51] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:29:39] (03PS1) 10Marostegui: mariadb: Productionize dbproxy1026 [puppet] - 10https://gerrit.wikimedia.org/r/933443 (https://phabricator.wikimedia.org/T337812) [12:30:41] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize dbproxy1026 [puppet] - 10https://gerrit.wikimedia.org/r/933443 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [12:33:34] (03PS2) 10Majavah: Remove -pwb images in favour of upcoming buildservice images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/916795 (https://phabricator.wikimedia.org/T249787) [12:33:51] (03CR) 10Majavah: [C: 03+2] Remove -pwb images in favour of upcoming buildservice images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/916795 (https://phabricator.wikimedia.org/T249787) (owner: 10Majavah) [12:34:25] (03Merged) 10jenkins-bot: Remove -pwb images in favour of upcoming buildservice images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/916795 (https://phabricator.wikimedia.org/T249787) (owner: 10Majavah) [12:34:44] (03PS1) 10Marostegui: production-m3.sql: Replace dbproxy1016 with dbproxy1026 [puppet] - 10https://gerrit.wikimedia.org/r/933445 (https://phabricator.wikimedia.org/T337812) [12:35:21] (03PS1) 10Marostegui: report_users.sh: Add dbproxy1026 [software] - 10https://gerrit.wikimedia.org/r/933446 (https://phabricator.wikimedia.org/T337812) [12:35:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Multiple RAID battery failures on hadoop worker hosts - https://phabricator.wikimedia.org/T318659 (10Jclark-ctr) [12:36:19] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Replace RAID controller battery in an-worker1092 - https://phabricator.wikimedia.org/T340204 (10Jclark-ctr) 05Open→03Resolved @BTullis Replaced Battery host is booting now. Thanks for your assistance [12:36:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:36:38] (03CR) 10Marostegui: [C: 03+2] report_users.sh: Add dbproxy1026 [software] - 10https://gerrit.wikimedia.org/r/933446 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [12:36:49] (03CR) 10Marostegui: [C: 03+2] production-m3.sql: Replace dbproxy1016 with dbproxy1026 [puppet] - 10https://gerrit.wikimedia.org/r/933445 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [12:37:11] (03Merged) 10jenkins-bot: report_users.sh: Add dbproxy1026 [software] - 10https://gerrit.wikimedia.org/r/933446 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [12:37:50] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T340400 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Replaced cable link returned to idrac [12:41:37] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:42:07] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:45:06] (03PS1) 10Marostegui: report_users.sh: Add dbproxy1027 [software] - 10https://gerrit.wikimedia.org/r/933448 (https://phabricator.wikimedia.org/T337812) [12:46:02] (03CR) 10Marostegui: [C: 03+2] report_users.sh: Add dbproxy1027 [software] - 10https://gerrit.wikimedia.org/r/933448 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [12:46:17] (03PS1) 10Majavah: Add bookworm-sssd, python311-sssd [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/933449 (https://phabricator.wikimedia.org/T335507) [12:46:18] (03Merged) 10jenkins-bot: report_users.sh: Add dbproxy1027 [software] - 10https://gerrit.wikimedia.org/r/933448 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [12:46:52] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:47:20] (03PS1) 10Marostegui: mariadb: Productionize dbproxy1027 [puppet] - 10https://gerrit.wikimedia.org/r/933450 (https://phabricator.wikimedia.org/T337812) [12:48:14] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize dbproxy1027 [puppet] - 10https://gerrit.wikimedia.org/r/933450 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [12:48:35] (03PS1) 10Arturo Borrero Gonzalez: wmcs: prometheus: increase scrape frequency for openstack APIs [puppet] - 10https://gerrit.wikimedia.org/r/933451 (https://phabricator.wikimedia.org/T335943) [12:52:49] (03PS1) 10Daniel Kinzler: Disable PC writes for parsoid endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933453 (https://phabricator.wikimedia.org/T339867) [12:53:03] (03CR) 10EoghanGaffney: [C: 03+1] doc: grant access to Jenkins releases instances [puppet] - 10https://gerrit.wikimedia.org/r/933439 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [12:53:32] (03CR) 10EoghanGaffney: [C: 03+2] doc: grant access to Jenkins releases instances [puppet] - 10https://gerrit.wikimedia.org/r/933439 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [12:55:42] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops (or wmcs-roots) for TheresNoTime - https://phabricator.wikimedia.org/T337829 (10TheresNoTime) (Is this waiting on anything from my end?) [12:56:13] (03CR) 10David Caro: [C: 03+1] wmcs: prometheus: increase scrape frequency for openstack APIs [puppet] - 10https://gerrit.wikimedia.org/r/933451 (https://phabricator.wikimedia.org/T335943) (owner: 10Arturo Borrero Gonzalez) [12:57:08] (03PS1) 10Marostegui: wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/933454 (https://phabricator.wikimedia.org/T337812) [12:57:23] !log Failover m3-master to dbproxy1026 T337812 [12:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:27] T337812: Productionize dbproxy10[22-27] - https://phabricator.wikimedia.org/T337812 [12:57:36] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:57:57] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:58:07] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/933454 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [12:58:15] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230627T1300) [13:00:05] duesen: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230627T1300) [13:00:16] (03PS1) 10Marostegui: dbproxy1026: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/933455 (https://phabricator.wikimedia.org/T337812) [13:00:58] (03CR) 10Marostegui: [C: 03+2] dbproxy1026: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/933455 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [13:06:00] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [13:07:13] (03PS1) 10Jbond: puppetserver: add pcc facts upload functionality [puppet] - 10https://gerrit.wikimedia.org/r/933457 (https://phabricator.wikimedia.org/T330490) [13:08:04] * TheresNoTime unable to deploy for 30mins [13:10:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:11:22] 10SRE, 10Wikimedia-Mailing-lists: Create wikija-g mailing list - https://phabricator.wikimedia.org/T340380 (10Sai10ukazuki) 05In progress→03Resolved a:03Sai10ukazuki [13:12:36] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:15:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:16:57] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 22): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42034/console" [puppet] - 10https://gerrit.wikimedia.org/r/933435 (owner: 10Jbond) [13:18:09] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [13:20:55] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42035/console" [puppet] - 10https://gerrit.wikimedia.org/r/931275 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [13:21:47] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/932237 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [13:22:02] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T340501 (10Jhancock.wm) @wiki_willy the is a ticket that I can't move. [13:22:14] (03CR) 10Jbond: [C: 03+2] puppetserver: open up firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/933440 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [13:22:29] (03CR) 10Jbond: [C: 03+2] puppetserver: add pcc facts upload functionality [puppet] - 10https://gerrit.wikimedia.org/r/933457 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [13:26:23] !log joal@deploy1002 Started deploy [airflow-dags/analytics@9eca77f]: Regular analytics weekly train [airflow-dags/analytics@9eca77f7] [13:26:26] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-test-worker1003.eqiad.wmnet with OS bullseye [13:26:33] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@9eca77f]: Regular analytics weekly train [airflow-dags/analytics@9eca77f7] (duration: 00m 09s) [13:26:57] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42036/console" [puppet] - 10https://gerrit.wikimedia.org/r/933402 (owner: 10Alexandros Kosiaris) [13:27:46] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-test-worker1003.eqiad.wmnet with OS bullseye [13:31:35] (03CR) 10Marostegui: [C: 03+2] "The commit message was wrong, this was meant to be: enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/933455 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [13:32:03] !log expand ml-staging200[12] kubelet partitions - T339231 [13:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:07] T339231: Expand the Lift Wing workers' kubelet partition - https://phabricator.wikimedia.org/T339231 [13:37:26] (03PS1) 10Jbond: puppetserver: Add support for providing hardcoded list of servers [puppet] - 10https://gerrit.wikimedia.org/r/933459 (https://phabricator.wikimedia.org/T330490) [13:39:25] (03CR) 10CI reject: [V: 04-1] puppetserver: Add support for providing hardcoded list of servers [puppet] - 10https://gerrit.wikimedia.org/r/933459 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [13:40:18] (03CR) 10Muehlenhoff: ferm: Allow passing sets to an srange or drange (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:40:21] (03CR) 10Muehlenhoff: [C: 03+2] ferm: Allow passing sets to an srange or drange [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:41:36] (03PS1) 10Bartosz Dziewoński: Remove unused config $wgVisualEditorAllowLossySwitching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933460 (https://phabricator.wikimedia.org/T339871) [13:46:15] (03PS4) 10Alexandros Kosiaris: urldownloader: Simplify wikimedia variable [puppet] - 10https://gerrit.wikimedia.org/r/933402 [13:46:17] (03PS3) 10Alexandros Kosiaris: network: Split $mwappserver_networks to public/private [puppet] - 10https://gerrit.wikimedia.org/r/933405 [13:46:19] (03PS3) 10Alexandros Kosiaris: urldownloader: Switch $towikimedia to $mw_appserver_networks_private [puppet] - 10https://gerrit.wikimedia.org/r/933426 [13:47:30] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42037/console" [puppet] - 10https://gerrit.wikimedia.org/r/933402 (owner: 10Alexandros Kosiaris) [13:52:11] (03PS2) 10Jbond: puppetserver: Add support for providing hardcoded list of servers [puppet] - 10https://gerrit.wikimedia.org/r/933459 (https://phabricator.wikimedia.org/T330490) [13:53:52] (03PS1) 10Muehlenhoff: nftables: Also write out empty sets if no ipv4 or ipv6 addresses are present [puppet] - 10https://gerrit.wikimedia.org/r/933462 (https://phabricator.wikimedia.org/T336497) [13:56:13] (03CR) 10Vgutierrez: [C: 03+1] hiera: Removed unused cache instances from deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/933436 (https://phabricator.wikimedia.org/T327742) (owner: 10Fabfur) [13:56:36] (03PS1) 10Fabfur: [beta] Update wgCdnServersNoPurge to remove unused cache servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933463 (https://phabricator.wikimedia.org/T327742) [13:56:42] (03CR) 10Jbond: [C: 03+2] puppetserver: Add support for providing hardcoded list of servers [puppet] - 10https://gerrit.wikimedia.org/r/933459 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [13:57:18] (03CR) 10Fabfur: [C: 03+2] hiera: Removed unused cache instances from deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/933436 (https://phabricator.wikimedia.org/T327742) (owner: 10Fabfur) [13:58:24] (03PS1) 10Andrew Bogott: wmcs-backup: remove a dangling comma [puppet] - 10https://gerrit.wikimedia.org/r/933464 [13:59:05] (03PS1) 10Elukey: role::ml_cache::storage: remove legacy_ssl_storage_port_enabled setting [puppet] - 10https://gerrit.wikimedia.org/r/933465 (https://phabricator.wikimedia.org/T288470) [14:00:30] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42038/console" [puppet] - 10https://gerrit.wikimedia.org/r/933465 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [14:01:20] (03CR) 10Muehlenhoff: [C: 03+2] Fix migration when "plain" instances are involved [cookbooks] - 10https://gerrit.wikimedia.org/r/932237 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [14:02:01] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::ml_cache::storage: remove legacy_ssl_storage_port_enabled setting [puppet] - 10https://gerrit.wikimedia.org/r/933465 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [14:03:04] (03PS1) 10Clément Goubert: mw-on-k8s: Redirect office.wikimedia.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/933467 (https://phabricator.wikimedia.org/T337490) [14:03:06] (03PS1) 10Clément Goubert: Revert "mw-on-k8s: Redirect office.wikimedia.org to mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/933468 [14:03:49] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-backup: remove a dangling comma [puppet] - 10https://gerrit.wikimedia.org/r/933464 (owner: 10Andrew Bogott) [14:04:17] !log elukey@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-codfw: Roll restart to pick up new certs and openjdk version - elukey@cumin1001 [14:07:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:07:51] (03PS1) 10Clément Goubert: mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/933469 (https://phabricator.wikimedia.org/T337490) [14:07:52] (03PS1) 10Clément Goubert: Revert "mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/933470 [14:08:18] (ProbeDown) firing: Service ml-cache2001-a:7001 has failed probes (tcp_cassandra_a_ssl_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#ml-cache2001-a:7001 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:08:30] (03PS1) 10MVernon: swift: roll object_expirer into cluster_info (remove profile) [puppet] - 10https://gerrit.wikimedia.org/r/933471 (https://phabricator.wikimedia.org/T229584) [14:08:44] (03PS1) 10Jbond: gitpuppet: fix authorize keys file [puppet] - 10https://gerrit.wikimedia.org/r/933472 [14:11:20] (03CR) 10Jbond: "lgtm but we should fix the highlighted issue, missed from previous reviews" [puppet] - 10https://gerrit.wikimedia.org/r/933462 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:11:38] jouncebot: nowandnext [14:11:38] No deployments scheduled for the next 1 hour(s) and 48 minute(s) [14:11:38] In 1 hour(s) and 48 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230627T1600) [14:11:44] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490 (10Clement_Goubert) [14:11:54] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/933471 (https://phabricator.wikimedia.org/T229584) (owner: 10MVernon) [14:12:39] (03CR) 10Jbond: [C: 03+2] gitpuppet: fix authorize keys file [puppet] - 10https://gerrit.wikimedia.org/r/933472 (owner: 10Jbond) [14:13:18] (ProbeDown) firing: (2) Service ml-cache2001-a:7001 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:13:22] (03PS1) 10Ssingh: P:systemd::timesyncd: automate generation of ntp_servers list [puppet] - 10https://gerrit.wikimedia.org/r/933473 (https://phabricator.wikimedia.org/T340479) [14:14:16] (03CR) 10Jbond: [V: 03+1 C: 03+2] private-repo: update repo hook to deal with different directories [puppet] - 10https://gerrit.wikimedia.org/r/933435 (owner: 10Jbond) [14:14:20] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate group1 to Kubernetes - https://phabricator.wikimedia.org/T340549 (10Clement_Goubert) [14:14:46] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate group1 to Kubernetes - https://phabricator.wikimedia.org/T340549 (10Clement_Goubert) p:05Triage→03Medium [14:14:56] (03CR) 10CI reject: [V: 04-1] P:systemd::timesyncd: automate generation of ntp_servers list [puppet] - 10https://gerrit.wikimedia.org/r/933473 (https://phabricator.wikimedia.org/T340479) (owner: 10Ssingh) [14:15:01] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T340550 (10phaultfinder) [14:15:02] (03PS4) 10Jbond: puppetserver: Add new puppet server to block [puppet] - 10https://gerrit.wikimedia.org/r/931275 (https://phabricator.wikimedia.org/T330490) [14:16:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2002.codfw.wmnet [14:16:45] (03PS1) 10Clément Goubert: mw-on-k8s: Redirect vrt-wiki.wikimedia.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/933474 (https://phabricator.wikimedia.org/T340549) [14:16:47] (03PS1) 10Clément Goubert: Revert "mw-on-k8s: Redirect vrt-wiki.wikimedia.org to mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/933475 [14:17:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [14:17:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:18:06] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] "PCC in https://puppet-compiler.wmflabs.org/output/933402/42037/urldownloader1001.wikimedia.org/index.html just has adds and they are the r" [puppet] - 10https://gerrit.wikimedia.org/r/933402 (owner: 10Alexandros Kosiaris) [14:19:02] (03PS1) 10Btullis: Configure the datahub-upgrade jobs to use TLS to contact the GMS [deployment-charts] - 10https://gerrit.wikimedia.org/r/933476 (https://phabricator.wikimedia.org/T329514) [14:20:37] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42039/console" [puppet] - 10https://gerrit.wikimedia.org/r/933405 (owner: 10Alexandros Kosiaris) [14:21:32] (03PS6) 10Herron: profile::pyrra::filesystem: add profile [puppet] - 10https://gerrit.wikimedia.org/r/929731 (https://phabricator.wikimedia.org/T302995) [14:21:38] (03CR) 10Btullis: [C: 03+2] Configure the datahub-upgrade jobs to use TLS to contact the GMS [deployment-charts] - 10https://gerrit.wikimedia.org/r/933476 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [14:21:54] !log elukey@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-codfw: Roll restart to pick up new certs and openjdk version - elukey@cumin1001 [14:22:19] (03PS2) 10Muehlenhoff: nftables: Also write out empty sets if no ipv4 or ipv6 addresses are present [puppet] - 10https://gerrit.wikimedia.org/r/933462 (https://phabricator.wikimedia.org/T336497) [14:22:31] (03Merged) 10jenkins-bot: Configure the datahub-upgrade jobs to use TLS to contact the GMS [deployment-charts] - 10https://gerrit.wikimedia.org/r/933476 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [14:22:49] (03CR) 10Muehlenhoff: nftables: Also write out empty sets if no ipv4 or ipv6 addresses are present (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933462 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:23:18] (ProbeDown) firing: (3) Service ml-cache2001-a:7001 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:23:34] (03CR) 10Alexandros Kosiaris: [C: 03+2] "PCC at https://puppet-compiler.wmflabs.org/output/933405/42039/mw1356.eqiad.wmnet/index.html says just order changes, so merging. This wil" [puppet] - 10https://gerrit.wikimedia.org/r/933405 (owner: 10Alexandros Kosiaris) [14:23:46] !log elukey@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-eqiad: Roll restart to pick up new certs and openjdk version - elukey@cumin1001 [14:24:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [14:24:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2002.codfw.wmnet [14:25:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: prometheus: increase scrape frequency for openstack APIs [puppet] - 10https://gerrit.wikimedia.org/r/933451 (https://phabricator.wikimedia.org/T335943) (owner: 10Arturo Borrero Gonzalez) [14:27:14] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42040/console" [puppet] - 10https://gerrit.wikimedia.org/r/933426 (owner: 10Alexandros Kosiaris) [14:27:31] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [14:28:18] (ProbeDown) firing: (4) Service ml-cache1001-a:7001 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:29:28] (03CR) 10JHathaway: [C: 03+2] dev env: sshd, allow for user CA based auth [puppet] - 10https://gerrit.wikimedia.org/r/931694 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [14:31:18] (03PS1) 10Arturo Borrero Gonzalez: team-wmcs: add openstack_apis_response.yaml [alerts] - 10https://gerrit.wikimedia.org/r/933477 (https://phabricator.wikimedia.org/T339152) [14:33:11] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC at https://puppet-compiler.wmflabs.org/output/933426/42040/urldownloader1001.wikimedia.org/index.html says that indeed, the only IP ra" [puppet] - 10https://gerrit.wikimedia.org/r/933426 (owner: 10Alexandros Kosiaris) [14:33:18] (ProbeDown) firing: (5) Service ml-cache1001-a:7001 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:33:20] (03CR) 10CI reject: [V: 04-1] team-wmcs: add openstack_apis_response.yaml [alerts] - 10https://gerrit.wikimedia.org/r/933477 (https://phabricator.wikimedia.org/T339152) (owner: 10Arturo Borrero Gonzalez) [14:33:55] (03PS1) 10MVernon: swift: roll object_expirer into cluster_info [puppet] - 10https://gerrit.wikimedia.org/r/933478 (https://phabricator.wikimedia.org/T229584) [14:34:21] (03CR) 10CI reject: [V: 04-1] swift: roll object_expirer into cluster_info [puppet] - 10https://gerrit.wikimedia.org/r/933478 (https://phabricator.wikimedia.org/T229584) (owner: 10MVernon) [14:35:24] (03PS1) 10Btullis: Bump the version of the datahub chart that is deployed [deployment-charts] - 10https://gerrit.wikimedia.org/r/933479 (https://phabricator.wikimedia.org/T329514) [14:35:39] the ml-cache probes failed are me, nothing on fire [14:35:43] (03PS2) 10MVernon: swift: roll object_expirer into cluster_info [puppet] - 10https://gerrit.wikimedia.org/r/933478 (https://phabricator.wikimedia.org/T229584) [14:36:31] (03PS2) 10JHathaway: admin: ensure dates are quoted [puppet] - 10https://gerrit.wikimedia.org/r/933180 (https://phabricator.wikimedia.org/T337972) [14:37:09] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/933180 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [14:38:18] (ProbeDown) firing: (7) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:41:22] !log elukey@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-eqiad: Roll restart to pick up new certs and openjdk version - elukey@cumin1001 [14:43:18] (ProbeDown) firing: (7) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:43:44] (03CR) 10Btullis: [C: 03+2] Bump the version of the datahub chart that is deployed [deployment-charts] - 10https://gerrit.wikimedia.org/r/933479 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [14:44:32] (03Merged) 10jenkins-bot: Bump the version of the datahub chart that is deployed [deployment-charts] - 10https://gerrit.wikimedia.org/r/933479 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [14:45:29] (03PS1) 10Muehlenhoff: Don't reboot Ganeti master nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/933482 (https://phabricator.wikimedia.org/T203964) [14:46:40] (03PS1) 10Elukey: role::ml_cache::storage: fix tls port to monitor [puppet] - 10https://gerrit.wikimedia.org/r/933483 [14:46:40] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [14:46:43] (03CR) 10JHathaway: [C: 03+2] admin: ensure dates are quoted [puppet] - 10https://gerrit.wikimedia.org/r/933180 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [14:47:22] !log root@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti-test2001.codfw.wmnet [14:48:38] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42041/console" [puppet] - 10https://gerrit.wikimedia.org/r/933483 (owner: 10Elukey) [14:48:42] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::ml_cache::storage: fix tls port to monitor [puppet] - 10https://gerrit.wikimedia.org/r/933483 (owner: 10Elukey) [14:49:23] (03PS3) 10JHathaway: stdlib: upgrade to v8.6.2 [puppet] - 10https://gerrit.wikimedia.org/r/932459 (https://phabricator.wikimedia.org/T337972) [14:49:48] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/932459 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [14:50:01] (03CR) 10Clément Goubert: [C: 03+1] urldownloader: Switch $towikimedia to $mw_appserver_networks_private [puppet] - 10https://gerrit.wikimedia.org/r/933426 (owner: 10Alexandros Kosiaris) [14:51:22] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/933426 (owner: 10Alexandros Kosiaris) [14:52:55] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@5e77b01]: (no justification provided) [14:53:05] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@5e77b01]: (no justification provided) (duration: 00m 10s) [14:53:18] (ProbeDown) firing: (8) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:56:38] 10SRE, 10SRE-Access-Requests: Requesting access to Kerberos for cjming - https://phabricator.wikimedia.org/T340491 (10cjming) @BTullis thank you \o/ [14:56:51] (03PS1) 10Jbond: merge_cli: Make the paths a parameter and add them to a config file [puppet] - 10https://gerrit.wikimedia.org/r/933485 (https://phabricator.wikimedia.org/T330490) [14:58:18] (ProbeDown) firing: (10) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:58:49] (03CR) 10CI reject: [V: 04-1] merge_cli: Make the paths a parameter and add them to a config file [puppet] - 10https://gerrit.wikimedia.org/r/933485 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [14:59:27] (03CR) 10Jbond: [C: 03+1] nftables: Also write out empty sets if no ipv4 or ipv6 addresses are present (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933462 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [15:00:49] (03CR) 10BryanDavis: [C: 03+1] Add bookworm-sssd, python311-sssd [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/933449 (https://phabricator.wikimedia.org/T335507) (owner: 10Majavah) [15:01:13] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [15:03:48] (03CR) 10BCornwall: [C: 03+2] Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [15:04:37] (03PS2) 10Jbond: merge_cli: Make the paths a parameter and add them to a config file [puppet] - 10https://gerrit.wikimedia.org/r/933485 (https://phabricator.wikimedia.org/T330490) [15:05:43] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42045/console" [puppet] - 10https://gerrit.wikimedia.org/r/933485 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [15:05:49] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/933482 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [15:07:20] (03CR) 10Jbond: [C: 03+2] puppetserver: Add new puppet server to block [puppet] - 10https://gerrit.wikimedia.org/r/931275 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [15:07:37] (03CR) 10Jbond: [V: 03+1 C: 03+2] merge_cli: Make the paths a parameter and add them to a config file [puppet] - 10https://gerrit.wikimedia.org/r/933485 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [15:08:01] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/933478 (https://phabricator.wikimedia.org/T229584) (owner: 10MVernon) [15:09:24] (03CR) 10Volans: admin: ensure dates are quoted (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933180 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:11:27] 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): PKI: add the new puppet CA to the pki infrastructre - https://phabricator.wikimedia.org/T340557 (10jbond) [15:12:19] (03CR) 10Ottomata: [C: 03+1] "One nit, but +1, merge at will." [deployment-charts] - 10https://gerrit.wikimedia.org/r/933143 (https://phabricator.wikimedia.org/T338380) (owner: 10Gmodena) [15:12:38] (03PS7) 10Herron: profile::pyrra::filesystem: add profile [puppet] - 10https://gerrit.wikimedia.org/r/929731 (https://phabricator.wikimedia.org/T302995) [15:13:06] (03PS3) 10MVernon: swift: roll object_expirer into cluster_info [puppet] - 10https://gerrit.wikimedia.org/r/933478 (https://phabricator.wikimedia.org/T229584) [15:13:28] (03PS1) 10Jbond: merge_cli: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/933488 [15:13:36] (03PS8) 10Herron: profile::pyrra::filesystem: add profile [puppet] - 10https://gerrit.wikimedia.org/r/929731 (https://phabricator.wikimedia.org/T302995) [15:14:14] (03CR) 10JHathaway: [C: 03+2] admin: ensure dates are quoted (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933180 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:16:11] (03PS4) 10Gmodena: mw-page-content-change-enrich: version bump docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/933143 (https://phabricator.wikimedia.org/T338380) [15:16:51] (03CR) 10Herron: profile::pyrra::filesystem: add profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/929731 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [15:16:56] (03CR) 10JHathaway: "@jbond this patch is ready for an initial review, at least whether the general change is acceptable. I think the PCC run is a noop, other " [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:17:01] (03CR) 10Jbond: [C: 03+2] merge_cli: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/933488 (owner: 10Jbond) [15:18:11] (03PS1) 10Jbond: Revert "merge_cli: fix typo" [puppet] - 10https://gerrit.wikimedia.org/r/933419 [15:18:18] (ProbeDown) resolved: (6) Service ml-cache1001-a:7001 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:18:45] (03PS1) 10Jbond: Revert "merge_cli: Make the paths a parameter and add them to a config file" [puppet] - 10https://gerrit.wikimedia.org/r/933420 [15:18:48] (ProbeDown) firing: (3) Service ml-cache1001-a:7001 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:18:56] (03CR) 10Jbond: [C: 03+2] Revert "merge_cli: fix typo" [puppet] - 10https://gerrit.wikimedia.org/r/933419 (owner: 10Jbond) [15:19:03] (ProbeDown) resolved: (3) Service ml-cache2001-a:7001 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:19:04] (03CR) 10Gmodena: mw-page-content-change-enrich: version bump docker image (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/933143 (https://phabricator.wikimedia.org/T338380) (owner: 10Gmodena) [15:19:11] (03PS2) 10Jbond: Revert "merge_cli: Make the paths a parameter and add them to a config file" [puppet] - 10https://gerrit.wikimedia.org/r/933420 [15:19:13] (03CR) 10CI reject: [V: 04-1] Revert "merge_cli: Make the paths a parameter and add them to a config file" [puppet] - 10https://gerrit.wikimedia.org/r/933420 (owner: 10Jbond) [15:19:19] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "merge_cli: Make the paths a parameter and add them to a config file" [puppet] - 10https://gerrit.wikimedia.org/r/933420 (owner: 10Jbond) [15:20:25] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [15:21:57] !log root@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti-test2001.codfw.wmnet [15:23:05] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [15:23:09] !log root@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti-test2001.codfw.wmnet [15:24:08] !log puppet-merge temprrarily broken [15:24:08] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [15:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:58] !log root@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2001.codfw.wmnet [15:30:03] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T340550 (10Jhancock.wm) power supply is dead. tried reseating and draining. borrowed a PSU from a decoded server to see if that works. it did. since the spare is from an out of warranty server, I put in a dispatch with Dell Direct Tech to replac... [15:30:15] (03PS1) 10Jbond: test puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/933489 [15:30:30] (03PS1) 10Majavah: ssh::server: define $aliases before using it [puppet] - 10https://gerrit.wikimedia.org/r/933490 [15:30:54] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T340501 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact. resolving. [15:30:56] (03CR) 10Jbond: [C: 03+2] test puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/933489 (owner: 10Jbond) [15:31:53] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:32:00] !log root@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2001.codfw.wmnet [15:32:00] !log root@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2001.codfw.wmnet [15:33:14] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [15:33:17] !log root@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2001.codfw.wmnet [15:34:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2002.codfw.wmnet [15:34:38] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti-test2002.codfw.wmnet [15:34:44] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [15:34:47] (03PS1) 10Jbond: Revert "test puppet-merge" [puppet] - 10https://gerrit.wikimedia.org/r/933421 [15:34:51] (03CR) 10Volans: admin: ensure dates are quoted (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/933180 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:34:55] (03CR) 10Jbond: [C: 03+2] Revert "test puppet-merge" [puppet] - 10https://gerrit.wikimedia.org/r/933421 (owner: 10Jbond) [15:35:08] (03CR) 10JHathaway: [C: 03+1] ssh::server: define $aliases before using it [puppet] - 10https://gerrit.wikimedia.org/r/933490 (owner: 10Majavah) [15:35:24] !log root@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2001.codfw.wmnet [15:36:22] !log puppet-merge fixed again [15:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:05] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:37:09] (03PS1) 10Jbond: merge_cli: Make the paths a parameter and add them to a config file [puppet] - 10https://gerrit.wikimedia.org/r/933422 (https://phabricator.wikimedia.org/T330490) [15:37:33] (03CR) 10JHathaway: [C: 03+2] ssh::server: define $aliases before using it [puppet] - 10https://gerrit.wikimedia.org/r/933490 (owner: 10Majavah) [15:39:04] (03PS2) 10Muehlenhoff: Don't reboot Ganeti master nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/933482 (https://phabricator.wikimedia.org/T203964) [15:39:10] (03CR) 10CI reject: [V: 04-1] merge_cli: Make the paths a parameter and add them to a config file [puppet] - 10https://gerrit.wikimedia.org/r/933422 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [15:40:13] (03CR) 10Majavah: merge_cli: Make the paths a parameter and add them to a config file (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/933422 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [15:40:52] (03CR) 10JHathaway: [C: 03+2] admin: ensure dates are quoted (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933180 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:41:33] (03CR) 10Muehlenhoff: Don't reboot Ganeti master nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/933482 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [15:42:33] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:02] (03CR) 10JHathaway: "@jbond, I think this is ready to review, PCC looks like a clean noop?" [puppet] - 10https://gerrit.wikimedia.org/r/932459 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:44:00] (03PS2) 10Jbond: merge_cli: Make the paths a parameter and add them to a config file [puppet] - 10https://gerrit.wikimedia.org/r/933422 (https://phabricator.wikimedia.org/T330490) [15:44:48] (03PS2) 10Ssingh: P:systemd::timesyncd: automate generation of ntp_servers list [puppet] - 10https://gerrit.wikimedia.org/r/933473 (https://phabricator.wikimedia.org/T340479) [15:45:32] (03CR) 10Volans: admin: ensure dates are quoted (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933180 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:46:01] (03CR) 10CI reject: [V: 04-1] merge_cli: Make the paths a parameter and add them to a config file [puppet] - 10https://gerrit.wikimedia.org/r/933422 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [15:46:44] (03PS3) 10Jbond: merge_cli: Make the paths a parameter and add them to a config file [puppet] - 10https://gerrit.wikimedia.org/r/933422 (https://phabricator.wikimedia.org/T330490) [15:47:00] 10SRE-OnFire, 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Update WDQS Runbook following update lag incident - https://phabricator.wikimedia.org/T336577 (10Gehel) p:05Triage→03High [15:47:37] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:47:46] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 22): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42046/console" [puppet] - 10https://gerrit.wikimedia.org/r/933422 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [15:48:43] (03CR) 10CI reject: [V: 04-1] merge_cli: Make the paths a parameter and add them to a config file [puppet] - 10https://gerrit.wikimedia.org/r/933422 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [15:49:50] (03PS4) 10Jbond: merge_cli: Make the paths a parameter and add them to a config file [puppet] - 10https://gerrit.wikimedia.org/r/933422 (https://phabricator.wikimedia.org/T330490) [15:50:39] (03PS3) 10Ssingh: P:systemd::timesyncd: automate generation of ntp_servers list [puppet] - 10https://gerrit.wikimedia.org/r/933473 (https://phabricator.wikimedia.org/T340479) [15:51:29] (03CR) 10Jbond: "thanks updated" [puppet] - 10https://gerrit.wikimedia.org/r/933422 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [15:51:44] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [15:51:48] (03PS5) 10Jbond: merge_cli: Make the paths a parameter and add them to a config file [puppet] - 10https://gerrit.wikimedia.org/r/933422 (https://phabricator.wikimedia.org/T330490) [15:51:50] (03CR) 10CI reject: [V: 04-1] merge_cli: Make the paths a parameter and add them to a config file [puppet] - 10https://gerrit.wikimedia.org/r/933422 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [15:53:18] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:53:49] (03CR) 10CI reject: [V: 04-1] merge_cli: Make the paths a parameter and add them to a config file [puppet] - 10https://gerrit.wikimedia.org/r/933422 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [15:57:38] (03CR) 10Ssingh: "This is ready for review https://puppet-compiler.wmflabs.org/output/933473/42047/. The full change catalog seems to confirm that ntp_serve" [puppet] - 10https://gerrit.wikimedia.org/r/933473 (https://phabricator.wikimedia.org/T340479) (owner: 10Ssingh) [15:58:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:00:05] jbond and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230627T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:03:25] (03CR) 10Jbond: [C: 03+1] "LGTM: although i'm a little concerned that this will drift again. we should try and add some linting." [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [16:03:34] (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:03:48] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [16:08:34] (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:09:47] (03PS6) 10Jbond: merge_cli: Make the paths a parameter and add them to a config file [puppet] - 10https://gerrit.wikimedia.org/r/933422 (https://phabricator.wikimedia.org/T330490) [16:12:18] (03CR) 10Jbond: [C: 03+1] "LGTM, while merging id disable puppet globally and first run this on one buster, bullseye and bookworm server just to make sure there is n" [puppet] - 10https://gerrit.wikimedia.org/r/932459 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [16:12:54] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/932818 [16:15:19] (03PS1) 10Ssingh: P:dns::recursor: automatically generate resolv.conf for DNS hosts [puppet] - 10https://gerrit.wikimedia.org/r/933497 (https://phabricator.wikimedia.org/T340479) [16:15:47] (03CR) 10Jbond: [C: 03+1] Don't reboot Ganeti master nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/933482 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [16:20:49] (03CR) 10Ssingh: "PCC looks good, ensured NOOP on both P:dns::recursor and non-P:dns::recursor hosts. https://puppet-compiler.wmflabs.org/output/933497/4204" [puppet] - 10https://gerrit.wikimedia.org/r/933497 (https://phabricator.wikimedia.org/T340479) (owner: 10Ssingh) [16:21:17] (03CR) 10Ssingh: [C: 03+1] "Ready for review. From the resolv.conf for dns4004 after this change:" [puppet] - 10https://gerrit.wikimedia.org/r/933497 (https://phabricator.wikimedia.org/T340479) (owner: 10Ssingh) [16:22:05] (03CR) 10Majavah: [C: 03+2] Add bookworm-sssd, python311-sssd [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/933449 (https://phabricator.wikimedia.org/T335507) (owner: 10Majavah) [16:22:55] (03Merged) 10jenkins-bot: Add bookworm-sssd, python311-sssd [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/933449 (https://phabricator.wikimedia.org/T335507) (owner: 10Majavah) [16:35:49] (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933500 (https://phabricator.wikimedia.org/T340243) [16:35:53] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933500 (https://phabricator.wikimedia.org/T340243) (owner: 10TrainBranchBot) [16:36:20] !log train 1.41.0-wmf.15: re-running scap stage-train (T340243) [16:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:24] T340243: 1.41.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T340243 [16:36:43] (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933500 (https://phabricator.wikimedia.org/T340243) (owner: 10TrainBranchBot) [16:37:11] !log brennen@deploy1002 Started scap: testwikis wikis to 1.41.0-wmf.15 refs T340243 [16:37:51] (03PS1) 10Majavah: Add node18-sssd, ruby31-sssd [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/933503 (https://phabricator.wikimedia.org/T335507) [16:37:54] (03PS1) 10Majavah: mariadb-sssd: build on bookworm [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/933504 [16:38:39] (03CR) 10Dzahn: [C: 03+2] webperf: replace Apache 2.2 with modern syntax for access control [puppet] - 10https://gerrit.wikimedia.org/r/932441 (https://phabricator.wikimedia.org/T258686) (owner: 10Dzahn) [16:40:01] (03CR) 10BryanDavis: [C: 03+1] mariadb-sssd: build on bookworm [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/933504 (owner: 10Majavah) [16:42:45] (03CR) 10BryanDavis: [C: 03+1] Add node18-sssd, ruby31-sssd (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/933503 (https://phabricator.wikimedia.org/T335507) (owner: 10Majavah) [16:45:02] (03CR) 10Majavah: [C: 03+2] Add node18-sssd, ruby31-sssd (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/933503 (https://phabricator.wikimedia.org/T335507) (owner: 10Majavah) [16:45:06] (03CR) 10Majavah: [C: 03+2] mariadb-sssd: build on bookworm [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/933504 (owner: 10Majavah) [16:45:35] (03Merged) 10jenkins-bot: Add node18-sssd, ruby31-sssd [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/933503 (https://phabricator.wikimedia.org/T335507) (owner: 10Majavah) [16:45:39] (03Merged) 10jenkins-bot: mariadb-sssd: build on bookworm [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/933504 (owner: 10Majavah) [16:49:11] (03CR) 10Dzahn: [C: 03+2] "thanks for the review. I merged, ran puppet on webperf*, restarted apache on both webperf machines and I can still open https://performanc" [puppet] - 10https://gerrit.wikimedia.org/r/932441 (https://phabricator.wikimedia.org/T258686) (owner: 10Dzahn) [16:49:39] !log webperf1003/2003 restarted apache after deploying gerrit:932441 [16:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:04] 10SRE, 10Data-Engineering, 10Event-Platform Value Stream, 10serviceops, 10Patch-For-Review: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10Ottomata) [16:50:34] 10SRE, 10Data-Engineering, 10Event-Platform Value Stream, 10serviceops, 10Patch-For-Review: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10Ottomata) [16:52:56] (03PS1) 10Jbond: puppetserver: authorise puppet server to upload files [puppet] - 10https://gerrit.wikimedia.org/r/933506 [16:53:59] (03CR) 10Dzahn: [C: 03+2] releases-jenkins: replace Apache 2.2 with 2.4 syntax for access control (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932439 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [16:55:01] (03CR) 10Dzahn: [C: 03+2] "I could also stop puppet temporarily, put the previous config back in place, for a quick test with a full revert." [puppet] - 10https://gerrit.wikimedia.org/r/932439 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [16:55:03] (03PS2) 10Jbond: puppetserver: authorise puppet server to upload files [puppet] - 10https://gerrit.wikimedia.org/r/933506 [16:55:14] (03CR) 10Dzahn: [C: 03+2] "s/with/without" [puppet] - 10https://gerrit.wikimedia.org/r/932439 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [16:56:49] 10SRE, 10SRE-Access-Requests, 10Abstract Wikipedia team (Phase λ – Launch): Please add Abstract Wiki team members to `deployment` prod SRE group - https://phabricator.wikimedia.org/T339936 (10Jdforrester-WMF) [16:58:19] (03CR) 10Dzahn: [C: 03+2] "comments-only" [puppet] - 10https://gerrit.wikimedia.org/r/932319 (https://phabricator.wikimedia.org/T280392) (owner: 10Dzahn) [16:59:57] (03CR) 10Dzahn: [C: 03+2] "oops, was not actually deployed yet. but now it is, repeated same steps, still works" [puppet] - 10https://gerrit.wikimedia.org/r/932441 (https://phabricator.wikimedia.org/T258686) (owner: 10Dzahn) [17:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230627T1700) [17:00:34] (03CR) 10Jbond: [C: 03+2] puppetserver: authorise puppet server to upload files [puppet] - 10https://gerrit.wikimedia.org/r/933506 (owner: 10Jbond) [17:05:39] (03PS1) 10Hnowlan: trafficserver: add lua script for gateway routing [puppet] - 10https://gerrit.wikimedia.org/r/933508 (https://phabricator.wikimedia.org/T324678) [17:07:56] (03PS1) 10Jbond: puppetserver::scripts: add proxy for production [puppet] - 10https://gerrit.wikimedia.org/r/933509 [17:09:10] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42049/console" [puppet] - 10https://gerrit.wikimedia.org/r/933509 (owner: 10Jbond) [17:10:05] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver::scripts: add proxy for production [puppet] - 10https://gerrit.wikimedia.org/r/933509 (owner: 10Jbond) [17:11:31] (03CR) 10Andrew Bogott: [C: 03+2] openstack: Hide locked Developer accounts from Keystone [puppet] - 10https://gerrit.wikimedia.org/r/932034 (https://phabricator.wikimedia.org/T339972) (owner: 10BryanDavis) [17:20:08] !log brennen@deploy1002 Finished scap: testwikis wikis to 1.41.0-wmf.15 refs T340243 (duration: 42m 56s) [17:20:13] T340243: 1.41.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T340243 [17:22:15] !log brennen@deploy1002 Pruned MediaWiki: 1.41.0-wmf.12 (duration: 02m 05s) [17:24:18] 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): puppetdb; allow connections from puppetserver over ipv6 - https://phabricator.wikimedia.org/T340563 (10jbond) [17:24:32] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): puppetdb; allow connections from puppetserver over ipv6 - https://phabricator.wikimedia.org/T340563 (10jbond) [17:24:43] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): puppetdb; allow connections from puppetserver over ipv6 - https://phabricator.wikimedia.org/T340563 (10jbond) p:05Triage→03Medium [17:27:35] (03CR) 10Ottomata: [C: 03+1] mw-page-content-change-enrich: version bump docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/933143 (https://phabricator.wikimedia.org/T338380) (owner: 10Gmodena) [17:29:19] (03CR) 10Gmodena: [C: 03+2] mw-page-content-change-enrich: version bump docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/933143 (https://phabricator.wikimedia.org/T338380) (owner: 10Gmodena) [17:30:21] (03Merged) 10jenkins-bot: mw-page-content-change-enrich: version bump docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/933143 (https://phabricator.wikimedia.org/T338380) (owner: 10Gmodena) [17:32:44] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42050/console" [puppet] - 10https://gerrit.wikimedia.org/r/933422 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [17:42:22] Jdlrobson: i'm noticing some of https://phabricator.wikimedia.org/T340243#8966056 in logs after staging the train to testwiki [17:42:24] care to advise? [17:42:53] seeing for ext.citoid.wikibase, ext.citoid.wikibase.init, ext.gadget.mobileonlygadget, ext.gadget.MobileCategories [17:43:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:44:24] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [17:44:28] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [17:45:20] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [17:45:23] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [17:45:39] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [17:45:42] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [17:48:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:57:54] (03PS3) 10TheDJ: Remove old origin-with-crossorigin referrer policy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927279 (https://phabricator.wikimedia.org/T338183) [18:00:05] brennen and jnuche: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230627T1800). [18:01:28] (03CR) 10Brennen Bearnes: [C: 03+2] Display the language button on pages without languages [extensions/UniversalLanguageSelector] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/933154 (https://phabricator.wikimedia.org/T315036) (owner: 10Abijeet Patro) [18:02:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [extensions/UniversalLanguageSelector] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/933154 (https://phabricator.wikimedia.org/T315036) (owner: 10Abijeet Patro) [18:05:53] (03CR) 10Dzahn: "thank you. there is another one here, if you can do both together, would be nice: https://gerrit.wikimedia.org/r/c/operations/puppet/+/932" [puppet] - 10https://gerrit.wikimedia.org/r/932435 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [18:17:51] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:18:50] !log disabling puppet to test stdlib upgrade patch [18:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:47] (03PS1) 10Btullis: Configure datahub-upgrade to use HTTPS to communicate with the GMS [deployment-charts] - 10https://gerrit.wikimedia.org/r/933606 (https://phabricator.wikimedia.org/T329514) [18:21:02] (03CR) 10Btullis: [C: 03+2] Configure datahub-upgrade to use HTTPS to communicate with the GMS [deployment-charts] - 10https://gerrit.wikimedia.org/r/933606 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [18:22:11] (03Merged) 10jenkins-bot: Display the language button on pages without languages [extensions/UniversalLanguageSelector] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/933154 (https://phabricator.wikimedia.org/T315036) (owner: 10Abijeet Patro) [18:22:17] (03CR) 10JHathaway: [C: 03+2] stdlib: upgrade to v8.6.2 [puppet] - 10https://gerrit.wikimedia.org/r/932459 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [18:22:22] (03CR) 10CI reject: [V: 04-1] Configure datahub-upgrade to use HTTPS to communicate with the GMS [deployment-charts] - 10https://gerrit.wikimedia.org/r/933606 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [18:22:42] !log brennen@deploy1002 Started scap: Backport for [[gerrit:933154|Display the language button on pages without languages (T315036)]] [18:22:46] T315036: Unexpected "In other languages" section in Vector 22 sidebar - https://phabricator.wikimedia.org/T315036 [18:22:59] (03CR) 10Btullis: [C: 03+2] "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/933606 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [18:24:09] (03Merged) 10jenkins-bot: Configure datahub-upgrade to use HTTPS to communicate with the GMS [deployment-charts] - 10https://gerrit.wikimedia.org/r/933606 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [18:24:36] !log brennen@deploy1002 abi and brennen: Backport for [[gerrit:933154|Display the language button on pages without languages (T315036)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [18:25:15] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [18:25:41] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [18:25:49] (03PS1) 10Jbond: puppetdb: Add ability to configure secondary proxies [puppet] - 10https://gerrit.wikimedia.org/r/933608 (https://phabricator.wikimedia.org/T330490) [18:26:13] (03CR) 10CI reject: [V: 04-1] puppetdb: Add ability to configure secondary proxies [puppet] - 10https://gerrit.wikimedia.org/r/933608 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [18:26:53] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [18:27:40] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [18:28:30] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [18:28:36] (03PS2) 10Jbond: puppetdb: Add ability to configure secondary proxies [puppet] - 10https://gerrit.wikimedia.org/r/933608 (https://phabricator.wikimedia.org/T330490) [18:29:00] (03CR) 10CI reject: [V: 04-1] puppetdb: Add ability to configure secondary proxies [puppet] - 10https://gerrit.wikimedia.org/r/933608 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [18:29:19] !log puppet re-enabled, enjoy! [18:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42052/console" [puppet] - 10https://gerrit.wikimedia.org/r/933608 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [18:31:36] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:933154|Display the language button on pages without languages (T315036)]] (duration: 08m 53s) [18:31:41] T315036: Unexpected "In other languages" section in Vector 22 sidebar - https://phabricator.wikimedia.org/T315036 [18:33:07] (03PS3) 10Jbond: puppetdb: Add ability to configure secondary proxies [puppet] - 10https://gerrit.wikimedia.org/r/933608 (https://phabricator.wikimedia.org/T330490) [18:33:39] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933610 (https://phabricator.wikimedia.org/T340243) [18:33:41] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933610 (https://phabricator.wikimedia.org/T340243) (owner: 10TrainBranchBot) [18:34:28] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42054/console" [puppet] - 10https://gerrit.wikimedia.org/r/933608 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [18:34:30] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933610 (https://phabricator.wikimedia.org/T340243) (owner: 10TrainBranchBot) [18:35:06] (03CR) 10CI reject: [V: 04-1] puppetdb: Add ability to configure secondary proxies [puppet] - 10https://gerrit.wikimedia.org/r/933608 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [18:40:34] (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:40:44] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [18:41:16] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.15 refs T340243 [18:41:20] T340243: 1.41.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T340243 [18:45:34] (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:47:18] !log upgrade dns6001 to gdnsd 3.99.0~alpha2 [18:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:47] (03PS1) 10Brennen Bearnes: Drop redundant targets [extensions/Citoid] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/933425 (https://phabricator.wikimedia.org/T340499) [18:55:01] (03PS1) 10Btullis: Add an evironment variable to datahub-gms for its port [deployment-charts] - 10https://gerrit.wikimedia.org/r/933611 (https://phabricator.wikimedia.org/T329514) [18:56:54] (03CR) 10Btullis: [C: 03+2] Add an evironment variable to datahub-gms for its port [deployment-charts] - 10https://gerrit.wikimedia.org/r/933611 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [18:57:58] (03Merged) 10jenkins-bot: Add an evironment variable to datahub-gms for its port [deployment-charts] - 10https://gerrit.wikimedia.org/r/933611 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [19:07:46] (03CR) 10Dzahn: [C: 03+1] "this needs a lot of rebasing now, but last time I looked it was ready to go. realistically I am not going to merge this and will be gone o" [puppet] - 10https://gerrit.wikimedia.org/r/850173 (owner: 10Jbond) [19:08:45] (03Abandoned) 10Ssingh: sre.hosts.reboot-cluster: simplify Icinga logic [cookbooks] - 10https://gerrit.wikimedia.org/r/928560 (owner: 10Volans) [19:09:23] (03CR) 10Dzahn: "I think this is good and desired. But ultimately serviceops should review and merge it and since I am soon on sabbatical and cleaning up m" [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) (owner: 10Legoktm) [19:10:19] (03CR) 10Ssingh: [C: 03+1] "Sorry for missing the review, LGTM and can merge tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/926509 (owner: 10Muehlenhoff) [19:11:10] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [19:11:44] (03CR) 10Ssingh: [C: 03+1] "(Will take care of the deployment.)" [puppet] - 10https://gerrit.wikimedia.org/r/926509 (owner: 10Muehlenhoff) [19:13:17] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10leila) [19:13:25] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10leila) I'm going to remove this task from the Backlog lane of the #Research board given that there is no task for Research... [19:14:15] (03CR) 10Dzahn: [C: 03+2] "@Jaime Could you please repeat the test one more time? I have stopped puppet on releases1003 (current prod backend), reverted to previous " [puppet] - 10https://gerrit.wikimedia.org/r/932439 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [19:16:30] (03CR) 10Muehlenhoff: gdnsd: Switch to systemd::sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/926509 (owner: 10Muehlenhoff) [19:18:07] (03PS3) 10Ahmon Dancy: Add 'git_tag' argument to git::clone [puppet] - 10https://gerrit.wikimedia.org/r/933192 (https://phabricator.wikimedia.org/T218900) [19:18:52] jouncebot nowandnext [19:18:52] For the next 0 hour(s) and 41 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230627T1800) [19:18:52] In 0 hour(s) and 41 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230627T2000) [19:19:26] (03CR) 10Ahmon Dancy: Add 'git_tag' argument to git::clone (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/933192 (https://phabricator.wikimedia.org/T218900) (owner: 10Ahmon Dancy) [19:20:01] (03PS4) 10Ahmon Dancy: Add 'git_tag' argument to git::clone [puppet] - 10https://gerrit.wikimedia.org/r/933192 (https://phabricator.wikimedia.org/T218900) [19:21:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [19:21:51] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [extensions/Citoid] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/933425 (https://phabricator.wikimedia.org/T340499) (owner: 10Brennen Bearnes) [19:23:34] (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [19:23:46] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [19:25:04] (03Merged) 10jenkins-bot: Drop redundant targets [extensions/Citoid] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/933425 (https://phabricator.wikimedia.org/T340499) (owner: 10Brennen Bearnes) [19:25:06] (03PS5) 10Ahmon Dancy: Add 'git_tag' argument to git::clone [puppet] - 10https://gerrit.wikimedia.org/r/933192 (https://phabricator.wikimedia.org/T218900) [19:25:32] !log brennen@deploy1002 Started scap: Backport for [[gerrit:933425|Drop redundant targets (T340499)]] [19:25:35] (03CR) 10Ahmon Dancy: Add 'git_tag' argument to git::clone (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933192 (https://phabricator.wikimedia.org/T218900) (owner: 10Ahmon Dancy) [19:25:37] T340499: PHP Deprecated: Use of Modules must target desktop and mobile. Module name:ext.citoid.wikibase.init was deprecated in MediaWiki 1.41. - https://phabricator.wikimedia.org/T340499 [19:27:13] (03Restored) 10Volans: sre.hosts.reboot-cluster: simplify Icinga logic [cookbooks] - 10https://gerrit.wikimedia.org/r/928560 (owner: 10Volans) [19:27:17] !log brennen@deploy1002 brennen: Backport for [[gerrit:933425|Drop redundant targets (T340499)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [19:27:50] (03CR) 10Volans: "This is actually required to remove outdated code that is now better supported by Spicerack's library." [cookbooks] - 10https://gerrit.wikimedia.org/r/928560 (owner: 10Volans) [19:28:34] (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [19:33:24] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:933425|Drop redundant targets (T340499)]] (duration: 07m 51s) [19:33:30] T340499: PHP Deprecated: Use of Modules must target desktop and mobile. Module name:ext.citoid.wikibase.init was deprecated in MediaWiki 1.41. - https://phabricator.wikimedia.org/T340499 [19:39:07] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [19:47:44] 10SRE, 10Security-Team, 10Security, 10affects-Miraheze: Security Issue Access Request for RhinosF1 - https://phabricator.wikimedia.org/T340572 (10RhinosF1) [19:47:54] 10SRE, 10Security-Team, 10Security, 10affects-Miraheze: Security Issue Access Request for RhinosF1 - https://phabricator.wikimedia.org/T340572 (10RhinosF1) Updated [19:48:23] (03PS1) 10Stef Dunlap: Wikifunctions: update image name; bump tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/933614 (https://phabricator.wikimedia.org/T297314) [19:48:26] (03PS1) 10Btullis: Configure the datahub-gms server to use SSL when a client of itself [deployment-charts] - 10https://gerrit.wikimedia.org/r/933615 (https://phabricator.wikimedia.org/T329514) [19:49:04] 10SRE, 10Security-Team, 10Security, 10affects-Miraheze: Security Issue Access Request for RhinosF1 - https://phabricator.wikimedia.org/T340572 (10sbassett) >>! In T340572#8969352, @taavi wrote: > I'm not sure if I understand this ticket. If the goal is to get access to (a subset of) security tickets, then... [19:49:17] 10SRE, 10Security-Team, 10Security, 10affects-Miraheze: Security Issue Access Request for RhinosF1 - https://phabricator.wikimedia.org/T340572 (10Dzahn) > I'm not sure if I understand this ticket. What seemed unclear about it? The goal is that RhinosF1 can read any ticket that is not public and related t... [19:49:27] (03Abandoned) 10Jforrester: [function-evaluator] Update image reference, now it's from GitLab [deployment-charts] - 10https://gerrit.wikimedia.org/r/912887 (owner: 10Jforrester) [19:49:31] (03Abandoned) 10Jforrester: [function-orchestrator] Update image reference, now it's from GitLab [deployment-charts] - 10https://gerrit.wikimedia.org/r/912886 (owner: 10Jforrester) [19:49:56] (03CR) 10Btullis: [C: 03+2] Configure the datahub-gms server to use SSL when a client of itself [deployment-charts] - 10https://gerrit.wikimedia.org/r/933615 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [19:50:41] (03Merged) 10jenkins-bot: Configure the datahub-gms server to use SSL when a client of itself [deployment-charts] - 10https://gerrit.wikimedia.org/r/933615 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [19:50:46] (03CR) 10Jforrester: [C: 03+2] Wikifunctions: update image name; bump tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/933614 (https://phabricator.wikimedia.org/T297314) (owner: 10Stef Dunlap) [19:51:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [19:51:21] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [19:51:33] (03Merged) 10jenkins-bot: Wikifunctions: update image name; bump tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/933614 (https://phabricator.wikimedia.org/T297314) (owner: 10Stef Dunlap) [19:51:34] (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [19:51:45] 10SRE, 10Security-Team, 10Security, 10affects-Miraheze: Security Issue Access Request for RhinosF1 - https://phabricator.wikimedia.org/T340572 (10sbassett) >>! In T340572#8969392, @Dzahn wrote: >> acl*security (acl*security_volunteer in this case?) > > No idea. The point of this ticket is to find out wh... [19:53:10] !log kindrobot@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [19:54:54] !log kindrobot@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [19:54:59] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Transfer Neil Shah-Quinn's production access to new developer account - https://phabricator.wikimedia.org/T337591 (10nshahquinn-wmf) Thank you, @MoritzMuehlenhoff! [19:55:03] 10SRE, 10Security-Team, 10Security, 10affects-Miraheze: Security Issue Access Request for RhinosF1 - https://phabricator.wikimedia.org/T340572 (10taavi) >>! In T340572#8969392, @Dzahn wrote: >> I'm not sure if I understand this ticket. > > What seemed unclear about it? I'm confused why it was talking ab... [19:55:33] !log kindrobot@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [19:56:05] (03PS2) 10Samtar: Remove wgDiscussionToolsEnable config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927632 (https://phabricator.wikimedia.org/T322497) (owner: 10Esanders) [19:56:08] (03PS2) 10Samtar: Remove unused config $wgVisualEditorAllowLossySwitching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933460 (https://phabricator.wikimedia.org/T339871) (owner: 10Bartosz Dziewoński) [19:56:18] (03PS2) 10Samtar: Remove most DiscussionTools feature configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927653 (https://phabricator.wikimedia.org/T322497) (owner: 10Esanders) [19:56:34] (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [19:56:36] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on phab2002.codfw.wmnet with reason: patch application [19:56:49] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: patch application [19:57:01] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on phab1004.eqiad.wmnet with reason: patch application [19:57:14] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: patch application [19:57:36] * TheresNoTime sees kindrobot doing stuff on deploy1002, assumes they'll deploy in ~3m :3 [19:57:57] TheresNoTime: We should be done by then. [19:57:57] TheresNoTime, kindrobot: y'all mind if i sling out a phabricator update before deploy window commences? [19:58:07] brennen: Go for it on our side. [19:58:10] Sorry that's a staging deploy for wikifunctions [19:58:31] I will not be available for the backport unfortunately TheresNoTime :( [19:58:31] brennen: go ahead :) [19:58:32] thanks. this should be brief, i'll ping when done. [19:58:42] kindrobot: awww :p [19:59:03] !log brennen@deploy1002 Started deploy [phabricator/deployment@a25a737]: deploy latest state to phab1004 [19:59:42] !log brennen@deploy1002 Finished deploy [phabricator/deployment@a25a737]: deploy latest state to phab1004 (duration: 00m 38s) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230627T2000). Please do the needful. [20:00:04] MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:22] * TheresNoTime can deploy (once phab update is done) [20:00:29] TheresNoTime: phab update done, go ahead, please yell at me if you see anything weird [20:00:29] hi. my backports today are no-op cleanups [20:00:37] 10SRE, 10Security-Team, 10Security, 10affects-Miraheze: Security Issue Access Request for RhinosF1 - https://phabricator.wikimedia.org/T340572 (10RhinosF1) I'm pretty sure the only way to grant access to security tickets generally is https://www.mediawiki.org/wiki/Security/SOP/Access_to_Phabricator_Securit... [20:00:38] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [20:00:38] brennen: ack [20:00:56] although i'd also like to backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/933613 (to fix a train blocker) if anyone here would like to review it [20:01:17] if you want to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/932270 too... :D [20:01:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933460 (https://phabricator.wikimedia.org/T339871) (owner: 10Bartosz Dziewoński) [20:01:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927632 (https://phabricator.wikimedia.org/T322497) (owner: 10Esanders) [20:02:19] (03CR) 10Jforrester: [C: 03+1] "WCPGW?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932270 (https://phabricator.wikimedia.org/T204193) (owner: 10Reedy) [20:02:37] (03Merged) 10jenkins-bot: Remove unused config $wgVisualEditorAllowLossySwitching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933460 (https://phabricator.wikimedia.org/T339871) (owner: 10Bartosz Dziewoński) [20:02:40] (03Merged) 10jenkins-bot: Remove wgDiscussionToolsEnable config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927632 (https://phabricator.wikimedia.org/T322497) (owner: 10Esanders) [20:03:09] !log samtar@deploy1002 Started scap: Backport for [[gerrit:933460|Remove unused config $wgVisualEditorAllowLossySwitching (T339871)]], [[gerrit:927632|Remove wgDiscussionToolsEnable config (T322497)]] [20:03:16] T339871: Remove reference to $wgVisualEditorAllowLossySwitching - https://phabricator.wikimedia.org/T339871 [20:03:17] T322497: Remove config settings for individual DiscussionTools features - https://phabricator.wikimedia.org/T322497 [20:03:20] (03PS4) 10Samtar: Remove references to auth-api.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932270 (https://phabricator.wikimedia.org/T204193) (owner: 10Reedy) [20:03:26] (03CR) 10Jdlrobson: "Thanks Brennen!" [extensions/Citoid] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/933425 (https://phabricator.wikimedia.org/T340499) (owner: 10Brennen Bearnes) [20:03:30] (03CR) 10Reedy: Remove references to auth-api.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932270 (https://phabricator.wikimedia.org/T204193) (owner: 10Reedy) [20:03:41] (03PS3) 10Samtar: Remove most DiscussionTools feature configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927653 (https://phabricator.wikimedia.org/T322497) (owner: 10Esanders) [20:04:54] !log samtar@deploy1002 esanders and samtar and matmarex: Backport for [[gerrit:933460|Remove unused config $wgVisualEditorAllowLossySwitching (T339871)]], [[gerrit:927632|Remove wgDiscussionToolsEnable config (T322497)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:05:03] (syncing) [20:06:51] Reedy: forgot to say yes, I can deploy that :) [20:07:13] assume you *will* want to test that? :p [20:07:38] Test what? [20:07:55] 932270: Remove references to auth-api.php [20:07:59] It going from some sort of 500 to a 404? :P [20:08:25] oh to have the confidence something somewhere doesn't weirdly rely on that (: [20:08:43] As an entry point, it's been "broken" for a few months [20:08:49] ah :P [20:09:13] Hmm. Seems it's not months plural... [20:09:15] 5 weeks ish [20:10:33] That's April, May, and June. Months. [20:10:35] ;-) [20:10:45] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:933460|Remove unused config $wgVisualEditorAllowLossySwitching (T339871)]], [[gerrit:927632|Remove wgDiscussionToolsEnable config (T322497)]] (duration: 07m 35s) [20:10:51] T339871: Remove reference to $wgVisualEditorAllowLossySwitching - https://phabricator.wikimedia.org/T339871 [20:10:51] T322497: Remove config settings for individual DiscussionTools features - https://phabricator.wikimedia.org/T322497 [20:11:50] It was only merged in May [20:11:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927653 (https://phabricator.wikimedia.org/T322497) (owner: 10Esanders) [20:11:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932270 (https://phabricator.wikimedia.org/T204193) (owner: 10Reedy) [20:12:18] well, May and June is still multiple, so 'months' it is [20:12:39] it's >1 but less than <2 [20:12:51] (03Merged) 10jenkins-bot: Remove most DiscussionTools feature configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927653 (https://phabricator.wikimedia.org/T322497) (owner: 10Esanders) [20:13:05] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [20:13:12] (03Merged) 10jenkins-bot: Remove references to auth-api.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932270 (https://phabricator.wikimedia.org/T204193) (owner: 10Reedy) [20:13:17] (03PS1) 10Jforrester: wikifunctions: Add some more real sample values for limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/933618 (https://phabricator.wikimedia.org/T297314) [20:13:41] !log samtar@deploy1002 Started scap: Backport for [[gerrit:927653|Remove most DiscussionTools feature configs (T322497)]], [[gerrit:932270|Remove references to auth-api.php (T204193)]] [20:13:47] T204193: SecurePoll auth-api.php needs to be rewritten to be a normal api module - https://phabricator.wikimedia.org/T204193 [20:15:10] !log samtar@deploy1002 reedy and esanders and samtar: Backport for [[gerrit:927653|Remove most DiscussionTools feature configs (T322497)]], [[gerrit:932270|Remove references to auth-api.php (T204193)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:15:17] (syncing) [20:15:29] 10SRE, 10Security-Team, 10Security, 10affects-Miraheze: Security Issue Access Request for RhinosF1 - https://phabricator.wikimedia.org/T340572 (10RhinosF1) Cc @LSobanski as @DZahn 's manager Discussed with him to confirm that #acl*security_volunteer would solve the issue - they are happy with whatever wor... [20:16:50] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [20:17:00] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [20:18:16] MatmaRex: then we're just waiting on 933613 to merge, correct? [20:18:42] yeah [20:18:52] maybe i should make the backport already so it can start cooking [20:18:59] these things take forever [20:19:08] please do, yeah :) [20:19:15] (03PS1) 10Bartosz Dziewoński: Title: Fix exists() assertion in toPageRecord() [core] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/933626 (https://phabricator.wikimedia.org/T340568) [20:20:35] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:927653|Remove most DiscussionTools feature configs (T322497)]], [[gerrit:932270|Remove references to auth-api.php (T204193)]] (duration: 06m 53s) [20:20:41] T204193: SecurePoll auth-api.php needs to be rewritten to be a normal api module - https://phabricator.wikimedia.org/T204193 [20:20:41] T322497: Remove config settings for individual DiscussionTools features - https://phabricator.wikimedia.org/T322497 [20:20:57] 10SRE, 10Security-Team, 10Security, 10affects-Miraheze: Security Issue Access Request for RhinosF1 - https://phabricator.wikimedia.org/T340572 (10Dzahn) To clarify: I do support things that are needed to achieve the goal, to let RhinosF1 see _tickets related to phab or phorge that are not public_. Since I... [20:21:50] https://en.wikipedia.org/w/auth-api.php 404s [20:21:51] yay [20:22:01] :p [20:22:35] TheresNoTime: want to +2 https://gerrit.wikimedia.org/r/c/mediawiki/core/+/933626 ahead of time? [20:22:38] brb [20:22:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [core] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/933626 (https://phabricator.wikimedia.org/T340568) (owner: 10Bartosz Dziewoński) [20:28:38] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/932444/42056/thanos-fe1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/932444 (https://phabricator.wikimedia.org/T258686) (owner: 10Dzahn) [20:30:01] (back) [20:30:20] o/ [20:30:42] i guess i can't really test this backport anyway, i couldn't reproduce the bug on testwiki [20:32:36] MatmaRex: did you just want me to run it through then, or is it worth holding on the mwdebugs to see if it breaks anything? [20:33:34] i'd just ship it, wmf.15 is only group0 now, right? [20:33:42] yeah [20:39:58] (03CR) 10CI reject: [V: 04-1] Title: Fix exists() assertion in toPageRecord() [core] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/933626 (https://phabricator.wikimedia.org/T340568) (owner: 10Bartosz Dziewoński) [20:40:14] 😡 [20:40:51] (03CR) 10Bartosz Dziewoński: "Test failure is a known bug in tests: https://phabricator.wikimedia.org/T334634" [core] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/933626 (https://phabricator.wikimedia.org/T340568) (owner: 10Bartosz Dziewoński) [20:41:06] (03Merged) 10jenkins-bot: Title: Fix exists() assertion in toPageRecord() [core] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/933626 (https://phabricator.wikimedia.org/T340568) (owner: 10Bartosz Dziewoński) [20:41:56] !log samtar@deploy1002 Started scap: Backport for [[gerrit:933626|Title: Fix exists() assertion in toPageRecord() (T340568)]] [20:42:01] T340568: Wikimedia\Assert\PreconditionException: Precondition failed: This Title instance does not represent an existing page: Property Simplify - https://phabricator.wikimedia.org/T340568 [20:42:29] (03CR) 10Dzahn: [C: 03+2] "deployed and actually tested that I still get Forbidden for /debug, on 4 hosts:" [puppet] - 10https://gerrit.wikimedia.org/r/932444 (https://phabricator.wikimedia.org/T258686) (owner: 10Dzahn) [20:43:29] !log samtar@deploy1002 matmarex and samtar: Backport for [[gerrit:933626|Title: Fix exists() assertion in toPageRecord() (T340568)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:43:35] (syncing) [20:48:48] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:933626|Title: Fix exists() assertion in toPageRecord() (T340568)]] (duration: 06m 52s) [20:48:52] T340568: Wikimedia\Assert\PreconditionException: Precondition failed: This Title instance does not represent an existing page: Property Simplify - https://phabricator.wikimedia.org/T340568 [20:48:56] MatmaRex: live :) [20:49:36] thanks [20:50:13] !log close UTC late backport window [20:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:34] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/932443/42009/" [puppet] - 10https://gerrit.wikimedia.org/r/932443 (https://phabricator.wikimedia.org/T258686) (owner: 10Dzahn) [21:45:03] !log prometheus* - puppet and partially manaul restart of apaches after deploying gerrit:932443 [21:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:21] !log prometheus4002 - sudo a2dismod access_compat ; sudo systemctl restart apach2 ; sudo apachectl configtest -> Syntax OK :) - to proof it works without the access_compat module T258686 [21:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:26] T258686: Stop using mod_access_compat - https://phabricator.wikimedia.org/T258686 [22:17:10] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "URLs like https://prometheus-codfw.wikimedia.org/ops/debug and https://prometheus-codfw.wikimedia.org/ops/-/reload are still Forbidden." [puppet] - 10https://gerrit.wikimedia.org/r/932443 (https://phabricator.wikimedia.org/T258686) (owner: 10Dzahn) [22:18:08] (03PS2) 10Dzahn: miscweb: move tests for static_tendril to k8s tests file [puppet] - 10https://gerrit.wikimedia.org/r/932338 (https://phabricator.wikimedia.org/T300171) [22:18:28] (03CR) 10Dzahn: [C: 03+2] miscweb: move tests for static_tendril to k8s tests file [puppet] - 10https://gerrit.wikimedia.org/r/932338 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [22:22:36] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:30:32] (Nonwrite HTTP requests with primary DB writes alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+writes+alert [22:42:16] !log andrew@deploy1002 Started deploy [horizon/deploy@9d02cd6]: installing (but not registering) magnum-ui [22:43:43] !log andrew@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: installing (but not registering) magnum-ui (duration: 01m 27s) [22:44:04] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for XHGui in eqiad - https://phabricator.wikimedia.org/T340595 (10andrea.denisse) [22:45:42] 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: 1 VM request for XHGui in eqiad - https://phabricator.wikimedia.org/T340595 (10andrea.denisse) a:03andrea.denisse [22:48:26] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for XHGui in codfw - https://phabricator.wikimedia.org/T340596 (10andrea.denisse) [22:48:52] 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: 1 VM request for XHGui in codfw - https://phabricator.wikimedia.org/T340596 (10andrea.denisse) a:03andrea.denisse [22:48:56] (03CR) 10Dzahn: "@Jelto, so I learned how to check if my pipeline/image build works but found it still fails. and the reason is "error: failed to solve: fa" [deployment-charts] - 10https://gerrit.wikimedia.org/r/930886 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [22:50:32] (Nonwrite HTTP requests with primary DB writes alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+writes+alert [22:54:17] 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: 1 VM request for XHGui in codfw - https://phabricator.wikimedia.org/T340596 (10Dzahn) Ah! As the one who created xhgui2001 (T259206#6352506 et al) and upgrade with buster. I approve of this request. cpu/disk/r... [22:55:34] 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: 1 VM request for XHGui in codfw - https://phabricator.wikimedia.org/T340596 (10andrea.denisse) Thanks for the suggestions @Dzahn . I'll try going straight to bookworm. :) [22:55:59] 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: 1 VM request for XHGui in codfw - https://phabricator.wikimedia.org/T340596 (10Dzahn) :) very cool! [23:01:25] (03PS1) 10Andrea Denisse: xhgui: Add node definitions for xhgui1002 and xhgui2002 [puppet] - 10https://gerrit.wikimedia.org/r/933662 (https://phabricator.wikimedia.org/T340596) [23:01:49] (03CR) 10CI reject: [V: 04-1] xhgui: Add node definitions for xhgui1002 and xhgui2002 [puppet] - 10https://gerrit.wikimedia.org/r/933662 (https://phabricator.wikimedia.org/T340596) (owner: 10Andrea Denisse) [23:02:38] (03PS2) 10Dzahn: xhgui: Add node definitions for xhgui1002 and xhgui2002 [puppet] - 10https://gerrit.wikimedia.org/r/933662 (https://phabricator.wikimedia.org/T340596) (owner: 10Andrea Denisse) [23:03:11] (03PS3) 10Andrea Denisse: xhgui: Add node definitions for xhgui1002 and xhgui2002 [puppet] - 10https://gerrit.wikimedia.org/r/933662 (https://phabricator.wikimedia.org/T340596) [23:03:49] (03CR) 10Dzahn: [C: 03+1] xhgui: Add node definitions for xhgui1002 and xhgui2002 [puppet] - 10https://gerrit.wikimedia.org/r/933662 (https://phabricator.wikimedia.org/T340596) (owner: 10Andrea Denisse) [23:04:28] (03CR) 10Andrea Denisse: [C: 03+2] xhgui: Add node definitions for xhgui1002 and xhgui2002 [puppet] - 10https://gerrit.wikimedia.org/r/933662 (https://phabricator.wikimedia.org/T340596) (owner: 10Andrea Denisse) [23:16:29] !log denisse@cumin1001 START - Cookbook sre.ganeti.makevm for new host xhgui1002.eqiad.wmnet [23:16:30] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [23:18:28] 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: 1 VM request for XHGui in codfw - https://phabricator.wikimedia.org/T340596 (10Dzahn) Beware of things like `mariadb/grants/production-m2.sql.erb:-- Grants for 'xhgui'@'10.64.0.135'` IP addresses in database gr... [23:18:35] !log denisse@cumin1001 START - Cookbook sre.ganeti.makevm for new host xhgui2002.codfw.wmnet [23:18:36] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM xhgui1002.eqiad.wmnet - denisse@cumin1001" [23:18:38] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [23:19:15] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM xhgui1002.eqiad.wmnet - denisse@cumin1001" [23:19:15] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:19:15] !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache xhgui1002.eqiad.wmnet on all recursors [23:19:18] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) xhgui1002.eqiad.wmnet on all recursors [23:19:45] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM xhgui1002.eqiad.wmnet - denisse@cumin1001" [23:20:27] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM xhgui1002.eqiad.wmnet - denisse@cumin1001" [23:20:51] !log denisse@cumin1001 START - Cookbook sre.hosts.reimage for host xhgui1002.eqiad.wmnet with OS bookworm [23:20:59] 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: 1 VM request for XHGui in eqiad - https://phabricator.wikimedia.org/T340595 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by denisse@cumin1001 for host xhgui1002.eqiad.wmnet with OS boo... [23:21:33] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM xhgui2002.codfw.wmnet - denisse@cumin1001" [23:22:18] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM xhgui2002.codfw.wmnet - denisse@cumin1001" [23:22:19] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:22:19] !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache xhgui2002.codfw.wmnet on all recursors [23:22:22] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) xhgui2002.codfw.wmnet on all recursors [23:22:47] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM xhgui2002.codfw.wmnet - denisse@cumin1001" [23:23:30] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM xhgui2002.codfw.wmnet - denisse@cumin1001" [23:23:41] !log denisse@cumin1001 START - Cookbook sre.hosts.reimage for host xhgui2002.codfw.wmnet with OS bookworm [23:23:48] 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: 1 VM request for XHGui in codfw - https://phabricator.wikimedia.org/T340596 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by denisse@cumin1001 for host xhgui2002.codfw.wmnet with OS boo... [23:31:49] !log denisse@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on xhgui1002.eqiad.wmnet with reason: host reimage [23:34:50] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on xhgui1002.eqiad.wmnet with reason: host reimage [23:40:33] !log denisse@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on xhgui2002.codfw.wmnet with reason: host reimage [23:43:38] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on xhgui2002.codfw.wmnet with reason: host reimage [23:49:49] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host xhgui1002.eqiad.wmnet with OS bookworm [23:49:49] !log denisse@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host xhgui1002.eqiad.wmnet [23:49:56] 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: 1 VM request for XHGui in eqiad - https://phabricator.wikimedia.org/T340595 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by denisse@cumin1001 for host xhgui1002.eqiad.wmnet with OS bookwor... [23:58:11] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host xhgui2002.codfw.wmnet with OS bookworm [23:58:11] !log denisse@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host xhgui2002.codfw.wmnet [23:58:16] 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: 1 VM request for XHGui in codfw - https://phabricator.wikimedia.org/T340596 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by denisse@cumin1001 for host xhgui2002.codfw.wmnet with OS bookwor...