[00:18:57] RECOVERY - Disk space on thanos-be2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2001&var-datasource=codfw+prometheus/ops [00:30:52] 10SRE, 10Observability-Alerting, 10observability, 10Patch-For-Review: MD RAID: remove mdadm daily check - https://phabricator.wikimedia.org/T169564 (10lmata) [00:31:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:32:12] 10SRE, 10SRE Observability, 10observability: Consider making a variant of the fatalmonitor CLI tool that ignores appserver timeouts - https://phabricator.wikimedia.org/T213777 (10lmata) [00:32:57] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting, 10observability: run-no-puppet leave puppet disabled on kill/crash - https://phabricator.wikimedia.org/T182228 (10lmata) [00:37:03] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:38:46] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:48:46] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:46] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:13:40] 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10bking) Thanks again for your response. >>! In T321874#8363793, @jbond wrote: > for instance i think that whether we where to use ansible or puppet we would always want... [02:18:46] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:59:05] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:02:09] PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The following units failed: generate_otrs_aliases.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:14:47] PROBLEM - SSH on mw1337.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:20:35] PROBLEM - SSH on mw1330.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:44:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:45:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [03:49:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:55:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [03:57:21] RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:21:33] (03PS1) 10Kevin Bazira: ml-services: add new fawiki model to isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/852944 (https://phabricator.wikimedia.org/T319373) [05:53:21] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 7843 [05:53:44] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 7843 [05:54:07] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 20115 [05:54:31] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 20115 [05:59:53] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 25091 [06:00:03] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [06:01:15] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 25091 [06:01:57] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:05:34] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 61461 [06:06:42] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 61461 [06:09:27] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 3292 [06:09:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1121.eqiad.wmnet with reason: Maintenance [06:09:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1121.eqiad.wmnet with reason: Maintenance [06:09:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [06:10:10] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 3292 [06:10:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [06:10:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T321123)', diff saved to https://phabricator.wikimedia.org/P38170 and previous config saved to /var/cache/conftool/dbconfig/20221107-061019-marostegui.json [06:10:26] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [06:17:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T321123)', diff saved to https://phabricator.wikimedia.org/P38171 and previous config saved to /var/cache/conftool/dbconfig/20221107-061730-marostegui.json [06:17:34] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [06:21:25] PROBLEM - SSH on mw1312.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:25:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:32:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P38172 and previous config saved to /var/cache/conftool/dbconfig/20221107-063236-marostegui.json [06:33:06] (03CR) 10Ayounsi: prometheus: probe SSH on mgmt network (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/845529 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [06:35:57] (03CR) 10Urbanecm: [C: 03+2] Calculate mentorship-related metrics [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853440 (https://phabricator.wikimedia.org/T318684) (owner: 10Urbanecm) [06:36:03] (03CR) 10Urbanecm: [C: 03+2] Add support for gemm_mentee_is_active [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853509 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [06:44:13] (03CR) 10Ayounsi: [C: 03+1] "LGTM, let me know if you need help deploying it." [homer/public] - 10https://gerrit.wikimedia.org/r/853374 (https://phabricator.wikimedia.org/T309407) (owner: 10Arturo Borrero Gonzalez) [06:45:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es5 T322406 [06:45:55] T322406: Switchover es5 codfw master (es2024 -> es2023) - https://phabricator.wikimedia.org/T322406 [06:46:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set es2023 with weight 0 T322406', diff saved to https://phabricator.wikimedia.org/P38173 and previous config saved to /var/cache/conftool/dbconfig/20221107-064608-root.json [06:46:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es5 T322406 [06:47:39] (03PS1) 10Marostegui: mariadb: Promote es2023 to es5 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/853708 (https://phabricator.wikimedia.org/T322406) [06:47:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P38174 and previous config saved to /var/cache/conftool/dbconfig/20221107-064743-marostegui.json [06:49:11] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote es2023 to es5 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/853708 (https://phabricator.wikimedia.org/T322406) (owner: 10Marostegui) [06:49:59] !log Starting es5 codfw failover from es2024 to es2023 - T322406 [06:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es2023 to es5 primary and set section read-write T322406', diff saved to https://phabricator.wikimedia.org/P38175 and previous config saved to /var/cache/conftool/dbconfig/20221107-065048-root.json [06:52:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2024 T322406', diff saved to https://phabricator.wikimedia.org/P38176 and previous config saved to /var/cache/conftool/dbconfig/20221107-065251-root.json [06:52:55] T322406: Switchover es5 codfw master (es2024 -> es2023) - https://phabricator.wikimedia.org/T322406 [06:54:22] (03PS1) 10Marostegui: es2024: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/853710 [06:54:37] (03Merged) 10jenkins-bot: Calculate mentorship-related metrics [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853440 (https://phabricator.wikimedia.org/T318684) (owner: 10Urbanecm) [06:54:49] (03Merged) 10jenkins-bot: Add support for gemm_mentee_is_active [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853509 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [06:54:55] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853509 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [06:54:58] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:853509|Add support for gemm_mentee_is_active (T318457)]], [[gerrit:853440|Calculate mentorship-related metrics (T318684)]] [06:55:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853440 (https://phabricator.wikimedia.org/T318684) (owner: 10Urbanecm) [06:55:03] T318457: Enable "Your unstarred mentees" at the biggest Growth wikis - https://phabricator.wikimedia.org/T318457 [06:55:03] T318684: Add mentorship numbers to the Growth team product KPIs dashboard - https://phabricator.wikimedia.org/T318684 [06:55:10] (03CR) 10Marostegui: [C: 03+2] es2024: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/853710 (owner: 10Marostegui) [06:55:22] !log urbanecm@deploy1002 urbanecm and urbanecm: Backport for [[gerrit:853509|Add support for gemm_mentee_is_active (T318457)]], [[gerrit:853440|Calculate mentorship-related metrics (T318684)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [07:01:25] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:853509|Add support for gemm_mentee_is_active (T318457)]], [[gerrit:853440|Calculate mentorship-related metrics (T318684)]] (duration: 06m 27s) [07:01:30] T318457: Enable "Your unstarred mentees" at the biggest Growth wikis - https://phabricator.wikimedia.org/T318457 [07:01:30] T318684: Add mentorship numbers to the Growth team product KPIs dashboard - https://phabricator.wikimedia.org/T318684 [07:02:01] !log Run `time mwscript extensions/GrowthExperiments/maintenance/updateIsActiveFlagForMentees.php --wiki=cswiki` in a tmux at mwmaint1002 (T318457) [07:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:28] (03PS1) 10Marostegui: ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853711 (https://phabricator.wikimedia.org/T322295) [07:02:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T321123)', diff saved to https://phabricator.wikimedia.org/P38177 and previous config saved to /var/cache/conftool/dbconfig/20221107-070249-marostegui.json [07:02:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1141.eqiad.wmnet with reason: Maintenance [07:02:53] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [07:03:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1141.eqiad.wmnet with reason: Maintenance [07:03:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T321123)', diff saved to https://phabricator.wikimedia.org/P38178 and previous config saved to /var/cache/conftool/dbconfig/20221107-070311-marostegui.json [07:03:32] (03PS1) 10Marostegui: pc1014: Add master role [puppet] - 10https://gerrit.wikimedia.org/r/853712 (https://phabricator.wikimedia.org/T322295) [07:04:12] (03CR) 10Marostegui: [C: 03+2] pc1014: Add master role [puppet] - 10https://gerrit.wikimedia.org/r/853712 (https://phabricator.wikimedia.org/T322295) (owner: 10Marostegui) [07:04:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T321123)', diff saved to https://phabricator.wikimedia.org/P38179 and previous config saved to /var/cache/conftool/dbconfig/20221107-070418-marostegui.json [07:05:03] !log Run `time mwscript extensions/GrowthExperiments/maintenance/updateIsActiveFlagForMentees.php --wiki=bnwiki` in a tmux at mwmaint1002 (T318457) [07:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on pc2011.codfw.wmnet,pc[1011,1014].eqiad.wmnet with reason: Primary switchover [07:07:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc[1011,1014].eqiad.wmnet with reason: Primary switchover [07:08:52] urbanecm: you done with scap? [07:09:00] marostegui: yes! [07:09:10] great thanks!! [07:09:41] Duh. I added my patch in wrong date :/ [07:09:54] (03CR) 10Urbanecm: "script is now live" [puppet] - 10https://gerrit.wikimedia.org/r/853265 (https://phabricator.wikimedia.org/T318684) (owner: 10Urbanecm) [07:09:57] marostegui: can I andd and deploy config patch? [07:10:04] kart_: go for it! [07:10:22] kart_: the official window's in an hour though :) [07:10:49] ahhh. [07:11:00] marostegui: sorry for noise. [07:11:05] urbanecm: thanks :) [07:11:13] np [07:11:21] no problem, I will proceed then :) [07:11:29] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853711 (https://phabricator.wikimedia.org/T322295) (owner: 10Marostegui) [07:12:40] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853711 (https://phabricator.wikimedia.org/T322295) (owner: 10Marostegui) [07:12:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [07:12:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by marostegui@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853711 (https://phabricator.wikimedia.org/T322295) (owner: 10Marostegui) [07:13:00] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:853711|ProductionServices.php: Promote pc1014 to pc1 master (T322295)]] [07:13:03] T322295: Migrate pc1 to MariaDB 10.6 - https://phabricator.wikimedia.org/T322295 [07:13:19] !log marostegui@deploy1002 marostegui and marostegui: Backport for [[gerrit:853711|ProductionServices.php: Promote pc1014 to pc1 master (T322295)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [07:14:26] (03PS2) 10KartikMistry: Set ContentTranslation MT threshold to 75 in Japanese WP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852924 (https://phabricator.wikimedia.org/T321819) [07:14:29] (03PS1) 10Marostegui: pc1011: Install MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/853714 (https://phabricator.wikimedia.org/T322295) [07:17:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [07:17:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [07:17:29] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:853711|ProductionServices.php: Promote pc1014 to pc1 master (T322295)]] (duration: 04m 29s) [07:18:13] RECOVERY - SSH on mw1337.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:19:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P38180 and previous config saved to /var/cache/conftool/dbconfig/20221107-071925-marostegui.json [07:19:38] (03CR) 10Marostegui: [C: 03+2] pc1011: Install MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/853714 (https://phabricator.wikimedia.org/T322295) (owner: 10Marostegui) [07:20:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [07:21:53] RECOVERY - SSH on mw1312.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:26:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [07:28:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [07:28:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [07:31:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [07:34:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P38181 and previous config saved to /var/cache/conftool/dbconfig/20221107-073431-marostegui.json [07:35:25] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc1014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853528 [07:35:54] (03CR) 10Elukey: [C: 03+2] ml-services: add new fawiki model to isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/852944 (https://phabricator.wikimedia.org/T319373) (owner: 10Kevin Bazira) [07:37:52] !log `elukey@aux-k8s-worker1002:~$ sudo systemctl reset-failed ifup@ens13.service` [07:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:19] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc1014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853528 (owner: 10Marostegui) [07:42:01] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc1014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853528 (owner: 10Marostegui) [07:42:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by marostegui@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853528 (owner: 10Marostegui) [07:42:26] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:853528|Revert "ProductionServices.php: Promote pc1014 to pc1 master"]] [07:42:45] !log marostegui@deploy1002 marostegui and marostegui: Backport for [[gerrit:853528|Revert "ProductionServices.php: Promote pc1014 to pc1 master"]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [07:43:15] (03PS1) 10Marostegui: pc1014: Remove master role [puppet] - 10https://gerrit.wikimedia.org/r/853909 [07:44:20] (03CR) 10Marostegui: [C: 03+2] pc1014: Remove master role [puppet] - 10https://gerrit.wikimedia.org/r/853909 (owner: 10Marostegui) [07:44:29] !log Run `time mwscript extensions/GrowthExperiments/maintenance/updateIsActiveFlagForMentees.php --wiki=frwiki` in a tmux at mwmaint1002 (T318457) [07:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:32] T318457: Enable "Your unstarred mentees" at the biggest Growth wikis - https://phabricator.wikimedia.org/T318457 [07:46:11] RECOVERY - Check systemd state on aux-k8s-worker1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:46:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [07:47:20] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:853528|Revert "ProductionServices.php: Promote pc1014 to pc1 master"]] (duration: 04m 53s) [07:49:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T321123)', diff saved to https://phabricator.wikimedia.org/P38182 and previous config saved to /var/cache/conftool/dbconfig/20221107-074938-marostegui.json [07:49:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1142.eqiad.wmnet with reason: Maintenance [07:49:42] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [07:49:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1142.eqiad.wmnet with reason: Maintenance [07:50:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T321123)', diff saved to https://phabricator.wikimedia.org/P38183 and previous config saved to /var/cache/conftool/dbconfig/20221107-074959-marostegui.json [07:50:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [07:50:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [07:51:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T321123)', diff saved to https://phabricator.wikimedia.org/P38184 and previous config saved to /var/cache/conftool/dbconfig/20221107-075106-marostegui.json [07:51:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [07:53:43] (03PS1) 10Marostegui: ProductionServices.php: Promote pc2014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853910 [07:54:05] (03PS1) 10Marostegui: pc2013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/853911 [07:54:36] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc2014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853910 (owner: 10Marostegui) [07:54:53] (03CR) 10Marostegui: [C: 03+2] pc2013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/853911 (owner: 10Marostegui) [07:55:27] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc2014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853910 (owner: 10Marostegui) [07:55:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by marostegui@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853910 (owner: 10Marostegui) [07:55:47] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:853910|ProductionServices.php: Promote pc2014 to pc3 master]] [07:56:07] !log marostegui@deploy1002 marostegui and marostegui: Backport for [[gerrit:853910|ProductionServices.php: Promote pc2014 to pc3 master]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [07:58:50] 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10MoritzMuehlenhoff) >>! In T321874#8372962, @bking wrote: > and (to me anyway) Puppet is the main explanation. The problems of deployment-prep are a matter of resourcing,... [08:00:05] Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221107T0800). [08:00:05] duesen and kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:05] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:853910|ProductionServices.php: Promote pc2014 to pc3 master]] (duration: 04m 18s) [08:00:15] o/ [08:00:21] o/ [08:00:34] marostegui: hi, are you done with scap? or should we wait with the window for a bit? [08:01:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [08:02:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [08:02:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [08:02:56] here [08:03:14] i guess i can start [08:03:16] urbanecm: I am done! [08:03:20] great! [08:03:40] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845058 (https://phabricator.wikimedia.org/T320531) (owner: 10Daniel Kinzler) [08:03:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [08:04:01] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 184 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:04:28] (03Merged) 10jenkins-bot: Set VisualEditorDefaultParsoidClient for dewiki-beta mad testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845058 (https://phabricator.wikimedia.org/T320531) (owner: 10Daniel Kinzler) [08:04:42] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:845058|Set VisualEditorDefaultParsoidClient for dewiki-beta mad testwiki (T320531)]] [08:04:44] T320531: Configure VE backend to use Parsoid directly on the beta cluster and testwiki - https://phabricator.wikimedia.org/T320531 [08:05:01] !log urbanecm@deploy1002 urbanecm and daniel: Backport for [[gerrit:845058|Set VisualEditorDefaultParsoidClient for dewiki-beta mad testwiki (T320531)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [08:05:12] urbanecm: ugh, I just spotted the silly typo in the commit message :D [08:05:29] duesen: hopefully it won't get too mad :) [08:05:36] Anyway, can you test at mwdebug1001? [08:05:40] hopefully... [08:05:44] yea, on it [08:06:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P38185 and previous config saved to /var/cache/conftool/dbconfig/20221107-080613-marostegui.json [08:06:46] urbanecm: actually - I think we'll have to wait five minutes for ResourceLoader cache to expire [08:07:07] Ok. [08:09:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [08:09:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [08:09:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [08:10:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [08:12:31] urbanecm: Let me know when you are done (no rush at all) [08:12:46] (03PS1) 10Marostegui: Revert "pc2013: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/853529 [08:13:01] I've another minor config patch, will add it.. [08:13:12] kart_: sure thing [08:14:57] (03CR) 10Marostegui: [C: 03+2] Revert "pc2013: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/853529 (owner: 10Marostegui) [08:15:00] duesen: 5 minutes passed, how is it looking? [08:15:28] (03PS1) 10KartikMistry: ContentTranslation: Move haw, ps and xh Wikipedias out of Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853914 [08:15:48] * duesen is still testing [08:16:31] ack [08:19:25] urbanecm: ok, looks good. VE on wikitech is still broken, the fix for that should go out with the train. [08:19:33] PROBLEM - SSH on an-coord1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:19:35] duesen: okay, syncing [08:19:45] (03PS3) 10KartikMistry: Set ContentTranslation MT threshold to 75 in Japanese WP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852924 (https://phabricator.wikimedia.org/T321819) [08:19:48] ftr i don't think wikitech is mwdebug-able yet [08:20:02] (03CR) 10Urbanecm: [C: 03+2] Set ContentTranslation MT threshold to 75 in Japanese WP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852924 (https://phabricator.wikimedia.org/T321819) (owner: 10KartikMistry) [08:20:08] urbanecm: oh really? and neither is beta, I suppose [08:20:33] urbanecm: and testwiki is lacking a setting it seems... [08:20:44] I'll cook up a quick patch for testwiki [08:20:48] sounds good [08:21:00] (03PS2) 10KartikMistry: ContentTranslation: Move haw, ps and xh Wikipedias out of Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853914 (https://phabricator.wikimedia.org/T317289) [08:21:15] (03Merged) 10jenkins-bot: Set ContentTranslation MT threshold to 75 in Japanese WP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852924 (https://phabricator.wikimedia.org/T321819) (owner: 10KartikMistry) [08:21:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P38188 and previous config saved to /var/cache/conftool/dbconfig/20221107-082120-marostegui.json [08:21:26] https://wikitech.wikimedia.org/wiki/WikimediaDebug#Limitations still says it doesn't work on wikitech, at least [08:23:36] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:845058|Set VisualEditorDefaultParsoidClient for dewiki-beta mad testwiki (T320531)]] (duration: 18m 54s) [08:23:39] T320531: Configure VE backend to use Parsoid directly on the beta cluster and testwiki - https://phabricator.wikimedia.org/T320531 [08:23:42] duesen: synced [08:24:05] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852924 (https://phabricator.wikimedia.org/T321819) (owner: 10KartikMistry) [08:24:16] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:852924|Set ContentTranslation MT threshold to 75 in Japanese WP (T321819)]] [08:24:19] T321819: Modify Machine Translation in Japanese Wikipedia by 75% or more to publish a translation - https://phabricator.wikimedia.org/T321819 [08:24:36] !log urbanecm@deploy1002 urbanecm and kartik: Backport for [[gerrit:852924|Set ContentTranslation MT threshold to 75 in Japanese WP (T321819)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [08:24:47] kart_: can you test ^^ at mwdebug1001 please? [08:25:01] sure. testing.. [08:25:03] (03PS1) 10Marostegui: add_cuc_private_T321130.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/853916 (https://phabricator.wikimedia.org/T321130) [08:25:05] RECOVERY - SSH on mw1330.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:25:24] urbanecm: ty! [08:25:27] np [08:26:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [08:26:13] (03PS1) 10Vgutierrez: swift: drain ms-be06 ASAP and remove ms-be05 [puppet] - 10https://gerrit.wikimedia.org/r/853917 (https://phabricator.wikimedia.org/T322231) [08:26:42] urbanecm: looks good. Please deploy.. [08:26:46] syncing [08:27:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [08:27:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [08:27:11] (03PS3) 10Urbanecm: ContentTranslation: Move haw, ps and xh Wikipedias out of Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853914 (https://phabricator.wikimedia.org/T317289) (owner: 10KartikMistry) [08:27:14] (03CR) 10Urbanecm: [C: 03+2] ContentTranslation: Move haw, ps and xh Wikipedias out of Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853914 (https://phabricator.wikimedia.org/T317289) (owner: 10KartikMistry) [08:28:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [08:28:08] (03Merged) 10jenkins-bot: ContentTranslation: Move haw, ps and xh Wikipedias out of Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853914 (https://phabricator.wikimedia.org/T317289) (owner: 10KartikMistry) [08:29:49] (03CR) 10Vgutierrez: [C: 03+2] swift: drain ms-be06 ASAP and remove ms-be05 [puppet] - 10https://gerrit.wikimedia.org/r/853917 (https://phabricator.wikimedia.org/T322231) (owner: 10Vgutierrez) [08:30:28] can someone with op here either op me or change the topic to put me on clinic duty ? thank you [08:30:41] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:852924|Set ContentTranslation MT threshold to 75 in Japanese WP (T321819)]] (duration: 06m 24s) [08:30:44] T321819: Modify Machine Translation in Japanese Wikipedia by 75% or more to publish a translation - https://phabricator.wikimedia.org/T321819 [08:30:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853914 (https://phabricator.wikimedia.org/T317289) (owner: 10KartikMistry) [08:31:09] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:853914|ContentTranslation: Move haw, ps and xh Wikipedias out of Beta]] [08:31:22] godog: doing it [08:31:29] !log urbanecm@deploy1002 urbanecm and kartik: Backport for [[gerrit:853914|ContentTranslation: Move haw, ps and xh Wikipedias out of Beta]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [08:31:46] kart_: can you test ^^ at mwdebug1001 please? [08:32:09] marostegui: thank you <3 [08:32:17] godog: <3 [08:33:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [08:34:05] urbanecm: Looks good. Please deploy.. [08:34:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [08:34:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [08:34:09] syncing [08:35:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [08:36:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T321123)', diff saved to https://phabricator.wikimedia.org/P38189 and previous config saved to /var/cache/conftool/dbconfig/20221107-083626-marostegui.json [08:36:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1143.eqiad.wmnet with reason: Maintenance [08:36:30] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [08:36:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1143.eqiad.wmnet with reason: Maintenance [08:36:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T321123)', diff saved to https://phabricator.wikimedia.org/P38190 and previous config saved to /var/cache/conftool/dbconfig/20221107-083648-marostegui.json [08:38:06] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:853914|ContentTranslation: Move haw, ps and xh Wikipedias out of Beta]] (duration: 06m 56s) [08:38:11] kart_: and, synced [08:38:14] anything else? [08:38:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T321123)', diff saved to https://phabricator.wikimedia.org/P38191 and previous config saved to /var/cache/conftool/dbconfig/20221107-083855-marostegui.json [08:40:20] urbanecm: not for now, I need to chat to Bartosz about the config on enwiki [08:40:28] ack [08:40:35] !log UTC morning B&C window done [08:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:13] urbanecm: can I proceed? [08:41:18] marostegui: yup, go ahead [08:41:39] thanks! [08:41:42] (03CR) 10Marostegui: [C: 03+2] add_cuc_private_T321130.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/853916 (https://phabricator.wikimedia.org/T321130) (owner: 10Marostegui) [08:42:02] (03CR) 10Marostegui: add_cuc_private_T321130.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/853916 (https://phabricator.wikimedia.org/T321130) (owner: 10Marostegui) [08:42:16] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc2014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853530 [08:44:33] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc2014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853530 (owner: 10Marostegui) [08:45:15] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc2014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853530 (owner: 10Marostegui) [08:45:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by marostegui@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853530 (owner: 10Marostegui) [08:45:33] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:853530|Revert "ProductionServices.php: Promote pc2014 to pc3 master"]] [08:45:52] !log marostegui@deploy1002 marostegui and marostegui: Backport for [[gerrit:853530|Revert "ProductionServices.php: Promote pc2014 to pc3 master"]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [08:47:07] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:47:09] (03CR) 10Ladsgroup: [C: 03+1] "😄" [software/schema-changes] - 10https://gerrit.wikimedia.org/r/853916 (https://phabricator.wikimedia.org/T321130) (owner: 10Marostegui) [08:47:18] urbanecm: hm actually, I think I do have a follow-up that I want to try... Give me five minutes [08:47:32] (03CR) 10Marostegui: [C: 03+2] add_cuc_private_T321130.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/853916 (https://phabricator.wikimedia.org/T321130) (owner: 10Marostegui) [08:47:57] (03Merged) 10jenkins-bot: add_cuc_private_T321130.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/853916 (https://phabricator.wikimedia.org/T321130) (owner: 10Marostegui) [08:48:14] duesen: sure, but i think it'd be better to do it in a next window (at 14:00 UTC), to have more time for it, if that's fine :) [08:48:37] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:49:56] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:853530|Revert "ProductionServices.php: Promote pc2014 to pc3 master"]] (duration: 04m 22s) [08:50:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [08:50:53] (03CR) 10Muehlenhoff: Set profile::contacts::role_contacts for contint* to ServiceOps-Collab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/852832 (owner: 10Muehlenhoff) [08:50:58] (03PS3) 10Muehlenhoff: Set profile::contacts::role_contacts for contint* to ServiceOps-Collab [puppet] - 10https://gerrit.wikimedia.org/r/852832 [08:51:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [08:51:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [08:52:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [08:52:43] (03PS1) 10Marostegui: pc2104: Move it to pc1 [puppet] - 10https://gerrit.wikimedia.org/r/853918 (https://phabricator.wikimedia.org/T322295) [08:53:55] (03PS1) 10Daniel Kinzler: Set wmgVisualEditorAccessRestbaseDirectly = false for testwiki and dewiki.beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853919 (https://phabricator.wikimedia.org/T320531) [08:54:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P38192 and previous config saved to /var/cache/conftool/dbconfig/20221107-085402-marostegui.json [08:54:04] (03CR) 10CI reject: [V: 04-1] Set wmgVisualEditorAccessRestbaseDirectly = false for testwiki and dewiki.beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853919 (https://phabricator.wikimedia.org/T320531) (owner: 10Daniel Kinzler) [08:54:21] (03CR) 10Marostegui: [C: 03+2] pc2104: Move it to pc1 [puppet] - 10https://gerrit.wikimedia.org/r/853918 (https://phabricator.wikimedia.org/T322295) (owner: 10Marostegui) [08:54:35] vgutierrez: ok to merge your change? [08:54:50] (03PS2) 10Daniel Kinzler: Set wmgVisualEditorAccessRestbaseDirectly = false for testwiki and dewiki.beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853919 (https://phabricator.wikimedia.org/T320531) [08:55:15] urbanecm: --^ [08:56:13] marostegui: ok to do one more config patch? :) [08:56:20] urbanecm: yeah, I am done for the day! [08:56:28] thanks [08:56:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853919 (https://phabricator.wikimedia.org/T320531) (owner: 10Daniel Kinzler) [08:56:48] (03CR) 10Muehlenhoff: [C: 03+2] Set profile::contacts::role_contacts for contint* to ServiceOps-Collab [puppet] - 10https://gerrit.wikimedia.org/r/852832 (owner: 10Muehlenhoff) [08:57:22] (03Merged) 10jenkins-bot: Set wmgVisualEditorAccessRestbaseDirectly = false for testwiki and dewiki.beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853919 (https://phabricator.wikimedia.org/T320531) (owner: 10Daniel Kinzler) [08:57:34] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:853919|Set wmgVisualEditorAccessRestbaseDirectly = false for testwiki and dewiki.beta (T320531)]] [08:57:38] T320531: Configure VE backend to use Parsoid directly on the beta cluster and testwiki - https://phabricator.wikimedia.org/T320531 [08:57:51] marostegui, vgutierrez: ok to merge your patches along? (move pc2104 and drain ms-be06) [08:57:54] !log urbanecm@deploy1002 urbanecm and daniel: Backport for [[gerrit:853919|Set wmgVisualEditorAccessRestbaseDirectly = false for testwiki and dewiki.beta (T320531)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [08:57:59] moritzm: mine is ok, I am waiting for vgutierrez to confirm [08:58:10] duesen: it's at mwdebug1001, can you check? [08:58:13] uh? [08:58:16] oh sorry [08:58:18] please go ahead [08:58:27] moritzm: merging [08:58:37] Ah, no, moritzm has the lock :) [08:58:48] urbanecm: on it. i hope resource loader cache won't interfere [08:58:53] we'll see [08:58:59] merging now :-) [08:59:03] thanks! [08:59:16] done [09:00:20] thanks [09:00:51] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Avoid marking origin servers down/dead [puppet] - 10https://gerrit.wikimedia.org/r/853321 (https://phabricator.wikimedia.org/T322420) (owner: 10Vgutierrez) [09:02:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [09:02:22] urbanecm: ok, testwiki now shows the same bug as wikitech, which is good. I'll test beta once the patch is synced. [09:02:37] okay, so, syncing [09:03:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [09:03:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [09:03:29] duesen: fwiw beta's not actually depending on production sync. you can monitor the beta deployment at https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/416708/console if you want [09:04:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [09:04:34] 10SRE, 10Traffic, 10Patch-For-Review: ATS flags origin servers as down during 60 seconds after a connect timeout - https://phabricator.wikimedia.org/T322420 (10Vgutierrez) 05Open→03Resolved [09:05:40] 10SRE, 10Infrastructure-Foundations, 10Packaging: Add support for temporary chroots to boron - https://phabricator.wikimedia.org/T219977 (10MoritzMuehlenhoff) 05Open→03Declined This is really simple to do with systemd-nspawn: ` mkdir -p ~/containers/bullseye sudo debootstrap bullseye ~/containers/bullse... [09:06:42] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:853919|Set wmgVisualEditorAccessRestbaseDirectly = false for testwiki and dewiki.beta (T320531)]] (duration: 09m 07s) [09:06:45] T320531: Configure VE backend to use Parsoid directly on the beta cluster and testwiki - https://phabricator.wikimedia.org/T320531 [09:06:58] duesen: deployed to production [09:07:33] and to beta too, the CI job finished as well [09:08:00] !log set thanos ring replicas to 3.30 T311690 [09:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:03] T311690: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690 [09:09:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P38193 and previous config saved to /var/cache/conftool/dbconfig/20221107-090908-marostegui.json [09:10:02] (03PS2) 10Phuedx: Update Metrics Platform streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852838 (https://phabricator.wikimedia.org/T322277) [09:10:11] (03PS3) 10Phuedx: Update Metrics Platform streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852838 (https://phabricator.wikimedia.org/T322277) [09:11:37] 10SRE: Enhance account handling (meta bug) - https://phabricator.wikimedia.org/T142815 (10MoritzMuehlenhoff) [09:11:50] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: create notifications about user accounts that have not been used for a long time - https://phabricator.wikimedia.org/T146657 (10MoritzMuehlenhoff) 05Open→03Declined I think we can close this task, part of detecting inactive accounts is part of the wider IDM... [09:12:17] (03CR) 10Filippo Giunchedi: [V: 03+1] "Thank you for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/845529 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [09:12:38] urbanecm: confirmed on beta, thanks! [09:12:46] great! [09:16:41] (03CR) 10David Caro: wmcs: add socks proxy support to wmcs cookbooks (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) (owner: 10David Caro) [09:18:05] !log draining ganeti1010 for eventual reimage T311687 [09:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:08] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [09:22:45] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:24:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T321123)', diff saved to https://phabricator.wikimedia.org/P38194 and previous config saved to /var/cache/conftool/dbconfig/20221107-092414-marostegui.json [09:24:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance [09:24:19] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [09:24:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance [09:24:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T321123)', diff saved to https://phabricator.wikimedia.org/P38195 and previous config saved to /var/cache/conftool/dbconfig/20221107-092436-marostegui.json [09:25:20] 10SRE, 10Infrastructure-Foundations, 10LDAP: Cross-check disabled accounts from corp LDAP against data.yaml - https://phabricator.wikimedia.org/T161003 (10MoritzMuehlenhoff) 05Open→03Declined We no longer need this, the LDAP replica will vanish soonish entirely and if we see that need again in the future... [09:25:22] 10SRE, 10LDAP: Make disabled accounts visible in the corp mirror LDAP replica - https://phabricator.wikimedia.org/T160158 (10MoritzMuehlenhoff) [09:25:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T321123)', diff saved to https://phabricator.wikimedia.org/P38196 and previous config saved to /var/cache/conftool/dbconfig/20221107-092543-marostegui.json [09:29:08] !log installing Django security updates [09:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:19] (03PS1) 10Ladsgroup: pruneRevData: Make it reload config [extensions/FlaggedRevs] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853531 [09:29:24] (03CR) 10Ladsgroup: [C: 03+2] pruneRevData: Make it reload config [extensions/FlaggedRevs] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853531 (owner: 10Ladsgroup) [09:30:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudmetrics1003.eqiad.wmnet [09:31:37] (03PS1) 10Vgutierrez: prometheus: Aggregation rules for ATS TTFB per crc/backend [puppet] - 10https://gerrit.wikimedia.org/r/853923 (https://phabricator.wikimedia.org/T321484) [09:32:24] (03Merged) 10jenkins-bot: pruneRevData: Make it reload config [extensions/FlaggedRevs] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853531 (owner: 10Ladsgroup) [09:33:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1096.eqiad.wmnet with reason: Maintenance [09:33:35] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10dcaro) [09:33:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1096.eqiad.wmnet with reason: Maintenance [09:33:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T321130)', diff saved to https://phabricator.wikimedia.org/P38197 and previous config saved to /var/cache/conftool/dbconfig/20221107-093352-marostegui.json [09:33:55] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 50 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:33:56] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [09:33:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [extensions/FlaggedRevs] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853531 (owner: 10Ladsgroup) [09:34:09] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:853531|pruneRevData: Make it reload config]] [09:34:29] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:853531|pruneRevData: Make it reload config]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [09:34:41] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10dcaro) [09:36:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance [09:36:12] (03PS3) 10David Caro: wmcs: add socks proxy support to wmcs cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) [09:36:23] (03CR) 10David Caro: wmcs: add socks proxy support to wmcs cookbooks (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) (owner: 10David Caro) [09:36:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance [09:36:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [09:36:29] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [09:36:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T318605)', diff saved to https://phabricator.wikimedia.org/P38198 and previous config saved to /var/cache/conftool/dbconfig/20221107-093629-ladsgroup.json [09:36:33] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [09:36:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [09:36:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudmetrics1003.eqiad.wmnet [09:36:58] (KubernetesRsyslogDown) firing: rsyslog on ml-serve2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=ml-serve2001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:37:01] (03CR) 10ClĂ©ment Goubert: [C: 03+2] hieradata: Add usernames for mw on k8s services [puppet] - 10https://gerrit.wikimedia.org/r/850094 (https://phabricator.wikimedia.org/T321786) (owner: 10ClĂ©ment Goubert) [09:37:10] (03CR) 10ClĂ©ment Goubert: [C: 03+2] admin: add mw on kubernetes namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/850095 (https://phabricator.wikimedia.org/T321786) (owner: 10ClĂ©ment Goubert) [09:38:34] !log restart rsyslog on ml-serve2001 [09:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:43] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [09:39:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudmetrics1004.eqiad.wmnet [09:39:27] PROBLEM - SSH on db1120.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:39:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T321130)', diff saved to https://phabricator.wikimedia.org/P38199 and previous config saved to /var/cache/conftool/dbconfig/20221107-093939-marostegui.json [09:39:42] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [09:39:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [09:40:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [09:40:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P38200 and previous config saved to /var/cache/conftool/dbconfig/20221107-094050-marostegui.json [09:40:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [09:40:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:41:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [09:41:58] (KubernetesRsyslogDown) resolved: rsyslog on ml-serve2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=ml-serve2001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:41:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1102.eqiad.wmnet with reason: Maintenance [09:42:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1102.eqiad.wmnet with reason: Maintenance [09:42:19] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:853531|pruneRevData: Make it reload config]] (duration: 08m 10s) [09:42:34] !log cgoubert@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:42:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2105.codfw.wmnet with reason: Maintenance [09:43:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2105.codfw.wmnet with reason: Maintenance [09:43:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2105 (T318955)', diff saved to https://phabricator.wikimedia.org/P38201 and previous config saved to /var/cache/conftool/dbconfig/20221107-094315-ladsgroup.json [09:43:18] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [09:43:30] (03PS6) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [09:44:01] !log cgoubert@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:44:07] (03CR) 10CI reject: [V: 04-1] Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [09:44:15] (03CR) 10Vgutierrez: "I've rebased this CR on top of production and resolved some merge conflicts due to some work performed on analytics.inc.vcl.erb" [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [09:45:45] !log cgoubert@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:45:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:46:10] !log cgoubert@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:46:22] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [09:46:24] (03PS1) 10Marostegui: Revert "es2024: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/853532 [09:46:41] (03CR) 10Ladsgroup: [C: 03+1] db_inventory: Cleanup zarcillo database grants [puppet] - 10https://gerrit.wikimedia.org/r/817181 (https://phabricator.wikimedia.org/T146149) (owner: 10Jcrespo) [09:46:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2024 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P38202 and previous config saved to /var/cache/conftool/dbconfig/20221107-094650-root.json [09:47:25] (03CR) 10Marostegui: [C: 03+2] Revert "es2024: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/853532 (owner: 10Marostegui) [09:47:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudmetrics1004.eqiad.wmnet [09:48:19] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:48:29] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:49:38] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [09:50:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T318955)', diff saved to https://phabricator.wikimedia.org/P38203 and previous config saved to /var/cache/conftool/dbconfig/20221107-095030-ladsgroup.json [09:50:34] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [09:51:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1112.eqiad.wmnet with reason: Maintenance [09:51:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1112.eqiad.wmnet with reason: Maintenance [09:51:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [09:51:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [09:51:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T318955)', diff saved to https://phabricator.wikimedia.org/P38204 and previous config saved to /var/cache/conftool/dbconfig/20221107-095149-ladsgroup.json [09:53:58] (03PS3) 10Muehlenhoff: Enable profile::auto_restarts::service for jwt-authorizer on docker registry [puppet] - 10https://gerrit.wikimedia.org/r/852831 (https://phabricator.wikimedia.org/T135991) [09:54:13] (03PS2) 10Muehlenhoff: ci: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850476 (https://phabricator.wikimedia.org/T308013) [09:54:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P38205 and previous config saved to /var/cache/conftool/dbconfig/20221107-095445-marostegui.json [09:55:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T318955)', diff saved to https://phabricator.wikimedia.org/P38206 and previous config saved to /var/cache/conftool/dbconfig/20221107-095542-ladsgroup.json [09:55:45] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [09:55:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P38207 and previous config saved to /var/cache/conftool/dbconfig/20221107-095556-marostegui.json [09:59:58] (03CR) 10Muehlenhoff: [C: 03+2] ci: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850476 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:00:18] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [10:01:01] (03CR) 10Muehlenhoff: [C: 03+2] paws: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842757 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:01:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2024 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P38208 and previous config saved to /var/cache/conftool/dbconfig/20221107-100155-root.json [10:02:39] (03PS1) 10ClĂ©ment Goubert: mw-on-k8s: Add dummy service data [labs/private] - 10https://gerrit.wikimedia.org/r/853930 (https://phabricator.wikimedia.org/T321786) [10:04:30] (03PS6) 10ClĂ©ment Goubert: P:kubernetes::deployment_server: absent services [puppet] - 10https://gerrit.wikimedia.org/r/852775 (https://phabricator.wikimedia.org/T322298) [10:04:45] (03PS6) 10ClĂ©ment Goubert: mwdebug: absent kubernetes service [puppet] - 10https://gerrit.wikimedia.org/r/852777 (https://phabricator.wikimedia.org/T321201) [10:05:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P38209 and previous config saved to /var/cache/conftool/dbconfig/20221107-100536-ladsgroup.json [10:05:53] (03PS2) 10ClĂ©ment Goubert: admin: Remove stale mwdebug stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/852186 (https://phabricator.wikimedia.org/T321201) [10:09:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P38210 and previous config saved to /var/cache/conftool/dbconfig/20221107-100952-marostegui.json [10:10:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P38211 and previous config saved to /var/cache/conftool/dbconfig/20221107-101048-ladsgroup.json [10:11:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T321123)', diff saved to https://phabricator.wikimedia.org/P38212 and previous config saved to /var/cache/conftool/dbconfig/20221107-101102-marostegui.json [10:11:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance [10:11:06] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [10:11:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance [10:11:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [10:11:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [10:11:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T321123)', diff saved to https://phabricator.wikimedia.org/P38213 and previous config saved to /var/cache/conftool/dbconfig/20221107-101140-marostegui.json [10:13:09] (03PS8) 10Jcrespo: db_inventory: Cleanup zarcillo database grants [puppet] - 10https://gerrit.wikimedia.org/r/817181 (https://phabricator.wikimedia.org/T146149) [10:13:54] 10SRE, 10LDAP: Make disabled accounts visible in the corp mirror LDAP replica - https://phabricator.wikimedia.org/T160158 (10LSobanski) 05Open→03Declined Closing as declined based on https://phabricator.wikimedia.org/T161003#8373403. [10:13:56] 10SRE: Enhance account handling (meta bug) - https://phabricator.wikimedia.org/T142815 (10LSobanski) [10:14:42] 10SRE, 10Infrastructure-Foundations: Enhance account handling (meta bug) - https://phabricator.wikimedia.org/T142815 (10LSobanski) [10:15:15] (03PS2) 10Stevemunene: Add stevemunene to ops and analytics [puppet] - 10https://gerrit.wikimedia.org/r/853300 (https://phabricator.wikimedia.org/T322339) [10:17:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2024 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P38214 and previous config saved to /var/cache/conftool/dbconfig/20221107-101700-root.json [10:17:02] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 5398 [10:17:16] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 5398 [10:17:44] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 6661 [10:17:51] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 6661 [10:19:03] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 7459 [10:19:07] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 7459 [10:20:35] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 12400 [10:20:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P38215 and previous config saved to /var/cache/conftool/dbconfig/20221107-102043-ladsgroup.json [10:20:57] RECOVERY - SSH on an-coord1002.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:20:59] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 12400 [10:21:21] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 17511 [10:21:56] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 17511 [10:23:02] 10SRE, 10Maps: Requesting access to maps for mbsantos and jgiannelos - https://phabricator.wikimedia.org/T269357 (10LSobanski) p:05High→03Medium [10:23:12] 10SRE, 10Maps: Requesting access to maps for mbsantos and jgiannelos - https://phabricator.wikimedia.org/T269357 (10LSobanski) @MSantos This is request is almost 2 years old, is it still relevant? [10:23:50] 10SRE, 10Continuous-Integration-Config: cergen CI fails to run on Debian Stretch because cryptography dependency cannot be built against newer openssl version - https://phabricator.wikimedia.org/T212395 (10LSobanski) 05Open→03Resolved a:03LSobanski With no response in 2 years and the fact that we've migr... [10:24:05] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 3214 [10:24:12] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 3214 [10:24:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T321130)', diff saved to https://phabricator.wikimedia.org/P38216 and previous config saved to /var/cache/conftool/dbconfig/20221107-102458-marostegui.json [10:25:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1098.eqiad.wmnet with reason: Maintenance [10:25:02] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [10:25:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1098.eqiad.wmnet with reason: Maintenance [10:25:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T321130)', diff saved to https://phabricator.wikimedia.org/P38217 and previous config saved to /var/cache/conftool/dbconfig/20221107-102509-marostegui.json [10:25:11] (03PS9) 10Jcrespo: db_inventory: Cleanup zarcillo database grants [puppet] - 10https://gerrit.wikimedia.org/r/817181 (https://phabricator.wikimedia.org/T146149) [10:25:12] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 46416 [10:25:34] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46416 [10:25:42] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 30058 [10:25:52] (03CR) 10Muehlenhoff: [C: 03+2] Set profile::contacts::role_contacts for role::dns::auth [puppet] - 10https://gerrit.wikimedia.org/r/852918 (owner: 10Muehlenhoff) [10:25:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P38218 and previous config saved to /var/cache/conftool/dbconfig/20221107-102555-ladsgroup.json [10:26:02] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 30058 [10:26:14] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 35598 [10:26:21] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 35598 [10:26:56] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 4817 [10:27:39] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 4817 [10:28:18] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/852961 (https://phabricator.wikimedia.org/T322344) (owner: 10Dduvall) [10:28:23] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 59796 [10:28:45] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 59796 [10:29:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning: Q2:rack/setup/install an-coord100[3,4] & an-mariadb100[1,2] - https://phabricator.wikimedia.org/T321119 (10EChetty) [10:29:19] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 399338 [10:29:34] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 399338 [10:29:48] (03PS1) 10Elukey: Upgrade to 1.15.3 [debs/istio] - 10https://gerrit.wikimedia.org/r/853936 (https://phabricator.wikimedia.org/T322193) [10:30:01] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 30103 [10:30:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T321130)', diff saved to https://phabricator.wikimedia.org/P38219 and previous config saved to /var/cache/conftool/dbconfig/20221107-103056-marostegui.json [10:31:00] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [10:31:27] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 30103 [10:31:41] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 5398 [10:32:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2024 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P38220 and previous config saved to /var/cache/conftool/dbconfig/20221107-103205-root.json [10:32:11] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 5398 [10:35:11] (03CR) 10Muehlenhoff: prometheus: probe SSH on mgmt network (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845529 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [10:35:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T318955)', diff saved to https://phabricator.wikimedia.org/P38221 and previous config saved to /var/cache/conftool/dbconfig/20221107-103549-ladsgroup.json [10:35:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2109.codfw.wmnet with reason: Maintenance [10:35:53] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [10:36:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2109.codfw.wmnet with reason: Maintenance [10:36:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T318955)', diff saved to https://phabricator.wikimedia.org/P38222 and previous config saved to /var/cache/conftool/dbconfig/20221107-103622-ladsgroup.json [10:37:32] 10SRE, 10Znuny, 10serviceops-collab, 10User-Matthewrbowker: Proposal: Centralize OTRS login methodology - https://phabricator.wikimedia.org/T133476 (10LSobanski) p:05Medium→03Low [10:38:12] 10SRE, 10Znuny, 10serviceops-collab: Convert glam@wikimedia.org OTRS into a Google Group - https://phabricator.wikimedia.org/T233843 (10LSobanski) [10:40:17] (03PS1) 10Filippo Giunchedi: mr: allow prometheus_group SSH access to mgmt [homer/public] - 10https://gerrit.wikimedia.org/r/853938 (https://phabricator.wikimedia.org/T310266) [10:40:19] RECOVERY - SSH on db1120.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:41:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T318955)', diff saved to https://phabricator.wikimedia.org/P38223 and previous config saved to /var/cache/conftool/dbconfig/20221107-104101-ladsgroup.json [10:41:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance [10:41:05] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [10:41:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance [10:43:08] (03PS7) 10Filippo Giunchedi: prometheus: probe SSH on mgmt network [puppet] - 10https://gerrit.wikimedia.org/r/845529 (https://phabricator.wikimedia.org/T310266) [10:43:13] (03PS5) 10Giuseppe Lavagetto: monitoring: introduce exclude list for checking systemd units [puppet] - 10https://gerrit.wikimedia.org/r/849928 (https://phabricator.wikimedia.org/T303253) [10:43:15] (03PS5) 10Giuseppe Lavagetto: check_systemd_state: consume exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/849929 (https://phabricator.wikimedia.org/T303253) [10:43:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q2:rack/setup/install an-coord100[3,4] & an-mariadb100[1,2] - https://phabricator.wikimedia.org/T321119 (10EChetty) [10:43:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T318955)', diff saved to https://phabricator.wikimedia.org/P38224 and previous config saved to /var/cache/conftool/dbconfig/20221107-104338-ladsgroup.json [10:46:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P38225 and previous config saved to /var/cache/conftool/dbconfig/20221107-104603-marostegui.json [10:46:16] (03PS1) 10Giuseppe Lavagetto: vopsbot: always restart the service via systemd [puppet] - 10https://gerrit.wikimedia.org/r/853939 [10:46:57] (03PS2) 10Giuseppe Lavagetto: vopsbot: always restart the service via systemd [puppet] - 10https://gerrit.wikimedia.org/r/853939 [10:47:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2024 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P38226 and previous config saved to /var/cache/conftool/dbconfig/20221107-104710-root.json [10:48:38] (03CR) 10Jcrespo: [C: 03+2] db_inventory: Cleanup zarcillo database grants [puppet] - 10https://gerrit.wikimedia.org/r/817181 (https://phabricator.wikimedia.org/T146149) (owner: 10Jcrespo) [10:49:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [10:49:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [10:49:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1157.eqiad.wmnet with reason: Maintenance [10:50:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1157.eqiad.wmnet with reason: Maintenance [10:50:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T318955)', diff saved to https://phabricator.wikimedia.org/P38227 and previous config saved to /var/cache/conftool/dbconfig/20221107-105015-ladsgroup.json [10:50:19] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [10:51:06] (03CR) 10ClĂ©ment Goubert: [C: 03+1] vopsbot: always restart the service via systemd [puppet] - 10https://gerrit.wikimedia.org/r/853939 (owner: 10Giuseppe Lavagetto) [10:56:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-etcd1002.eqiad.wmnet to drbd [10:56:53] 10SRE, 10Data Pipelines, 10Data-Engineering-Planning, 10Traffic: Add a rolled-up cache_status field to druid webrequest_sampled_128 - https://phabricator.wikimedia.org/T319344 (10EChetty) [10:57:55] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [10:57:59] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [10:58:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P38228 and previous config saved to /var/cache/conftool/dbconfig/20221107-105844-ladsgroup.json [10:59:30] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [10:59:32] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [11:01:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T318955)', diff saved to https://phabricator.wikimedia.org/P38229 and previous config saved to /var/cache/conftool/dbconfig/20221107-110109-ladsgroup.json [11:01:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P38230 and previous config saved to /var/cache/conftool/dbconfig/20221107-110110-marostegui.json [11:01:11] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [11:01:13] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [11:01:31] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [11:01:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cr-cloud: enable openstack heat API TCP port [homer/public] - 10https://gerrit.wikimedia.org/r/853374 (https://phabricator.wikimedia.org/T309407) (owner: 10Arturo Borrero Gonzalez) [11:02:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2024 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P38231 and previous config saved to /var/cache/conftool/dbconfig/20221107-110215-root.json [11:03:02] (03CR) 10Vgutierrez: "This CR needs to add some VTCs checking DP headers" [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [11:04:49] <_joe_> !log manually started dump_cloud_ip_ranges.service [11:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:59] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:05:10] (03PS3) 10Volans: sre.hosts.decommission: use mgmt IP if no DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/852955 (https://phabricator.wikimedia.org/T320721) [11:06:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-etcd1002.eqiad.wmnet to drbd [11:06:07] PROBLEM - Host ml-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [11:06:27] !log running homer on cr-eqiad/cr-codfw for https://gerrit.wikimedia.org/r/c/operations/homer/public/+/853374 (T321220, T309407) [11:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:31] T309407: Install OpenStack Heat for cloud-vps - https://phabricator.wikimedia.org/T309407 [11:06:31] T321220: Subnet for magnum - https://phabricator.wikimedia.org/T321220 [11:06:39] RECOVERY - Host ml-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [11:09:12] (03CR) 10Vgutierrez: Varnish analytics: support differential privacy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [11:10:39] (03CR) 10Volans: [C: 03+2] Remove outdated TODO comment in wmnet template [dns] - 10https://gerrit.wikimedia.org/r/853384 (owner: 10Zabe) [11:10:55] (03CR) 10Volans: [C: 03+2] "Thanks for spotting this" [dns] - 10https://gerrit.wikimedia.org/r/853384 (owner: 10Zabe) [11:11:22] (03PS3) 10ClĂ©ment Goubert: mwdebug: Remove old mwdebug deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/850184 (https://phabricator.wikimedia.org/T321201) [11:11:34] (03PS3) 10ClĂ©ment Goubert: admin: Remove stale mwdebug stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/852186 (https://phabricator.wikimedia.org/T321201) [11:11:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T321123)', diff saved to https://phabricator.wikimedia.org/P38232 and previous config saved to /var/cache/conftool/dbconfig/20221107-111156-marostegui.json [11:12:00] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [11:13:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P38233 and previous config saved to /var/cache/conftool/dbconfig/20221107-111351-ladsgroup.json [11:14:35] (03CR) 10Awight: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/853941 (https://phabricator.wikimedia.org/T321887) (owner: 10Awight) [11:14:46] (03PS1) 10Hnowlan: Allow additional parameters to be passed to prod entrypoint [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/853944 (https://phabricator.wikimedia.org/T233196) [11:15:25] (03CR) 10Awight: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/853941 (https://phabricator.wikimedia.org/T321887) (owner: 10Awight) [11:16:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P38234 and previous config saved to /var/cache/conftool/dbconfig/20221107-111615-ladsgroup.json [11:16:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T321130)', diff saved to https://phabricator.wikimedia.org/P38235 and previous config saved to /var/cache/conftool/dbconfig/20221107-111616-marostegui.json [11:16:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1113.eqiad.wmnet with reason: Maintenance [11:16:21] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [11:16:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1113.eqiad.wmnet with reason: Maintenance [11:16:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T321130)', diff saved to https://phabricator.wikimedia.org/P38236 and previous config saved to /var/cache/conftool/dbconfig/20221107-111637-marostegui.json [11:17:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2024 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P38237 and previous config saved to /var/cache/conftool/dbconfig/20221107-111719-root.json [11:19:00] (03PS2) 10Elukey: Upgrade to 1.15.3 [debs/istio] - 10https://gerrit.wikimedia.org/r/853936 (https://phabricator.wikimedia.org/T322193) [11:19:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-etcd1002.eqiad.wmnet to plain [11:21:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-etcd1002.eqiad.wmnet to plain [11:22:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T321130)', diff saved to https://phabricator.wikimedia.org/P38238 and previous config saved to /var/cache/conftool/dbconfig/20221107-112219-marostegui.json [11:22:24] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [11:24:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T318605)', diff saved to https://phabricator.wikimedia.org/P38239 and previous config saved to /var/cache/conftool/dbconfig/20221107-112423-ladsgroup.json [11:24:27] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [11:25:46] (03PS1) 10Arturo Borrero Gonzalez: cr-cloud: enable openstack magnum API TCP port [homer/public] - 10https://gerrit.wikimedia.org/r/853947 (https://phabricator.wikimedia.org/T309407) [11:27:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P38240 and previous config saved to /var/cache/conftool/dbconfig/20221107-112702-marostegui.json [11:27:49] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [11:28:16] (03PS2) 10Vgutierrez: varnish: Add sessioncookie bit to X-Analytics [puppet] - 10https://gerrit.wikimedia.org/r/839512 (https://phabricator.wikimedia.org/T319324) [11:28:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T318955)', diff saved to https://phabricator.wikimedia.org/P38241 and previous config saved to /var/cache/conftool/dbconfig/20221107-112857-ladsgroup.json [11:29:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2139.codfw.wmnet with reason: Maintenance [11:29:01] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [11:29:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2139.codfw.wmnet with reason: Maintenance [11:29:27] (03CR) 10Ayounsi: "1 comment, lgtm otherwise." [homer/public] - 10https://gerrit.wikimedia.org/r/853947 (https://phabricator.wikimedia.org/T309407) (owner: 10Arturo Borrero Gonzalez) [11:29:45] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [11:30:49] (03PS2) 10Arturo Borrero Gonzalez: cr-cloud: enable openstack magnum API TCP port [homer/public] - 10https://gerrit.wikimedia.org/r/853947 (https://phabricator.wikimedia.org/T309407) [11:31:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P38242 and previous config saved to /var/cache/conftool/dbconfig/20221107-113122-ladsgroup.json [11:32:12] (03CR) 10Arturo Borrero Gonzalez: cr-cloud: enable openstack magnum API TCP port (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/853947 (https://phabricator.wikimedia.org/T309407) (owner: 10Arturo Borrero Gonzalez) [11:32:14] (03CR) 10Ayounsi: [C: 03+1] cr-cloud: enable openstack magnum API TCP port [homer/public] - 10https://gerrit.wikimedia.org/r/853947 (https://phabricator.wikimedia.org/T309407) (owner: 10Arturo Borrero Gonzalez) [11:32:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2024 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P38243 and previous config saved to /var/cache/conftool/dbconfig/20221107-113224-root.json [11:32:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cr-cloud: enable openstack magnum API TCP port [homer/public] - 10https://gerrit.wikimedia.org/r/853947 (https://phabricator.wikimedia.org/T309407) (owner: 10Arturo Borrero Gonzalez) [11:34:04] (03CR) 10Vlad.shapik: [C: 03+1] "LGTM" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/853944 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [11:34:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2149.codfw.wmnet with reason: Maintenance [11:34:45] (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] prometheus: Rename ats_ metrics to trafficserver_ [puppet] - 10https://gerrit.wikimedia.org/r/851139 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [11:34:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2149.codfw.wmnet with reason: Maintenance [11:34:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T318955)', diff saved to https://phabricator.wikimedia.org/P38244 and previous config saved to /var/cache/conftool/dbconfig/20221107-113452-ladsgroup.json [11:34:55] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [11:34:57] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/853283 (https://phabricator.wikimedia.org/T320431) (owner: 10Slyngshede) [11:36:05] (03Abandoned) 10Vgutierrez: acme_chief: Improve OCSPResponse error handling [software/acme-chief] - 10https://gerrit.wikimedia.org/r/689068 (https://phabricator.wikimedia.org/T282490) (owner: 10Vgutierrez) [11:36:13] (03Abandoned) 10Vgutierrez: Release 0.30 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/689756 (https://phabricator.wikimedia.org/T282490) (owner: 10Vgutierrez) [11:36:16] !log running homer on cr-eqiad/cr-codfw for https://gerrit.wikimedia.org/r/853947 (T321220, T309407) [11:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:20] T309407: Install OpenStack Heat for cloud-vps - https://phabricator.wikimedia.org/T309407 [11:36:20] T321220: Openstack Magnum network setup - https://phabricator.wikimedia.org/T321220 [11:37:07] (03PS2) 10ClĂ©ment Goubert: mediawiki: Create new mw-api-int deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853933 (https://phabricator.wikimedia.org/T321895) [11:37:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P38245 and previous config saved to /var/cache/conftool/dbconfig/20221107-113726-marostegui.json [11:39:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P38246 and previous config saved to /var/cache/conftool/dbconfig/20221107-113929-ladsgroup.json [11:39:33] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [11:40:54] 10SRE, 10Maps: Requesting access to maps for mbsantos and jgiannelos - https://phabricator.wikimedia.org/T269357 (10MoritzMuehlenhoff) 05Stalled→03Resolved a:03MoritzMuehlenhoff @Jgiannelos and @MSantos were added to the maps-root group back in September 2021, closing this task. [11:41:35] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [11:42:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P38247 and previous config saved to /var/cache/conftool/dbconfig/20221107-114209-marostegui.json [11:42:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T318955)', diff saved to https://phabricator.wikimedia.org/P38248 and previous config saved to /var/cache/conftool/dbconfig/20221107-114217-ladsgroup.json [11:42:21] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [11:42:32] (03PS1) 10Jcrespo: zarcillo: Remove access to non-primary dc prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/853950 (https://phabricator.wikimedia.org/T146149) [11:42:52] (03CR) 10Muehlenhoff: [C: 03+1] "Patch looks good (only approval by Olja needed via https://phabricator.wikimedia.org/T322339)" [puppet] - 10https://gerrit.wikimedia.org/r/853300 (https://phabricator.wikimedia.org/T322339) (owner: 10Stevemunene) [11:44:00] !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox circuit ID 112 [11:44:25] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 112 [11:45:02] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/853010 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [11:46:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T318955)', diff saved to https://phabricator.wikimedia.org/P38249 and previous config saved to /var/cache/conftool/dbconfig/20221107-114628-ladsgroup.json [11:46:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1166.eqiad.wmnet with reason: Maintenance [11:46:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1166.eqiad.wmnet with reason: Maintenance [11:46:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T318955)', diff saved to https://phabricator.wikimedia.org/P38250 and previous config saved to /var/cache/conftool/dbconfig/20221107-114649-ladsgroup.json [11:47:31] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: use mgmt IP if no DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/852955 (https://phabricator.wikimedia.org/T320721) (owner: 10Volans) [11:48:09] (03CR) 10Hnowlan: [C: 03+2] Allow additional parameters to be passed to prod entrypoint [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/853944 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [11:49:17] (03CR) 10Ayounsi: [C: 03+1] mr: allow prometheus_group SSH access to mgmt [homer/public] - 10https://gerrit.wikimedia.org/r/853938 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [11:51:46] (03CR) 10Jcrespo: "10.64.32.25 is cumin1001. DBAs, can you think a reason why to provide read only access to zarcillo from cumin under the prometheus-mysqld-" [puppet] - 10https://gerrit.wikimedia.org/r/853950 (https://phabricator.wikimedia.org/T146149) (owner: 10Jcrespo) [11:52:29] (03PS2) 10Jcrespo: zarcillo: Remove access to non-primary dc prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/853950 (https://phabricator.wikimedia.org/T146149) [11:52:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P38251 and previous config saved to /var/cache/conftool/dbconfig/20221107-115232-marostegui.json [11:52:35] (03Merged) 10jenkins-bot: sre.hosts.decommission: use mgmt IP if no DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/852955 (https://phabricator.wikimedia.org/T320721) (owner: 10Volans) [11:52:53] (03PS3) 10Jcrespo: zarcillo: Remove access to non-primary dc prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/853950 (https://phabricator.wikimedia.org/T146149) [11:53:10] (03Merged) 10jenkins-bot: Allow additional parameters to be passed to prod entrypoint [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/853944 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [11:54:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P38252 and previous config saved to /var/cache/conftool/dbconfig/20221107-115436-ladsgroup.json [11:57:14] (03PS1) 10Vgutierrez: debian: Add release 0.35 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/853951 (https://phabricator.wikimedia.org/T244232) [11:57:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T321123)', diff saved to https://phabricator.wikimedia.org/P38253 and previous config saved to /var/cache/conftool/dbconfig/20221107-115715-marostegui.json [11:57:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1147.eqiad.wmnet with reason: Maintenance [11:57:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P38254 and previous config saved to /var/cache/conftool/dbconfig/20221107-115723-ladsgroup.json [11:57:25] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [11:57:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1147.eqiad.wmnet with reason: Maintenance [11:57:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T321123)', diff saved to https://phabricator.wikimedia.org/P38255 and previous config saved to /var/cache/conftool/dbconfig/20221107-115737-marostegui.json [11:58:44] (03CR) 10Marostegui: [C: 03+1] "I don't think I can, maybe it was some sort of old testing or need." [puppet] - 10https://gerrit.wikimedia.org/r/853950 (https://phabricator.wikimedia.org/T146149) (owner: 10Jcrespo) [11:59:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T321123)', diff saved to https://phabricator.wikimedia.org/P38256 and previous config saved to /var/cache/conftool/dbconfig/20221107-115944-marostegui.json [12:00:12] (03CR) 10Kosta Harlan: [C: 03+1] growthexperiments.pp: Run updateMetrics.php daily [puppet] - 10https://gerrit.wikimedia.org/r/853265 (https://phabricator.wikimedia.org/T318684) (owner: 10Urbanecm) [12:00:13] !log volans@cumin1001 START - Cookbook sre.hosts.decommission for hosts ganeti4003.ulsfo.wmnet [12:02:06] (03CR) 10Vgutierrez: [C: 03+2] debian: Add release 0.35 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/853951 (https://phabricator.wikimedia.org/T244232) (owner: 10Vgutierrez) [12:03:39] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:04:01] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:04:33] !log volans@cumin1001 START - Cookbook sre.dns.netbox [12:04:45] (03CR) 10ClĂ©ment Goubert: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/853952 (https://phabricator.wikimedia.org/T321896) (owner: 10ClĂ©ment Goubert) [12:05:31] (03CR) 10Jcrespo: [C: 03+2] "Thank you! Merging." [puppet] - 10https://gerrit.wikimedia.org/r/853950 (https://phabricator.wikimedia.org/T146149) (owner: 10Jcrespo) [12:05:44] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:05:44] !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ganeti4003.ulsfo.wmnet [12:05:51] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by volans@cumin1001 for hosts: `ganeti4003.ulsfo.wmnet` - ganeti4003.ulsfo.wmnet (**FAIL**... [12:05:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance [12:05:59] !log testing acme-chief 0.35 in acmechief-test1001 [12:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance [12:06:15] (03CR) 10Muehlenhoff: [C: 03+2] Point profile::contacts::role_contacts for clouddumps to WMCS SREs [puppet] - 10https://gerrit.wikimedia.org/r/852127 (owner: 10Muehlenhoff) [12:06:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T318605)', diff saved to https://phabricator.wikimedia.org/P38257 and previous config saved to /var/cache/conftool/dbconfig/20221107-120614-ladsgroup.json [12:06:17] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [12:06:51] jynus: can I merge your patch for zarcillo along? [12:06:56] yes [12:07:03] I was about to do it, thank you [12:07:17] ack, doing that now [12:07:31] done [12:07:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T321130)', diff saved to https://phabricator.wikimedia.org/P38258 and previous config saved to /var/cache/conftool/dbconfig/20221107-120739-marostegui.json [12:07:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1131.eqiad.wmnet with reason: Maintenance [12:07:43] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [12:07:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1131.eqiad.wmnet with reason: Maintenance [12:08:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T321130)', diff saved to https://phabricator.wikimedia.org/P38259 and previous config saved to /var/cache/conftool/dbconfig/20221107-120800-marostegui.json [12:08:45] (03PS1) 10Ssingh: package_builder: remove deprecated Varnish6 hooks [puppet] - 10https://gerrit.wikimedia.org/r/853954 (https://phabricator.wikimedia.org/T321309) [12:09:27] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48974 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:09:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T318605)', diff saved to https://phabricator.wikimedia.org/P38260 and previous config saved to /var/cache/conftool/dbconfig/20221107-120942-ladsgroup.json [12:09:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance [12:09:49] (03PS1) 10Urbanecm: Rename QuitMentorship to ReassignMentees [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853955 (https://phabricator.wikimedia.org/T321382) [12:09:51] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:09:52] (03PS1) 10Urbanecm: ReassignMentees: Pass the actual performer to ChangeMentor [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853956 (https://phabricator.wikimedia.org/T321382) [12:09:54] (03PS1) 10Urbanecm: ManageMentorsRemoveMentor: Reassign mentees to a different mentor [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853957 (https://phabricator.wikimedia.org/T321382) [12:09:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance [12:10:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T318605)', diff saved to https://phabricator.wikimedia.org/P38261 and previous config saved to /var/cache/conftool/dbconfig/20221107-121004-ladsgroup.json [12:10:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T321130)', diff saved to https://phabricator.wikimedia.org/P38262 and previous config saved to /var/cache/conftool/dbconfig/20221107-121009-marostegui.json [12:11:27] (03PS1) 10ClĂ©ment Goubert: mediawiki: Create new mw-jobrunner deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853958 (https://phabricator.wikimedia.org/T321897) [12:11:57] (03PS1) 10Volans: sre.hosts.decommission: power off only if not off [cookbooks] - 10https://gerrit.wikimedia.org/r/853959 [12:12:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P38263 and previous config saved to /var/cache/conftool/dbconfig/20221107-121230-ladsgroup.json [12:13:22] (03CR) 10Btullis: [C: 03+1] Add stevemunene to ops and analytics [puppet] - 10https://gerrit.wikimedia.org/r/853300 (https://phabricator.wikimedia.org/T322339) (owner: 10Stevemunene) [12:13:53] !log reprepro -C main include bullseye-wikimedia prometheus-varnishkafka-exporter_0.1-2_amd64.changes: T321309 [12:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:56] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [12:14:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P38264 and previous config saved to /var/cache/conftool/dbconfig/20221107-121451-marostegui.json [12:17:33] (03PS1) 10Ssingh: Release 1.9-2 [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/853962 (https://phabricator.wikimedia.org/T321309) [12:19:21] (03CR) 10Vgutierrez: [C: 03+1] Release 1.9-2 [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/853962 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [12:21:39] (03PS2) 10Urbanecm: Rename QuitMentorship to ReassignMentees [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853955 (https://phabricator.wikimedia.org/T321382) [12:21:53] (03PS2) 10Urbanecm: ReassignMentees: Pass the actual performer to ChangeMentor [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853956 (https://phabricator.wikimedia.org/T321382) [12:21:57] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/851126 (owner: 10Andrew Bogott) [12:22:00] (03PS3) 10Urbanecm: ReassignMentees: Pass the actual performer to ChangeMentor [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853956 (https://phabricator.wikimedia.org/T321382) [12:22:09] (03PS2) 10Urbanecm: ManageMentorsRemoveMentor: Reassign mentees to a different mentor [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853957 (https://phabricator.wikimedia.org/T321382) [12:22:16] (03PS3) 10Urbanecm: ManageMentorsRemoveMentor: Reassign mentees to a different mentor [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853957 (https://phabricator.wikimedia.org/T321382) [12:22:28] (03PS1) 10Hokwelum: Add poincare.acc.umu.se to ipv4 and ipv6 config [puppet] - 10https://gerrit.wikimedia.org/r/853965 [12:23:31] (03CR) 10Vgutierrez: [C: 03+1] package_builder: remove deprecated Varnish6 hooks [puppet] - 10https://gerrit.wikimedia.org/r/853954 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [12:23:45] (03PS1) 10Ssingh: Release 0.19 [software/purged] - 10https://gerrit.wikimedia.org/r/853967 (https://phabricator.wikimedia.org/T321309) [12:24:56] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/853959 (owner: 10Volans) [12:25:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P38265 and previous config saved to /var/cache/conftool/dbconfig/20221107-122516-marostegui.json [12:25:54] (03CR) 10CI reject: [V: 04-1] Release 0.19 [software/purged] - 10https://gerrit.wikimedia.org/r/853967 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [12:27:01] (03PS1) 10Ssingh: Release 1.5.3-2 [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/853974 (https://phabricator.wikimedia.org/T321309) [12:27:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T318955)', diff saved to https://phabricator.wikimedia.org/P38266 and previous config saved to /var/cache/conftool/dbconfig/20221107-122737-ladsgroup.json [12:27:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2156.codfw.wmnet with reason: Maintenance [12:27:41] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [12:27:44] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/853954 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [12:27:46] (03CR) 10CI reject: [V: 04-1] Release 1.5.3-2 [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/853974 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [12:27:51] (03PS1) 10ClĂ©ment Goubert: mediawiki: Create new mw-web deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853975 (https://phabricator.wikimedia.org/T321900) [12:27:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2156.codfw.wmnet with reason: Maintenance [12:27:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2094.codfw.wmnet with reason: Maintenance [12:28:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2094.codfw.wmnet with reason: Maintenance [12:28:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T318955)', diff saved to https://phabricator.wikimedia.org/P38267 and previous config saved to /var/cache/conftool/dbconfig/20221107-122814-ladsgroup.json [12:29:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P38268 and previous config saved to /var/cache/conftool/dbconfig/20221107-122957-marostegui.json [12:30:37] (03PS1) 10Ssingh: Release 1.1.0-2 [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/853987 (https://phabricator.wikimedia.org/T321309) [12:30:49] (03CR) 10Ssingh: [C: 03+2] package_builder: remove deprecated Varnish6 hooks [puppet] - 10https://gerrit.wikimedia.org/r/853954 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [12:32:22] (03CR) 10Majavah: "looks good! one small issue inline" [puppet] - 10https://gerrit.wikimedia.org/r/850592 (https://phabricator.wikimedia.org/T288108) (owner: 10Andrew Bogott) [12:32:34] (03CR) 10Majavah: [C: 03+1] "looks good, but didn't test" [puppet] - 10https://gerrit.wikimedia.org/r/850540 (https://phabricator.wikimedia.org/T288108) (owner: 10Andrew Bogott) [12:33:35] (03CR) 10Ssingh: "Similar to I54b7d1ccbf2eb3ac21f5059659bda105ae9e01c9, I guess we need to remove the -1 manually here." [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/853974 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [12:34:35] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:35:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T318955)', diff saved to https://phabricator.wikimedia.org/P38269 and previous config saved to /var/cache/conftool/dbconfig/20221107-123528-ladsgroup.json [12:35:29] (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/853973 (owner: 10L10n-bot) [12:35:33] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [12:35:42] (03CR) 10Jbond: dumps/distribution: add more data types to parameters (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [12:37:48] 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Alerting, 10User-fgiunchedi: Spicerack: add support for Alertmanager - https://phabricator.wikimedia.org/T293209 (10lmata) [12:37:55] (03PS2) 10Vgutierrez: Release 0.19 [software/purged] - 10https://gerrit.wikimedia.org/r/853967 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [12:38:25] 10SRE, 10Observability-Metrics, 10Goal, 10MW-1.38-notes (1.38.0-wmf.4; 2021-10-12), and 2 others: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10lmata) [12:39:14] (03CR) 10CI reject: [V: 04-1] Release 0.19 [software/purged] - 10https://gerrit.wikimedia.org/r/853967 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [12:39:29] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops and analytics for stevemunene - https://phabricator.wikimedia.org/T322339 (10Ottomata) Approved! Since this is ops/sre/root(?) access, is there any approval that needs to happen from SRE? [12:40:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P38270 and previous config saved to /var/cache/conftool/dbconfig/20221107-124022-marostegui.json [12:42:20] urbanecm: is this high baseline of exceptions known? I couldn't find anything on phabricator: https://logstash.wikimedia.org/goto/4db7e6fe954418db256123840a06d98c [12:42:39] looking [12:42:55] uhh [12:43:00] I am guessing related to your scap [12:43:16] yeah, that's definitely related [12:43:18] checking [12:43:26] it should be "just" logspam [12:45:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T321123)', diff saved to https://phabricator.wikimedia.org/P38271 and previous config saved to /var/cache/conftool/dbconfig/20221107-124504-marostegui.json [12:45:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1148.eqiad.wmnet with reason: Maintenance [12:45:09] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [12:45:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1148.eqiad.wmnet with reason: Maintenance [12:45:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T321123)', diff saved to https://phabricator.wikimedia.org/P38272 and previous config saved to /var/cache/conftool/dbconfig/20221107-124526-marostegui.json [12:46:02] not too worried about it, it was mostly to open a task for what you say - awareness it in case other more impacting issues start: https://grafana.wikimedia.org/goto/Xaj4_FDVk?orgId=1 [12:46:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:46:36] yep, thanks for the ping jynus. not sure how did i miss this :/ [12:46:49] (03PS1) 10Esanders: Keep DiscussionTools "Share feedback..." links on WMF wikis for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853991 (https://phabricator.wikimedia.org/T322494) [12:47:03] you're welcome [12:47:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T318955)', diff saved to https://phabricator.wikimedia.org/P38273 and previous config saved to /var/cache/conftool/dbconfig/20221107-124706-ladsgroup.json [12:47:09] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [12:47:20] filled as T322538, will fix in prod soon :). [12:47:20] T322538: LogicException: MentorStore::isMenteeActive was called, but GEMentorshipUseIsActiveFlag is false - https://phabricator.wikimedia.org/T322538 [12:47:41] no worries or rush, just making sure it was known is enough! [12:48:18] (I noticed by chance while I was checking other past issue) [12:48:24] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Create mw-api-int helmfile deployment - https://phabricator.wikimedia.org/T321895 (10Clement_Goubert) p:05Triage→03High [12:48:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T321123)', diff saved to https://phabricator.wikimedia.org/P38274 and previous config saved to /var/cache/conftool/dbconfig/20221107-124833-marostegui.json [12:48:43] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Create mw-jobrunner helmfile deployment - https://phabricator.wikimedia.org/T321897 (10Clement_Goubert) p:05Triage→03High [12:48:56] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Create mw-web helmfile deployment - https://phabricator.wikimedia.org/T321900 (10Clement_Goubert) p:05Triage→03High [12:49:11] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Create mw-web helmfile deployment - https://phabricator.wikimedia.org/T321900 (10Clement_Goubert) 05Open→03In progress [12:49:23] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert) [12:50:11] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Create mw-api-int helmfile deployment - https://phabricator.wikimedia.org/T321895 (10Clement_Goubert) 05Open→03In progress [12:50:23] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert) [12:50:33] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Create mw-api-ext helmfile deployment - https://phabricator.wikimedia.org/T321896 (10Clement_Goubert) 05Open→03In progress [12:50:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P38275 and previous config saved to /var/cache/conftool/dbconfig/20221107-125035-ladsgroup.json [12:50:45] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert) [12:50:59] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Create mw-jobrunner helmfile deployment - https://phabricator.wikimedia.org/T321897 (10Clement_Goubert) 05Open→03In progress [12:51:11] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert) [12:51:31] !log Add NAT for frmon2001 - T321735 [12:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:34] T321735: Investigate why frmon-codfw.wikimedia.org is not accessible from untrust zone. - https://phabricator.wikimedia.org/T321735 [12:53:14] 10SRE, 10Infrastructure-Foundations, 10netops: Investigate why frmon-codfw.wikimedia.org is not accessible from untrust zone. - https://phabricator.wikimedia.org/T321735 (10ayounsi) 05Open→03Resolved a:03ayounsi Indeed. It's now live: `nc -zv 208.80.152.235 443` works. [12:54:10] (03PS1) 10Urbanecm: MentorHooks: Add missing check for GEMentorshipUseIsActiveFlag [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853537 (https://phabricator.wikimedia.org/T322538) [12:54:56] (03PS1) 10JMeybohm: k8s: Add version switching where needed [puppet] - 10https://gerrit.wikimedia.org/r/853995 (https://phabricator.wikimedia.org/T307943) [12:54:58] (03PS1) 10JMeybohm: k8s: Use the K8s::Core::V1Taint type [puppet] - 10https://gerrit.wikimedia.org/r/853996 (https://phabricator.wikimedia.org/T307943) [12:55:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T321130)', diff saved to https://phabricator.wikimedia.org/P38276 and previous config saved to /var/cache/conftool/dbconfig/20221107-125529-marostegui.json [12:55:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1140.eqiad.wmnet with reason: Maintenance [12:55:33] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [12:55:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1140.eqiad.wmnet with reason: Maintenance [12:59:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1165.eqiad.wmnet with reason: Maintenance [12:59:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1165.eqiad.wmnet with reason: Maintenance [12:59:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:59:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:59:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T321130)', diff saved to https://phabricator.wikimedia.org/P38277 and previous config saved to /var/cache/conftool/dbconfig/20221107-125946-marostegui.json [13:00:00] (03CR) 10Filippo Giunchedi: [C: 03+2] growthexperiments.pp: Run updateMetrics.php daily [puppet] - 10https://gerrit.wikimedia.org/r/853265 (https://phabricator.wikimedia.org/T318684) (owner: 10Urbanecm) [13:00:22] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 18): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37980/console" [puppet] - 10https://gerrit.wikimedia.org/r/853995 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [13:01:28] (03PS1) 10Muehlenhoff: Remove obsolete vp9 hook [puppet] - 10https://gerrit.wikimedia.org/r/853997 [13:01:48] (03PS2) 10Muehlenhoff: Remove obsolete vp9 hook [puppet] - 10https://gerrit.wikimedia.org/r/853997 [13:01:51] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 102 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:01:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T321130)', diff saved to https://phabricator.wikimedia.org/P38278 and previous config saved to /var/cache/conftool/dbconfig/20221107-130155-marostegui.json [13:01:59] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [13:02:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P38279 and previous config saved to /var/cache/conftool/dbconfig/20221107-130212-ladsgroup.json [13:03:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P38280 and previous config saved to /var/cache/conftool/dbconfig/20221107-130340-marostegui.json [13:05:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P38281 and previous config saved to /var/cache/conftool/dbconfig/20221107-130541-ladsgroup.json [13:07:51] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete vp9 hook [puppet] - 10https://gerrit.wikimedia.org/r/853997 (owner: 10Muehlenhoff) [13:09:05] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 13 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37983/console" [puppet] - 10https://gerrit.wikimedia.org/r/853996 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [13:09:50] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops and analytics for stevemunene - https://phabricator.wikimedia.org/T322339 (10fgiunchedi) Not AFAIK, we'll just need stamp of approval from @odimitrijevic and we're good to go I think [13:10:48] (03PS1) 10Muehlenhoff: Remove hook for icu63 [puppet] - 10https://gerrit.wikimedia.org/r/853999 [13:11:34] (03CR) 10CI reject: [V: 04-1] Remove hook for icu63 [puppet] - 10https://gerrit.wikimedia.org/r/853999 (owner: 10Muehlenhoff) [13:11:58] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: power off only if not off [cookbooks] - 10https://gerrit.wikimedia.org/r/853959 (owner: 10Volans) [13:12:10] (03PS2) 10Muehlenhoff: Remove hook for icu63 [puppet] - 10https://gerrit.wikimedia.org/r/853999 [13:15:52] (03CR) 10Muehlenhoff: [C: 03+2] Remove hook for icu63 [puppet] - 10https://gerrit.wikimedia.org/r/853999 (owner: 10Muehlenhoff) [13:16:12] (03Merged) 10jenkins-bot: sre.hosts.decommission: power off only if not off [cookbooks] - 10https://gerrit.wikimedia.org/r/853959 (owner: 10Volans) [13:17:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P38282 and previous config saved to /var/cache/conftool/dbconfig/20221107-131701-marostegui.json [13:17:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P38283 and previous config saved to /var/cache/conftool/dbconfig/20221107-131718-ladsgroup.json [13:18:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P38284 and previous config saved to /var/cache/conftool/dbconfig/20221107-131846-marostegui.json [13:18:51] (03PS1) 10Muehlenhoff: Remove obsolete spicerack hook [puppet] - 10https://gerrit.wikimedia.org/r/854001 [13:20:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T318955)', diff saved to https://phabricator.wikimedia.org/P38285 and previous config saved to /var/cache/conftool/dbconfig/20221107-132048-ladsgroup.json [13:20:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2177.codfw.wmnet with reason: Maintenance [13:20:52] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [13:21:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2177.codfw.wmnet with reason: Maintenance [13:21:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T318955)', diff saved to https://phabricator.wikimedia.org/P38286 and previous config saved to /var/cache/conftool/dbconfig/20221107-132109-ladsgroup.json [13:26:20] (03CR) 10Volans: [C: 03+1] "LGTM, I don't think this affects WMCS in any way, but if unsure check with them." [puppet] - 10https://gerrit.wikimedia.org/r/854001 (owner: 10Muehlenhoff) [13:26:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T318605)', diff saved to https://phabricator.wikimedia.org/P38287 and previous config saved to /var/cache/conftool/dbconfig/20221107-132643-ladsgroup.json [13:26:48] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [13:28:08] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ilias Sarantopoulos - https://phabricator.wikimedia.org/T322350 (10Ottomata) Approve! [13:28:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T318955)', diff saved to https://phabricator.wikimedia.org/P38288 and previous config saved to /var/cache/conftool/dbconfig/20221107-132824-ladsgroup.json [13:28:28] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [13:29:31] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1010.eqiad.wmnet with reason: Remove from cluster for eventual reimage [13:29:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1010.eqiad.wmnet with reason: Remove from cluster for eventual reimage [13:30:34] (03PS2) 10David Caro: global: replace labsproject by wmcs_project [puppet] - 10https://gerrit.wikimedia.org/r/849473 [13:32:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P38289 and previous config saved to /var/cache/conftool/dbconfig/20221107-133208-marostegui.json [13:32:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T318955)', diff saved to https://phabricator.wikimedia.org/P38290 and previous config saved to /var/cache/conftool/dbconfig/20221107-133225-ladsgroup.json [13:32:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1175.eqiad.wmnet with reason: Maintenance [13:32:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1175.eqiad.wmnet with reason: Maintenance [13:32:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T318955)', diff saved to https://phabricator.wikimedia.org/P38291 and previous config saved to /var/cache/conftool/dbconfig/20221107-133246-ladsgroup.json [13:33:40] (03PS1) 10David Caro: Revert "toolforge k8s: add a PodSecurityPolicy to be used by buildpacks" [puppet] - 10https://gerrit.wikimedia.org/r/853539 [13:33:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T321123)', diff saved to https://phabricator.wikimedia.org/P38292 and previous config saved to /var/cache/conftool/dbconfig/20221107-133353-marostegui.json [13:33:55] (03CR) 10David Caro: [C: 03+2] global: replace labsproject by wmcs_project [puppet] - 10https://gerrit.wikimedia.org/r/849473 (owner: 10David Caro) [13:33:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1149.eqiad.wmnet with reason: Maintenance [13:33:56] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [13:34:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1149.eqiad.wmnet with reason: Maintenance [13:34:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T321123)', diff saved to https://phabricator.wikimedia.org/P38293 and previous config saved to /var/cache/conftool/dbconfig/20221107-133414-marostegui.json [13:34:17] (03CR) 10CI reject: [V: 04-1] Revert "toolforge k8s: add a PodSecurityPolicy to be used by buildpacks" [puppet] - 10https://gerrit.wikimedia.org/r/853539 (owner: 10David Caro) [13:35:01] (03CR) 10Volans: [C: 03+2] constants: use CORE_DATACENTERS from wmflib [software/spicerack] - 10https://gerrit.wikimedia.org/r/852902 (owner: 10Volans) [13:35:16] (03CR) 10Volans: [C: 03+2] ipmi: clarify that the target can also be an IP [software/spicerack] - 10https://gerrit.wikimedia.org/r/852903 (https://phabricator.wikimedia.org/T320721) (owner: 10Volans) [13:35:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T321123)', diff saved to https://phabricator.wikimedia.org/P38294 and previous config saved to /var/cache/conftool/dbconfig/20221107-133522-marostegui.json [13:35:47] (03PS1) 10Jbond: nodegen: only analyse manifets files for auto selector [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/854002 [13:36:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T318955)', diff saved to https://phabricator.wikimedia.org/P38295 and previous config saved to /var/cache/conftool/dbconfig/20221107-133638-ladsgroup.json [13:36:42] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [13:37:49] (03CR) 10CI reject: [V: 04-1] nodegen: only analyse manifets files for auto selector [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/854002 (owner: 10Jbond) [13:39:50] (03CR) 10David Caro: [C: 03+1] "LGTM, not sure why it's failing though" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852284 (https://phabricator.wikimedia.org/T224977) (owner: 10Jbond) [13:41:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P38296 and previous config saved to /var/cache/conftool/dbconfig/20221107-134150-ladsgroup.json [13:43:00] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/849473 (owner: 10David Caro) [13:43:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T318605)', diff saved to https://phabricator.wikimedia.org/P38297 and previous config saved to /var/cache/conftool/dbconfig/20221107-134313-ladsgroup.json [13:43:18] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [13:43:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P38298 and previous config saved to /var/cache/conftool/dbconfig/20221107-134331-ladsgroup.json [13:45:26] (03Merged) 10jenkins-bot: constants: use CORE_DATACENTERS from wmflib [software/spicerack] - 10https://gerrit.wikimedia.org/r/852902 (owner: 10Volans) [13:45:28] (03Merged) 10jenkins-bot: ipmi: clarify that the target can also be an IP [software/spicerack] - 10https://gerrit.wikimedia.org/r/852903 (https://phabricator.wikimedia.org/T320721) (owner: 10Volans) [13:46:10] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete spicerack hook [puppet] - 10https://gerrit.wikimedia.org/r/854001 (owner: 10Muehlenhoff) [13:47:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T321130)', diff saved to https://phabricator.wikimedia.org/P38299 and previous config saved to /var/cache/conftool/dbconfig/20221107-134714-marostegui.json [13:47:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1168.eqiad.wmnet with reason: Maintenance [13:47:19] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [13:47:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1010.eqiad.wmnet with OS bullseye [13:47:26] 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Metrics: wmflib.prometheus: add support for thanos backend - https://phabricator.wikimedia.org/T295498 (10Volans) 05Open→03Resolved This has been solved long time ago in [[ https://doc.wikimedia.org/wmflib/master/release.html#v1-2-0-2022-04-04 |... [13:47:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1168.eqiad.wmnet with reason: Maintenance [13:47:30] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1010.eqiad.wmnet with OS bullseye [13:47:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T321130)', diff saved to https://phabricator.wikimedia.org/P38300 and previous config saved to /var/cache/conftool/dbconfig/20221107-134735-marostegui.json [13:47:46] (03CR) 10Btullis: [C: 03+2] Add a namespace for the stream-enrichment-poc on dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/851063 (https://phabricator.wikimedia.org/T321682) (owner: 10Btullis) [13:48:34] (03PS2) 10Jbond: nodegen: only analyse manifets files for auto selector [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/854002 [13:49:26] (03PS5) 10Clare Ming: testwiki: Add config for Visual Editor Feature Use instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852254 (https://phabricator.wikimedia.org/T309602) [13:49:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T321130)', diff saved to https://phabricator.wikimedia.org/P38301 and previous config saved to /var/cache/conftool/dbconfig/20221107-134944-marostegui.json [13:49:46] (03CR) 10CI reject: [V: 04-1] nodegen: only analyse manifets files for auto selector [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/854002 (owner: 10Jbond) [13:50:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P38302 and previous config saved to /var/cache/conftool/dbconfig/20221107-135028-marostegui.json [13:51:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P38303 and previous config saved to /var/cache/conftool/dbconfig/20221107-135144-ladsgroup.json [13:51:51] (03Merged) 10jenkins-bot: Add a namespace for the stream-enrichment-poc on dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/851063 (https://phabricator.wikimedia.org/T321682) (owner: 10Btullis) [13:53:45] 10SRE-tools, 10Infrastructure-Foundations: Netbox check: the Uncommitted DNS changes in Netbox should recover more quickly - https://phabricator.wikimedia.org/T293206 (10Volans) 05Open→03Resolved As the timer runs every 5 minutes I think this is an acceptable recovery time. Resolving for now, feel free to... [13:54:27] (03PS3) 10Jbond: nodegen: only analyse manifets files for auto selector [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/854002 [13:55:23] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Decommisioning a VM failed with a key error when generating DNS - https://phabricator.wikimedia.org/T278523 (10Volans) 05Open→03Resolved Boldly resolving as we've not seen this recently and Netbox has been upgraded multiple times since then. [13:55:52] (03CR) 10CI reject: [V: 04-1] nodegen: only analyse manifets files for auto selector [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/854002 (owner: 10Jbond) [13:56:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P38304 and previous config saved to /var/cache/conftool/dbconfig/20221107-135656-ladsgroup.json [13:57:09] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: debmonitor-client: urllib3 deprecation warning on Bullseye - https://phabricator.wikimedia.org/T284647 (10Volans) 05Open→03Resolved This was resolved with the above patches. [13:57:34] jouncebot: next [13:57:34] In 0 hour(s) and 2 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221107T1400) [13:57:54] (03CR) 10Urbanecm: [C: 03+2] MentorHooks: Add missing check for GEMentorshipUseIsActiveFlag [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853537 (https://phabricator.wikimedia.org/T322538) (owner: 10Urbanecm) [13:58:02] (03CR) 10Urbanecm: [C: 03+2] Rename QuitMentorship to ReassignMentees [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853955 (https://phabricator.wikimedia.org/T321382) (owner: 10Urbanecm) [13:58:08] (03CR) 10Urbanecm: [C: 03+2] ReassignMentees: Pass the actual performer to ChangeMentor [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853956 (https://phabricator.wikimedia.org/T321382) (owner: 10Urbanecm) [13:58:14] (03CR) 10Urbanecm: [C: 03+2] ManageMentorsRemoveMentor: Reassign mentees to a different mentor [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853957 (https://phabricator.wikimedia.org/T321382) (owner: 10Urbanecm) [13:58:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P38305 and previous config saved to /var/cache/conftool/dbconfig/20221107-135819-ladsgroup.json [13:58:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P38306 and previous config saved to /var/cache/conftool/dbconfig/20221107-135837-ladsgroup.json [13:58:58] (03PS3) 10Vgutierrez: Release 0.19 [software/purged] - 10https://gerrit.wikimedia.org/r/853967 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221107T1400). [14:00:05] cjming, aanzx, and Urbanecm: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:18] (03CR) 10CI reject: [V: 04-1] Release 0.19 [software/purged] - 10https://gerrit.wikimedia.org/r/853967 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:00:18] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [14:00:24] o/ [14:00:29] i can deploy today [14:00:30] o/ [14:00:52] urbanecm: thank you [14:00:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852254 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming) [14:01:00] no problem cjming [14:01:35] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1010.eqiad.wmnet with reason: host reimage [14:01:38] (03Merged) 10jenkins-bot: testwiki: Add config for Visual Editor Feature Use instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852254 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming) [14:01:49] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:852254|testwiki: Add config for Visual Editor Feature Use instrument (T309602)]] [14:01:52] T309602: VisualEditorFeatureUse Migration to MP - https://phabricator.wikimedia.org/T309602 [14:02:09] !log urbanecm@deploy1002 urbanecm and cjming: Backport for [[gerrit:852254|testwiki: Add config for Visual Editor Feature Use instrument (T309602)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [14:02:19] cjming: can you test at mwdebug1001 (or a diff debug server)? [14:02:25] checking [14:02:40] perfect - lgtm [14:03:22] (03PS1) 10Phuedx: wgWMESchemaEditAttemptStepSamplingRate to 1 on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854005 (https://phabricator.wikimedia.org/T312016) [14:03:24] (03PS1) 10Phuedx: wgWMESchemaEditAttemptStepSamplingRate to 1 everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854006 (https://phabricator.wikimedia.org/T312016) [14:03:54] (03PS1) 10Btullis: Add kubectl files for stream-enrichment-poc on dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/854007 (https://phabricator.wikimedia.org/T321682) [14:04:08] urbanecm: greenlight to sync [14:04:13] syncing! [14:04:39] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853437 (https://phabricator.wikimedia.org/T322472) (owner: 10Anzx) [14:04:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P38307 and previous config saved to /var/cache/conftool/dbconfig/20221107-140451-marostegui.json [14:04:56] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853436 (https://phabricator.wikimedia.org/T322471) (owner: 10Anzx) [14:05:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1010.eqiad.wmnet with reason: host reimage [14:05:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:05:26] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:05:31] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:05:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P38308 and previous config saved to /var/cache/conftool/dbconfig/20221107-140535-marostegui.json [14:06:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:06:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:06:13] (03PS2) 10Btullis: Add kubectl files for stream-enrichment-poc on dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/854007 (https://phabricator.wikimedia.org/T321682) [14:06:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P38309 and previous config saved to /var/cache/conftool/dbconfig/20221107-140651-ladsgroup.json [14:06:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:07:04] (03PS4) 10Urbanecm: Enable flood flag on knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853437 (https://phabricator.wikimedia.org/T322472) (owner: 10Anzx) [14:07:08] (03CR) 10Urbanecm: [C: 03+2] Enable flood flag on knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853437 (https://phabricator.wikimedia.org/T322472) (owner: 10Anzx) [14:07:57] (03Merged) 10jenkins-bot: Enable flood flag on knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853437 (https://phabricator.wikimedia.org/T322472) (owner: 10Anzx) [14:08:33] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:852254|testwiki: Add config for Visual Editor Feature Use instrument (T309602)]] (duration: 06m 43s) [14:08:36] cjming: should be live! [14:08:36] T309602: VisualEditorFeatureUse Migration to MP - https://phabricator.wikimedia.org/T309602 [14:08:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853437 (https://phabricator.wikimedia.org/T322472) (owner: 10Anzx) [14:08:48] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:853437|Enable flood flag on knwiki (T322472)]] [14:08:51] T322472: Enable flood flag on knwiki - https://phabricator.wikimedia.org/T322472 [14:08:57] urbanecm: ty! [14:09:00] np [14:09:07] !log xcollazo@deploy1002 Started deploy [airflow-dags/platform_eng@3bb99c2]: Deploying to Airflow platform_eng instance [14:09:07] !log urbanecm@deploy1002 urbanecm and anzx: Backport for [[gerrit:853437|Enable flood flag on knwiki (T322472)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [14:09:10] aanzx: your patch is now available at mwdebug1001. Can you test it there, please? [14:09:28] !log xcollazo@deploy1002 Finished deploy [airflow-dags/platform_eng@3bb99c2]: Deploying to Airflow platform_eng instance (duration: 00m 20s) [14:09:45] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37984/console" [puppet] - 10https://gerrit.wikimedia.org/r/854007 (https://phabricator.wikimedia.org/T321682) (owner: 10Btullis) [14:09:57] urbanecm: working [14:10:00] great, syncing [14:10:37] (03PS4) 10Urbanecm: Set timezone for knwiki , knwiktionary , knwikiquote and knwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853436 (https://phabricator.wikimedia.org/T322471) (owner: 10Anzx) [14:10:43] (03CR) 10Urbanecm: [C: 03+2] Set timezone for knwiki , knwiktionary , knwikiquote and knwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853436 (https://phabricator.wikimedia.org/T322471) (owner: 10Anzx) [14:11:32] (03Merged) 10jenkins-bot: Set timezone for knwiki , knwiktionary , knwikiquote and knwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853436 (https://phabricator.wikimedia.org/T322471) (owner: 10Anzx) [14:11:34] (03CR) 10Elukey: [C: 03+1] Release 1.1.0-2 (031 comment) [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/853987 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:12:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:12:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T318605)', diff saved to https://phabricator.wikimedia.org/P38310 and previous config saved to /var/cache/conftool/dbconfig/20221107-141203-ladsgroup.json [14:12:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance [14:12:08] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [14:12:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance [14:12:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T318605)', diff saved to https://phabricator.wikimedia.org/P38311 and previous config saved to /var/cache/conftool/dbconfig/20221107-141224-ladsgroup.json [14:13:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:13:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:13:05] 10SRE-swift-storage: Update Debian rclone package to 1.60.0 - https://phabricator.wikimedia.org/T322547 (10MatthewVernon) [14:13:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P38312 and previous config saved to /var/cache/conftool/dbconfig/20221107-141326-ladsgroup.json [14:13:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T318955)', diff saved to https://phabricator.wikimedia.org/P38313 and previous config saved to /var/cache/conftool/dbconfig/20221107-141344-ladsgroup.json [14:13:47] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [14:13:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:13:58] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:853437|Enable flood flag on knwiki (T322472)]] (duration: 05m 10s) [14:14:01] T322472: Enable flood flag on knwiki - https://phabricator.wikimedia.org/T322472 [14:14:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853436 (https://phabricator.wikimedia.org/T322471) (owner: 10Anzx) [14:14:21] aanzx: first patch's live, second one in progress [14:14:22] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:853436|Set timezone for knwiki , knwiktionary , knwikiquote and knwikisource (T322471)]] [14:14:25] T322471: Set timezone for knwiki, knwiktionary, knwikiquote and knwikisource - https://phabricator.wikimedia.org/T322471 [14:14:41] !log urbanecm@deploy1002 urbanecm and anzx: Backport for [[gerrit:853436|Set timezone for knwiki , knwiktionary , knwikiquote and knwikisource (T322471)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [14:14:50] aanzx: can you check the second one at mwdebug1001, please? [14:15:39] (03Merged) 10jenkins-bot: MentorHooks: Add missing check for GEMentorshipUseIsActiveFlag [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853537 (https://phabricator.wikimedia.org/T322538) (owner: 10Urbanecm) [14:15:59] (03Merged) 10jenkins-bot: Rename QuitMentorship to ReassignMentees [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853955 (https://phabricator.wikimedia.org/T321382) (owner: 10Urbanecm) [14:16:09] urbanecm: second is also working [14:16:13] great, syncing [14:16:29] 10SRE-swift-storage: pristine-tar handles complex filenames badly - https://phabricator.wikimedia.org/T322549 (10MatthewVernon) [14:16:38] (03Merged) 10jenkins-bot: ReassignMentees: Pass the actual performer to ChangeMentor [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853956 (https://phabricator.wikimedia.org/T321382) (owner: 10Urbanecm) [14:16:50] 10SRE-swift-storage: pristine-tar handles complex filenames badly - https://phabricator.wikimedia.org/T322549 (10MatthewVernon) p:05Triage→03High [14:17:04] 10SRE-swift-storage: Update Debian rclone package to 1.60.0 - https://phabricator.wikimedia.org/T322547 (10MatthewVernon) p:05Triage→03High [14:17:10] (03Merged) 10jenkins-bot: ManageMentorsRemoveMentor: Reassign mentees to a different mentor [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853957 (https://phabricator.wikimedia.org/T321382) (owner: 10Urbanecm) [14:17:20] and CI finished just in time :) [14:18:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:19:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:19:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:19:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P38314 and previous config saved to /var/cache/conftool/dbconfig/20221107-141958-marostegui.json [14:20:24] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:853436|Set timezone for knwiki , knwiktionary , knwikiquote and knwikisource (T322471)]] (duration: 06m 01s) [14:20:27] T322471: Set timezone for knwiki, knwiktionary, knwikiquote and knwikisource - https://phabricator.wikimedia.org/T322471 [14:20:31] aanzx: and second patch live [14:20:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T321123)', diff saved to https://phabricator.wikimedia.org/P38315 and previous config saved to /var/cache/conftool/dbconfig/20221107-142041-marostegui.json [14:20:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance [14:20:45] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [14:20:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:20:54] urbanecm: thanks [14:20:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance [14:20:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1160.eqiad.wmnet with reason: Maintenance [14:21:08] !log urbanecm@deploy1002 Backport cancelled. [14:21:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1160.eqiad.wmnet with reason: Maintenance [14:21:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1160 (T321123)', diff saved to https://phabricator.wikimedia.org/P38316 and previous config saved to /var/cache/conftool/dbconfig/20221107-142118-marostegui.json [14:21:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853537 (https://phabricator.wikimedia.org/T322538) (owner: 10Urbanecm) [14:21:44] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853955 (https://phabricator.wikimedia.org/T321382) (owner: 10Urbanecm) [14:21:50] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853956 (https://phabricator.wikimedia.org/T321382) (owner: 10Urbanecm) [14:21:54] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:853537|MentorHooks: Add missing check for GEMentorshipUseIsActiveFlag (T322538)]], [[gerrit:853955|Rename QuitMentorship to ReassignMentees (T321382)]], [[gerrit:853956|ReassignMentees: Pass the actual performer to ChangeMentor (T321382)]], [[gerrit:853957|ManageMentorsRemoveMentor: Reassign mentees to a different mentor (T321382)]] [14:21:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1010.eqiad.wmnet with OS bullseye [14:21:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853957 (https://phabricator.wikimedia.org/T321382) (owner: 10Urbanecm) [14:21:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T318955)', diff saved to https://phabricator.wikimedia.org/P38317 and previous config saved to /var/cache/conftool/dbconfig/20221107-142157-ladsgroup.json [14:21:59] T322538: LogicException: MentorStore::isMenteeActive was called, but GEMentorshipUseIsActiveFlag is false - https://phabricator.wikimedia.org/T322538 [14:21:59] T321382: When a mentor is removed via Speical:ManageMentors, no mentees are reassigned - https://phabricator.wikimedia.org/T321382 [14:21:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1179.eqiad.wmnet with reason: Maintenance [14:22:04] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [14:22:04] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1010.eqiad.wmnet with OS bullseye completed: - ganeti1010 (**PASS**) - Downtimed on... [14:22:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1179.eqiad.wmnet with reason: Maintenance [14:22:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T318955)', diff saved to https://phabricator.wikimedia.org/P38318 and previous config saved to /var/cache/conftool/dbconfig/20221107-142219-ladsgroup.json [14:22:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T321123)', diff saved to https://phabricator.wikimedia.org/P38319 and previous config saved to /var/cache/conftool/dbconfig/20221107-142226-marostegui.json [14:22:51] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:24:23] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.280 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:25:26] (03CR) 10Btullis: [V: 03+1 C: 03+2] Add kubectl files for stream-enrichment-poc on dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/854007 (https://phabricator.wikimedia.org/T321682) (owner: 10Btullis) [14:26:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T318955)', diff saved to https://phabricator.wikimedia.org/P38320 and previous config saved to /var/cache/conftool/dbconfig/20221107-142610-ladsgroup.json [14:26:40] !log urbanecm@deploy1002 urbanecm and urbanecm: Backport for [[gerrit:853537|MentorHooks: Add missing check for GEMentorshipUseIsActiveFlag (T322538)]], [[gerrit:853955|Rename QuitMentorship to ReassignMentees (T321382)]], [[gerrit:853956|ReassignMentees: Pass the actual performer to ChangeMentor (T321382)]], [[gerrit:853957|ManageMentorsRemoveMentor: Reassign mentees to a different mentor (T321382)]] synced to the testse [14:26:40] rvers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [14:27:59] (03CR) 10Elukey: "Everything looks good but I left a comment since I didn't follow a lookup in one of the modified erb." [puppet] - 10https://gerrit.wikimedia.org/r/853995 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [14:28:05] (03PS4) 10Vgutierrez: Release 0.19 [software/purged] - 10https://gerrit.wikimedia.org/r/853967 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:28:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T318605)', diff saved to https://phabricator.wikimedia.org/P38321 and previous config saved to /var/cache/conftool/dbconfig/20221107-142832-ladsgroup.json [14:28:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [14:28:38] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [14:28:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [14:29:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T318605)', diff saved to https://phabricator.wikimedia.org/P38322 and previous config saved to /var/cache/conftool/dbconfig/20221107-142904-ladsgroup.json [14:29:34] (03CR) 10CI reject: [V: 04-1] Release 0.19 [software/purged] - 10https://gerrit.wikimedia.org/r/853967 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:31:26] (03CR) 10Elukey: [C: 03+1] k8s: Use the K8s::Core::V1Taint type (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/853996 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [14:34:20] (03CR) 10Ssingh: [C: 03+2] prometheus: Alter node_ats_config class to include [puppet] - 10https://gerrit.wikimedia.org/r/853387 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [14:35:03] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [14:35:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T321130)', diff saved to https://phabricator.wikimedia.org/P38323 and previous config saved to /var/cache/conftool/dbconfig/20221107-143504-marostegui.json [14:35:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1180.eqiad.wmnet with reason: Maintenance [14:35:09] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [14:35:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1180.eqiad.wmnet with reason: Maintenance [14:35:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T321130)', diff saved to https://phabricator.wikimedia.org/P38324 and previous config saved to /var/cache/conftool/dbconfig/20221107-143526-marostegui.json [14:35:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:36:51] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:853537|MentorHooks: Add missing check for GEMentorshipUseIsActiveFlag (T322538)]], [[gerrit:853955|Rename QuitMentorship to ReassignMentees (T321382)]], [[gerrit:853956|ReassignMentees: Pass the actual performer to ChangeMentor (T321382)]], [[gerrit:853957|ManageMentorsRemoveMentor: Reassign mentees to a different mentor (T321382)]] (duration: 14m 56s) [14:36:55] T322538: LogicException: MentorStore::isMenteeActive was called, but GEMentorshipUseIsActiveFlag is false - https://phabricator.wikimedia.org/T322538 [14:36:55] T321382: When a mentor is removed via Speical:ManageMentors, no mentees are reassigned - https://phabricator.wikimedia.org/T321382 [14:36:57] and, all done [14:37:03] !log UTC afternoon B&C window done [14:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P38326 and previous config saved to /var/cache/conftool/dbconfig/20221107-143732-marostegui.json [14:37:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T321130)', diff saved to https://phabricator.wikimedia.org/P38327 and previous config saved to /var/cache/conftool/dbconfig/20221107-143741-marostegui.json [14:38:26] (03PS5) 10Ssingh: Release 0.19 [software/purged] - 10https://gerrit.wikimedia.org/r/853967 (https://phabricator.wikimedia.org/T321309) [14:39:50] (03CR) 10CI reject: [V: 04-1] Release 0.19 [software/purged] - 10https://gerrit.wikimedia.org/r/853967 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:40:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:40:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:41:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P38328 and previous config saved to /var/cache/conftool/dbconfig/20221107-144117-ladsgroup.json [14:41:30] (03PS2) 10JMeybohm: k8s: Add version switching where needed [puppet] - 10https://gerrit.wikimedia.org/r/853995 (https://phabricator.wikimedia.org/T307943) [14:41:32] (03PS2) 10JMeybohm: k8s: Use the K8s::Core::V1Taint type [puppet] - 10https://gerrit.wikimedia.org/r/853996 (https://phabricator.wikimedia.org/T307943) [14:41:35] (03PS1) 10JMeybohm: Add --service-account* flags for TokenRequest [puppet] - 10https://gerrit.wikimedia.org/r/854011 (https://phabricator.wikimedia.org/T307943) [14:42:09] (03CR) 10JMeybohm: k8s: Use the K8s::Core::V1Taint type (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/853996 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [14:43:21] (03CR) 10JMeybohm: k8s: Add version switching where needed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/853995 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [14:43:23] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Aggregation rules for ATS TTFB per crc/backend [puppet] - 10https://gerrit.wikimedia.org/r/853923 (https://phabricator.wikimedia.org/T321484) (owner: 10Vgutierrez) [14:43:26] (03CR) 10Filippo Giunchedi: [C: 03+2] mr: allow prometheus_group SSH access to mgmt [homer/public] - 10https://gerrit.wikimedia.org/r/853938 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [14:43:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:45:52] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 4 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:46:21] (03CR) 10JMeybohm: [C: 03+1] "I think this is good to go considering the known limitations" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/850244 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [14:48:50] (03CR) 10JMeybohm: Add --service-account* flags for TokenRequest (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854011 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [14:48:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1010.eqiad.wmnet [14:49:27] (03CR) 10JMeybohm: "PCC for the complete chain: https://puppet-compiler.wmflabs.org/pcc-worker1002/37985/" [puppet] - 10https://gerrit.wikimedia.org/r/854011 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [14:49:57] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 13 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37985/console" [puppet] - 10https://gerrit.wikimedia.org/r/853996 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [14:51:45] 10SRE, 10SRE-Access-Requests: Requesting deployment group membership for mfossati - https://phabricator.wikimedia.org/T321772 (10mfossati) Thank you @SLyngshede-WMF ! [14:52:12] 10SRE, 10Infrastructure-Foundations: reprepro: automate incoming processing - https://phabricator.wikimedia.org/T215812 (10LSobanski) [14:52:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P38329 and previous config saved to /var/cache/conftool/dbconfig/20221107-145239-marostegui.json [14:52:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P38330 and previous config saved to /var/cache/conftool/dbconfig/20221107-145248-marostegui.json [14:55:03] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [14:55:32] (03CR) 10Btullis: [V: 03+2 C: 03+2] Update the spark and spark-operator images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/850244 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [14:56:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P38331 and previous config saved to /var/cache/conftool/dbconfig/20221107-145623-ladsgroup.json [14:57:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1010.eqiad.wmnet [14:58:05] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/853987 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:00:21] (03CR) 10Filippo Giunchedi: prometheus: probe SSH on mgmt network (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845529 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [15:01:25] (03CR) 10Ssingh: [C: 03+2] Release 1.1.0-2 [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/853987 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:01:37] 10SRE-swift-storage: Update Debian rclone package to 1.60.0 - https://phabricator.wikimedia.org/T322547 (10MatthewVernon) [I've made a bit of a start on this, based on a locally-patched pristine-tar] [15:01:50] (03CR) 10Muehlenhoff: "Looks good, two things inline." [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/853962 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:04:14] (03PS2) 10Ssingh: Release 1.9-2 [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/853962 (https://phabricator.wikimedia.org/T321309) [15:04:16] (03PS6) 10Vgutierrez: Release 0.19 [software/purged] - 10https://gerrit.wikimedia.org/r/853967 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:04:18] (03CR) 10Muehlenhoff: "Looks good, one comment inline." [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/853974 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:05:22] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/853962 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:05:35] (03CR) 10CI reject: [V: 04-1] Release 0.19 [software/purged] - 10https://gerrit.wikimedia.org/r/853967 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:05:43] (03CR) 10Filippo Giunchedi: [C: 03+2] "This is ready to go, Prometheus hosts now have ssh access to mgmt network" [puppet] - 10https://gerrit.wikimedia.org/r/845529 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [15:06:38] (03CR) 10Vgutierrez: [C: 03+2] prometheus: Aggregation rules for ATS TTFB per crc/backend [puppet] - 10https://gerrit.wikimedia.org/r/853923 (https://phabricator.wikimedia.org/T321484) (owner: 10Vgutierrez) [15:07:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T321123)', diff saved to https://phabricator.wikimedia.org/P38332 and previous config saved to /var/cache/conftool/dbconfig/20221107-150745-marostegui.json [15:07:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1190.eqiad.wmnet with reason: Maintenance [15:07:50] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [15:07:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P38333 and previous config saved to /var/cache/conftool/dbconfig/20221107-150754-marostegui.json [15:08:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1190.eqiad.wmnet with reason: Maintenance [15:08:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1190 (T321123)', diff saved to https://phabricator.wikimedia.org/P38334 and previous config saved to /var/cache/conftool/dbconfig/20221107-150807-marostegui.json [15:08:31] 10SRE, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): wikitech-static down - https://phabricator.wikimedia.org/T295266 (10LSobanski) 05Open→03Resolved a:03LSobanski The related action item has been resolved so I'll resolve this one as well. Please reopen if you think otherwise. [15:09:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T321123)', diff saved to https://phabricator.wikimedia.org/P38335 and previous config saved to /var/cache/conftool/dbconfig/20221107-150914-marostegui.json [15:09:36] !log reprepro -C main include bullseye-wikimedia varnishkafka_1.1.0-2_amd64.changes: T321309 [15:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:40] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [15:10:02] 10SRE, 10SRE-OnFire, 10wikitech.wikimedia.org, 10Sustainability (Incident Followup), 10User-LSobanski: Incident response tools operational readiness review - https://phabricator.wikimedia.org/T290130 (10LSobanski) [15:11:02] (03PS7) 10Vgutierrez: Release 0.19 [software/purged] - 10https://gerrit.wikimedia.org/r/853967 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:11:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T318955)', diff saved to https://phabricator.wikimedia.org/P38336 and previous config saved to /var/cache/conftool/dbconfig/20221107-151130-ladsgroup.json [15:11:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1189.eqiad.wmnet with reason: Maintenance [15:11:34] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [15:11:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1189.eqiad.wmnet with reason: Maintenance [15:11:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T318955)', diff saved to https://phabricator.wikimedia.org/P38337 and previous config saved to /var/cache/conftool/dbconfig/20221107-151151-ladsgroup.json [15:12:24] (03CR) 10CI reject: [V: 04-1] Release 0.19 [software/purged] - 10https://gerrit.wikimedia.org/r/853967 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:13:11] (03CR) 10Ssingh: Release 1.9-2 (031 comment) [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/853962 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:13:19] (03CR) 10Ssingh: [C: 03+2] Release 1.9-2 [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/853962 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:15:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T318955)', diff saved to https://phabricator.wikimedia.org/P38338 and previous config saved to /var/cache/conftool/dbconfig/20221107-151543-ladsgroup.json [15:16:54] (03CR) 10Giuseppe Lavagetto: [C: 03+1] P:kubernetes::deployment_server: absent services [puppet] - 10https://gerrit.wikimedia.org/r/852775 (https://phabricator.wikimedia.org/T322298) (owner: 10ClĂ©ment Goubert) [15:18:54] !log reprepro -C main include bullseye-wikimedia libvmod-netmapper_1.9-2_amd64.changes: T321309 [15:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:58] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [15:20:50] (03CR) 10Elukey: [C: 03+1] Add --service-account* flags for TokenRequest [puppet] - 10https://gerrit.wikimedia.org/r/854011 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [15:21:59] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mwdebug: absent kubernetes service [puppet] - 10https://gerrit.wikimedia.org/r/852777 (https://phabricator.wikimedia.org/T321201) (owner: 10ClĂ©ment Goubert) [15:22:00] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:22:25] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mwdebug: Remove old mwdebug deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/850184 (https://phabricator.wikimedia.org/T321201) (owner: 10ClĂ©ment Goubert) [15:22:28] (03CR) 10Elukey: [C: 03+1] k8s: Add version switching where needed [puppet] - 10https://gerrit.wikimedia.org/r/853995 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [15:22:29] (03PS8) 10Vgutierrez: Release 0.19 [software/purged] - 10https://gerrit.wikimedia.org/r/853967 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:22:42] 10SRE, 10WMF-Communications: Feasibility of hosting podcast setup on Wikimedia servers - https://phabricator.wikimedia.org/T148061 (10LSobanski) 05Open→03Resolved a:03LSobanski It's been more than a year and a half ssince the last response so I'll resolve this task. If any additional insight is needed, p... [15:23:00] (03CR) 10Giuseppe Lavagetto: [C: 03+1] admin: Remove stale mwdebug stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/852186 (https://phabricator.wikimedia.org/T321201) (owner: 10ClĂ©ment Goubert) [15:23:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T321130)', diff saved to https://phabricator.wikimedia.org/P38339 and previous config saved to /var/cache/conftool/dbconfig/20221107-152301-marostegui.json [15:23:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1187.eqiad.wmnet with reason: Maintenance [15:23:05] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [15:23:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1187.eqiad.wmnet with reason: Maintenance [15:23:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T321130)', diff saved to https://phabricator.wikimedia.org/P38340 and previous config saved to /var/cache/conftool/dbconfig/20221107-152322-marostegui.json [15:23:28] (03PS1) 10Hnowlan: thumbor: enable setting log level, set staging to debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/854026 (https://phabricator.wikimedia.org/T233196) [15:23:46] (03PS1) 10Ssingh: Release 0.3 [software/varnish/libvmod-querysort] - 10https://gerrit.wikimedia.org/r/854028 (https://phabricator.wikimedia.org/T321309) [15:23:48] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48974 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:24:13] (03CR) 10CI reject: [V: 04-1] Release 0.19 [software/purged] - 10https://gerrit.wikimedia.org/r/853967 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:24:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P38341 and previous config saved to /var/cache/conftool/dbconfig/20221107-152421-marostegui.json [15:24:26] (03PS2) 10Hnowlan: thumbor: enable setting log level, set staging to debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/854026 (https://phabricator.wikimedia.org/T233196) [15:25:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T321130)', diff saved to https://phabricator.wikimedia.org/P38342 and previous config saved to /var/cache/conftool/dbconfig/20221107-152531-marostegui.json [15:26:41] (03PS1) 10Hnowlan: Encode before using hashlib [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/854029 (https://phabricator.wikimedia.org/T233196) [15:27:23] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [15:27:29] (03CR) 10ClĂ©ment Goubert: [C: 03+2] P:kubernetes::deployment_server: absent services [puppet] - 10https://gerrit.wikimedia.org/r/852775 (https://phabricator.wikimedia.org/T322298) (owner: 10ClĂ©ment Goubert) [15:27:38] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [15:29:05] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan) [15:30:11] (03CR) 10Ori: [C: 03+1] Release 0.3 [software/varnish/libvmod-querysort] - 10https://gerrit.wikimedia.org/r/854028 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:30:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P38343 and previous config saved to /var/cache/conftool/dbconfig/20221107-153049-ladsgroup.json [15:30:58] (03PS2) 10Ssingh: Release 1.5.3-2 [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/853974 (https://phabricator.wikimedia.org/T321309) [15:31:08] (03CR) 10CI reject: [V: 04-1] Release 1.5.3-2 [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/853974 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:32:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T318605)', diff saved to https://phabricator.wikimedia.org/P38344 and previous config saved to /var/cache/conftool/dbconfig/20221107-153257-ladsgroup.json [15:33:01] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [15:33:23] (03CR) 10Ssingh: Release 1.5.3-2 (032 comments) [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/853974 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:33:51] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops and analytics for stevemunene - https://phabricator.wikimedia.org/T322339 (10odimitrijevic) Approved [15:35:55] (03PS7) 10ClĂ©ment Goubert: mwdebug: absent kubernetes service [puppet] - 10https://gerrit.wikimedia.org/r/852777 (https://phabricator.wikimedia.org/T321201) [15:36:51] (03CR) 10Ssingh: [C: 03+2] Release 0.3 [software/varnish/libvmod-querysort] - 10https://gerrit.wikimedia.org/r/854028 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:39:12] (03PS9) 10Vgutierrez: Release 0.19 [software/purged] - 10https://gerrit.wikimedia.org/r/853967 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:39:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P38345 and previous config saved to /var/cache/conftool/dbconfig/20221107-153927-marostegui.json [15:40:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P38346 and previous config saved to /var/cache/conftool/dbconfig/20221107-154037-marostegui.json [15:41:19] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/varnish/libvmod-querysort] - 10https://gerrit.wikimedia.org/r/854028 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:42:02] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/853974 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:44:16] !log reprepro -C main include bullseye-wikimedia libvmod-querysort_0.3_amd64.changes: T321309 [15:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:20] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [15:45:28] (03PS10) 10Vgutierrez: Release 0.19 [software/purged] - 10https://gerrit.wikimedia.org/r/853967 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:45:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P38347 and previous config saved to /var/cache/conftool/dbconfig/20221107-154556-ladsgroup.json [15:46:00] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2004.mgmt.codfw.wmnet with reboot policy FORCED [15:46:50] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov2004.mgmt.codfw.wmnet with reboot policy FORCED [15:47:59] (03CR) 10Elukey: [V: 03+2 C: 03+2] istio: upgrade to 1.15.3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/852842 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey) [15:48:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P38348 and previous config saved to /var/cache/conftool/dbconfig/20221107-154803-ladsgroup.json [15:49:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2004.mgmt.codfw.wmnet with reboot policy FORCED [15:49:44] (03CR) 10Elukey: [V: 03+2 C: 03+2] Import istioctl 1.15.3 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/852921 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey) [15:50:22] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov2004.mgmt.codfw.wmnet with reboot policy FORCED [15:50:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2004.mgmt.codfw.wmnet with reboot policy FORCED [15:51:10] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov2004.mgmt.codfw.wmnet with reboot policy FORCED [15:51:16] (03CR) 10ClĂ©ment Goubert: [C: 03+2] admin: Remove stale mwdebug stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/852186 (https://phabricator.wikimedia.org/T321201) (owner: 10ClĂ©ment Goubert) [15:52:17] (03CR) 10ClĂ©ment Goubert: [C: 03+2] mwdebug: Remove old mwdebug deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/850184 (https://phabricator.wikimedia.org/T321201) (owner: 10ClĂ©ment Goubert) [15:54:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T321123)', diff saved to https://phabricator.wikimedia.org/P38349 and previous config saved to /var/cache/conftool/dbconfig/20221107-155434-marostegui.json [15:54:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1199.eqiad.wmnet with reason: Maintenance [15:54:38] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [15:54:49] (03PS11) 10Vgutierrez: Release 0.19 [software/purged] - 10https://gerrit.wikimedia.org/r/853967 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:54:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1199.eqiad.wmnet with reason: Maintenance [15:54:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1199 (T321123)', diff saved to https://phabricator.wikimedia.org/P38350 and previous config saved to /var/cache/conftool/dbconfig/20221107-155455-marostegui.json [15:55:29] !log upgrade istioctl to 1.15.3 on apt1001 for {buster,bullseye}-wikimedia - T322193 [15:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:33] T322193: Import istio 1.1x (k8s 1.23 dependency) - https://phabricator.wikimedia.org/T322193 [15:55:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P38351 and previous config saved to /var/cache/conftool/dbconfig/20221107-155544-marostegui.json [15:56:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T321123)', diff saved to https://phabricator.wikimedia.org/P38352 and previous config saved to /var/cache/conftool/dbconfig/20221107-155603-marostegui.json [15:56:06] (03Merged) 10jenkins-bot: mwdebug: Remove old mwdebug deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/850184 (https://phabricator.wikimedia.org/T321201) (owner: 10ClĂ©ment Goubert) [15:56:09] (03Merged) 10jenkins-bot: admin: Remove stale mwdebug stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/852186 (https://phabricator.wikimedia.org/T321201) (owner: 10ClĂ©ment Goubert) [15:56:41] (03PS1) 10Filippo Giunchedi: hiera_export: add tenant information to mgmt [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/854037 (https://phabricator.wikimedia.org/T310266) [15:58:16] (03CR) 10Filippo Giunchedi: "When deploying the mgmt probes in https://gerrit.wikimedia.org/r/c/operations/puppet/+/845529 I realized fr-tech mgmt interfaces can't be " [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/854037 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [15:59:05] !log cgoubert@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [16:00:07] (03CR) 10Aqu: "I'm adding ppls of our team as cc. Maybe sessioncookie could go in one of our dataset." [puppet] - 10https://gerrit.wikimedia.org/r/839512 (https://phabricator.wikimedia.org/T319324) (owner: 10Vgutierrez) [16:00:20] !log cgoubert@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [16:01:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T318955)', diff saved to https://phabricator.wikimedia.org/P38353 and previous config saved to /var/cache/conftool/dbconfig/20221107-160102-ladsgroup.json [16:01:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1198.eqiad.wmnet with reason: Maintenance [16:01:06] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [16:01:13] !log cleaning up stale mwdebug kubernetes config [16:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1198.eqiad.wmnet with reason: Maintenance [16:01:20] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [16:01:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T318955)', diff saved to https://phabricator.wikimedia.org/P38354 and previous config saved to /var/cache/conftool/dbconfig/20221107-160124-ladsgroup.json [16:01:26] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10ARM support: Adoption of aarch64 (aka arm64) in WMF production? (SRE Summit 2022 Session) - https://phabricator.wikimedia.org/T320811 (10jijiki) [16:02:04] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2004.mgmt.codfw.wmnet with reboot policy FORCED [16:02:25] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [16:02:49] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov2004.mgmt.codfw.wmnet with reboot policy FORCED [16:02:53] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1010.eqiad.wmnet to cluster eqiad and group C [16:03:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P38355 and previous config saved to /var/cache/conftool/dbconfig/20221107-160310-ladsgroup.json [16:03:46] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@e51ff67]: import_cirrus_indexes: set executor cores to 1 [16:03:47] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1010.eqiad.wmnet to cluster eqiad and group C [16:04:42] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [16:05:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T318955)', diff saved to https://phabricator.wikimedia.org/P38356 and previous config saved to /var/cache/conftool/dbconfig/20221107-160516-ladsgroup.json [16:05:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T318605)', diff saved to https://phabricator.wikimedia.org/P38357 and previous config saved to /var/cache/conftool/dbconfig/20221107-160527-ladsgroup.json [16:05:31] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [16:05:43] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [16:05:54] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [16:06:05] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@e51ff67]: import_cirrus_indexes: set executor cores to 1 (duration: 02m 19s) [16:06:22] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:06:40] (03CR) 10Ssingh: "recheck" [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/853974 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:10:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T321130)', diff saved to https://phabricator.wikimedia.org/P38358 and previous config saved to /var/cache/conftool/dbconfig/20221107-161050-marostegui.json [16:10:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1201.eqiad.wmnet with reason: Maintenance [16:10:55] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [16:11:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1201.eqiad.wmnet with reason: Maintenance [16:11:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P38359 and previous config saved to /var/cache/conftool/dbconfig/20221107-161109-marostegui.json [16:11:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1201 (T321130)', diff saved to https://phabricator.wikimedia.org/P38360 and previous config saved to /var/cache/conftool/dbconfig/20221107-161118-marostegui.json [16:11:59] (03CR) 10Filippo Giunchedi: [C: 03+2] Add stevemunene to ops and analytics [puppet] - 10https://gerrit.wikimedia.org/r/853300 (https://phabricator.wikimedia.org/T322339) (owner: 10Stevemunene) [16:12:24] (03CR) 10Ssingh: [V: 03+2 C: 03+2] "The CI failure is now the varnish tests, which is expected. Removing -1 similar to other CI failures and merging." [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/853974 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:12:32] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops and analytics for stevemunene - https://phabricator.wikimedia.org/T322339 (10fgiunchedi) [16:13:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T321130)', diff saved to https://phabricator.wikimedia.org/P38361 and previous config saved to /var/cache/conftool/dbconfig/20221107-161327-marostegui.json [16:13:39] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops and analytics for stevemunene - https://phabricator.wikimedia.org/T322339 (10fgiunchedi) 05Stalled→03Resolved Thank you all! Patch is merged and will be fully effective in ~30min. Resolving task as completed, please reopen if some... [16:14:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2004.mgmt.codfw.wmnet with reboot policy FORCED [16:16:24] (03PS1) 10Filippo Giunchedi: base: remove check_long_procs, unused [puppet] - 10https://gerrit.wikimedia.org/r/854039 (https://phabricator.wikimedia.org/T225140) [16:16:26] (03PS1) 10Filippo Giunchedi: alertmanager: use 'site' label to route tasks for dcops [puppet] - 10https://gerrit.wikimedia.org/r/854040 (https://phabricator.wikimedia.org/T225140) [16:16:28] (03CR) 10Joal: [C: 03+1] "LGTM in terms of functionality for analytics" [puppet] - 10https://gerrit.wikimedia.org/r/839512 (https://phabricator.wikimedia.org/T319324) (owner: 10Vgutierrez) [16:16:35] (03CR) 10Ssingh: [C: 03+1] "Thanks for all the work on this!" [software/purged] - 10https://gerrit.wikimedia.org/r/853967 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:18:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T318605)', diff saved to https://phabricator.wikimedia.org/P38362 and previous config saved to /var/cache/conftool/dbconfig/20221107-161816-ladsgroup.json [16:18:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [16:18:20] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [16:18:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [16:18:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T318605)', diff saved to https://phabricator.wikimedia.org/P38363 and previous config saved to /var/cache/conftool/dbconfig/20221107-161837-ladsgroup.json [16:19:17] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) [16:20:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P38364 and previous config saved to /var/cache/conftool/dbconfig/20221107-162023-ladsgroup.json [16:20:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P38365 and previous config saved to /var/cache/conftool/dbconfig/20221107-162033-ladsgroup.json [16:21:00] !log volans@cumin1001 START - Cookbook sre.dns.netbox [16:21:37] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov2004.mgmt.codfw.wmnet with reboot policy FORCED [16:23:04] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: codfw VM %request for dispatch-be - https://phabricator.wikimedia.org/T322556 (10fgiunchedi) [16:23:20] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: codfw VM %request for dispatch-be2001 - https://phabricator.wikimedia.org/T322556 (10fgiunchedi) [16:23:35] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:23:37] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: codfw VM %request for dispatch-be2001 - https://phabricator.wikimedia.org/T322556 (10fgiunchedi) [16:23:43] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Volans) As requested by @Eevans I've added the additional IPs to those AQS hosts that didn't get it at provisioning time. [[ https://netbox.wikimedia.org/extras/cha... [16:23:46] (03PS3) 10Elukey: Upgrade to 1.15.3 [debs/istio] - 10https://gerrit.wikimedia.org/r/853936 (https://phabricator.wikimedia.org/T322193) [16:23:57] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Volans) I've also run the `sre.dns.netbox` cookbook, the DNS records are now live. [16:24:27] (03PS1) 10Vgutierrez: deployment-prep: Remove ms-be05 [puppet] - 10https://gerrit.wikimedia.org/r/854041 (https://phabricator.wikimedia.org/T322231) [16:25:55] (03CR) 10Volans: [C: 04-1] "I think we should just filter them out" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/854037 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [16:26:05] (03CR) 10Elukey: "Still working on it, I am trying to add vendor deps directly to the package but then I am not able to build anymore." [debs/istio] - 10https://gerrit.wikimedia.org/r/853936 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey) [16:26:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P38366 and previous config saved to /var/cache/conftool/dbconfig/20221107-162616-marostegui.json [16:26:52] !log filippo@cumin1001 START - Cookbook sre.ganeti.makevm for new host dispatch-be2001.codfw.wmnet [16:26:53] !log filippo@cumin1001 START - Cookbook sre.dns.netbox [16:28:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P38367 and previous config saved to /var/cache/conftool/dbconfig/20221107-162834-marostegui.json [16:30:05] jan_drewniak: That opportune time is upon us again. Time for a Wikimedia Portals Update deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221107T1630). [16:30:48] (03PS2) 10Filippo Giunchedi: hiera_export: skip mgmt for non-production tenants [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/854037 (https://phabricator.wikimedia.org/T310266) [16:31:23] (03CR) 10Filippo Giunchedi: hiera_export: skip mgmt for non-production tenants (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/854037 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [16:31:45] 10SRE, 10Observability-Alerting, 10Traffic: Use DNS name instead of IP in PyBal alerts - https://phabricator.wikimedia.org/T322377 (10bking) [16:32:59] (03CR) 10Filippo Giunchedi: "File was absented in I77bb568752" [puppet] - 10https://gerrit.wikimedia.org/r/854039 (https://phabricator.wikimedia.org/T225140) (owner: 10Filippo Giunchedi) [16:33:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2004.mgmt.codfw.wmnet with reboot policy FORCED [16:34:57] (03CR) 10Vgutierrez: [C: 03+2] deployment-prep: Remove ms-be05 [puppet] - 10https://gerrit.wikimedia.org/r/854041 (https://phabricator.wikimedia.org/T322231) (owner: 10Vgutierrez) [16:35:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P38368 and previous config saved to /var/cache/conftool/dbconfig/20221107-163529-ladsgroup.json [16:35:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P38369 and previous config saved to /var/cache/conftool/dbconfig/20221107-163540-ladsgroup.json [16:35:42] !log filippo@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:35:42] !log filippo@cumin1001 START - Cookbook sre.dns.wipe-cache dispatch-be2001.codfw.wmnet on all recursors [16:35:45] !log filippo@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dispatch-be2001.codfw.wmnet on all recursors [16:38:11] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. purged is actually quite close to even be built from debs in the archive, only github.com/matttproud/golang_protobuf_extensio" [software/purged] - 10https://gerrit.wikimedia.org/r/853967 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:38:55] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov2004.mgmt.codfw.wmnet with reboot policy FORCED [16:39:20] (03CR) 10Vgutierrez: [C: 03+2] Release 0.19 [software/purged] - 10https://gerrit.wikimedia.org/r/853967 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:41:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T321123)', diff saved to https://phabricator.wikimedia.org/P38370 and previous config saved to /var/cache/conftool/dbconfig/20221107-164122-marostegui.json [16:41:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:41:27] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [16:41:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:41:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2099.codfw.wmnet with reason: Maintenance [16:41:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2099.codfw.wmnet with reason: Maintenance [16:41:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2106.codfw.wmnet with reason: Maintenance [16:42:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2106.codfw.wmnet with reason: Maintenance [16:42:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2106 (T321123)', diff saved to https://phabricator.wikimedia.org/P38371 and previous config saved to /var/cache/conftool/dbconfig/20221107-164217-marostegui.json [16:42:52] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: codfw VM %request for dispatch-be2001 - https://phabricator.wikimedia.org/T322556 (10fgiunchedi) Filing the task for tracking purposes, I'm creating the VM ATM [16:43:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P38372 and previous config saved to /var/cache/conftool/dbconfig/20221107-164340-marostegui.json [16:44:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T321123)', diff saved to https://phabricator.wikimedia.org/P38373 and previous config saved to /var/cache/conftool/dbconfig/20221107-164427-marostegui.json [16:44:39] (03PS1) 10Vgutierrez: deployment-prep: Add ms-fe04 to profile::swift::proxyhosts [puppet] - 10https://gerrit.wikimedia.org/r/854044 (https://phabricator.wikimedia.org/T322554) [16:47:02] (03PS2) 10Vgutierrez: deployment-prep: Add ms-fe04 [puppet] - 10https://gerrit.wikimedia.org/r/854044 (https://phabricator.wikimedia.org/T322554) [16:48:26] 10SRE, 10Growth-Team, 10Notifications, 10serviceops, 10Wikimedia-production-error: Failed to fetch API response from {wiki}. Error code {code} - https://phabricator.wikimedia.org/T321409 (10Joe) 05Open→03Declined Hi @Sgs there is no public incident report because we usually don't do that for incident... [16:50:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T318955)', diff saved to https://phabricator.wikimedia.org/P38374 and previous config saved to /var/cache/conftool/dbconfig/20221107-165036-ladsgroup.json [16:50:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:50:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:50:40] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [16:50:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T318605)', diff saved to https://phabricator.wikimedia.org/P38375 and previous config saved to /var/cache/conftool/dbconfig/20221107-165046-ladsgroup.json [16:50:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance [16:50:50] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [16:51:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance [16:51:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T318605)', diff saved to https://phabricator.wikimedia.org/P38376 and previous config saved to /var/cache/conftool/dbconfig/20221107-165108-ladsgroup.json [16:52:09] (03CR) 10ClĂ©ment Goubert: [C: 03+2] mwdebug: absent kubernetes service [puppet] - 10https://gerrit.wikimedia.org/r/852777 (https://phabricator.wikimedia.org/T321201) (owner: 10ClĂ©ment Goubert) [16:55:23] (03CR) 10Andrew Bogott: [C: 03+2] wmfkeystonehooks: convert a couple of config opts to StrOpt [puppet] - 10https://gerrit.wikimedia.org/r/850540 (https://phabricator.wikimedia.org/T288108) (owner: 10Andrew Bogott) [16:55:41] (03CR) 10Andrew Bogott: [C: 03+2] wmf spec tests: Update to test Bullseye/Xena [puppet] - 10https://gerrit.wikimedia.org/r/851126 (owner: 10Andrew Bogott) [16:56:24] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/854037 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [16:57:20] (03CR) 10Filippo Giunchedi: [C: 03+2] hiera_export: skip mgmt for non-production tenants [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/854037 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [16:58:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T321130)', diff saved to https://phabricator.wikimedia.org/P38377 and previous config saved to /var/cache/conftool/dbconfig/20221107-165847-marostegui.json [16:58:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [16:58:51] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [16:59:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [16:59:19] !log filippo@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync-mgmt - filippo@cumin1001" [16:59:22] !log filippo@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host dispatch-be2001.codfw.wmnet [16:59:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2105.codfw.wmnet with reason: Maintenance [16:59:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2105.codfw.wmnet with reason: Maintenance [16:59:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P38378 and previous config saved to /var/cache/conftool/dbconfig/20221107-165933-marostegui.json [16:59:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2105 (T318955)', diff saved to https://phabricator.wikimedia.org/P38379 and previous config saved to /var/cache/conftool/dbconfig/20221107-165943-ladsgroup.json [16:59:47] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [17:02:15] (03CR) 10Ebernhardson: [C: 03+2] team-search-platform: alert when CirrusSearch jobs are backlogged [alerts] - 10https://gerrit.wikimedia.org/r/852899 (https://phabricator.wikimedia.org/T312175) (owner: 10DCausse) [17:02:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2117.codfw.wmnet with reason: Maintenance [17:02:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2117.codfw.wmnet with reason: Maintenance [17:02:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T321130)', diff saved to https://phabricator.wikimedia.org/P38380 and previous config saved to /var/cache/conftool/dbconfig/20221107-170247-marostegui.json [17:04:18] (03CR) 10Arturo Borrero Gonzalez: wmcs: add socks proxy support to wmcs cookbooks (036 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) (owner: 10David Caro) [17:05:34] (03Merged) 10jenkins-bot: team-search-platform: alert when CirrusSearch jobs are backlogged [alerts] - 10https://gerrit.wikimedia.org/r/852899 (https://phabricator.wikimedia.org/T312175) (owner: 10DCausse) [17:06:21] (03PS1) 10Filippo Giunchedi: install_server: add dispatch-be2001 [puppet] - 10https://gerrit.wikimedia.org/r/854052 (https://phabricator.wikimedia.org/T322556) [17:06:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T318955)', diff saved to https://phabricator.wikimedia.org/P38381 and previous config saved to /var/cache/conftool/dbconfig/20221107-170658-ladsgroup.json [17:07:02] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [17:07:52] (03PS1) 10Filippo Giunchedi: makevm: print dhcpd config snippet [cookbooks] - 10https://gerrit.wikimedia.org/r/854053 [17:07:58] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: add dispatch-be2001 [puppet] - 10https://gerrit.wikimedia.org/r/854052 (https://phabricator.wikimedia.org/T322556) (owner: 10Filippo Giunchedi) [17:08:11] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] create_instance_with_prefix: fix prefix guess [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/848222 (owner: 10David Caro) [17:08:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T321130)', diff saved to https://phabricator.wikimedia.org/P38382 and previous config saved to /var/cache/conftool/dbconfig/20221107-170816-marostegui.json [17:08:20] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [17:08:21] (03PS2) 10Filippo Giunchedi: makevm: print dhcpd config snippet [cookbooks] - 10https://gerrit.wikimedia.org/r/854053 [17:09:42] PROBLEM - SSH on mw1309.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:11:03] (03CR) 10Arturo Borrero Gonzalez: wmcs: add cookbook to add/remove a user to/from a project (033 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650 (owner: 10David Caro) [17:13:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.toolforge.grid: get also the job logs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841930 (owner: 10David Caro) [17:13:55] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2004.mgmt.codfw.wmnet with reboot policy FORCED [17:14:17] (03CR) 10Volans: [C: 03+1] "Feel free to merge it, just FYI there is an OKR for this quarter to complete the work in:" [cookbooks] - 10https://gerrit.wikimedia.org/r/854053 (owner: 10Filippo Giunchedi) [17:14:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P38383 and previous config saved to /var/cache/conftool/dbconfig/20221107-171439-marostegui.json [17:15:55] (03CR) 10CI reject: [V: 04-1] makevm: print dhcpd config snippet [cookbooks] - 10https://gerrit.wikimedia.org/r/854053 (owner: 10Filippo Giunchedi) [17:17:25] (03CR) 10Majavah: [C: 04-1] wmcs: add socks proxy support to wmcs cookbooks (033 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) (owner: 10David Caro) [17:21:11] (03PS3) 10Filippo Giunchedi: makevm: print dhcpd config snippet [cookbooks] - 10https://gerrit.wikimedia.org/r/854053 [17:22:01] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov2004.mgmt.codfw.wmnet with reboot policy FORCED [17:22:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P38384 and previous config saved to /var/cache/conftool/dbconfig/20221107-172204-ladsgroup.json [17:22:45] !log reprepro -C main include bullseye-wikimedia purged_0.19_amd64.changes: T321309 [17:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:51] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [17:23:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P38385 and previous config saved to /var/cache/conftool/dbconfig/20221107-172322-marostegui.json [17:23:31] (03PS4) 10Elukey: Upgrade to 1.15.3 [debs/istio] - 10https://gerrit.wikimedia.org/r/853936 (https://phabricator.wikimedia.org/T322193) [17:24:49] !log krinkle@deploy1002 Started deploy [performance/arc-lamp@e1ac118]: https://gerrit.wikimedia.org/r/c/825870 - T322561, T315056 [17:24:53] T315056: arclamp_generate_svgs OOMs - https://phabricator.wikimedia.org/T315056 [17:24:54] T322561: Arc Lamp stopped publishing SVGs 2022-10-26 - https://phabricator.wikimedia.org/T322561 [17:24:56] !log krinkle@deploy1002 Finished deploy [performance/arc-lamp@e1ac118]: https://gerrit.wikimedia.org/r/c/825870 - T322561, T315056 (duration: 00m 07s) [17:26:56] (03PS1) 10Filippo Giunchedi: site: add dispatch-be2001 [puppet] - 10https://gerrit.wikimedia.org/r/854056 (https://phabricator.wikimedia.org/T322556) [17:26:58] (03PS5) 10Elukey: Upgrade to 1.15.3 [debs/istio] - 10https://gerrit.wikimedia.org/r/853936 (https://phabricator.wikimedia.org/T322193) [17:27:10] (03CR) 10Filippo Giunchedi: [C: 03+2] makevm: print dhcpd config snippet [cookbooks] - 10https://gerrit.wikimedia.org/r/854053 (owner: 10Filippo Giunchedi) [17:27:44] (03CR) 10Filippo Giunchedi: [C: 03+2] site: add dispatch-be2001 [puppet] - 10https://gerrit.wikimedia.org/r/854056 (https://phabricator.wikimedia.org/T322556) (owner: 10Filippo Giunchedi) [17:27:51] (03PS2) 10Filippo Giunchedi: site: add dispatch-be2001 [puppet] - 10https://gerrit.wikimedia.org/r/854056 (https://phabricator.wikimedia.org/T322556) [17:29:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T321123)', diff saved to https://phabricator.wikimedia.org/P38386 and previous config saved to /var/cache/conftool/dbconfig/20221107-172946-marostegui.json [17:29:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2110.codfw.wmnet with reason: Maintenance [17:29:50] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [17:30:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2110.codfw.wmnet with reason: Maintenance [17:30:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2110 (T321123)', diff saved to https://phabricator.wikimedia.org/P38387 and previous config saved to /var/cache/conftool/dbconfig/20221107-173007-marostegui.json [17:32:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T321123)', diff saved to https://phabricator.wikimedia.org/P38388 and previous config saved to /var/cache/conftool/dbconfig/20221107-173217-marostegui.json [17:34:24] (03CR) 10Elukey: "Ready for a review :)" [debs/istio] - 10https://gerrit.wikimedia.org/r/853936 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey) [17:36:00] (03PS1) 10Filippo Giunchedi: access_new_install: add compat install-console [puppet] - 10https://gerrit.wikimedia.org/r/854058 [17:37:07] (03PS2) 10Filippo Giunchedi: access_new_install: add compat install-console [puppet] - 10https://gerrit.wikimedia.org/r/854058 [17:37:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P38389 and previous config saved to /var/cache/conftool/dbconfig/20221107-173711-ladsgroup.json [17:37:31] !log filippo@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "sync-mgmt - filippo@cumin1001" [17:38:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P38390 and previous config saved to /var/cache/conftool/dbconfig/20221107-173829-marostegui.json [17:38:38] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37988/console" [puppet] - 10https://gerrit.wikimedia.org/r/854058 (owner: 10Filippo Giunchedi) [17:38:59] !log filippo@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync-mgmt - filippo@cumin1001" [17:39:53] (03PS3) 10Filippo Giunchedi: access_new_install: use install-console and compat symlink [puppet] - 10https://gerrit.wikimedia.org/r/854058 [17:41:13] !log filippo@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "sync-mgmt - filippo@cumin1001" [17:41:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T318605)', diff saved to https://phabricator.wikimedia.org/P38391 and previous config saved to /var/cache/conftool/dbconfig/20221107-174123-ladsgroup.json [17:41:26] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [17:42:22] (03CR) 10Filippo Giunchedi: "This might seem like something trivial, however I keep stumbling on install_console vs install-console." [puppet] - 10https://gerrit.wikimedia.org/r/854058 (owner: 10Filippo Giunchedi) [17:47:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P38392 and previous config saved to /var/cache/conftool/dbconfig/20221107-174724-marostegui.json [17:51:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T318605)', diff saved to https://phabricator.wikimedia.org/P38393 and previous config saved to /var/cache/conftool/dbconfig/20221107-175124-ladsgroup.json [17:51:29] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [17:51:37] (03PS3) 10ClĂ©ment Goubert: mw-debug: Add dummy service data [labs/private] - 10https://gerrit.wikimedia.org/r/852809 (https://phabricator.wikimedia.org/T321201) [17:52:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T318955)', diff saved to https://phabricator.wikimedia.org/P38394 and previous config saved to /var/cache/conftool/dbconfig/20221107-175217-ladsgroup.json [17:52:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2109.codfw.wmnet with reason: Maintenance [17:52:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2109.codfw.wmnet with reason: Maintenance [17:52:26] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [17:52:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T318955)', diff saved to https://phabricator.wikimedia.org/P38395 and previous config saved to /var/cache/conftool/dbconfig/20221107-175228-ladsgroup.json [17:53:10] (03CR) 10ClĂ©ment Goubert: [V: 03+1 C: 03+2] mw-debug: Add dummy service data [labs/private] - 10https://gerrit.wikimedia.org/r/852809 (https://phabricator.wikimedia.org/T321201) (owner: 10ClĂ©ment Goubert) [17:53:27] (03CR) 10ClĂ©ment Goubert: [V: 03+2 C: 03+2] mw-debug: Add dummy service data [labs/private] - 10https://gerrit.wikimedia.org/r/852809 (https://phabricator.wikimedia.org/T321201) (owner: 10ClĂ©ment Goubert) [17:53:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T321130)', diff saved to https://phabricator.wikimedia.org/P38396 and previous config saved to /var/cache/conftool/dbconfig/20221107-175335-marostegui.json [17:53:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2124.codfw.wmnet with reason: Maintenance [17:53:40] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [17:53:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2124.codfw.wmnet with reason: Maintenance [17:53:53] (03CR) 10ClĂ©ment Goubert: [V: 03+1 C: 03+2] mwdebug: Remove dummy service data [labs/private] - 10https://gerrit.wikimedia.org/r/852811 (https://phabricator.wikimedia.org/T321201) (owner: 10ClĂ©ment Goubert) [17:53:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T321130)', diff saved to https://phabricator.wikimedia.org/P38397 and previous config saved to /var/cache/conftool/dbconfig/20221107-175357-marostegui.json [17:54:00] (03PS2) 10ClĂ©ment Goubert: mwdebug: Remove dummy service data [labs/private] - 10https://gerrit.wikimedia.org/r/852811 (https://phabricator.wikimedia.org/T321201) [17:54:14] (03CR) 10ClĂ©ment Goubert: [V: 03+2] mwdebug: Remove dummy service data [labs/private] - 10https://gerrit.wikimedia.org/r/852811 (https://phabricator.wikimedia.org/T321201) (owner: 10ClĂ©ment Goubert) [17:56:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P38398 and previous config saved to /var/cache/conftool/dbconfig/20221107-175629-ladsgroup.json [17:57:02] (03PS2) 10ClĂ©ment Goubert: mw-on-k8s: Add dummy service data [labs/private] - 10https://gerrit.wikimedia.org/r/853930 (https://phabricator.wikimedia.org/T321786) [17:58:37] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:59:09] (03CR) 10ClĂ©ment Goubert: [V: 03+1 C: 03+2] mw-on-k8s: Add dummy service data [labs/private] - 10https://gerrit.wikimedia.org/r/853930 (https://phabricator.wikimedia.org/T321786) (owner: 10ClĂ©ment Goubert) [17:59:13] (03CR) 10ClĂ©ment Goubert: [V: 03+2 C: 03+2] mw-on-k8s: Add dummy service data [labs/private] - 10https://gerrit.wikimedia.org/r/853930 (https://phabricator.wikimedia.org/T321786) (owner: 10ClĂ©ment Goubert) [17:59:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T321130)', diff saved to https://phabricator.wikimedia.org/P38399 and previous config saved to /var/cache/conftool/dbconfig/20221107-175928-marostegui.json [17:59:33] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [17:59:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T318955)', diff saved to https://phabricator.wikimedia.org/P38400 and previous config saved to /var/cache/conftool/dbconfig/20221107-175943-ladsgroup.json [17:59:47] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [18:00:04] ryankemper: May I have your attention please! Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221107T1800) [18:00:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:02:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P38401 and previous config saved to /var/cache/conftool/dbconfig/20221107-180230-marostegui.json [18:04:00] (03CR) 10Herron: [C: 03+1] alertmanager: use 'site' label to route tasks for dcops [puppet] - 10https://gerrit.wikimedia.org/r/854040 (https://phabricator.wikimedia.org/T225140) (owner: 10Filippo Giunchedi) [18:04:57] (03CR) 10Vgutierrez: [C: 03+2] deployment-prep: Add ms-fe04 [puppet] - 10https://gerrit.wikimedia.org/r/854044 (https://phabricator.wikimedia.org/T322554) (owner: 10Vgutierrez) [18:06:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P38402 and previous config saved to /var/cache/conftool/dbconfig/20221107-180630-ladsgroup.json [18:08:55] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host puppetdb2003 [18:09:42] RECOVERY - SSH on mw1309.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:09:44] (03PS1) 10Ssingh: Release 1.5.3-3 [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/854063 (https://phabricator.wikimedia.org/T321309) [18:09:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host puppetdb2003 [18:09:56] (03CR) 10CI reject: [V: 04-1] Release 1.5.3-3 [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/854063 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [18:10:41] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host arclamp2001 [18:10:55] (03PS2) 10Ssingh: Release 1.5.3-3 [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/854063 (https://phabricator.wikimedia.org/T321309) [18:11:13] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host arclamp2001 [18:11:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P38403 and previous config saved to /var/cache/conftool/dbconfig/20221107-181135-ladsgroup.json [18:14:19] (03CR) 10CI reject: [V: 04-1] Release 1.5.3-3 [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/854063 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [18:14:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P38404 and previous config saved to /var/cache/conftool/dbconfig/20221107-181435-marostegui.json [18:14:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P38405 and previous config saved to /var/cache/conftool/dbconfig/20221107-181449-ladsgroup.json [18:17:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T321123)', diff saved to https://phabricator.wikimedia.org/P38406 and previous config saved to /var/cache/conftool/dbconfig/20221107-181737-marostegui.json [18:17:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2119.codfw.wmnet with reason: Maintenance [18:17:42] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [18:17:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2119.codfw.wmnet with reason: Maintenance [18:17:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2119 (T321123)', diff saved to https://phabricator.wikimedia.org/P38407 and previous config saved to /var/cache/conftool/dbconfig/20221107-181759-marostegui.json [18:18:21] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host puppetdb2003.mgmt.codfw.wmnet with reboot policy FORCED [18:19:10] (03CR) 10Ssingh: "Since the package builds fine, I am guessing the CI failure is due to:" [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/854063 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [18:20:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T321123)', diff saved to https://phabricator.wikimedia.org/P38408 and previous config saved to /var/cache/conftool/dbconfig/20221107-182009-marostegui.json [18:21:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P38409 and previous config saved to /var/cache/conftool/dbconfig/20221107-182137-ladsgroup.json [18:21:42] (03CR) 10Vgutierrez: [C: 04-1] "nope, lintian is sad:" [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/854063 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [18:22:14] PROBLEM - Check systemd state on dispatch-be2001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:22:24] (03CR) 10David Caro: wmcs: add socks proxy support to wmcs cookbooks (039 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) (owner: 10David Caro) [18:24:16] (03CR) 10David Caro: wmcs: add cookbook to add/remove a user to/from a project (033 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650 (owner: 10David Caro) [18:24:33] (03CR) 10Vlad.shapik: [C: 03+1] "LGTM" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/854029 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [18:24:35] (03CR) 10David Caro: [C: 03+2] wmcs.toolforge.grid: get also the job logs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841930 (owner: 10David Caro) [18:24:44] (03CR) 10CI reject: [V: 04-1] wmcs.toolforge.grid: get also the job logs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841930 (owner: 10David Caro) [18:24:47] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host puppetdb2003.mgmt.codfw.wmnet with reboot policy FORCED [18:24:50] (03CR) 10David Caro: [C: 03+2] create_instance_with_prefix: fix prefix guess [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/848222 (owner: 10David Caro) [18:25:45] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host arclamp2001.mgmt.codfw.wmnet with reboot policy FORCED [18:26:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T318605)', diff saved to https://phabricator.wikimedia.org/P38410 and previous config saved to /var/cache/conftool/dbconfig/20221107-182642-ladsgroup.json [18:26:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance [18:26:46] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [18:26:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance [18:27:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T318605)', diff saved to https://phabricator.wikimedia.org/P38411 and previous config saved to /var/cache/conftool/dbconfig/20221107-182704-ladsgroup.json [18:27:49] (03CR) 10Vlad.shapik: [C: 03+1] thumbor: enable setting log level, set staging to debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/854026 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [18:28:41] (03CR) 10Htriedman: Varnish analytics: support differential privacy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [18:29:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P38412 and previous config saved to /var/cache/conftool/dbconfig/20221107-182941-marostegui.json [18:29:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P38413 and previous config saved to /var/cache/conftool/dbconfig/20221107-182956-ladsgroup.json [18:30:22] (03PS1) 10Vgutierrez: deployment-prep: Stop using ms-fe03 [puppet] - 10https://gerrit.wikimedia.org/r/854064 (https://phabricator.wikimedia.org/T322554) [18:33:22] (03CR) 10Vgutierrez: [C: 03+2] deployment-prep: Stop using ms-fe03 [puppet] - 10https://gerrit.wikimedia.org/r/854064 (https://phabricator.wikimedia.org/T322554) (owner: 10Vgutierrez) [18:33:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host arclamp2001.mgmt.codfw.wmnet with reboot policy FORCED [18:35:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P38414 and previous config saved to /var/cache/conftool/dbconfig/20221107-183515-marostegui.json [18:36:26] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2004.mgmt.codfw.wmnet with reboot policy FORCED [18:36:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T318605)', diff saved to https://phabricator.wikimedia.org/P38415 and previous config saved to /var/cache/conftool/dbconfig/20221107-183643-ladsgroup.json [18:36:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [18:36:47] (03CR) 10Htriedman: Varnish analytics: support differential privacy (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [18:36:49] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [18:36:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [18:37:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [18:37:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [18:37:20] (03PS1) 10Htriedman: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/854087 (https://phabricator.wikimedia.org/T315676) [18:37:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T318605)', diff saved to https://phabricator.wikimedia.org/P38416 and previous config saved to /var/cache/conftool/dbconfig/20221107-183722-ladsgroup.json [18:37:55] (03CR) 10CI reject: [V: 04-1] Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/854087 (https://phabricator.wikimedia.org/T315676) (owner: 10Htriedman) [18:39:07] (03PS1) 10Vgutierrez: deployment-prep: Fix ms-f04 FQDN [puppet] - 10https://gerrit.wikimedia.org/r/854088 (https://phabricator.wikimedia.org/T322554) [18:43:43] (03PS2) 10Vgutierrez: deployment-prep: Fix ms-fe04 FQDN [puppet] - 10https://gerrit.wikimedia.org/r/854088 (https://phabricator.wikimedia.org/T322554) [18:44:41] (03CR) 10Htriedman: "I accidentally started a new CR while addressing the below comments (my bad, I don't often use Gerrit so I'm a bit unsure of the mechanics" [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [18:44:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T321130)', diff saved to https://phabricator.wikimedia.org/P38417 and previous config saved to /var/cache/conftool/dbconfig/20221107-184448-marostegui.json [18:44:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2129.codfw.wmnet with reason: Maintenance [18:44:55] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [18:45:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T318955)', diff saved to https://phabricator.wikimedia.org/P38418 and previous config saved to /var/cache/conftool/dbconfig/20221107-184502-ladsgroup.json [18:45:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2129.codfw.wmnet with reason: Maintenance [18:45:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2139.codfw.wmnet with reason: Maintenance [18:45:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2139.codfw.wmnet with reason: Maintenance [18:45:10] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [18:45:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2129 (T321130)', diff saved to https://phabricator.wikimedia.org/P38419 and previous config saved to /var/cache/conftool/dbconfig/20221107-184510-marostegui.json [18:45:22] (03CR) 10Vgutierrez: [C: 03+2] deployment-prep: Fix ms-fe04 FQDN [puppet] - 10https://gerrit.wikimedia.org/r/854088 (https://phabricator.wikimedia.org/T322554) (owner: 10Vgutierrez) [18:50:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P38420 and previous config saved to /var/cache/conftool/dbconfig/20221107-185022-marostegui.json [18:50:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2149.codfw.wmnet with reason: Maintenance [18:50:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2149.codfw.wmnet with reason: Maintenance [18:50:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T318955)', diff saved to https://phabricator.wikimedia.org/P38421 and previous config saved to /var/cache/conftool/dbconfig/20221107-185035-ladsgroup.json [18:50:41] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [18:50:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T321130)', diff saved to https://phabricator.wikimedia.org/P38422 and previous config saved to /var/cache/conftool/dbconfig/20221107-185044-marostegui.json [18:50:51] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [18:51:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T318605)', diff saved to https://phabricator.wikimedia.org/P38423 and previous config saved to /var/cache/conftool/dbconfig/20221107-185105-ladsgroup.json [18:51:12] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [18:52:57] (03CR) 10Vgutierrez: "hmm I'm assuming this should be another patch set of https://gerrit.wikimedia.org/r/c/operations/puppet/+/824769?" [puppet] - 10https://gerrit.wikimedia.org/r/854087 (https://phabricator.wikimedia.org/T315676) (owner: 10Htriedman) [18:55:49] (03PS1) 10Cwhite: beta-logs: allow bullseye logstash host access to loki [puppet] - 10https://gerrit.wikimedia.org/r/854106 (https://phabricator.wikimedia.org/T321410) [18:57:41] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov2004.mgmt.codfw.wmnet with reboot policy FORCED [18:58:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T318955)', diff saved to https://phabricator.wikimedia.org/P38424 and previous config saved to /var/cache/conftool/dbconfig/20221107-185800-ladsgroup.json [18:58:06] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [18:58:19] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2004.mgmt.codfw.wmnet with reboot policy FORCED [19:00:52] (03CR) 10Htriedman: Varnish analytics: support differential privacy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854087 (https://phabricator.wikimedia.org/T315676) (owner: 10Htriedman) [19:02:37] (03CR) 10Cwhite: [C: 03+2] beta-logs: allow bullseye logstash host access to loki [puppet] - 10https://gerrit.wikimedia.org/r/854106 (https://phabricator.wikimedia.org/T321410) (owner: 10Cwhite) [19:04:32] (03CR) 10Vgutierrez: Varnish analytics: support differential privacy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [19:05:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T321123)', diff saved to https://phabricator.wikimedia.org/P38425 and previous config saved to /var/cache/conftool/dbconfig/20221107-190528-marostegui.json [19:05:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2136.codfw.wmnet with reason: Maintenance [19:05:33] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [19:05:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2136.codfw.wmnet with reason: Maintenance [19:05:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2136 (T321123)', diff saved to https://phabricator.wikimedia.org/P38426 and previous config saved to /var/cache/conftool/dbconfig/20221107-190550-marostegui.json [19:05:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P38427 and previous config saved to /var/cache/conftool/dbconfig/20221107-190551-marostegui.json [19:06:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P38428 and previous config saved to /var/cache/conftool/dbconfig/20221107-190612-ladsgroup.json [19:07:13] (03PS1) 10Ahmon Dancy: Only Enable LBFactory config callback in CLI in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854090 (https://phabricator.wikimedia.org/T298485) [19:08:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T321123)', diff saved to https://phabricator.wikimedia.org/P38429 and previous config saved to /var/cache/conftool/dbconfig/20221107-190800-marostegui.json [19:10:46] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov2004.mgmt.codfw.wmnet with reboot policy FORCED [19:11:28] (03CR) 10Ahmon Dancy: "I ran into an issue in train-dev recently after merging the master branch of operations/mediawiki-config into the train-dev branch. The e" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854090 (https://phabricator.wikimedia.org/T298485) (owner: 10Ahmon Dancy) [19:13:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P38430 and previous config saved to /var/cache/conftool/dbconfig/20221107-191306-ladsgroup.json [19:16:07] (03CR) 10Krinkle: [C: 03+1] Only Enable LBFactory config callback in CLI in production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854090 (https://phabricator.wikimedia.org/T298485) (owner: 10Ahmon Dancy) [19:16:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2004.mgmt.codfw.wmnet with reboot policy FORCED [19:18:00] (03PS7) 10Htriedman: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [19:22:53] (03CR) 10Htriedman: Varnish analytics: support differential privacy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [19:22:53] (03CR) 10CI reject: [V: 04-1] Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [19:22:53] (03Abandoned) 10Htriedman: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/854087 (https://phabricator.wikimedia.org/T315676) (owner: 10Htriedman) [19:22:53] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install dbprov2004 - https://phabricator.wikimedia.org/T321128 (10Papaul) provision cookbook failed on first run on the new R650 redfish tried to connect to the server with the default IDRAC password (calvin) or the servers were shipp... [19:22:53] (03PS2) 10Ahmon Dancy: Only Enable LBFactory config callback in CLI in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854090 (https://phabricator.wikimedia.org/T298485) [19:22:53] (03CR) 10Ahmon Dancy: Only Enable LBFactory config callback in CLI in production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854090 (https://phabricator.wikimedia.org/T298485) (owner: 10Ahmon Dancy) [19:24:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P38431 and previous config saved to /var/cache/conftool/dbconfig/20221107-192058-marostegui.json [19:24:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P38432 and previous config saved to /var/cache/conftool/dbconfig/20221107-192119-ladsgroup.json [19:24:07] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov2004.mgmt.codfw.wmnet with reboot policy FORCED [19:24:08] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2004.mgmt.codfw.wmnet with reboot policy FORCED [19:24:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P38433 and previous config saved to /var/cache/conftool/dbconfig/20221107-192306-marostegui.json [19:25:06] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov2004.mgmt.codfw.wmnet with reboot policy FORCED [19:28:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P38434 and previous config saved to /var/cache/conftool/dbconfig/20221107-192813-ladsgroup.json [19:28:21] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install dbprov2004 - https://phabricator.wikimedia.org/T321128 (10Papaul) @Volans new R650 failing provision cookbook with ` Failed to run cookbooks.sre.hosts.provision.ProvisionRunner._config: 'BiosBootSeq' ` and ` raise IpmiErr... [19:30:56] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Eevans) >>! In T305570#8375505, @Volans wrote: > I've also run the `sre.dns.netbox` cookbook, the DNS records are now live. Thank you! [19:32:08] (03PS3) 10Ssingh: Release 1.5.3-3 [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/854063 (https://phabricator.wikimedia.org/T321309) [19:36:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T321130)', diff saved to https://phabricator.wikimedia.org/P38435 and previous config saved to /var/cache/conftool/dbconfig/20221107-193604-marostegui.json [19:36:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2141.codfw.wmnet with reason: Maintenance [19:36:10] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [19:36:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2141.codfw.wmnet with reason: Maintenance [19:36:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T318605)', diff saved to https://phabricator.wikimedia.org/P38436 and previous config saved to /var/cache/conftool/dbconfig/20221107-193625-ladsgroup.json [19:36:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [19:36:30] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [19:36:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [19:36:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T318605)', diff saved to https://phabricator.wikimedia.org/P38437 and previous config saved to /var/cache/conftool/dbconfig/20221107-193646-ladsgroup.json [19:38:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P38438 and previous config saved to /var/cache/conftool/dbconfig/20221107-193813-marostegui.json [19:39:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2158.codfw.wmnet with reason: Maintenance [19:40:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2158.codfw.wmnet with reason: Maintenance [19:40:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2095.codfw.wmnet with reason: Maintenance [19:40:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2095.codfw.wmnet with reason: Maintenance [19:40:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T321130)', diff saved to https://phabricator.wikimedia.org/P38439 and previous config saved to /var/cache/conftool/dbconfig/20221107-194026-marostegui.json [19:41:17] (03PS1) 10Andrew Bogott: Open magnum and heat apis to the greater internet [puppet] - 10https://gerrit.wikimedia.org/r/854092 (https://phabricator.wikimedia.org/T319312) [19:43:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T318955)', diff saved to https://phabricator.wikimedia.org/P38440 and previous config saved to /var/cache/conftool/dbconfig/20221107-194319-ladsgroup.json [19:43:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2156.codfw.wmnet with reason: Maintenance [19:43:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2156.codfw.wmnet with reason: Maintenance [19:43:26] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [19:43:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2094.codfw.wmnet with reason: Maintenance [19:43:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2094.codfw.wmnet with reason: Maintenance [19:43:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T318955)', diff saved to https://phabricator.wikimedia.org/P38441 and previous config saved to /var/cache/conftool/dbconfig/20221107-194335-ladsgroup.json [19:43:37] (03PS1) 10Vgutierrez: deployment-prep: Use ms-fe04 for auth requests [puppet] - 10https://gerrit.wikimedia.org/r/854093 (https://phabricator.wikimedia.org/T322554) [19:44:52] (03CR) 10Ssingh: "Addressed the Lintian E: libvmod-re2: custom-library-search-path usr/lib/x86_64-linux-gnu/varnish/vmods/libvmod_re2.so /usr/lib/x86_64-lin" [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/854063 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [19:45:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T321130)', diff saved to https://phabricator.wikimedia.org/P38442 and previous config saved to /var/cache/conftool/dbconfig/20221107-194557-marostegui.json [19:45:58] (03PS8) 10Andrew Bogott: Add upgrade_openstack_node.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 [19:46:03] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [19:46:10] (03CR) 10Ssingh: Release 1.5.3-3 (031 comment) [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/854063 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [19:49:18] (03CR) 10CI reject: [V: 04-1] Add upgrade_openstack_node.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 (owner: 10Andrew Bogott) [19:49:30] 10ops-codfw: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10Papaul) [19:50:32] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host sretest2002 [19:50:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T318955)', diff saved to https://phabricator.wikimedia.org/P38443 and previous config saved to /var/cache/conftool/dbconfig/20221107-195049-ladsgroup.json [19:50:55] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [19:51:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2002 [19:51:37] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [19:51:57] (03PS2) 10Andrew Bogott: add wmcs-securitygroup-backfill [puppet] - 10https://gerrit.wikimedia.org/r/850592 (https://phabricator.wikimedia.org/T288108) [19:53:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T321123)', diff saved to https://phabricator.wikimedia.org/P38444 and previous config saved to /var/cache/conftool/dbconfig/20221107-195319-marostegui.json [19:53:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2137.codfw.wmnet with reason: Maintenance [19:53:25] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [19:53:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2137.codfw.wmnet with reason: Maintenance [19:53:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:53:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3314 (T321123)', diff saved to https://phabricator.wikimedia.org/P38445 and previous config saved to /var/cache/conftool/dbconfig/20221107-195340-marostegui.json [19:54:35] (03CR) 10Andrew Bogott: Add upgrade_openstack_node.py (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 (owner: 10Andrew Bogott) [19:55:04] (03PS9) 10Andrew Bogott: Add upgrade_openstack_node.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 [19:55:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T321123)', diff saved to https://phabricator.wikimedia.org/P38446 and previous config saved to /var/cache/conftool/dbconfig/20221107-195550-marostegui.json [19:58:26] (03CR) 10CI reject: [V: 04-1] Add upgrade_openstack_node.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 (owner: 10Andrew Bogott) [20:01:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P38447 and previous config saved to /var/cache/conftool/dbconfig/20221107-200103-marostegui.json [20:01:48] (03CR) 10Vgutierrez: [C: 03+2] deployment-prep: Use ms-fe04 for auth requests [puppet] - 10https://gerrit.wikimedia.org/r/854093 (https://phabricator.wikimedia.org/T322554) (owner: 10Vgutierrez) [20:02:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T318605)', diff saved to https://phabricator.wikimedia.org/P38448 and previous config saved to /var/cache/conftool/dbconfig/20221107-200245-ladsgroup.json [20:02:51] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [20:03:37] (03CR) 10Andrew Bogott: Add upgrade_openstack_node.py (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 (owner: 10Andrew Bogott) [20:05:50] (03PS1) 10Vgutierrez: swift: Remove ms-be06 from deployment-prep cluster [puppet] - 10https://gerrit.wikimedia.org/r/854094 (https://phabricator.wikimedia.org/T322231) [20:05:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P38449 and previous config saved to /var/cache/conftool/dbconfig/20221107-200556-ladsgroup.json [20:06:07] (03CR) 10Andrew Bogott: "(I have not tested the exception handling, hoping Taavi will do that since you seem to have a test case in front of you)" [puppet] - 10https://gerrit.wikimedia.org/r/850592 (https://phabricator.wikimedia.org/T288108) (owner: 10Andrew Bogott) [20:06:43] (03CR) 10Vgutierrez: [C: 03+2] swift: Remove ms-be06 from deployment-prep cluster [puppet] - 10https://gerrit.wikimedia.org/r/854094 (https://phabricator.wikimedia.org/T322231) (owner: 10Vgutierrez) [20:10:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P38450 and previous config saved to /var/cache/conftool/dbconfig/20221107-201057-marostegui.json [20:14:46] (03CR) 10Bartosz DziewoƄski: [C: 04-1] Enable history page visual diffs on beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833831 (owner: 10Esanders) [20:15:46] (03PS1) 10Andrew Bogott: Clean up some obsolete cloudservices hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/854095 [20:15:48] (03PS1) 10Andrew Bogott: eqiad1 designate -> Yoga [puppet] - 10https://gerrit.wikimedia.org/r/854096 (https://phabricator.wikimedia.org/T305828) [20:15:51] (03PS3) 10BCornwall: prometheus: Handle inactive trafficserver service [puppet] - 10https://gerrit.wikimedia.org/r/851669 (https://phabricator.wikimedia.org/T292815) [20:15:53] (03CR) 10BCornwall: prometheus: Handle inactive trafficserver service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/851669 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [20:16:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P38451 and previous config saved to /var/cache/conftool/dbconfig/20221107-201610-marostegui.json [20:16:48] (03CR) 10CI reject: [V: 04-1] prometheus: Handle inactive trafficserver service [puppet] - 10https://gerrit.wikimedia.org/r/851669 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [20:17:15] (03PS1) 10Bartosz DziewoƄski: ThreadItemStore: Update existing rows if possible rather than insert+delete [extensions/DiscussionTools] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854068 (https://phabricator.wikimedia.org/T321121) [20:17:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P38452 and previous config saved to /var/cache/conftool/dbconfig/20221107-201752-ladsgroup.json [20:19:06] (03CR) 10Andrew Bogott: [C: 03+2] Clean up some obsolete cloudservices hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/854095 (owner: 10Andrew Bogott) [20:20:11] (03PS1) 10Bartosz DziewoƄski: Enable wgDiscussionToolsEnablePermalinksBackend on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854097 (https://phabricator.wikimedia.org/T315353) [20:21:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P38453 and previous config saved to /var/cache/conftool/dbconfig/20221107-202102-ladsgroup.json [20:21:35] (03PS2) 10Andrew Bogott: eqiad1 designate -> Yoga [puppet] - 10https://gerrit.wikimedia.org/r/854096 (https://phabricator.wikimedia.org/T305828) [20:26:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P38454 and previous config saved to /var/cache/conftool/dbconfig/20221107-202603-marostegui.json [20:31:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T321130)', diff saved to https://phabricator.wikimedia.org/P38455 and previous config saved to /var/cache/conftool/dbconfig/20221107-203116-marostegui.json [20:31:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2169.codfw.wmnet with reason: Maintenance [20:31:22] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [20:31:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2169.codfw.wmnet with reason: Maintenance [20:31:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T321130)', diff saved to https://phabricator.wikimedia.org/P38456 and previous config saved to /var/cache/conftool/dbconfig/20221107-203138-marostegui.json [20:32:29] 10SRE, 10WMF-Communications, 10serviceops-collab: Feasibility of hosting podcast setup on Wikimedia servers - https://phabricator.wikimedia.org/T148061 (10Dzahn) [20:32:51] (03PS8) 10BCornwall: prometheus: Rename ats_ metrics to trafficserver_ [puppet] - 10https://gerrit.wikimedia.org/r/851139 (https://phabricator.wikimedia.org/T292815) [20:32:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P38457 and previous config saved to /var/cache/conftool/dbconfig/20221107-203258-ladsgroup.json [20:34:40] (03CR) 10BCornwall: "Had to rebase now that I8a5c9a32a6a3d12c604717557e1bd3d48d8570ba is merged" [puppet] - 10https://gerrit.wikimedia.org/r/851139 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [20:36:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T318955)', diff saved to https://phabricator.wikimedia.org/P38458 and previous config saved to /var/cache/conftool/dbconfig/20221107-203609-ladsgroup.json [20:36:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2177.codfw.wmnet with reason: Maintenance [20:36:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2177.codfw.wmnet with reason: Maintenance [20:36:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T321130)', diff saved to https://phabricator.wikimedia.org/P38459 and previous config saved to /var/cache/conftool/dbconfig/20221107-203615-marostegui.json [20:36:17] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [20:36:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T318955)', diff saved to https://phabricator.wikimedia.org/P38460 and previous config saved to /var/cache/conftool/dbconfig/20221107-203626-ladsgroup.json [20:36:36] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [20:40:16] (03PS1) 10Bartosz DziewoƄski: Simplify some redundant settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854100 [20:40:49] (03CR) 10Andrew Bogott: [C: 03+2] eqiad1 designate -> Yoga [puppet] - 10https://gerrit.wikimedia.org/r/854096 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [20:41:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T321123)', diff saved to https://phabricator.wikimedia.org/P38461 and previous config saved to /var/cache/conftool/dbconfig/20221107-204110-marostegui.json [20:41:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2138.codfw.wmnet with reason: Maintenance [20:41:17] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [20:41:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2138.codfw.wmnet with reason: Maintenance [20:41:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3314 (T321123)', diff saved to https://phabricator.wikimedia.org/P38462 and previous config saved to /var/cache/conftool/dbconfig/20221107-204131-marostegui.json [20:42:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T321123)', diff saved to https://phabricator.wikimedia.org/P38463 and previous config saved to /var/cache/conftool/dbconfig/20221107-204240-marostegui.json [20:43:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T318955)', diff saved to https://phabricator.wikimedia.org/P38464 and previous config saved to /var/cache/conftool/dbconfig/20221107-204340-ladsgroup.json [20:43:47] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [20:48:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T318605)', diff saved to https://phabricator.wikimedia.org/P38465 and previous config saved to /var/cache/conftool/dbconfig/20221107-204805-ladsgroup.json [20:48:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [20:48:11] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [20:48:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [20:48:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T318605)', diff saved to https://phabricator.wikimedia.org/P38466 and previous config saved to /var/cache/conftool/dbconfig/20221107-204827-ladsgroup.json [20:51:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P38467 and previous config saved to /var/cache/conftool/dbconfig/20221107-205122-marostegui.json [20:51:54] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "yep, that looks like the Hashicorp signing key" [puppet] - 10https://gerrit.wikimedia.org/r/852961 (https://phabricator.wikimedia.org/T322344) (owner: 10Dduvall) [20:52:26] (03PS6) 10BCornwall: prometheus: Add ats header/body size total metrics [puppet] - 10https://gerrit.wikimedia.org/r/845688 (https://phabricator.wikimedia.org/T284304) [20:53:22] (03CR) 10BCornwall: "I apologize, vgutierrez... Something went awry and PS4 → PS5 was very wrong." [puppet] - 10https://gerrit.wikimedia.org/r/845688 (https://phabricator.wikimedia.org/T284304) (owner: 10BCornwall) [20:55:09] (03CR) 10Dzahn: [V: 03+1 C: 03+2] aptrepo: Add thirdparty/terraform [puppet] - 10https://gerrit.wikimedia.org/r/852961 (https://phabricator.wikimedia.org/T322344) (owner: 10Dduvall) [20:55:53] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "at this moment, simply looking at "add the repo". whether the packages used will be another patch anyways." [puppet] - 10https://gerrit.wikimedia.org/r/852961 (https://phabricator.wikimedia.org/T322344) (owner: 10Dduvall) [20:57:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T318605)', diff saved to https://phabricator.wikimedia.org/P38468 and previous config saved to /var/cache/conftool/dbconfig/20221107-205735-ladsgroup.json [20:57:40] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [20:57:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P38469 and previous config saved to /var/cache/conftool/dbconfig/20221107-205747-marostegui.json [20:58:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P38470 and previous config saved to /var/cache/conftool/dbconfig/20221107-205847-ladsgroup.json [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221107T2100). [21:00:05] MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:27] i can deploy today [21:00:28] hi MatmaRex [21:00:31] hello [21:00:50] (03CR) 10Urbanecm: [C: 03+2] ThreadItemStore: Update existing rows if possible rather than insert+delete [extensions/DiscussionTools] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854068 (https://phabricator.wikimedia.org/T321121) (owner: 10Bartosz DziewoƄski) [21:01:08] my other things need to go in order [21:01:15] okay, so backport first [21:01:32] (03PS2) 10Bartosz DziewoƄski: Simplify some redundant settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854100 [21:01:34] but, while we wait, i have a no-op config change that could be merged: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/854100 ;) [21:01:44] and https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/851147 is good to go as well (beta only) [21:01:55] yep [21:01:57] 10SRE, 10Znuny, 10serviceops-collab: Convert glam@wikimedia.org OTRS into a Google Group - https://phabricator.wikimedia.org/T233843 (10Dzahn) This can certainly be done but needs coordination with the Wikimedia ITS team. Basically we have to create a request by emailing techsupport@ to ask them to create th... [21:02:21] I'll skip mwdebug for both, as they can't be tested anyway [21:02:27] 10SRE, 10Znuny, 10serviceops-collab: Convert glam@wikimedia.org OTRS into a Google Group - https://phabricator.wikimedia.org/T233843 (10Dzahn) This being said, when I look at the setup of glam@wikimedia.org right now then it's already on Google. I checked like this on mx1001, prod mail server: ` [mx1001:~... [21:02:32] (03PS3) 10Urbanecm: Clean up wgDiscussionToolsABTest config for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851147 (owner: 10Bartosz DziewoƄski) [21:02:43] !log urbanecm@deploy1002 backport aborted: (duration: 00m 02s) [21:02:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854100 (owner: 10Bartosz DziewoƄski) [21:02:48] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851147 (owner: 10Bartosz DziewoƄski) [21:03:36] (03Merged) 10jenkins-bot: Simplify some redundant settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854100 (owner: 10Bartosz DziewoƄski) [21:03:37] Thanks urbanecm :) [21:03:38] (03Merged) 10jenkins-bot: Clean up wgDiscussionToolsABTest config for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851147 (owner: 10Bartosz DziewoƄski) [21:03:54] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:854100|Simplify some redundant settings]], [[gerrit:851147|Clean up wgDiscussionToolsABTest config for beta cluster]] [21:04:08] (03PS1) 10BCornwall: varnish: set expandtab in vim modeline [puppet] - 10https://gerrit.wikimedia.org/r/854104 [21:04:08] no problem kindrobot (also, hi) [21:04:39] Hi! o/ [21:05:40] 10SRE, 10Znuny, 10serviceops-collab: Convert glam@wikimedia.org OTRS into a Google Group - https://phabricator.wikimedia.org/T233843 (10Dzahn) @Astinson @Sadads Sorry for using both users, I wasn't sure. Would you say this ticket has been resolved meanwhile? Looks like things are already on Google. Cheers! [21:06:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P38471 and previous config saved to /var/cache/conftool/dbconfig/20221107-210628-marostegui.json [21:06:51] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "deployed on apt1001. it should now be possible to use this" [puppet] - 10https://gerrit.wikimedia.org/r/852961 (https://phabricator.wikimedia.org/T322344) (owner: 10Dduvall) [21:07:35] (03Merged) 10jenkins-bot: ThreadItemStore: Update existing rows if possible rather than insert+delete [extensions/DiscussionTools] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854068 (https://phabricator.wikimedia.org/T321121) (owner: 10Bartosz DziewoƄski) [21:08:26] will proceed with the backport once the two other patches finish [21:08:35] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:854100|Simplify some redundant settings]], [[gerrit:851147|Clean up wgDiscussionToolsABTest config for beta cluster]] (duration: 04m 40s) [21:08:45] ...which just happened [21:08:54] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854068 (https://phabricator.wikimedia.org/T321121) (owner: 10Bartosz DziewoƄski) [21:09:07] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:854068|ThreadItemStore: Update existing rows if possible rather than insert+delete (T321121)]] [21:09:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [21:09:15] T321121: When storing new permalinks data, update existing rows if possible rather than insert+delete - https://phabricator.wikimedia.org/T321121 [21:09:26] !log urbanecm@deploy1002 urbanecm and matmarex: Backport for [[gerrit:854068|ThreadItemStore: Update existing rows if possible rather than insert+delete (T321121)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [21:09:35] MatmaRex: can you test at mwdebug1001, please? [21:09:45] looking [21:10:41] seems good on testwiki [21:11:34] so, let's sync? [21:11:38] or are you testing other wikis? [21:11:48] yeah, let's sync [21:11:52] it's only enabled on testwiki [21:11:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [21:11:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [21:11:57] until that config patch [21:12:33] okay, syncing [21:12:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P38472 and previous config saved to /var/cache/conftool/dbconfig/20221107-211241-ladsgroup.json [21:12:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P38473 and previous config saved to /var/cache/conftool/dbconfig/20221107-211253-marostegui.json [21:13:05] (03CR) 10Dzahn: [C: 03+2] base: remove check_long_procs, unused [puppet] - 10https://gerrit.wikimedia.org/r/854039 (https://phabricator.wikimedia.org/T225140) (owner: 10Filippo Giunchedi) [21:13:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P38474 and previous config saved to /var/cache/conftool/dbconfig/20221107-211353-ladsgroup.json [21:13:57] (03PS2) 10Urbanecm: Enable wgDiscussionToolsEnablePermalinksBackend on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854097 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz DziewoƄski) [21:14:06] (03CR) 10Urbanecm: [C: 03+2] Enable wgDiscussionToolsEnablePermalinksBackend on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854097 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz DziewoƄski) [21:14:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [21:14:56] (03Merged) 10jenkins-bot: Enable wgDiscussionToolsEnablePermalinksBackend on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854097 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz DziewoƄski) [21:16:37] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:854068|ThreadItemStore: Update existing rows if possible rather than insert+delete (T321121)]] (duration: 07m 30s) [21:16:44] T321121: When storing new permalinks data, update existing rows if possible rather than insert+delete - https://phabricator.wikimedia.org/T321121 [21:17:23] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854097 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz DziewoƄski) [21:17:34] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:854097|Enable wgDiscussionToolsEnablePermalinksBackend on group0 wikis (T315353)]] [21:17:39] T315353: Create database tables for permalinks in production wikis, and enable the feature - https://phabricator.wikimedia.org/T315353 [21:17:54] !log urbanecm@deploy1002 urbanecm and matmarex: Backport for [[gerrit:854097|Enable wgDiscussionToolsEnablePermalinksBackend on group0 wikis (T315353)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [21:17:59] (03CR) 10Dzahn: [C: 03+2] remove phab1001-aphlict.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/853010 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:18:03] (03PS4) 10Dzahn: remove phab1001-aphlict.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/853010 (https://phabricator.wikimedia.org/T280597) [21:18:41] MatmaRex: can you test at mwdebug1001? [21:18:54] also, for the maint script which is next, do i need to wait until the sync to start it? [21:19:03] (03PS3) 10Jforrester: Enable history page visual diffs on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833831 (owner: 10Esanders) [21:19:03] (and, how long is it expected to run in total?) [21:19:14] urbanecm: yes, also works! [21:19:20] urbanecm: yeah, it depends on the backport [21:19:20] great, proceeding [21:19:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [21:19:27] MatmaRex: the backport's live [21:19:29] the config patch's not [21:19:47] but calendar says backport -> config -> maint script [21:19:51] (03CR) 10Jforrester: Enable history page visual diffs on beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833831 (owner: 10Esanders) [21:20:12] urbanecm: oh sorry, you're right, it needs both [21:20:18] okay. waiting then :) [21:20:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [21:20:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [21:20:53] urbanecm: it will probably take a few hours. i'd need to check how many pages there are on all of these wikis [21:21:12] that's fine, i'll start it in a tmux session then [21:21:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [21:21:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T321130)', diff saved to https://phabricator.wikimedia.org/P38475 and previous config saved to /var/cache/conftool/dbconfig/20221107-212135-marostegui.json [21:21:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2171.codfw.wmnet with reason: Maintenance [21:21:39] the last time: https://phabricator.wikimedia.org/T315510#8200829 it took 3h 26m to process 69134 pages (on testwiki), and i have hopefully improved it since then [21:21:42] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [21:21:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2171.codfw.wmnet with reason: Maintenance [21:21:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T321130)', diff saved to https://phabricator.wikimedia.org/P38476 and previous config saved to /var/cache/conftool/dbconfig/20221107-212156-marostegui.json [21:22:04] many DISCUSSION pages there are* [21:22:48] 10SRE-swift-storage, 10Community-Tech, 10MediaWiki-extensions-Phonos, 10MW-1.40-notes (1.40.0-wmf.6; 2022-10-17): Establish Phonos production storage requirements - https://phabricator.wikimedia.org/T320675 (10MusikAnimal) >>! In T320675#8368902, @Eevans wrote: > If I'm being honest, I think that were we h... [21:22:50] i see [21:23:22] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:854097|Enable wgDiscussionToolsEnablePermalinksBackend on group0 wikis (T315353)]] (duration: 05m 47s) [21:23:30] T315353: Create database tables for permalinks in production wikis, and enable the feature - https://phabricator.wikimedia.org/T315353 [21:23:42] and, let's do the script now [21:24:43] thanks [21:24:55] MatmaRex: actually, last question before i start it. how does it behave when it runs for one wiki twice? [21:25:02] since it's a long-running script, it can happen quite easily [21:25:09] yeah. it's harmless [21:25:11] okay [21:25:38] !log DNS - removing phab1001-aphlict.eqiad.wmnet - should have no effect because we use aphlict.discovery.wmnet - but if it does, then it's Phabricator realtime notifications [21:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:55] it will still do all the work work (to verify that all the data is already there), but it's intended to be ran this way too [21:26:05] all the work* [21:26:22] gotcha [21:26:23] !log Start [urbanecm@mwmaint1002 /srv/mediawiki]$ foreachwikiindblist group0 extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --current --all # T315510, running in mwmaint1002 at a tmux session under my name [21:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T321130)', diff saved to https://phabricator.wikimedia.org/P38477 and previous config saved to /var/cache/conftool/dbconfig/20221107-212628-marostegui.json [21:26:29] it's running now [21:26:32] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [21:26:34] i guess we'll see this when it runs on testwiki (again) [21:26:40] !log DNS - removing phab1001-aphlict.eqiad.wmnet - should have no effect because we use aphlict.discovery.wmnet - but if it does, then it's Phabricator realtime notifications - T280597 [21:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:44] T280597: move phabricator to new hardware generation - https://phabricator.wikimedia.org/T280597 [21:26:48] thanks [21:27:14] (03CR) 10Hashar: Add upgrade_openstack_node.py (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 (owner: 10Andrew Bogott) [21:27:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P38478 and previous config saved to /var/cache/conftool/dbconfig/20221107-212748-ladsgroup.json [21:28:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T321123)', diff saved to https://phabricator.wikimedia.org/P38479 and previous config saved to /var/cache/conftool/dbconfig/20221107-212800-marostegui.json [21:28:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2139.codfw.wmnet with reason: Maintenance [21:28:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2139.codfw.wmnet with reason: Maintenance [21:28:06] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [21:28:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2147.codfw.wmnet with reason: Maintenance [21:28:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2147.codfw.wmnet with reason: Maintenance [21:28:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2147 (T321123)', diff saved to https://phabricator.wikimedia.org/P38480 and previous config saved to /var/cache/conftool/dbconfig/20221107-212828-marostegui.json [21:29:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T318955)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20221107-212900-ladsgroup.json [21:29:35] it's now at krwiki (48th wiki, out of 131 total in group0) [21:30:06] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [21:30:08] either surprisingly fast or those wikis are smaller than i thought [21:30:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T321123)', diff saved to https://phabricator.wikimedia.org/P38481 and previous config saved to /var/cache/conftool/dbconfig/20221107-213038-marostegui.json [21:30:39] are they just inactive or something? [21:30:48] group0 is all closed wikis [21:30:56] plus testwiki plus mediawiki.org [21:31:04] which it is at now :) [21:31:19] yeah, makes sense [21:31:23] will likely take a while https://usercontent.irccloud-cdn.com/file/U8pzT6Wo/image.png [21:31:50] (03CR) 10Dzahn: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/854058 (owner: 10Filippo Giunchedi) [21:31:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [21:32:29] MatmaRex: fyi, the script logs some notices. https://logstash.wikimedia.org/goto/d69ccafa955da11e0f98788e3787f581 [21:32:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [21:32:51] (03CR) 10Dzahn: [C: 03+1] "arguably you could also rename the source because it's also confusing when source file in the repo and what it actually creates differ by " [puppet] - 10https://gerrit.wikimedia.org/r/854058 (owner: 10Filippo Giunchedi) [21:33:07] here, looking [21:33:42] urbanecm: hmm, looking [21:33:51] got that too [21:33:55] hello [21:34:44] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): PXE boot failure on cloudvirt1023 - https://phabricator.wikimedia.org/T319042 (10Andrew) I would not advise moving any of the cloudvirts other than 1023, since they're all likely to be decom'd next year (if not sooner) regardless. We /can/ move 1023 but my... [21:34:49] fyi i started a maintenance script a few minutes ago as part of B&C. happy to stop it if needed. https://sal.toolforge.org/log/HvD8U4QB6FQ6iqKivdCA is the SAL entry. [21:34:54] so, a maintenance job was started just now? [21:35:09] yes. [21:35:19] urbanecm: I very much doubt the maintenance job is at fault [21:35:27] urbanecm: it it doesn't look to me like something a maintenance script could have caused but will get back to you [21:35:31] no need to do anything yet [21:35:41] okay, continuing then. [21:36:57] what is the alert about? the maint script does a lot of wikitext parsing (using parsoid) and database writing [21:37:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [21:38:50] MatmaRex: it failed to process an edit at mediawiki.org. https://www.irccloud.com/pastebin/fNC2Ym6e/ [21:39:42] MatmaRex: it was about HAProxy, and it just resolved. it's most likely all unrelated [21:40:09] urbanecm: looks like Parsoid crashes on that page. it's okay for the script though [21:40:15] okay [21:40:30] if the maintenance script sends any traffic back out to the traffic layer to do its work, the HAProxy issues could have caused 500s in the maintenance script [21:40:30] https://www.mediawiki.org/w/index.php?curid=10252&veaction=edit [21:40:36] retrying any failed requests should work now [21:41:12] (but if it talks directly to the DB hosts or even the appservers, it should have been unaffected by the hiccup at the edge) [21:41:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P38482 and previous config saved to /var/cache/conftool/dbconfig/20221107-214135-marostegui.json [21:41:52] and the notices were apparently about https://ie.wikibooks.org/wiki/Wikibooks:Nospam#Pages_locked_from_recreation , i don't know what is happening here, but it's not a "real" discussion page, so this should be harmless [21:42:03] ack [21:42:04] (03PS3) 10Eevans: [DRAFT]: Bootstrap new AQS Cassandra nodes (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/812426 (https://phabricator.wikimedia.org/T307802) [21:42:29] in the worst case we can re-run it, if it turns out the same thing happens elsewhere [21:42:41] i'll file a bug about the notice though [21:42:47] thanks [21:42:52] and the crash [21:42:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T318605)', diff saved to https://phabricator.wikimedia.org/P38483 and previous config saved to /var/cache/conftool/dbconfig/20221107-214254-ladsgroup.json [21:42:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [21:43:00] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [21:43:05] it processed a handful of other edits, so seems to affect only some edits for some reason [21:43:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [21:43:12] currently mediawikiwiki: Processed 9600 (updated 2193) of 1402493 rows [21:44:12] (03PS4) 10Eevans: [DRAFT]: Bootstrap new AQS Cassandra nodes (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/812426 (https://phabricator.wikimedia.org/T307802) [21:45:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P38484 and previous config saved to /var/cache/conftool/dbconfig/20221107-214545-marostegui.json [21:46:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [21:47:02] (03CR) 10Eevans: "check_experimental" [puppet] - 10https://gerrit.wikimedia.org/r/812426 (https://phabricator.wikimedia.org/T307802) (owner: 10Eevans) [21:49:51] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/812426 (https://phabricator.wikimedia.org/T307802) (owner: 10Eevans) [21:50:36] urbanecm: i guess we can leave it to turn through the night then? it'd be nice if you could drop the final output on the task, with any other errors it gets [21:50:43] to run* [21:51:18] (03PS1) 10Cwhite: scap: update logstash_host for beta scap [puppet] - 10https://gerrit.wikimedia.org/r/854109 (https://phabricator.wikimedia.org/T321410) [21:51:31] yep, looks so. [21:52:35] i'll stay monitoring it for a while though [21:53:47] (03PS5) 10Eevans: Bootstrap new AQS Cassandra nodes (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/812426 (https://phabricator.wikimedia.org/T307802) [21:54:29] (03PS1) 10JHathaway: aux-k8s: add BGP config for calico [homer/public] - 10https://gerrit.wikimedia.org/r/854110 (https://phabricator.wikimedia.org/T321120) [21:56:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P38485 and previous config saved to /var/cache/conftool/dbconfig/20221107-215641-marostegui.json [22:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221107T2200). [22:00:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P38486 and previous config saved to /var/cache/conftool/dbconfig/20221107-220051-marostegui.json [22:07:00] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:07:28] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:07:41] !log [apt1001:~] $ sudo -E reprepro --verbose --component thirdparty/terraform update bullseye-wikimedia - T322344 [22:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:50] T322344: Move cloud runner CI jobs to trusted runners - https://phabricator.wikimedia.org/T322344 [22:08:28] mutante: right on. ty! [22:10:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T318605)', diff saved to https://phabricator.wikimedia.org/P38487 and previous config saved to /var/cache/conftool/dbconfig/20221107-221016-ladsgroup.json [22:10:22] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [22:11:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T321130)', diff saved to https://phabricator.wikimedia.org/P38488 and previous config saved to /var/cache/conftool/dbconfig/20221107-221148-marostegui.json [22:11:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2180.codfw.wmnet with reason: Maintenance [22:11:52] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [22:12:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2180.codfw.wmnet with reason: Maintenance [22:12:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T321130)', diff saved to https://phabricator.wikimedia.org/P38489 and previous config saved to /var/cache/conftool/dbconfig/20221107-221209-marostegui.json [22:12:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [22:12:29] 10SRE-swift-storage, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-Phonos, 10Community-Tech (CommTech-Sprint-36), and 2 others: Phonos links to an unauthorized URL - https://phabricator.wikimedia.org/T317417 (10JMcLeod_WMF) [22:14:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T321130)', diff saved to https://phabricator.wikimedia.org/P38490 and previous config saved to /var/cache/conftool/dbconfig/20221107-221423-marostegui.json [22:15:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T321123)', diff saved to https://phabricator.wikimedia.org/P38491 and previous config saved to /var/cache/conftool/dbconfig/20221107-221557-marostegui.json [22:16:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2155.codfw.wmnet with reason: Maintenance [22:16:02] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [22:16:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2155.codfw.wmnet with reason: Maintenance [22:16:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2095.codfw.wmnet with reason: Maintenance [22:16:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2095.codfw.wmnet with reason: Maintenance [22:16:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2155 (T321123)', diff saved to https://phabricator.wikimedia.org/P38492 and previous config saved to /var/cache/conftool/dbconfig/20221107-221624-marostegui.json [22:16:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [22:16:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [22:17:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [22:18:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T321123)', diff saved to https://phabricator.wikimedia.org/P38493 and previous config saved to /var/cache/conftool/dbconfig/20221107-221834-marostegui.json [22:25:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P38494 and previous config saved to /var/cache/conftool/dbconfig/20221107-222523-ladsgroup.json [22:29:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P38495 and previous config saved to /var/cache/conftool/dbconfig/20221107-222930-marostegui.json [22:33:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P38496 and previous config saved to /var/cache/conftool/dbconfig/20221107-223340-marostegui.json [22:40:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P38497 and previous config saved to /var/cache/conftool/dbconfig/20221107-224029-ladsgroup.json [22:42:49] (03CR) 10Dzahn: dumps/distribution: add more data types to parameters (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [22:43:07] (03PS7) 10Dzahn: dumps/distribution: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852260 [22:43:43] (03CR) 10CI reject: [V: 04-1] dumps/distribution: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [22:43:52] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/37997/" [puppet] - 10https://gerrit.wikimedia.org/r/853051 (https://phabricator.wikimedia.org/T322250) (owner: 10Dzahn) [22:43:59] (03PS2) 10Dzahn: site/phabricator: move phab2001 from prod to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/853051 (https://phabricator.wikimedia.org/T322250) [22:44:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P38498 and previous config saved to /var/cache/conftool/dbconfig/20221107-224437-marostegui.json [22:44:50] !log Deployed patches for T316414 and T315123 [22:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P38499 and previous config saved to /var/cache/conftool/dbconfig/20221107-224847-marostegui.json [22:49:27] (03CR) 10Dzahn: "noop confirmed on prod and other phab hosts" [puppet] - 10https://gerrit.wikimedia.org/r/853051 (https://phabricator.wikimedia.org/T322250) (owner: 10Dzahn) [22:51:13] !log phab2001 - removing from production puppet role - removes ssh access, ferm rules, exim config and more T322250 [22:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:16] T322250: decom phab2001 - https://phabricator.wikimedia.org/T322250 [22:53:35] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on phab2001.codfw.wmnet with reason: T322250 [22:53:50] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on phab2001.codfw.wmnet with reason: T322250 [22:55:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [22:55:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [22:55:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T318605)', diff saved to https://phabricator.wikimedia.org/P38500 and previous config saved to /var/cache/conftool/dbconfig/20221107-225525-ladsgroup.json [22:55:29] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [22:55:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T318605)', diff saved to https://phabricator.wikimedia.org/P38501 and previous config saved to /var/cache/conftool/dbconfig/20221107-225536-ladsgroup.json [22:55:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [22:55:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [22:55:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [22:55:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [22:56:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T318605)', diff saved to https://phabricator.wikimedia.org/P38502 and previous config saved to /var/cache/conftool/dbconfig/20221107-225602-ladsgroup.json [22:56:38] (03CR) 10Dzahn: [C: 03+2] "this removed the firewall rules, the special exim setup, ssh access for releng users and more" [puppet] - 10https://gerrit.wikimedia.org/r/853051 (https://phabricator.wikimedia.org/T322250) (owner: 10Dzahn) [22:58:48] (03PS8) 10Dzahn: dumps/distribution: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852260 [22:59:23] (03CR) 10CI reject: [V: 04-1] dumps/distribution: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [22:59:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T321130)', diff saved to https://phabricator.wikimedia.org/P38503 and previous config saved to /var/cache/conftool/dbconfig/20221107-225943-marostegui.json [22:59:48] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [23:02:04] (03CR) 10Dzahn: "Syntax error at '::Yes_no' (file: /srv/workspace/puppet/modules/wmflib/types/dumps/mirror.pp, line: 12" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [23:03:52] (03PS9) 10Dzahn: dumps/distribution: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852260 [23:03:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T321123)', diff saved to https://phabricator.wikimedia.org/P38504 and previous config saved to /var/cache/conftool/dbconfig/20221107-230353-marostegui.json [23:03:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2172.codfw.wmnet with reason: Maintenance [23:03:57] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [23:04:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2172.codfw.wmnet with reason: Maintenance [23:04:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2172 (T321123)', diff saved to https://phabricator.wikimedia.org/P38505 and previous config saved to /var/cache/conftool/dbconfig/20221107-230414-marostegui.json [23:04:27] (03CR) 10CI reject: [V: 04-1] dumps/distribution: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [23:06:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T321123)', diff saved to https://phabricator.wikimedia.org/P38506 and previous config saved to /var/cache/conftool/dbconfig/20221107-230624-marostegui.json [23:08:24] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 739 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:09:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T318605)', diff saved to https://phabricator.wikimedia.org/P38507 and previous config saved to /var/cache/conftool/dbconfig/20221107-230940-ladsgroup.json [23:09:45] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [23:18:08] (03CR) 10Dzahn: "looks like Phabricator isn't on the list yet either" [puppet] - 10https://gerrit.wikimedia.org/r/853454 (owner: 10Dzahn) [23:21:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P38508 and previous config saved to /var/cache/conftool/dbconfig/20221107-232131-marostegui.json [23:21:44] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:24:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P38509 and previous config saved to /var/cache/conftool/dbconfig/20221107-232447-ladsgroup.json [23:26:25] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dasm - https://phabricator.wikimedia.org/T322591 (10dasm) [23:29:05] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dasm - https://phabricator.wikimedia.org/T322591 (10Dzahn) [23:33:06] PROBLEM - SSH on an-coord1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:33:50] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dasm - https://phabricator.wikimedia.org/T322591 (10Dzahn) @dasm Thanks, confirmed you have already signed L3 and looks like you are providing all the needed info. Kicking off this request. @KFrancis Is this ('working for a con... [23:34:16] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dasm - https://phabricator.wikimedia.org/T322591 (10Dzahn) 05Open→03In progress [23:36:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P38510 and previous config saved to /var/cache/conftool/dbconfig/20221107-233637-marostegui.json [23:39:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20221107-233954-ladsgroup.json [23:40:29] ladsgroup@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [23:42:18] (ProbeDown) firing: (11) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:42:18] (ProbeDown) firing: (12) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:42:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [23:42:45] (03Abandoned) 10Jdlrobson: Remove logo setting in YAML files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843514 (owner: 10Jdlrobson) [23:42:46] looking [23:42:49] here [23:43:58] I was briefly getting 504s at metawiki, but seems back now? [23:44:53] quiddity: my guess is the event is still underway for another few minutes but you might get intermittent errors until it resolves [23:47:18] (ProbeDown) resolved: (12) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:47:18] (ProbeDown) resolved: (13) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:47:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [23:51:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T321123)', diff saved to https://phabricator.wikimedia.org/P38511 and previous config saved to /var/cache/conftool/dbconfig/20221107-235144-marostegui.json [23:51:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2179.codfw.wmnet with reason: Maintenance [23:51:48] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [23:52:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2179.codfw.wmnet with reason: Maintenance [23:52:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2179 (T321123)', diff saved to https://phabricator.wikimedia.org/P38512 and previous config saved to /var/cache/conftool/dbconfig/20221107-235206-marostegui.json [23:54:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T321123)', diff saved to https://phabricator.wikimedia.org/P38513 and previous config saved to /var/cache/conftool/dbconfig/20221107-235415-marostegui.json [23:55:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T318605)', diff saved to https://phabricator.wikimedia.org/P38514 and previous config saved to /var/cache/conftool/dbconfig/20221107-235505-ladsgroup.json [23:55:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [23:55:08] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [23:55:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [23:55:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T318605)', diff saved to https://phabricator.wikimedia.org/P38515 and previous config saved to /var/cache/conftool/dbconfig/20221107-235526-ladsgroup.json