[00:18:17] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:18:24] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:26:33] RECOVERY - Disk space on centrallog1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog1002&var-datasource=eqiad+prometheus/ops [00:38:55] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1009946 [00:38:57] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1009946 (owner: 10TrainBranchBot) [00:49:55] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [01:00:45] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1009946 (owner: 10TrainBranchBot) [01:01:56] (SystemdUnitFailed) resolved: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:35:48] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:35:54] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:15:55] RECOVERY - Host ripe-atlas-ulsfo is UP: PING WARNING - Packet loss = 77%, RTA = 33.11 ms [02:22:11] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:22:19] PROBLEM - Host ripe-atlas-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [02:36:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 49.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:37:14] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 49.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:41:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:11:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:12:14] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:32:25] (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:04:06] (03PS1) 10KartikMistry: Update cxserver to 2024-03-11-035839-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009941 (https://phabricator.wikimedia.org/T350773) [04:07:06] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Uploads fail due to 401 error from swift on wednesdays - https://phabricator.wikimedia.org/T358830#9618833 (10tstarling) a:03tstarling [04:31:34] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2024-03-11-035839-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009941 (https://phabricator.wikimedia.org/T350773) (owner: 10KartikMistry) [04:32:42] (03Merged) 10jenkins-bot: Update cxserver to 2024-03-11-035839-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009941 (https://phabricator.wikimedia.org/T350773) (owner: 10KartikMistry) [04:37:31] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:37:38] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:38:46] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [04:39:20] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [04:46:31] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [04:47:03] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [04:47:46] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [04:48:22] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [04:49:55] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [04:52:28] !log Updated cxserver to 2024-03-11-035839-production (T350773) [04:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:52:34] T350773: Remove preq and use node fetch - https://phabricator.wikimedia.org/T350773 [05:25:09] PROBLEM - OpenSearch unassigned shard check - 9200 on logstash1024 is CRITICAL: CRITICAL - logstash-default-1-7.0.0-1-2023.12.26[0](2024-03-08T03:44:46.007Z), logstash-syslog-1-7.0.0-1-2023.12.13[0](2024-03-08T03:44:46.003Z), logstash-mediawiki-1-7.0.0-1-2023.12.19[0](2024-03-08T03:44:46.007Z), logstash-default-1-7.0.0-1-2023.12.19[0](2024-03-08T03:44:46.006Z), logstash-webrequest-1-7.0.0-1-2023.12.18[0](2024-03-08T03:44:46.002Z), logstas [05:25:09] iki-1-7.0.0-1-2023.12.13[0](2024-03-08T03:44:46.003Z), logstash-webrequest-1-7.0.0-1-2023.12.30[0](2024-03-08T03:44:46.004Z), logstash-k8s-1-7.0.0-1-2023.12.29[0](2024-03-08T03:44:46.003Z), logstash-default-1-7.0.0-1-2023.12.30[0](2024-03-08T03:44:46.005Z), logstash-syslog-1-7.0.0-1-2023.12.17[0](2024-03-08T03:44:46.003Z), logstash-deploy-1-7.0.0-1-2023.12.16[0](2024-03-08T03:44:46.007Z), logstash-mediawiki-1-7.0.0-1-2023.12.17[0](2024-03 [05:25:09] 4:46.005Z), logstash-syslog-1-7.0.0-1-2023.12.12[0](2024-03-08T03:44:46.002Z), logstash-mediawiki-1-7.0.0-1-2023.12.21[0](2024-03-08T03:44:46.003Z), logstash-webrequest-1-7.0.0-1-2023.12.17[0](2024-03-08 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:25:09] PROBLEM - OpenSearch unassigned shard check - 9200 on logstash1031 is CRITICAL: CRITICAL - logstash-webrequest-1-7.0.0-1-2023.12.13[0](2024-03-08T03:44:46.007Z), logstash-deploy-1-7.0.0-1-2023.12.26[0](2024-03-08T03:44:46.006Z), logstash-syslog-1-7.0.0-1-2023.12.13[0](2024-03-08T03:44:46.003Z), logstash-k8s-1-7.0.0-1-2023.12.25[0](2024-03-08T03:44:46.006Z), logstash-k8s-1-7.0.0-1-2023.12.21[0](2024-03-08T03:44:46.004Z), logstash-default-1 [05:25:09] -2023.12.29[0](2024-03-08T03:44:46.003Z), logstash-webrequest-1-7.0.0-1-2023.12.18[0](2024-03-08T03:44:46.002Z), logstash-default-1-7.0.0-1-2023.12.27[0](2024-03-08T03:44:46.006Z), logstash-webrequest-1-7.0.0-1-2023.12.21[0](2024-03-08T03:44:46.003Z), logstash-mediawiki-1-7.0.0-1-2023.12.17[0](2024-03-08T03:44:46.005Z), logstash-mediawiki-1-7.0.0-1-2023.12.25[0](2024-03-08T03:44:46.002Z), logstash-mediawiki-1-7.0.0-1-2023.12.15[0](2024-03 [05:25:10] 4:46.003Z), logstash-default-1-7.0.0-1-2023.12.26[0](2024-03-08T03:44:46.007Z), logstash-k8s-1-7.0.0-1-2023.12.20[0](2024-03-08T03:44:46.002Z), logstash-default-1-7.0.0-1-2023.12.18[0](2024-03-08T03:44:4 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:31:11] PROBLEM - OpenSearch unassigned shard check - 9200 on logstash1023 is CRITICAL: CRITICAL - logstash-k8s-1-7.0.0-1-2023.12.18[0](2024-03-08T03:44:46.006Z), logstash-mediawiki-1-7.0.0-1-2023.12.21[0](2024-03-08T03:44:46.003Z), logstash-syslog-1-7.0.0-1-2023.12.27[0](2024-03-08T03:44:46.004Z), logstash-webrequest-1-7.0.0-1-2023.12.14[0](2024-03-08T03:44:46.005Z), logstash-mediawiki-1-7.0.0-1-2023.12.14[0](2024-03-08T03:44:46.005Z), logstash- [05:31:11] 0.0-1-2023.12.20[0](2024-03-08T03:44:46.002Z), logstash-syslog-1-7.0.0-1-2023.12.18[0](2024-03-08T03:44:46.003Z), logstash-k8s-1-7.0.0-1-2023.12.26[0](2024-03-08T03:44:46.006Z), logstash-k8s-1-7.0.0-1-2023.12.22[0](2024-03-08T03:44:46.002Z), logstash-default-1-7.0.0-1-2023.12.23[0](2024-03-08T03:44:46.003Z), logstash-webrequest-1-7.0.0-1-2023.12.13[0](2024-03-08T03:44:46.007Z), logstash-webrequest-1-7.0.0-1-2023.12.24[0](2024-03-08T03:44: [05:31:11] , logstash-deploy-1-7.0.0-1-2023.12.13[0](2024-03-08T03:44:46.004Z), logstash-syslog-1-7.0.0-1-2023.12.26[0](2024-03-08T03:44:46.004Z), logstash-default-1-7.0.0-1-2023.12.25[0](2024-03-08T03:44:46.003Z), https://wikitech.wikimedia.org/wiki/Search%23Administration [05:31:11] PROBLEM - OpenSearch unassigned shard check - 9200 on logstash1030 is CRITICAL: CRITICAL - logstash-deploy-1-7.0.0-1-2023.12.24[0](2024-03-08T03:44:46.002Z), logstash-deploy-1-7.0.0-1-2023.12.16[0](2024-03-08T03:44:46.007Z), logstash-k8s-1-7.0.0-1-2023.12.25[0](2024-03-08T03:44:46.006Z), logstash-k8s-1-7.0.0-1-2023.12.26[0](2024-03-08T03:44:46.006Z), logstash-webrequest-1-7.0.0-1-2023.12.20[0](2024-03-08T03:44:46.007Z), logstash-k8s-1-7.0 [05:31:11] 4.01.02[0](2024-03-08T03:44:46.007Z), logstash-mediawiki-1-7.0.0-1-2023.12.21[0](2024-03-08T03:44:46.003Z), logstash-default-1-7.0.0-1-2023.12.16[0](2024-03-08T03:44:46.003Z), logstash-k8s-1-7.0.0-1-2023.12.20[0](2024-03-08T03:44:46.002Z), logstash-default-1-7.0.0-1-2023.12.25[0](2024-03-08T03:44:46.003Z), logstash-syslog-1-7.0.0-1-2023.12.31[0](2024-03-08T03:44:46.004Z), logstash-mediawiki-1-7.0.0-1-2024.01.02[0](2024-03-08T03:44:46.004Z [05:31:12] ash-webrequest-1-7.0.0-1-2023.12.14[0](2024-03-08T03:44:46.005Z), logstash-default-1-7.0.0-1-2023.12.29[0](2024-03-08T03:44:46.003Z), logstash-webrequest-1-7.0.0-1-2023.12.29[0](2024-03-08T03:44:46.004Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [05:32:25] (SystemdUnitFailed) resolved: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:34:11] PROBLEM - OpenSearch unassigned shard check - 9200 on logstash1037 is CRITICAL: CRITICAL - logstash-k8s-1-7.0.0-1-2023.12.23[0](2024-03-08T03:44:46.005Z), logstash-mediawiki-1-7.0.0-1-2023.12.25[0](2024-03-08T03:44:46.002Z), logstash-syslog-1-7.0.0-1-2023.12.16[0](2024-03-08T03:44:46.007Z), logstash-webrequest-1-7.0.0-1-2023.12.27[0](2024-03-08T03:44:46.007Z), logstash-webrequest-1-7.0.0-1-2023.12.15[0](2024-03-08T03:44:46.004Z), logstash [05:34:11] ki-1-7.0.0-1-2023.12.21[0](2024-03-08T03:44:46.003Z), logstash-default-1-7.0.0-1-2023.12.26[0](2024-03-08T03:44:46.007Z), logstash-deploy-1-7.0.0-1-2023.12.21[0](2024-03-08T03:44:46.004Z), logstash-webrequest-1-7.0.0-1-2023.12.14[0](2024-03-08T03:44:46.005Z), logstash-syslog-1-7.0.0-1-2023.12.17[0](2024-03-08T03:44:46.003Z), logstash-mediawiki-1-7.0.0-1-2023.12.31[0](2024-03-08T03:44:46.004Z), logstash-mediawiki-1-7.0.0-1-2023.12.12[0](20 [05:34:11] T03:44:46.005Z), logstash-deploy-1-7.0.0-1-2023.12.25[0](2024-03-08T03:44:46.002Z), logstash-webrequest-1-7.0.0-1-2023.12.20[0](2024-03-08T03:44:46.007Z), logstash-mediawiki-1-7.0.0-1-2023.12.24[0](2024- https://wikitech.wikimedia.org/wiki/Search%23Administration [05:37:09] PROBLEM - OpenSearch unassigned shard check - 9200 on logstash1036 is CRITICAL: CRITICAL - logstash-k8s-1-7.0.0-1-2023.12.23[0](2024-03-08T03:44:46.005Z), logstash-webrequest-1-7.0.0-1-2023.12.18[0](2024-03-08T03:44:46.002Z), logstash-deploy-1-7.0.0-1-2023.12.26[0](2024-03-08T03:44:46.006Z), logstash-webrequest-1-7.0.0-1-2023.12.21[0](2024-03-08T03:44:46.003Z), logstash-mediawiki-1-7.0.0-1-2023.12.31[0](2024-03-08T03:44:46.004Z), logstash [05:37:09] -1-7.0.0-1-2023.12.24[0](2024-03-08T03:44:46.003Z), logstash-default-1-7.0.0-1-2023.12.19[0](2024-03-08T03:44:46.006Z), logstash-webrequest-1-7.0.0-1-2023.12.19[0](2024-03-08T03:44:46.002Z), logstash-webrequest-1-7.0.0-1-2023.12.14[0](2024-03-08T03:44:46.005Z), logstash-mediawiki-1-7.0.0-1-2023.12.14[0](2024-03-08T03:44:46.005Z), logstash-default-1-7.0.0-1-2023.12.13[0](2024-03-08T03:44:46.003Z), logstash-mediawiki-1-7.0.0-1-2023.12.13[0] [05:37:09] -08T03:44:46.003Z), logstash-syslog-1-7.0.0-1-2023.12.27[0](2024-03-08T03:44:46.004Z), logstash-default-1-7.0.0-1-2023.12.30[0](2024-03-08T03:44:46.005Z), logstash-k8s-1-7.0.0-1-2023.12.24[0](2024-03-08T https://wikitech.wikimedia.org/wiki/Search%23Administration [05:37:11] PROBLEM - OpenSearch unassigned shard check - 9200 on logstash1029 is CRITICAL: CRITICAL - logstash-k8s-1-7.0.0-1-2024.01.02[0](2024-03-08T03:44:46.007Z), logstash-deploy-1-7.0.0-1-2023.12.13[0](2024-03-08T03:44:46.004Z), logstash-deploy-1-7.0.0-1-2023.12.25[0](2024-03-08T03:44:46.002Z), logstash-webrequest-1-7.0.0-1-2023.12.17[0](2024-03-08T03:44:46.002Z), logstash-syslog-1-7.0.0-1-2023.12.26[0](2024-03-08T03:44:46.004Z), logstash-k8s-1- [05:37:11] 2023.12.13[0](2024-03-08T03:44:46.006Z), logstash-syslog-1-7.0.0-1-2023.12.14[0](2024-03-08T03:44:46.006Z), logstash-mediawiki-1-7.0.0-1-2024.01.01[0](2024-03-08T03:44:46.007Z), logstash-default-1-7.0.0-1-2023.12.18[0](2024-03-08T03:44:46.006Z), logstash-mediawiki-1-7.0.0-1-2023.12.29[0](2024-03-08T03:44:46.004Z), logstash-syslog-1-7.0.0-1-2023.12.12[0](2024-03-08T03:44:46.002Z), logstash-mediawiki-1-7.0.0-1-2023.12.24[0](2024-03-08T03:44 [05:37:11] ), logstash-k8s-1-7.0.0-1-2023.12.20[0](2024-03-08T03:44:46.002Z), logstash-webrequest-1-7.0.0-1-2023.12.14[0](2024-03-08T03:44:46.005Z), logstash-webrequest-1-7.0.0-1-2023.12.26[0](2024-03-08T03:44:46.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:37:11] PROBLEM - OpenSearch unassigned shard check - 9200 on logstash1012 is CRITICAL: CRITICAL - logstash-mediawiki-1-7.0.0-1-2023.12.27[0](2024-03-08T03:44:46.006Z), logstash-syslog-1-7.0.0-1-2023.12.13[0](2024-03-08T03:44:46.003Z), logstash-webrequest-1-7.0.0-1-2023.12.24[0](2024-03-08T03:44:46.007Z), logstash-k8s-1-7.0.0-1-2023.12.19[0](2024-03-08T03:44:46.004Z), logstash-syslog-1-7.0.0-1-2023.12.19[0](2024-03-08T03:44:46.006Z), logstash-web [05:37:11] 1-7.0.0-1-2023.12.20[0](2024-03-08T03:44:46.007Z), logstash-mediawiki-1-7.0.0-1-2023.12.31[0](2024-03-08T03:44:46.004Z), logstash-default-1-7.0.0-1-2023.12.29[0](2024-03-08T03:44:46.003Z), logstash-default-1-7.0.0-1-2023.12.25[0](2024-03-08T03:44:46.003Z), logstash-webrequest-1-7.0.0-1-2023.12.17[0](2024-03-08T03:44:46.002Z), logstash-syslog-1-7.0.0-1-2023.12.20[0](2024-03-08T03:44:46.005Z), logstash-syslog-1-7.0.0-1-2023.12.28[0](2024-03 [05:37:12] 4:46.006Z), logstash-webrequest-1-7.0.0-1-2023.12.30[0](2024-03-08T03:44:46.004Z), logstash-k8s-1-7.0.0-1-2023.12.23[0](2024-03-08T03:44:46.005Z), logstash-default-1-7.0.0-1-2023.12.18[0](2024-03-08T03:4 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:37:55] (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:40:11] PROBLEM - OpenSearch unassigned shard check - 9200 on logstash1035 is CRITICAL: CRITICAL - logstash-deploy-1-7.0.0-1-2023.12.19[0](2024-03-08T03:44:46.003Z), logstash-syslog-1-7.0.0-1-2023.12.25[0](2024-03-08T03:44:46.002Z), logstash-default-1-7.0.0-1-2023.12.19[0](2024-03-08T03:44:46.006Z), logstash-mediawiki-1-7.0.0-1-2023.12.24[0](2024-03-08T03:44:46.003Z), logstash-syslog-1-7.0.0-1-2023.12.20[0](2024-03-08T03:44:46.005Z), logstash-dep [05:40:11] 0.0-1-2023.12.24[0](2024-03-08T03:44:46.002Z), logstash-syslog-1-7.0.0-1-2023.12.19[0](2024-03-08T03:44:46.006Z), logstash-mediawiki-1-7.0.0-1-2023.12.27[0](2024-03-08T03:44:46.006Z), logstash-default-1-7.0.0-1-2023.12.16[0](2024-03-08T03:44:46.003Z), logstash-webrequest-1-7.0.0-1-2023.12.14[0](2024-03-08T03:44:46.005Z), logstash-mediawiki-1-7.0.0-1-2023.12.25[0](2024-03-08T03:44:46.002Z), logstash-deploy-1-7.0.0-1-2023.12.15[0](2024-03-0 [05:40:11] 46.004Z), logstash-mediawiki-1-7.0.0-1-2023.12.19[0](2024-03-08T03:44:46.007Z), logstash-syslog-1-7.0.0-1-2023.12.28[0](2024-03-08T03:44:46.006Z), logstash-webrequest-1-7.0.0-1-2023.12.18[0](2024-03-08T0 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:40:11] PROBLEM - OpenSearch unassigned shard check - 9200 on logstash1028 is CRITICAL: CRITICAL - logstash-deploy-1-7.0.0-1-2023.12.14[0](2024-03-08T03:44:46.002Z), logstash-webrequest-1-7.0.0-1-2023.12.30[0](2024-03-08T03:44:46.004Z), logstash-syslog-1-7.0.0-1-2023.12.13[0](2024-03-08T03:44:46.003Z), logstash-webrequest-1-7.0.0-1-2023.12.14[0](2024-03-08T03:44:46.005Z), logstash-mediawiki-1-7.0.0-1-2023.12.15[0](2024-03-08T03:44:46.003Z), logst [05:40:11] 1-7.0.0-1-2024.01.02[0](2024-03-08T03:44:46.007Z), logstash-syslog-1-7.0.0-1-2023.12.19[0](2024-03-08T03:44:46.006Z), logstash-k8s-1-7.0.0-1-2023.12.22[0](2024-03-08T03:44:46.002Z), logstash-default-1-7.0.0-1-2023.12.27[0](2024-03-08T03:44:46.006Z), logstash-mediawiki-1-7.0.0-1-2023.12.14[0](2024-03-08T03:44:46.005Z), logstash-mediawiki-1-7.0.0-1-2023.12.12[0](2024-03-08T03:44:46.005Z), logstash-webrequest-1-7.0.0-1-2023.12.29[0](2024-03- [05:40:12] :46.004Z), logstash-webrequest-1-7.0.0-1-2023.12.26[0](2024-03-08T03:44:46.007Z), logstash-default-1-7.0.0-1-2023.12.16[0](2024-03-08T03:44:46.003Z), logstash-syslog-1-7.0.0-1-2023.12.14[0](2024-03-08T03 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:43:09] PROBLEM - OpenSearch unassigned shard check - 9200 on logstash1010 is CRITICAL: CRITICAL - logstash-deploy-1-7.0.0-1-2023.12.13[0](2024-03-08T03:44:46.004Z), logstash-default-1-7.0.0-1-2023.12.19[0](2024-03-08T03:44:46.006Z), logstash-deploy-1-7.0.0-1-2023.12.27[0](2024-03-08T03:44:46.004Z), logstash-syslog-1-7.0.0-1-2023.12.27[0](2024-03-08T03:44:46.004Z), logstash-mediawiki-1-7.0.0-1-2024.01.02[0](2024-03-08T03:44:46.004Z), logstash-sys [05:43:09] 0.0-1-2023.12.28[0](2024-03-08T03:44:46.006Z), logstash-default-1-7.0.0-1-2023.12.25[0](2024-03-08T03:44:46.003Z), logstash-webrequest-1-7.0.0-1-2023.12.19[0](2024-03-08T03:44:46.002Z), logstash-syslog-1-7.0.0-1-2023.12.25[0](2024-03-08T03:44:46.002Z), logstash-webrequest-1-7.0.0-1-2023.12.25[0](2024-03-08T03:44:46.005Z), logstash-syslog-1-7.0.0-1-2023.12.19[0](2024-03-08T03:44:46.006Z), logstash-webrequest-1-7.0.0-1-2023.12.27[0](2024-03 [05:43:09] 4:46.007Z), logstash-webrequest-1-7.0.0-1-2023.12.18[0](2024-03-08T03:44:46.002Z), logstash-webrequest-1-7.0.0-1-2023.12.14[0](2024-03-08T03:44:46.005Z), logstash-default-1-7.0.0-1-2023.12.24[0](2024-03- https://wikitech.wikimedia.org/wiki/Search%23Administration [05:43:11] PROBLEM - OpenSearch unassigned shard check - 9200 on logstash1034 is CRITICAL: CRITICAL - logstash-mediawiki-1-7.0.0-1-2024.01.02[0](2024-03-08T03:44:46.004Z), logstash-webrequest-1-7.0.0-1-2023.12.27[0](2024-03-08T03:44:46.007Z), logstash-default-1-7.0.0-1-2023.12.23[0](2024-03-08T03:44:46.003Z), logstash-syslog-1-7.0.0-1-2023.12.13[0](2024-03-08T03:44:46.003Z), logstash-mediawiki-1-7.0.0-1-2023.12.27[0](2024-03-08T03:44:46.006Z), logst [05:43:11] og-1-7.0.0-1-2023.12.28[0](2024-03-08T03:44:46.006Z), logstash-syslog-1-7.0.0-1-2023.12.31[0](2024-03-08T03:44:46.004Z), logstash-k8s-1-7.0.0-1-2023.12.29[0](2024-03-08T03:44:46.003Z), logstash-default-1-7.0.0-1-2023.12.18[0](2024-03-08T03:44:46.006Z), logstash-syslog-1-7.0.0-1-2023.12.26[0](2024-03-08T03:44:46.004Z), logstash-syslog-1-7.0.0-1-2023.12.20[0](2024-03-08T03:44:46.005Z), logstash-syslog-1-7.0.0-1-2023.12.18[0](2024-03-08T03:4 [05:43:11] Z), logstash-webrequest-1-7.0.0-1-2023.12.29[0](2024-03-08T03:44:46.004Z), logstash-syslog-1-7.0.0-1-2023.12.27[0](2024-03-08T03:44:46.004Z), logstash-deploy-1-7.0.0-1-2023.12.22[0](2024-03-08T03:44:46.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:43:11] PROBLEM - OpenSearch unassigned shard check - 9200 on logstash1027 is CRITICAL: CRITICAL - logstash-webrequest-1-7.0.0-1-2023.12.13[0](2024-03-08T03:44:46.007Z), logstash-mediawiki-1-7.0.0-1-2023.12.26[0](2024-03-08T03:44:46.005Z), logstash-syslog-1-7.0.0-1-2023.12.18[0](2024-03-08T03:44:46.003Z), logstash-webrequest-1-7.0.0-1-2023.12.26[0](2024-03-08T03:44:46.007Z), logstash-k8s-1-7.0.0-1-2023.12.29[0](2024-03-08T03:44:46.003Z), logstash [05:43:11] 1-7.0.0-1-2023.12.15[0](2024-03-08T03:44:46.004Z), logstash-syslog-1-7.0.0-1-2023.12.28[0](2024-03-08T03:44:46.006Z), logstash-syslog-1-7.0.0-1-2023.12.12[0](2024-03-08T03:44:46.002Z), logstash-mediawiki-1-7.0.0-1-2023.12.24[0](2024-03-08T03:44:46.003Z), logstash-mediawiki-1-7.0.0-1-2023.12.14[0](2024-03-08T03:44:46.005Z), logstash-syslog-1-7.0.0-1-2023.12.31[0](2024-03-08T03:44:46.004Z), logstash-mediawiki-1-7.0.0-1-2023.12.25[0](2024-03 [05:43:12] 4:46.002Z), logstash-syslog-1-7.0.0-1-2023.12.20[0](2024-03-08T03:44:46.005Z), logstash-mediawiki-1-7.0.0-1-2023.12.15[0](2024-03-08T03:44:46.003Z), logstash-webrequest-1-7.0.0-1-2023.12.14[0](2024-03-08 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:46:11] PROBLEM - OpenSearch unassigned shard check - 9200 on logstash1026 is CRITICAL: CRITICAL - logstash-default-1-7.0.0-1-2023.12.23[0](2024-03-08T03:44:46.003Z), logstash-syslog-1-7.0.0-1-2023.12.28[0](2024-03-08T03:44:46.006Z), logstash-deploy-1-7.0.0-1-2023.12.24[0](2024-03-08T03:44:46.002Z), logstash-k8s-1-7.0.0-1-2023.12.21[0](2024-03-08T03:44:46.004Z), logstash-deploy-1-7.0.0-1-2023.12.13[0](2024-03-08T03:44:46.004Z), logstash-webreques [05:46:11] 0-1-2023.12.27[0](2024-03-08T03:44:46.007Z), logstash-mediawiki-1-7.0.0-1-2023.12.31[0](2024-03-08T03:44:46.004Z), logstash-syslog-1-7.0.0-1-2023.12.13[0](2024-03-08T03:44:46.003Z), logstash-mediawiki-1-7.0.0-1-2023.12.12[0](2024-03-08T03:44:46.005Z), logstash-syslog-1-7.0.0-1-2023.12.17[0](2024-03-08T03:44:46.003Z), logstash-deploy-1-7.0.0-1-2023.12.25[0](2024-03-08T03:44:46.002Z), logstash-webrequest-1-7.0.0-1-2023.12.25[0](2024-03-08T0 [05:46:11] 005Z), logstash-k8s-1-7.0.0-1-2023.12.23[0](2024-03-08T03:44:46.005Z), logstash-webrequest-1-7.0.0-1-2024.01.02[0](2024-03-08T03:44:46.005Z), logstash-mediawiki-1-7.0.0-1-2023.12.15[0](2024-03-08T03:44:4 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:46:11] PROBLEM - OpenSearch unassigned shard check - 9200 on logstash1033 is CRITICAL: CRITICAL - logstash-mediawiki-1-7.0.0-1-2023.12.30[0](2024-03-08T03:44:46.006Z), logstash-deploy-1-7.0.0-1-2023.12.20[0](2024-03-08T03:44:46.002Z), logstash-webrequest-1-7.0.0-1-2023.12.19[0](2024-03-08T03:44:46.002Z), logstash-webrequest-1-7.0.0-1-2023.12.30[0](2024-03-08T03:44:46.004Z), logstash-default-1-7.0.0-1-2023.12.26[0](2024-03-08T03:44:46.007Z), logs [05:46:11] -1-7.0.0-1-2023.12.13[0](2024-03-08T03:44:46.006Z), logstash-k8s-1-7.0.0-1-2023.12.21[0](2024-03-08T03:44:46.004Z), logstash-k8s-1-7.0.0-1-2023.12.26[0](2024-03-08T03:44:46.006Z), logstash-deploy-1-7.0.0-1-2023.12.21[0](2024-03-08T03:44:46.004Z), logstash-mediawiki-1-7.0.0-1-2023.12.13[0](2024-03-08T03:44:46.003Z), logstash-mediawiki-1-7.0.0-1-2023.12.21[0](2024-03-08T03:44:46.003Z), logstash-syslog-1-7.0.0-1-2023.12.18[0](2024-03-08T03:4 [05:46:12] Z), logstash-webrequest-1-7.0.0-1-2023.12.14[0](2024-03-08T03:44:46.005Z), logstash-syslog-1-7.0.0-1-2023.12.31[0](2024-03-08T03:44:46.004Z), logstash-mediawiki-1-7.0.0-1-2023.12.22[0](2024-03-08T03:44:4 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:49:11] PROBLEM - OpenSearch unassigned shard check - 9200 on logstash1025 is CRITICAL: CRITICAL - logstash-k8s-1-7.0.0-1-2023.12.18[0](2024-03-08T03:44:46.006Z), logstash-syslog-1-7.0.0-1-2023.12.16[0](2024-03-08T03:44:46.007Z), logstash-webrequest-1-7.0.0-1-2023.12.30[0](2024-03-08T03:44:46.004Z), logstash-webrequest-1-7.0.0-1-2023.12.20[0](2024-03-08T03:44:46.007Z), logstash-deploy-1-7.0.0-1-2023.12.20[0](2024-03-08T03:44:46.002Z), logstash-k8 [05:49:11] 0-1-2023.12.26[0](2024-03-08T03:44:46.006Z), logstash-default-1-7.0.0-1-2023.12.18[0](2024-03-08T03:44:46.006Z), logstash-deploy-1-7.0.0-1-2023.12.22[0](2024-03-08T03:44:46.003Z), logstash-webrequest-1-7.0.0-1-2023.12.27[0](2024-03-08T03:44:46.007Z), logstash-k8s-1-7.0.0-1-2024.01.02[0](2024-03-08T03:44:46.007Z), logstash-default-1-7.0.0-1-2023.12.13[0](2024-03-08T03:44:46.003Z), logstash-deploy-1-7.0.0-1-2023.12.27[0](2024-03-08T03:44:46 [05:49:11] logstash-mediawiki-1-7.0.0-1-2023.12.19[0](2024-03-08T03:44:46.007Z), logstash-syslog-1-7.0.0-1-2023.12.12[0](2024-03-08T03:44:46.002Z), logstash-webrequest-1-7.0.0-1-2024.01.02[0](2024-03-08T03:44:46.00 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:49:11] PROBLEM - OpenSearch unassigned shard check - 9200 on logstash1032 is CRITICAL: CRITICAL - logstash-k8s-1-7.0.0-1-2023.12.22[0](2024-03-08T03:44:46.002Z), logstash-syslog-1-7.0.0-1-2023.12.13[0](2024-03-08T03:44:46.003Z), logstash-mediawiki-1-7.0.0-1-2023.12.21[0](2024-03-08T03:44:46.003Z), logstash-webrequest-1-7.0.0-1-2023.12.13[0](2024-03-08T03:44:46.007Z), logstash-syslog-1-7.0.0-1-2023.12.27[0](2024-03-08T03:44:46.004Z), logstash-web [05:49:11] 1-7.0.0-1-2023.12.17[0](2024-03-08T03:44:46.002Z), logstash-syslog-1-7.0.0-1-2023.12.19[0](2024-03-08T03:44:46.006Z), logstash-default-1-7.0.0-1-2023.12.29[0](2024-03-08T03:44:46.003Z), logstash-k8s-1-7.0.0-1-2023.12.19[0](2024-03-08T03:44:46.004Z), logstash-default-1-7.0.0-1-2023.12.19[0](2024-03-08T03:44:46.006Z), logstash-default-1-7.0.0-1-2023.12.27[0](2024-03-08T03:44:46.006Z), logstash-deploy-1-7.0.0-1-2023.12.21[0](2024-03-08T03:44 [05:49:12] ), logstash-default-1-7.0.0-1-2023.12.16[0](2024-03-08T03:44:46.003Z), logstash-mediawiki-1-7.0.0-1-2023.12.12[0](2024-03-08T03:44:46.005Z), logstash-mediawiki-1-7.0.0-1-2023.12.14[0](2024-03-08T03:44:46 https://wikitech.wikimedia.org/wiki/Search%23Administration [06:09:05] 06SRE, 10SRE-Access-Requests: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9618875 (10KCVelaga_WMF) @cmooney all permissions and access for `kcvelaga` are working fine without any trouble, permissions/access for LDAP user `KCVe... [06:22:11] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:35:03] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:35:10] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240310T0800) [07:00:05] Amir1 and Urbanecm: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240311T0700) [07:00:05] mo_abualruz: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:04:21] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:06:23] (03CR) 10Mabualruz: [C: 03+1] Exclude non-functional pages from night mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009790 (https://phabricator.wikimedia.org/T359183) (owner: 10Jdlrobson) [07:09:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:12:28] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:27:26] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1009854 (https://phabricator.wikimedia.org/T357547) (owner: 10Kamila Součková) [07:29:27] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:29:33] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:39:46] mo_abualruz: are you around? [07:39:53] I am [07:39:58] I can deploy your patch [07:40:11] Thanks that would be lovely [07:41:47] (03PS3) 10Kosta Harlan: throttle: Allow for overriding temp account creation limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008112 (https://phabricator.wikimedia.org/T357777) [07:42:20] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009790 (https://phabricator.wikimedia.org/T359183) (owner: 10Jdlrobson) [07:43:35] (03Merged) 10jenkins-bot: Exclude non-functional pages from night mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009790 (https://phabricator.wikimedia.org/T359183) (owner: 10Jdlrobson) [07:44:28] !log kharlan@deploy2002 Started scap: Backport for [[gerrit:1009790|Exclude non-functional pages from night mode (T359183)]] [07:44:32] T359183: Exclude non-functional pages from night mode - https://phabricator.wikimedia.org/T359183 [07:44:36] (03PS1) 10Cwhite: logstash: provision and commision logging-hd100[123] nodes [puppet] - 10https://gerrit.wikimedia.org/r/1009947 (https://phabricator.wikimedia.org/T352517) [07:53:16] (03CR) 10Cwhite: "PCC OK: https://puppet-compiler.wmflabs.org/output/1009947/1628/" [puppet] - 10https://gerrit.wikimedia.org/r/1009947 (https://phabricator.wikimedia.org/T352517) (owner: 10Cwhite) [07:56:30] !log kharlan@deploy2002 kharlan and jdlrobson: Backport for [[gerrit:1009790|Exclude non-functional pages from night mode (T359183)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:56:32] effie / marostegui: should I be concerned about seeing T359787 during a scap backport just now? [07:56:35] T359183: Exclude non-functional pages from night mode - https://phabricator.wikimedia.org/T359183 [07:56:35] T359787: ImportError: cannot import name 'where' from 'certifi' (unknown location) - https://phabricator.wikimedia.org/T359787 [07:56:52] mo_abualruz: please test your patch on mwdebug [07:57:18] Thanks give me a minute [07:57:44] kostajh: I have no context on what that really is about sorry [07:58:05] ok [07:58:13] it seems like `scap` is able to proceed... [07:59:25] it means that https://gitlab.wikimedia.org/repos/releng/scap/-/blob/master/scap/main.py#L347 didn't run, it seems [08:00:07] Seems it is working [08:01:21] 06SRE, 10SRE-Access-Requests: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9618970 (10cmooney) Thanks for confirming @KCVelaga_WMF. I’ll get that done over the next day or so; we have our annual SRE meet up this week but I sho... [08:01:36] Amir1 / urbanecm, are either of you around? [08:01:50] around [08:01:58] (but waiting for Madalina to join a meeting) [08:02:54] urbanecm: do you think the error for T359787 should halt deployment? [08:02:55] T359787: ImportError: cannot import name 'where' from 'certifi' (unknown location) - https://phabricator.wikimedia.org/T359787 [08:03:33] it seems like not pulling master is a blocker, but I don't know the underlying mechanics well enough to say for sure. [08:03:40] kostajh: it looks super weird. but i also can't reproduce it anywhere. [08:03:51] urbanecm: seems ok to proceed with sync, then? [08:04:18] personally, i'd stop until someone can take a look and verify what is happening [08:04:26] alright [08:04:35] seems safer [08:04:46] !log kharlan@deploy2002 Sync cancelled. [08:04:59] * urbanecm goes fully into the meeting now [08:05:00] mo_abualruz: sorry, but we'll have to pick this up later, after T359787 is resolved. [08:05:38] No worries I will document this in the ticket [08:06:18] (NELHigh) firing: (2) Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [08:16:17] (NELHigh) resolved: (2) Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [08:17:49] kostajh: we are at our offsite, but we will look at this shortly [08:25:41] !log bounce prometheus@aux-k8s - T343529 [08:29:41] (KubernetesAPINotScrapable) resolved: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [08:29:41] hah stashbot came back [08:29:46] !log bounce prometheus@aux-k8s - T343529 [08:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:51] T343529: Prometheus doesn't reload or alert on expired client certificates - https://phabricator.wikimedia.org/T343529 [08:44:26] effie: it should be ok to leave for RelEng [08:55:03] !log jnuche@deploy2002 Installing scap version "4.70.1" for 376 hosts [08:55:44] !log jnuche@deploy2002 Installation of scap version "4.70.1" completed for 376 hosts [08:56:18] !log jnuche@deploy2002 Installing scap version "4.70.1" for 376 hosts [08:57:03] !log jnuche@deploy2002 Installation of scap version "4.70.1" completed for 376 hosts [09:01:19] kostajh, effie: scap should be working normally again -> https://phabricator.wikimedia.org/T359787#9619037 [09:02:29] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1220 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/1009949 (https://phabricator.wikimedia.org/T359790) [09:03:12] jouncebot: nowandnext [09:03:12] No deployments scheduled for the next 0 hour(s) and 56 minute(s) [09:03:12] In 0 hour(s) and 56 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240311T1000) [09:03:39] you should be fine to go ahead now with the backports if you still have the time [09:25:09] kostajh: my guess is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1009790 got merged but hasn't been deployed due to the scap issue [09:26:00] hashar: ah, right. [09:26:11] mo_abualruz: are you still around? we can continue with the backport [09:26:34] I am [09:27:19] we can do it now, that looks straight forward to test [09:27:24] ok [09:27:30] hashar: do you want to do it, or should I? [09:27:52] I also had a patch for the window (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1008112) which is a no-op, and can be merged without verification [09:28:00] please do :) [09:28:03] I can do it cool let me [09:28:44] oh it was not addressed to me nvm [09:35:20] mo_abualruz: ok, hang on [09:35:45] !log kharlan@deploy2002 Started scap: Backport for [[gerrit:1009790|Exclude non-functional pages from night mode (T359183)]] [09:35:50] T359183: Exclude non-functional pages from night mode - https://phabricator.wikimedia.org/T359183 [09:38:00] !log kharlan@deploy2002 jdlrobson and kharlan: Backport for [[gerrit:1009790|Exclude non-functional pages from night mode (T359183)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:38:10] (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:38:17] mo_abualruz: do you mind verifying again? [09:39:07] Sure [09:39:39] It is working [09:40:50] \o/ [09:40:55] jnuche: thanks for the scap fix! [09:41:17] 🥳 [09:41:42] !log kharlan@deploy2002 jdlrobson and kharlan: Continuing with sync [09:48:43] PROBLEM - Check whether ferm is active by checking the default input chain on mw2351 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:52:28] !log kharlan@deploy2002 Finished scap: Backport for [[gerrit:1009790|Exclude non-functional pages from night mode (T359183)]] (duration: 16m 42s) [09:52:32] T359183: Exclude non-functional pages from night mode - https://phabricator.wikimedia.org/T359183 [09:56:23] (03CR) 10Majavah: [C: 03+2] P:puppetserver: git: mark /srv/git as safe [puppet] - 10https://gerrit.wikimedia.org/r/1009805 (owner: 10Majavah) [09:59:12] mo_abualruz: all done [09:59:25] thanks a lot [09:59:42] !log UTC morning deploys done [09:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:59] (I decided to leave my patch for next week) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240311T1000) [10:12:53] (03PS1) 10Dzahn: add 'kus' (Kusaal) language to project languages [dns] - 10https://gerrit.wikimedia.org/r/1010161 (https://phabricator.wikimedia.org/T359757) [10:14:43] (03CR) 10Dzahn: [C: 03+2] add 'kus' (Kusaal) language to project languages [dns] - 10https://gerrit.wikimedia.org/r/1010161 (https://phabricator.wikimedia.org/T359757) (owner: 10Dzahn) [10:14:47] (03PS2) 10Dzahn: add 'kus' (Kusaal) language to project languages [dns] - 10https://gerrit.wikimedia.org/r/1010161 (https://phabricator.wikimedia.org/T359757) [10:18:03] (03CR) 10Majavah: [V: 03+1] "Ok. I guess the cache eviction `curl` call is failing? If so, that a separate issue than this one that we should fix separately. The rest " [puppet] - 10https://gerrit.wikimedia.org/r/1007396 (owner: 10Majavah) [10:18:43] RECOVERY - Check whether ferm is active by checking the default input chain on mw2351 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:22:11] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:24:13] (03CR) 10Dzahn: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1010161 (https://phabricator.wikimedia.org/T359757) (owner: 10Dzahn) [10:26:06] !log DNS - added new project language 'kus' - Kusaal is a Gur language spoken primarily in northern eastern Ghana, and Burkina Faso. It is spoken by about 121,000 people. T359757 [10:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:18] T359757: Create Wikipedia Kusaal - https://phabricator.wikimedia.org/T359757 [10:28:03] 06SRE, 10ops-eqiad: Degraded RAID on dumpsdata1007 - https://phabricator.wikimedia.org/T359702#9619268 (10Jclark-ctr) a:03Jclark-ctr [10:32:06] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:32:13] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:33:19] (03PS1) 10Mvolz: editcheckreferenceurl: don't error when aborting the lookupPromise [extensions/Citoid] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1009740 (https://phabricator.wikimedia.org/T359601) [10:49:01] 06SRE, 10ops-eqiad: Degraded RAID on dumpsdata1007 - https://phabricator.wikimedia.org/T359702#9619344 (10Jclark-ctr) ticket submitted You have successfully submitted request SR186677718. @Marostegui lets catch up about eta for replacement [10:54:33] 06SRE, 10ops-eqiad, 06Data-Engineering: Degraded RAID on dumpsdata1007 - https://phabricator.wikimedia.org/T359702#9619351 (10Marostegui) [11:02:23] (03CR) 10Dzahn: [C: 03+2] miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009859 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza) [11:03:29] (03Merged) 10jenkins-bot: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009859 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza) [11:12:28] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:21:15] (03CR) 10Andrew Bogott: "It looks like the cache eviction happens last, so the important bits are likely getting done even though the run returns failure. So maybe" [puppet] - 10https://gerrit.wikimedia.org/r/1007396 (owner: 10Majavah) [11:30:02] (03CR) 10Andrew Bogott: [C: 03+1] "Confirmed, this works on the second pass." [puppet] - 10https://gerrit.wikimedia.org/r/1007396 (owner: 10Majavah) [11:37:41] (03PS2) 10Andrew Bogott: git-sync-upstream: on puppet7, deploy code after update [puppet] - 10https://gerrit.wikimedia.org/r/1009798 (https://phabricator.wikimedia.org/T351450) [11:37:42] (03PS2) 10Andrew Bogott: git-sync-upstream.py: run through black [puppet] - 10https://gerrit.wikimedia.org/r/1009799 [11:37:44] (03PS12) 10Andrew Bogott: wmf_sink: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007445 (https://phabricator.wikimedia.org/T351455) [11:37:50] (03PS13) 10Andrew Bogott: wmcs-puppetcertleaks: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007444 (https://phabricator.wikimedia.org/T351455) [11:37:55] (SystemdUnitFailed) resolved: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:37:58] (03PS1) 10Andrew Bogott: P:puppetserver: git: mark repos dirs as safe [puppet] - 10https://gerrit.wikimedia.org/r/1010166 [11:40:17] (03CR) 10Majavah: [V: 03+1 C: 03+2] "If the" [puppet] - 10https://gerrit.wikimedia.org/r/1007396 (owner: 10Majavah) [11:41:25] (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:42:56] (03CR) 10CI reject: [V: 04-1] P:puppetserver: git: mark repos dirs as safe [puppet] - 10https://gerrit.wikimedia.org/r/1010166 (owner: 10Andrew Bogott) [11:49:57] (03PS1) 10Majavah: hieradata: WMCS: try to evict Puppet cache after more operations [puppet] - 10https://gerrit.wikimedia.org/r/1010168 (https://phabricator.wikimedia.org/T351450) [11:50:57] (03CR) 10Majavah: "This is probably fine, but can you try https://gerrit.wikimedia.org/r/c/operations/puppet/+/1010168 instead first? We manually commit/reba" [puppet] - 10https://gerrit.wikimedia.org/r/1009798 (https://phabricator.wikimedia.org/T351450) (owner: 10Andrew Bogott) [11:51:28] (03CR) 10Majavah: [C: 04-1] P:puppetserver: git: mark repos dirs as safe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1010166 (owner: 10Andrew Bogott) [12:05:21] (03PS2) 10Andrew Bogott: P:puppetserver: git: mark repos dirs as safe [puppet] - 10https://gerrit.wikimedia.org/r/1010166 [12:05:22] (03PS13) 10Andrew Bogott: wmf_sink: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007445 (https://phabricator.wikimedia.org/T351455) [12:05:24] (03PS14) 10Andrew Bogott: wmcs-puppetcertleaks: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007444 (https://phabricator.wikimedia.org/T351455) [12:06:31] (03CR) 10Andrew Bogott: [C: 03+1] hieradata: WMCS: try to evict Puppet cache after more operations [puppet] - 10https://gerrit.wikimedia.org/r/1010168 (https://phabricator.wikimedia.org/T351450) (owner: 10Majavah) [12:06:31] (03CR) 10Majavah: "I don't think we can merge this before we've upgraded the cloudwide puppetservers?" [puppet] - 10https://gerrit.wikimedia.org/r/1007445 (https://phabricator.wikimedia.org/T351455) (owner: 10Andrew Bogott) [12:06:53] (03CR) 10Majavah: [V: 03+1 C: 03+2] hieradata: WMCS: try to evict Puppet cache after more operations [puppet] - 10https://gerrit.wikimedia.org/r/1010168 (https://phabricator.wikimedia.org/T351450) (owner: 10Majavah) [12:07:01] (03CR) 10Andrew Bogott: P:puppetserver: git: mark repos dirs as safe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1010166 (owner: 10Andrew Bogott) [12:07:08] (03PS1) 10KartikMistry: Update cxserver to 2024-03-11-120258-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1010169 (https://phabricator.wikimedia.org/T350773) [12:07:46] (03PS1) 10Majavah: hieradata: update striker to 2024-03-11-120408-production [puppet] - 10https://gerrit.wikimedia.org/r/1010171 [12:07:57] (03CR) 10Majavah: [C: 03+1] P:puppetserver: git: mark repos dirs as safe [puppet] - 10https://gerrit.wikimedia.org/r/1010166 (owner: 10Andrew Bogott) [12:09:16] (03CR) 10Majavah: [C: 03+2] hieradata: update striker to 2024-03-11-120408-production [puppet] - 10https://gerrit.wikimedia.org/r/1010171 (owner: 10Majavah) [12:14:11] (03CR) 10Andrew Bogott: [C: 03+1] "This is fine with me; I'd rather have it removed than installed and broken." [puppet] - 10https://gerrit.wikimedia.org/r/1009350 (owner: 10Majavah) [12:15:34] (03CR) 10Andrew Bogott: "hm, definitely didn't mean to +1 myself" [puppet] - 10https://gerrit.wikimedia.org/r/1010166 (owner: 10Andrew Bogott) [12:15:48] (03CR) 10Andrew Bogott: [C: 03+2] P:puppetserver: git: mark repos dirs as safe [puppet] - 10https://gerrit.wikimedia.org/r/1010166 (owner: 10Andrew Bogott) [12:17:58] (03CR) 10Majavah: [C: 03+2] Undeploy Striker from codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1009350 (owner: 10Majavah) [12:23:04] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009955 [12:37:36] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for GeorgeMikesell - https://phabricator.wikimedia.org/T358922#9619690 (10SBisson) I approve but I am not @GMikesell-WMF's manager. That woul probably be @Jrbranaa [12:39:11] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 92 probes of 734 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:44:11] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 68 probes of 734 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:50:40] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:50:46] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:52:40] !log Re-starting MediaModeration scanning script [12:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:53] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [12:57:22] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [12:57:23] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [12:59:09] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [12:59:10] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [12:59:40] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240311T1300). [13:00:05] Superpes, Jhs, and mvolz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:29] o/ [13:00:56] Hi :) [13:01:13] (03PS2) 10Cyndywikime: Add account_conversion event streams. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989216 [13:04:14] (03PS3) 10Cyndywikime: Add account_conversion event streams. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989216 [13:04:54] (03CR) 10Cyndywikime: Add account_conversion event streams. (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989216 (owner: 10Cyndywikime) [13:06:13] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 102 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:08:08] Who is available to help deploy today? I think I can mostly do it on my own but I'd like someone to double check I've set everything up right before I go for it :) [13:11:13] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 79 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:14:57] In the meantime could you please deploy my patch because I’ve to go out in some minutes… [13:16:42] RoanKattouw, Lucas_WMDE, urbanecm, TheresNoTime - any of you around to help Superpes? [13:18:03] Superpes: I wouldn't feel confident doing your patch, I've never done a config patch before, sorry! [13:18:11] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 91 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:18:24] mvolz: give me 5 minutes and I'll be around [13:19:27] Thank TheresNoTime! In case I’m not around could you please check my patch? [13:19:57] Just need to go on special:block on itwiki and see if there’s the “block the user talk page” option [13:20:06] Superpes: ack, okay [13:20:35] tnx! [13:20:48] Superpes: mvolz: I'm going to start with 1009731 [13:21:18] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009731 (owner: 10Superpes15) [13:21:56] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:22:12] (03Merged) 10jenkins-bot: [itwiki] Set 'wgBlockAllowsUTEdit' to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009731 (owner: 10Superpes15) [13:22:16] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 16 hosts with reason: Primary switchover x1 T359790 [13:22:20] T359790: Switchover x1 master (db1179 -> db1220) - https://phabricator.wikimedia.org/T359790 [13:22:30] !log samtar@deploy2002 Started scap: Backport for [[gerrit:1009731|[itwiki] Set 'wgBlockAllowsUTEdit' to true]] [13:22:36] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:22:42] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 16 hosts with reason: Primary switchover x1 T359790 [13:22:43] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:23:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db1220 with weight 0 T359790', diff saved to https://phabricator.wikimedia.org/P58701 and previous config saved to /var/cache/conftool/dbconfig/20240311-132259-arnaudb.json [13:24:35] !log samtar@deploy2002 superpes and samtar: Backport for [[gerrit:1009731|[itwiki] Set 'wgBlockAllowsUTEdit' to true]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:24:40] Superpes: still around to test, or shall I? [13:25:15] * TheresNoTime tests [13:25:18] I’m going out in this moment :( please try the patch if you can [13:25:35] Thanks :3 [13:25:40] lgtm [13:25:43] !log samtar@deploy2002 superpes and samtar: Continuing with sync [13:25:52] (03CR) 10Elukey: [V: 03+2 C: 03+2] slo_definitions: remove prometheus label from ml-serve definitions [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1009551 (owner: 10Elukey) [13:27:00] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:27:07] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:29:09] Jhs: your patch will be next — are you around? It's also marked WIP [13:29:24] mvolz: will you want to self-deploy? :) [13:29:52] TheresNoTime, i'm here, yeah [13:30:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 45.51% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:30:21] TheresNoTime: I'd like to give it ago. Would you double check that my patch looks okay, i.e. I've cherry picked it to the right branch etc? (After you're done with Jhs) [13:31:03] ack [13:31:11] (03PS1) 10Hashar: Merge tag 'v3.7.8' into wmf/stable-3.7 [software/gerrit] (wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/1010189 (https://phabricator.wikimedia.org/T359819) [13:31:49] (03PS1) 10Jon Harald Søby: nnwiki: Enable sandbox link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010155 (https://phabricator.wikimedia.org/T359788) [13:32:04] (03CR) 10Arnaudb: [C: 03+2] mariadb: Promote db1220 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/1009949 (https://phabricator.wikimedia.org/T359790) (owner: 10Gerrit maintenance bot) [13:32:15] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:32:21] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:32:36] (03PS2) 10Jon Harald Søby: nnwiki: Enable sandbox link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010155 (https://phabricator.wikimedia.org/T359788) [13:32:38] !log Starting x1 eqiad failover from db1179 to db1220 - T359790 [13:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:42] T359790: Switchover x1 master (db1179 -> db1220) - https://phabricator.wikimedia.org/T359790 [13:33:13] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 82 probes of 734 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:34:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db1220 to x1 primary T359790', diff saved to https://phabricator.wikimedia.org/P58702 and previous config saved to /var/cache/conftool/dbconfig/20240311-133405-arnaudb.json [13:35:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 48.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:35:43] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:1009731|[itwiki] Set 'wgBlockAllowsUTEdit' to true]] (duration: 13m 13s) [13:36:11] * TheresNoTime tests again [13:36:21] all looks good Superpes, deployed [13:36:26] !log Running `foreachwikiindblist group2.dblist extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=commonswiki --use-jobqueue --sleep 30 --verbose 2>&1 | tee ~/scan-files-in-scan-table-group2-sleep-30-no-render-now.txt` on a tmux session [13:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db1179 T359790', diff saved to https://phabricator.wikimedia.org/P58703 and previous config saved to /var/cache/conftool/dbconfig/20240311-133631-arnaudb.json [13:36:43] Jhs: moving to your patch now [13:36:51] 👍 [13:37:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:37:34] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010155 (https://phabricator.wikimedia.org/T359788) (owner: 10Jon Harald Søby) [13:38:18] (03Merged) 10jenkins-bot: nnwiki: Enable sandbox link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010155 (https://phabricator.wikimedia.org/T359788) (owner: 10Jon Harald Søby) [13:38:33] !log samtar@deploy2002 Started scap: Backport for [[gerrit:1010155|nnwiki: Enable sandbox link (T359788)]] [13:38:38] T359788: Enable wmgUseSandboxLink on nnwiki - https://phabricator.wikimedia.org/T359788 [13:40:21] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on db1179.eqiad.wmnet with reason: Silence for upgrade [13:40:24] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1179.eqiad.wmnet with reason: Silence for upgrade [13:40:25] (03CR) 10Hashar: [C: 03+2] Merge tag 'v3.7.8' into wmf/stable-3.7 [software/gerrit] (wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/1010189 (https://phabricator.wikimedia.org/T359819) (owner: 10Hashar) [13:40:51] !log samtar@deploy2002 jhsoby and samtar: Backport for [[gerrit:1010155|nnwiki: Enable sandbox link (T359788)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:40:55] TheresNoTime, working as it should on mwdebug2002 👍 [13:40:55] (03Abandoned) 10Elukey: python-webapp: update mesh and base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/980904 (owner: 10Elukey) [13:41:01] ack [13:41:05] !log samtar@deploy2002 jhsoby and samtar: Continuing with sync [13:41:19] (03Abandoned) 10Elukey: profile::cache::kafka::webrequest: change the JSON format [puppet] - 10https://gerrit.wikimedia.org/r/980912 (https://phabricator.wikimedia.org/T346463) (owner: 10Elukey) [13:41:54] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:42:00] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:42:08] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1179.eqiad.wmnet with OS bookworm [13:42:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 48.26% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:45:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.02% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:45:31] (is there a reason for ^ or..?) [13:45:54] (03Merged) 10jenkins-bot: Merge tag 'v3.7.8' into wmf/stable-3.7 [software/gerrit] (wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/1010189 (https://phabricator.wikimedia.org/T359819) (owner: 10Hashar) [13:46:31] (03PS1) 10Hashar: Update Gerrit to v3.7.8 and update plugins [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/1010192 (https://phabricator.wikimedia.org/T359819) [13:46:56] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:47:33] TheresNoTime: who is that directed at? [13:47:45] oh, just the channel, sorry [13:48:05] I put a note in -sre regardless :) [13:48:10] ok, I wasn't sure of the context :). [13:48:48] mvolz: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Citoid/+/1009740 looks perfect, so I'll ping you when I'm done with this patch and you can deploy. I'll be around if you need me :) [13:48:56] ok great [13:49:09] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:49:16] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:50:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.86% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:50:51] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:1010155|nnwiki: Enable sandbox link (T359788)]] (duration: 12m 18s) [13:50:56] T359788: Enable wmgUseSandboxLink on nnwiki - https://phabricator.wikimedia.org/T359788 [13:51:11] Jhs: deployed :) [13:51:16] mvolz: all yours! [13:51:25] TheresNoTime, thanks! lost my internet connection there for a bit, sorry [13:51:33] np! [13:52:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 47.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:52:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.05s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:53:03] thanks! about to start [13:53:40] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mvolz@deploy2002 using scap backport" [extensions/Citoid] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1009740 (https://phabricator.wikimedia.org/T359601) (owner: 10Mvolz) [13:54:01] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1179.eqiad.wmnet with reason: host reimage [13:56:14] (03CR) 10Hashar: [C: 03+2] Update Gerrit to v3.7.8 and update plugins [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/1010192 (https://phabricator.wikimedia.org/T359819) (owner: 10Hashar) [13:56:27] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1179.eqiad.wmnet with reason: host reimage [13:56:49] (03Merged) 10jenkins-bot: Update Gerrit to v3.7.8 and update plugins [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/1010192 (https://phabricator.wikimedia.org/T359819) (owner: 10Hashar) [13:57:16] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 1.05s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:00:38] (03Merged) 10jenkins-bot: editcheckreferenceurl: don't error when aborting the lookupPromise [extensions/Citoid] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1009740 (https://phabricator.wikimedia.org/T359601) (owner: 10Mvolz) [14:00:54] !log mvolz@deploy2002 Started scap: Backport for [[gerrit:1009740|editcheckreferenceurl: don't error when aborting the lookupPromise (T359601)]] [14:01:09] T359601: TypeError: Cannot read properties of undefined (reading 'abort') at ve.ui.CitoidInspector.performLookup - https://phabricator.wikimedia.org/T359601 [14:01:58] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:02:05] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:02:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.62% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:02:58] !log mvolz@deploy2002 mvolz: Backport for [[gerrit:1009740|editcheckreferenceurl: don't error when aborting the lookupPromise (T359601)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:04:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 47.58% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:10:16] tested, looks like the patch fixed the wedging, so continuing. [14:10:21] !log mvolz@deploy2002 mvolz: Continuing with sync [14:10:30] :) [14:13:51] when you are done with the backport window, I will upgrade Gerrit [14:14:01] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1011 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:15:43] (03PS1) 10Elukey: Remove unecessary regexes from Lift Wing metrics [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1010193 [14:17:00] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1179.eqiad.wmnet with OS bookworm [14:19:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 1%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P58704 and previous config saved to /var/cache/conftool/dbconfig/20240311-141945-arnaudb.json [14:20:15] !log mvolz@deploy2002 Finished scap: Backport for [[gerrit:1009740|editcheckreferenceurl: don't error when aborting the lookupPromise (T359601)]] (duration: 19m 20s) [14:20:19] T359601: TypeError: Cannot read properties of undefined (reading 'abort') at ve.ui.CitoidInspector.performLookup - https://phabricator.wikimedia.org/T359601 [14:21:44] (03PS1) 10Elukey: Remove response_code label from totals in Lift Wing Availability SLOs [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1010196 [14:23:04] I am done! Thanks for the hand-holding, that was my first "solo" mediawiki backport :). [14:24:15] hashar: you're up! [14:24:28] great :) [14:24:39] mvolz: and congratulations for the backport deployment! [14:27:48] !log hashar@deploy2002 Started deploy [gerrit/gerrit@737c475]: Gerrit to 3.7.8 on gerrit2002 [14:27:51] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@737c475]: Gerrit to 3.7.8 on gerrit2002 (duration: 00m 03s) [14:28:57] I forgot to poke T359819 [14:28:57] T359819: Upgrade to Gerrit 3.7.8 - https://phabricator.wikimedia.org/T359819 [14:30:16] mvolz: congrats! ^^ [14:31:21] !log hashar@deploy2002 Started deploy [gerrit/gerrit@2150230]: Gerrit to 3.7.8 on gerrit2002 - T359819 [14:31:28] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@2150230]: Gerrit to 3.7.8 on gerrit2002 - T359819 (duration: 00m 07s) [14:31:37] * hashar whistles about forgetting `git rebase` on the deployment server [14:31:42] (03PS2) 10Elukey: Remove response_code label from totals in Lift Wing Availability SLOs [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1010196 [14:33:07] (03PS1) 10Arnaudb: mariadb: toggle notifications for db2205 [puppet] - 10https://gerrit.wikimedia.org/r/1009956 (https://phabricator.wikimedia.org/T355422) [14:33:42] * hashar whistles about forgetting `git rebase` on the deployment server [14:34:20] (03PS2) 10Arnaudb: mariadb: toggle notifications for db2205/6/8 [puppet] - 10https://gerrit.wikimedia.org/r/1009956 (https://phabricator.wikimedia.org/T355422) [14:34:33] arnaudb: I am now upgrading Gerrit :D [14:34:46] (03PS2) 10Elukey: Remove unecessary regexes from Lift Wing metrics [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1010193 [14:34:46] (03PS3) 10Elukey: Remove response_code label from totals in Lift Wing Availability SLOs [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1010196 [14:34:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 2%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P58705 and previous config saved to /var/cache/conftool/dbconfig/20240311-143451-arnaudb.json [14:35:04] !log hashar@deploy2002 Started deploy [gerrit/gerrit@2150230]: Gerrit to 3.7.8 on gerrit1003 - T359819 [14:35:09] T359819: Upgrade to Gerrit 3.7.8 - https://phabricator.wikimedia.org/T359819 [14:35:14] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@2150230]: Gerrit to 3.7.8 on gerrit1003 - T359819 (duration: 00m 10s) [14:35:25] 🤞 [14:35:34] Aha, it's planned. [14:35:54] here is my monitoring assistant :) [14:36:01] * James_F was code-reviewing. :-P [14:36:16] Aka I was awake. [14:36:31] my bad, I should have announced it earlier today before my lunch [14:36:57] No worries. [14:37:14] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:55] I think it worked [14:38:08] It's back up. [14:38:17] Whether or not it works, we'll see. [14:38:31] (ProbeDown) firing: (2) Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:38:42] (ProbeDown) firing: (2) Service gerrit1003:29418 has failed probes (tcp_gerrit_ssh_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#gerrit1003:29418 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:42:35] ^ lies [14:42:51] my guess is the probe is lagging [14:43:31] (ProbeDown) resolved: (2) Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:43:42] (ProbeDown) resolved: (2) Service gerrit1003:29418 has failed probes (tcp_gerrit_ssh_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#gerrit1003:29418 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:44:01] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1011 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:49:26] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: provisionning db2209.codfw.wmnet - T355422 [14:49:31] T355422: Productionize db2196-db2220 - https://phabricator.wikimedia.org/T355422 [14:49:40] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: provisionning db2209.codfw.wmnet - T355422 [14:49:44] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2209.codfw.wmnet with reason: provisionning db2209.codfw.wmnet - T355422 [14:49:47] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2209.codfw.wmnet with reason: provisionning db2209.codfw.wmnet - T355422 [14:50:11] (03PS7) 10SBassett: Remove X-Webkit-CSP-Report-Only response header from foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003108 (https://phabricator.wikimedia.org/T357479) (owner: 10TheDJ) [14:50:19] (03CR) 10Jgiannelos: "Indeed node17 introduced a change on how DNS resolution works (verbatim=True by default) [1]. This means that it might be the case ipv6 ge" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007959 (https://phabricator.wikimedia.org/T358017) (owner: 10Sbailey) [14:50:37] RECOVERY - Kafka broker TLS certificate validity on kafka-logging1003 is OK: SSL OK - Certificate kafka-logging1003.eqiad.wmnet valid until 2025-03-03 12:57:00 +0000 (expires in 356 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [14:51:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2109 in db2209 for T355422', diff saved to https://phabricator.wikimedia.org/P58706 and previous config saved to /var/cache/conftool/dbconfig/20240311-145102-arnaudb.json [14:51:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 4%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P58707 and previous config saved to /var/cache/conftool/dbconfig/20240311-145111-arnaudb.json [14:51:36] (03PS4) 10Elukey: Remove response_code label from totals in Lift Wing Availability SLOs [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1010196 [14:52:04] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2109.codfw.wmnet onto db2209.codfw.wmnet [14:52:54] (03PS1) 10KartikMistry: Enable Content/Section translation on some Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010226 (https://phabricator.wikimedia.org/T353510) [14:54:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 48.26% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:54:43] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: provisionning db2210.codfw.wmnet - T355422 [14:54:47] T355422: Productionize db2196-db2220 - https://phabricator.wikimedia.org/T355422 [14:54:57] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: provisionning db2210.codfw.wmnet - T355422 [14:55:00] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: provisionning db2210.codfw.wmnet - T355422 [14:55:03] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: provisionning db2210.codfw.wmnet - T355422 [14:56:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2110 in db2210 for T355422', diff saved to https://phabricator.wikimedia.org/P58708 and previous config saved to /var/cache/conftool/dbconfig/20240311-145604-arnaudb.json [14:57:02] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2110.codfw.wmnet onto db2210.codfw.wmnet [14:57:14] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:58:31] (03PS5) 10Elukey: Remove response_code label from totals in Lift Wing Availability SLOs [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1010196 [14:59:06] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: provisionning db2211.codfw.wmnet - T355422 [14:59:20] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: provisionning db2211.codfw.wmnet - T355422 [14:59:23] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2211.codfw.wmnet with reason: provisionning db2211.codfw.wmnet - T355422 [14:59:26] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2211.codfw.wmnet with reason: provisionning db2211.codfw.wmnet - T355422 [15:00:13] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [15:00:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2111 in db2211 for T355422', diff saved to https://phabricator.wikimedia.org/P58709 and previous config saved to /var/cache/conftool/dbconfig/20240311-150025-arnaudb.json [15:00:36] T355422: Productionize db2196-db2220 - https://phabricator.wikimedia.org/T355422 [15:01:36] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2111.codfw.wmnet onto db2211.codfw.wmnet [15:03:44] (03PS1) 10Arnaudb: mariadb: toggle notifications for db2209/10/11 [puppet] - 10https://gerrit.wikimedia.org/r/1010246 (https://phabricator.wikimedia.org/T355422) [15:05:15] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [15:06:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 8%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P58710 and previous config saved to /var/cache/conftool/dbconfig/20240311-150617-arnaudb.json [15:21:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 16%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P58711 and previous config saved to /var/cache/conftool/dbconfig/20240311-152123-arnaudb.json [15:30:05] jan_drewniak: #bothumor My software never has bugs. It just develops random features. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240311T1530). [15:30:59] jouncebot: nowandnext [15:30:59] For the next 0 hour(s) and 29 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240311T1530) [15:30:59] In 1 hour(s) and 29 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240311T1700) [15:30:59] In 1 hour(s) and 29 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240311T1700) [15:33:34] !log T357007 Running mwscript CampaignEvents:GenerateInvitationList --wiki=metawiki --listfile=/home/daimona/list.txt [15:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:39] T357007: Generate Invitation Lists for Event Organizers - https://phabricator.wikimedia.org/T357007 [15:35:57] !log jnuche@deploy2002 Installing scap version "4.71.0" for 376 hosts [15:36:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 32%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P58712 and previous config saved to /var/cache/conftool/dbconfig/20240311-153628-arnaudb.json [15:36:53] !log jnuche@deploy2002 Installation of scap version "4.71.0" completed for 376 hosts [15:39:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.92% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:41:25] (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:42:25] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:44:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.92% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:45:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.62% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:46:05] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:49:30] (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 47.95% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:49:54] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2111.codfw.wmnet onto db2211.codfw.wmnet [15:51:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 50%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P58713 and previous config saved to /var/cache/conftool/dbconfig/20240311-155134-arnaudb.json [16:06:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 75%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P58714 and previous config saved to /var/cache/conftool/dbconfig/20240311-160639-arnaudb.json [16:08:35] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Idle - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:11:17] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 100 probes of 734 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:16:17] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 84 probes of 734 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:21:21] 06SRE, 10SRE-Access-Requests: Requesting access to mwmaint for rkhan / Himejijo - https://phabricator.wikimedia.org/T359490#9620985 (10Himejijo) [16:21:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 100%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P58715 and previous config saved to /var/cache/conftool/dbconfig/20240311-162145-arnaudb.json [16:26:05] (03PS6) 10MdsShakil: Add `suppressredirect` right to pagemover and filemover user groups in azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009729 (https://phabricator.wikimedia.org/T359614) [16:33:49] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2109.codfw.wmnet onto db2209.codfw.wmnet [16:38:09] 06SRE, 10SRE-Access-Requests: Requesting access to mwmaint for rkhan / Himejijo - https://phabricator.wikimedia.org/T359490#9621054 (10Himejijo) Can I just edit this ticket? [16:42:25] (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:46:05] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:54:28] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:54:34] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240311T1700) [17:00:05] ryankemper: Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240311T1700). Please do the needful. [17:04:22] hashar: o/ if you have time (even tomorrow) - https://gerrit.wikimedia.org/r/c/integration/config/+/1009218 [17:11:32] (03CR) 10Krinkle: Support cookies in XWikimediaDebug (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1000307 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [17:18:44] (03CR) 10Krinkle: Support cookies in XWikimediaDebug (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1000307 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [17:20:49] (03PS1) 10Ilias Sarantopoulos: WIP - httpbb: add ores-legacy tests [puppet] - 10https://gerrit.wikimedia.org/r/1010245 (https://phabricator.wikimedia.org/T359871) [17:25:51] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2110.codfw.wmnet onto db2210.codfw.wmnet [17:29:25] 06SRE, 10ops-eqiad, 10procurement: install (2) 1.92TB SSDs from decom into prometheus100[56] - https://phabricator.wikimedia.org/T359632#9621206 (10lmata) thank you for all the help and care @Jclark-ctr and @RobH [17:35:31] elukey: I'll process it. [17:35:55] 06SRE, 10SRE-Access-Requests: Requesting access to mwmaint for rkhan / Himejijo - https://phabricator.wikimedia.org/T359490#9621223 (10Marostegui) [17:39:38] elukey: Done [17:40:18] 06SRE, 10SRE-Access-Requests: Requesting access to mwmaint for rkhan / Himejijo - https://phabricator.wikimedia.org/T359490#9621231 (10Marostegui) @thcipriani would you approve this request to mwmaint? [17:41:25] (SystemdUnitFailed) resolved: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:43:05] 06SRE, 10SRE-Access-Requests: Requesting access to mwmaint for rkhan / Himejijo - https://phabricator.wikimedia.org/T359490#9621241 (10thcipriani) >>! In T359490#9621230, @Marostegui wrote: > @thcipriani would you approve this request to mwmaint? This is for `restricted`, correct? Approved from me. [17:46:25] (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:46:45] 06SRE, 10SRE-Access-Requests: Requesting access to mwmaint for rkhan / Himejijo - https://phabricator.wikimedia.org/T359490#9621245 (10Marostegui) [17:47:11] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:52:46] (03PS1) 10Jforrester: Be able to disable MobileFrontend and drop the secondary domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010268 (https://phabricator.wikimedia.org/T349408) [17:52:50] (03PS1) 10Jforrester: [BETA CLUSTER] Disable MobileFrontend for Wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010269 (https://phabricator.wikimedia.org/T358329) [17:52:58] (03PS1) 10Jforrester: [wikifunctionswiki] Disable MobileFrontend in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010270 (https://phabricator.wikimedia.org/T349408) [17:53:18] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2024.03.04 - 2024.03.24): Requesting access to kubernetes deployment for tjones - https://phabricator.wikimedia.org/T359092#9621295 (10thcipriani) >>! In T359092#9599307, @Marostegui wrote: > @thcipriani can you approve this request for the deployment group?... [18:23:31] 06SRE, 10SRE-Access-Requests: Requesting access to mwmaint for rkhan / Himejijo - https://phabricator.wikimedia.org/T359490#9621355 (10Himejijo) [18:26:54] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:27:00] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:38:53] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:39:05] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:40:33] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:44:07] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:44:13] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:45:33] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:45:49] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.231 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:45:59] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51594 bytes in 0.078 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:48:44] (03PS1) 10Jdlrobson: Disable special pages on a per name basis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010286 (https://phabricator.wikimedia.org/T359183) [18:50:05] (03PS2) 10Jdlrobson: Disable special pages on a per name basis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010286 (https://phabricator.wikimedia.org/T359183) [18:51:49] (03PS1) 10Jdlrobson: Interaction to Next Paint (INP) Core Web Vital Improvement [skins/Vector] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1010215 (https://phabricator.wikimedia.org/T358380) [18:56:02] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:56:08] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:57:14] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:34:23] (03CR) 10Marostegui: [C: 03+1] mariadb: toggle notifications for db2209/10/11 [puppet] - 10https://gerrit.wikimedia.org/r/1010246 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [19:34:47] (03CR) 10Marostegui: [C: 03+1] mariadb: toggle notifications for db2205/6/8 [puppet] - 10https://gerrit.wikimedia.org/r/1009956 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [19:35:06] (03CR) 10Mabualruz: [C: 03+1] Disable special pages on a per name basis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010286 (https://phabricator.wikimedia.org/T359183) (owner: 10Jdlrobson) [19:37:37] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:37:44] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:58:13] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:58:19] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240311T2000) [20:00:05] Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:16] oh, the late window is early this time [20:00:19] i can deploy today :) [20:00:26] Jdlrobson: around? [20:00:35] (03PS3) 10Jdlrobson: Disable special pages on a per name basis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010286 (https://phabricator.wikimedia.org/T359183) [20:00:37] urbanecm: yep [20:00:45] (03CR) 10Urbanecm: [C: 03+2] Interaction to Next Paint (INP) Core Web Vital Improvement [skins/Vector] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1010215 (https://phabricator.wikimedia.org/T358380) (owner: 10Jdlrobson) [20:00:48] YAY CLOCK CHANGES [20:01:41] fortunately, i use an electronic calendar. it would be a nightmare to keep track of this via a paper one. [20:02:06] Jdlrobson: should i wait for the backport with the config? or is it ok to deploy the config in the meantime? [20:03:06] neither blocks each other urbanecm [20:03:09] okay [20:03:12] you can do them in whatever order makes sense [20:03:19] (03CR) 10Urbanecm: [C: 03+2] Disable special pages on a per name basis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010286 (https://phabricator.wikimedia.org/T359183) (owner: 10Jdlrobson) [20:03:32] thanks for clarifying. just wanted to double check :) [20:03:44] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010286 (https://phabricator.wikimedia.org/T359183) (owner: 10Jdlrobson) [20:04:05] (03Merged) 10jenkins-bot: Disable special pages on a per name basis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010286 (https://phabricator.wikimedia.org/T359183) (owner: 10Jdlrobson) [20:04:23] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:1010286|Disable special pages on a per name basis (T359183)]] [20:04:27] T359183: Exclude non-functional pages from night mode - https://phabricator.wikimedia.org/T359183 [20:06:36] !log urbanecm@deploy2002 jdlrobson and urbanecm: Backport for [[gerrit:1010286|Disable special pages on a per name basis (T359183)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:06:50] Jdlrobson: the config's at mwdebug. can you test, please? :) [20:06:59] yep on it [20:07:34] urbanecm: lgtm please sync [20:07:38] !log urbanecm@deploy2002 jdlrobson and urbanecm: Continuing with sync [20:07:41] proceeding, thank you [20:09:17] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:09:23] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:14:27] PROBLEM - Check whether ferm is active by checking the default input chain on mw1367 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:16:33] PROBLEM - Disk space on centrallog1002 is CRITICAL: DISK CRITICAL - free space: /srv 52954 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog1002&var-datasource=eqiad+prometheus/ops [20:18:06] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:1010286|Disable special pages on a per name basis (T359183)]] (duration: 13m 43s) [20:18:10] T359183: Exclude non-functional pages from night mode - https://phabricator.wikimedia.org/T359183 [20:18:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [skins/Vector] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1010215 (https://phabricator.wikimedia.org/T358380) (owner: 10Jdlrobson) [20:19:47] (03Merged) 10jenkins-bot: Interaction to Next Paint (INP) Core Web Vital Improvement [skins/Vector] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1010215 (https://phabricator.wikimedia.org/T358380) (owner: 10Jdlrobson) [20:20:01] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:1010215|Interaction to Next Paint (INP) Core Web Vital Improvement (T358380)]] [20:20:08] T358380: [3 days] Interaction to Next Paint (INP) Core Web Vital is scored as "Needs Improvement" or "Poor" for Mobile users on Desktop - https://phabricator.wikimedia.org/T358380 [20:22:14] !log urbanecm@deploy2002 urbanecm and jdlrobson: Backport for [[gerrit:1010215|Interaction to Next Paint (INP) Core Web Vital Improvement (T358380)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:22:36] Jdlrobson: can you test the backport as well, please? :) [20:23:23] urbanecm: yep lgtm [20:23:26] !log urbanecm@deploy2002 urbanecm and jdlrobson: Continuing with sync [20:23:30] that was quick, syncing :) [20:33:42] thanks urbanecm :) [20:33:59] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:1010215|Interaction to Next Paint (INP) Core Web Vital Improvement (T358380)]] (duration: 13m 57s) [20:34:09] T358380: [3 days] Interaction to Next Paint (INP) Core Web Vital is scored as "Needs Improvement" or "Poor" for Mobile users on Desktop - https://phabricator.wikimedia.org/T358380 [20:36:43] and all done :) [20:36:45] no problem [20:37:16] urbanecm: I have a late addition if you are finished [20:37:36] tgr: no problem. do you want to self-serve, or do you want me to deploy for you? [20:38:08] (03PS3) 10Gergő Tisza: Move checkuser grant configuration to CheckUser extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009865 (https://phabricator.wikimedia.org/T359537) [20:38:17] thx, I can do it [20:38:24] ack, feel free to go ahead then :) [20:40:05] on second thought it needs to wait one more train cycle [20:41:05] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:41:25] (SystemdUnitFailed) firing: puppet-agent-timer.service on poolcounter2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:44:27] RECOVERY - Check whether ferm is active by checking the default input chain on mw1367 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:46:25] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:48:21] (03PS1) 10Dwisehaupt: Update lp.email cname and validation domain [dns] - 10https://gerrit.wikimedia.org/r/1010315 (https://phabricator.wikimedia.org/T336000) [20:49:20] (03CR) 10CI reject: [V: 04-1] Update lp.email cname and validation domain [dns] - 10https://gerrit.wikimedia.org/r/1010315 (https://phabricator.wikimedia.org/T336000) (owner: 10Dwisehaupt) [20:51:15] I'm "live testing" one step of the switchdc cookbook -- it'll only touch eqiad (currently the read-only DC) so no production impact [20:51:28] !log rzl@cumin2002 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance [20:51:44] !log rzl@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) [20:52:20] done 👍 [20:55:11] (03CR) 10Dwisehaupt: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1010315 (https://phabricator.wikimedia.org/T336000) (owner: 10Dwisehaupt) [20:56:05] (03CR) 10CI reject: [V: 04-1] Update lp.email cname and validation domain [dns] - 10https://gerrit.wikimedia.org/r/1010315 (https://phabricator.wikimedia.org/T336000) (owner: 10Dwisehaupt) [21:00:05] Reedy, sbassett, Maryum, and manfredi: Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240311T2100). Please do the needful. [21:01:19] (03PS2) 10Dwisehaupt: Update lp.email cname and validation domain [dns] - 10https://gerrit.wikimedia.org/r/1010315 (https://phabricator.wikimedia.org/T336000) [21:01:25] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:08:51] (03PS3) 10Dwisehaupt: Update lp.email cname and validation domain [dns] - 10https://gerrit.wikimedia.org/r/1010315 (https://phabricator.wikimedia.org/T336000) [21:11:45] (03CR) 10Jgreen: [C: 03+2] Update lp.email cname and validation domain [dns] - 10https://gerrit.wikimedia.org/r/1010315 (https://phabricator.wikimedia.org/T336000) (owner: 10Dwisehaupt) [21:12:05] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:12:09] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:13:57] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.328 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:13:59] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51594 bytes in 0.112 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:22:00] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:22:07] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:41:05] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:41:25] (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:47:10] (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:47:11] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:31:33] (03CR) 10Krinkle: [C: 03+1] "Untested but LGTM. I suggest whoever merges it, perhaps runs it first and/or shortly afterwards to confirm just in case that the dums stil" [puppet] - 10https://gerrit.wikimedia.org/r/1009784 (https://phabricator.wikimedia.org/T99268) (owner: 10Ahmon Dancy) [22:47:29] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:47:36] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:53:26] (03PS3) 10Gergő Tisza: Support cookies in XWikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1000307 (https://phabricator.wikimedia.org/T350094) [22:53:31] (03CR) 10Gergő Tisza: Support cookies in XWikimediaDebug (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1000307 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [22:57:29] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:14:47] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:14:54] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:17:57] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:18:03] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:25:50] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:25:56] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:37:24] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:37:30] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:46:25] (SystemdUnitFailed) resolved: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:50:25] (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:52:24] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:52:30] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply