[00:01:10] RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:07:00] win 14 [00:12:18] !log aaron@deploy1002 Started deploy [performance/arc-lamp@40cb764]: T315056 [00:12:23] T315056: arclamp_generate_svgs OOMs - https://phabricator.wikimedia.org/T315056 [00:12:25] !log aaron@deploy1002 Finished deploy [performance/arc-lamp@40cb764]: T315056 (duration: 00m 07s) [00:15:22] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [00:18:08] PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:22:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refinery-drop-eventlogging-legacy-raw-partitions.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:32:16] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [00:32:55] (03PS3) 10Tim Starling: Apply scaling_governor=performance to MediaWiki servers [puppet] - 10https://gerrit.wikimedia.org/r/826405 (https://phabricator.wikimedia.org/T315398) [00:33:17] (03CR) 10Tim Starling: Apply scaling_governor=performance to MediaWiki servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826405 (https://phabricator.wikimedia.org/T315398) (owner: 10Tim Starling) [00:41:56] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [00:43:04] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:12] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:54:02] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [00:55:10] RECOVERY - nova instance creation test on cloudcontrol1005 is OK: PROCS OK: 1 process with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [00:58:54] (03CR) 10Tim Starling: "New PCC result: https://puppet-compiler.wmflabs.org/pcc-worker1003/37034/" [puppet] - 10https://gerrit.wikimedia.org/r/826405 (https://phabricator.wikimedia.org/T315398) (owner: 10Tim Starling) [01:03:02] (03CR) 10Tim Starling: [C: 03+2] Apply scaling_governor=performance to MediaWiki servers [puppet] - 10https://gerrit.wikimedia.org/r/826405 (https://phabricator.wikimedia.org/T315398) (owner: 10Tim Starling) [01:04:18] !log setting scaling_governor=performance on all mediawiki servers, via puppet gerrit 826405 [01:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:12] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [01:10:20] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.02471 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [01:11:04] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [01:15:59] (03PS3) 10Andrew Bogott: wikimediacloud.org: do not use CNAMEs for nsX addresses [dns] - 10https://gerrit.wikimedia.org/r/827446 (https://phabricator.wikimedia.org/T315955) (owner: 10Majavah) [01:18:10] (03CR) 10Andrew Bogott: [C: 03+2] wikimediacloud.org: do not use CNAMEs for nsX addresses [dns] - 10https://gerrit.wikimedia.org/r/827446 (https://phabricator.wikimedia.org/T315955) (owner: 10Majavah) [01:28:06] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [01:35:26] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [01:36:45] (JobUnavailable) firing: Reduced availability for job workhorse in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:40:16] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [01:41:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:50:06] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220830T0200) [02:03:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:04:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:04:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:06:45] (JobUnavailable) firing: (8) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:07:06] RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [02:07:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.27 [core] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/827612 (https://phabricator.wikimedia.org/T314188) [02:07:56] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.27 [core] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/827612 (https://phabricator.wikimedia.org/T314188) (owner: 10TrainBranchBot) [02:08:50] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.00436 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [02:09:06] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:11:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:21:44] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [02:24:29] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/1.39.0-wmf.27 [core] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/827612 (https://phabricator.wikimedia.org/T314188) (owner: 10TrainBranchBot) [02:26:36] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [02:26:50] (03CR) 10Ahmon Dancy: [C: 03+2] Branch commit for wmf/1.39.0-wmf.27 [core] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/827612 (https://phabricator.wikimedia.org/T314188) (owner: 10TrainBranchBot) [02:39:39] 10SRE, 10Performance-Team (Radar): Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 (10tstarling) 05Open→03Resolved a:03tstarling {F35496065} [02:39:43] 10SRE, 10Cloud-VPS, 10Performance-Team (Radar), 10cloud-services-team (Kanban): CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10tstarling) [02:44:37] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/1.39.0-wmf.27 [core] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/827612 (https://phabricator.wikimedia.org/T314188) (owner: 10TrainBranchBot) [02:50:58] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [02:51:49] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) [02:55:50] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220830T0300) [03:01:46] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) Stage 3 (traffic percentage) is useful for capacity modelling, but it's not expected to be optimal for data store consistency, since the stability... [03:06:14] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:07:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [03:08:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [03:08:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [03:09:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [03:09:33] (03PS1) 10Tim Starling: Multi-DC stage 3: send 2% of traffic to appservers-ro [puppet] - 10https://gerrit.wikimedia.org/r/827616 (https://phabricator.wikimedia.org/T279664) [03:09:35] (03PS1) 10Tim Starling: Multi-DC stage 4: send all traffic to appservers-ro [puppet] - 10https://gerrit.wikimedia.org/r/827617 (https://phabricator.wikimedia.org/T279664) [03:10:26] RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:14:54] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:22:20] RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:32:24] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [04:00:20] RECOVERY - Check systemd state on dse-k8s-worker1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:07:38] PROBLEM - Check systemd state on dse-k8s-worker1005 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:21:06] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [04:31:44] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 3 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) Planning for stage 3/4 capacity monitoring. > Observe cross-DC database connection rate, analyse sources In the DBPerformance logs, we see a da... [04:35:46] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [04:35:52] !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.115`. Pre-deploy tests passing on canary `wdqs1003` [04:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:36:04] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@2d34f5c]: 0.3.115 [04:37:33] !log [WDQS Deploy] Tests passing following deploy of `0.3.115` on canary `wdqs1003`; proceeding to rest of fleet [04:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:43:06] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [04:45:05] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@2d34f5c]: 0.3.115 (duration: 09m 01s) [04:56:24] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2110 - https://phabricator.wikimedia.org/T315229 (10Marostegui) And it worked fine indeed: ` root@db2110:~# megacli -LDInfo -Lall -aALL Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Primary-1... [04:57:58] !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [04:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:58:02] !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [04:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:58:07] !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [04:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:58:25] 10SRE, 10ops-codfw, 10DBA: db2149 is sad after reboot - https://phabricator.wikimedia.org/T316494 (10Marostegui) MySQL started. [04:58:46] !log [WCQS Deploy] Gearing up for deploy of wcqs `0.3.115` [04:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:58:57] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@2d34f5c] (wcqs): Deploy 0.3.115 to WCQS [04:59:08] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2149 - https://phabricator.wikimedia.org/T316565 (10Marostegui) 05Open→03Declined It is being handled at T316494 [05:00:58] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@2d34f5c] (wcqs): Deploy 0.3.115 to WCQS (duration: 02m 00s) [05:01:02] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:01:45] !log [WCQS Deploy] Restarted `wcqs-updater` across all hosts: `sudo -E cumin 'A:wcqs-public' 'systemctl restart wcqs-updater'` [05:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:14] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) [05:03:23] 10SRE, 10DBA: Replication stopped on db1143 - https://phabricator.wikimedia.org/T315742 (10Marostegui) 05Open→03Resolved a:03Marostegui The host is back in sync with the master. Not repooling it yet as all 10.6 hosts are depooled. I have also moved it under the current master, db1160. [05:03:45] !log T306899 T316496 Deployed WCQS `0.3.115`. That should (hopefully) resolve these tickets. [05:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:50] T306899: WCQS 500 errors - https://phabricator.wikimedia.org/T306899 [05:03:51] T316496: WCQS does not report proper lag information - https://phabricator.wikimedia.org/T316496 [05:06:31] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) It looks like the bug was introduced at https://jira.mariadb.org/browse/MDEV-27058 and the reason is described at: http... [05:10:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s6 T316110 [05:10:36] T316110: Switchover s6 master (db1173 -> db1131) - https://phabricator.wikimedia.org/T316110 [05:10:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s6 T316110 [05:11:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1131 with weight 0 T316110', diff saved to https://phabricator.wikimedia.org/P33636 and previous config saved to /var/cache/conftool/dbconfig/20220830-051106-ladsgroup.json [05:12:24] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:14:46] 10SRE, 10ops-codfw, 10DBA: db2149 broken storage after reboot - https://phabricator.wikimedia.org/T316494 (10Marostegui) [05:17:18] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:22:10] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:29:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance [05:29:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance [05:29:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2146 (T316186)', diff saved to https://phabricator.wikimedia.org/P33637 and previous config saved to /var/cache/conftool/dbconfig/20220830-052930-ladsgroup.json [05:30:13] (03PS2) 10Ladsgroup: mariadb: Promote db1131 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/826281 (https://phabricator.wikimedia.org/T316110) (owner: 10Gerrit maintenance bot) [05:30:24] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1131 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/826281 (https://phabricator.wikimedia.org/T316110) (owner: 10Gerrit maintenance bot) [05:30:34] just got a "wiki is in readonly mode" on wikitech (next save attempt worked). Is that supposed to happen? [05:31:54] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:33:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance [05:33:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance [05:35:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T316186)', diff saved to https://phabricator.wikimedia.org/P33638 and previous config saved to /var/cache/conftool/dbconfig/20220830-053529-ladsgroup.json [05:35:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [05:35:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [05:35:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [05:35:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [05:36:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2173 (T316186)', diff saved to https://phabricator.wikimedia.org/P33639 and previous config saved to /var/cache/conftool/dbconfig/20220830-053559-ladsgroup.json [05:42:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T316186)', diff saved to https://phabricator.wikimedia.org/P33640 and previous config saved to /var/cache/conftool/dbconfig/20220830-054217-ladsgroup.json [05:42:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [05:42:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [05:42:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2174 (T316186)', diff saved to https://phabricator.wikimedia.org/P33641 and previous config saved to /var/cache/conftool/dbconfig/20220830-054242-ladsgroup.json [05:44:54] 10SRE, 10Wikimedia-Etherpad, 10serviceops: Upgrade etherpad.wikimedia.org to (more) recent Etherpad version with more rich end-user features - https://phabricator.wikimedia.org/T316421 (10Spinster) FWIW, I filed this ticket because I was using the public https://board.net/ Etherpad instance, and saw all thos... [05:46:32] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:49:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T316186)', diff saved to https://phabricator.wikimedia.org/P33642 and previous config saved to /var/cache/conftool/dbconfig/20220830-054859-ladsgroup.json [05:49:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance [05:49:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance [05:49:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2130 (T316186)', diff saved to https://phabricator.wikimedia.org/P33643 and previous config saved to /var/cache/conftool/dbconfig/20220830-054924-ladsgroup.json [05:55:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T316186)', diff saved to https://phabricator.wikimedia.org/P33644 and previous config saved to /var/cache/conftool/dbconfig/20220830-055555-ladsgroup.json [05:56:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2102.codfw.wmnet with reason: Maintenance [05:56:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2102.codfw.wmnet with reason: Maintenance [05:59:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance [05:59:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance [05:59:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2116 (T316186)', diff saved to https://phabricator.wikimedia.org/P33645 and previous config saved to /var/cache/conftool/dbconfig/20220830-055948-ladsgroup.json [06:00:05] kormat, marostegui, and Amir1: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220830T0600). [06:00:11] starting [06:00:18] !log Starting s6 eqiad failover from db1173 to db1131 - T316110 [06:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:22] T316110: Switchover s6 master (db1173 -> db1131) - https://phabricator.wikimedia.org/T316110 [06:00:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s6 eqiad as read-only for maintenance - T316110', diff saved to https://phabricator.wikimedia.org/P33646 and previous config saved to /var/cache/conftool/dbconfig/20220830-060026-ladsgroup.json [06:00:29] ladsgroup@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [06:01:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1131 to s6 primary and set section read-write T316110', diff saved to https://phabricator.wikimedia.org/P33647 and previous config saved to /var/cache/conftool/dbconfig/20220830-060109-ladsgroup.json [06:03:16] (03PS2) 10Ladsgroup: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/826282 (https://phabricator.wikimedia.org/T316110) (owner: 10Gerrit maintenance bot) [06:03:26] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/826282 (https://phabricator.wikimedia.org/T316110) (owner: 10Gerrit maintenance bot) [06:05:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T316186)', diff saved to https://phabricator.wikimedia.org/P33648 and previous config saved to /var/cache/conftool/dbconfig/20220830-060509-ladsgroup.json [06:05:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance [06:05:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance [06:05:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1173 T316110 T312984 T312863 T316186', diff saved to https://phabricator.wikimedia.org/P33649 and previous config saved to /var/cache/conftool/dbconfig/20220830-060543-ladsgroup.json [06:05:54] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [06:05:54] T316110: Switchover s6 master (db1173 -> db1131) - https://phabricator.wikimedia.org/T316110 [06:05:55] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [06:05:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2112 (T316186)', diff saved to https://phabricator.wikimedia.org/P33650 and previous config saved to /var/cache/conftool/dbconfig/20220830-060554-ladsgroup.json [06:07:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [06:07:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [06:08:36] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [06:12:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2112 (T316186)', diff saved to https://phabricator.wikimedia.org/P33651 and previous config saved to /var/cache/conftool/dbconfig/20220830-061218-ladsgroup.json [06:12:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [06:12:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [06:12:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T316186)', diff saved to https://phabricator.wikimedia.org/P33652 and previous config saved to /var/cache/conftool/dbconfig/20220830-061243-ladsgroup.json [06:16:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1173.eqiad.wmnet with reason: Maintenance [06:16:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1173.eqiad.wmnet with reason: Maintenance [06:19:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T316186)', diff saved to https://phabricator.wikimedia.org/P33653 and previous config saved to /var/cache/conftool/dbconfig/20220830-061901-ladsgroup.json [06:19:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance [06:19:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance [06:19:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2153 (T316186)', diff saved to https://phabricator.wikimedia.org/P33654 and previous config saved to /var/cache/conftool/dbconfig/20220830-061926-ladsgroup.json [06:22:36] (03PS1) 10Marostegui: site.pp: Add db1196-db1203 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/827859 (https://phabricator.wikimedia.org/T306848) [06:23:16] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [06:24:05] (03CR) 10Marostegui: [C: 03+2] site.pp: Add db1196-db1203 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/827859 (https://phabricator.wikimedia.org/T306848) (owner: 10Marostegui) [06:25:30] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10Marostegui) @Cmjohnson @Jclark-ctr I have added the insetup role for these hosts and the partitioning schema in puppet for y... [06:25:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T316186)', diff saved to https://phabricator.wikimedia.org/P33655 and previous config saved to /var/cache/conftool/dbconfig/20220830-062547-ladsgroup.json [06:25:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance [06:26:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance [06:26:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2145 (T316186)', diff saved to https://phabricator.wikimedia.org/P33656 and previous config saved to /var/cache/conftool/dbconfig/20220830-062613-ladsgroup.json [06:26:26] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1123 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/827860 (https://phabricator.wikimedia.org/T316622) [06:26:30] (03PS1) 10Gerrit maintenance bot: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/827861 (https://phabricator.wikimedia.org/T316622) [06:26:57] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/827862 (https://phabricator.wikimedia.org/T316623) [06:27:01] (03PS1) 10Gerrit maintenance bot: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/827863 (https://phabricator.wikimedia.org/T316623) [06:28:33] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) [06:32:14] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:33:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T316186)', diff saved to https://phabricator.wikimedia.org/P33657 and previous config saved to /var/cache/conftool/dbconfig/20220830-063332-ladsgroup.json [06:37:08] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:38:19] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade fasw to Junos 21 - https://phabricator.wikimedia.org/T316542 (10ayounsi) @Dwisehaupt that will be too soon for us (SRE summit + routers upgrades planned this month). Is the following maintenance week known? [06:43:33] (03PS1) 10Marostegui: mariadb: Promote db2115 to x1 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/827864 (https://phabricator.wikimedia.org/T316522) [06:45:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:50:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:50:46] <_joe_> I don't understand these probes tbh [06:51:15] <_joe_> but I'd say the videoscalers are still overloaded [06:51:44] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [06:53:41] <_joe_> !log running scap pull on parse1* T316611 [06:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:46] T316611: No localisation cache found for English. Please run maintenance/rebuildLocalisationCache.php - https://phabricator.wikimedia.org/T316611 [06:59:30] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 2.963 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [06:59:36] RECOVERY - Check systemd state on dse-k8s-worker1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:04] Amir1 and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220830T0700). [07:00:05] tgr: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:35] o/ will deploy in a sec [07:01:42] <_joe_> tgr: hold on please [07:01:51] <_joe_> I just noticed train-presync failed tonight [07:02:00] <_joe_> so I'm not sure what is the status of mediawiki-staging [07:02:17] <_joe_> we need someone from releng to take a look I guess [07:04:02] <_joe_> uhhh, nevermind, the problem seems to be internal to the new undeployed branch (.27) [07:04:05] <_joe_> you can go on [07:05:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:06:10] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:06:30] PROBLEM - Check systemd state on dse-k8s-worker1005 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:08:06] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [07:09:15] (03PS6) 10Gergő Tisza: Declare mediawiki.accountcreation_block stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822686 (https://phabricator.wikimedia.org/T306018) (owner: 10Sergio Gimeno) [07:10:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:10:51] (03CR) 10Gergő Tisza: [C: 03+2] Declare mediawiki.accountcreation_block stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822686 (https://phabricator.wikimedia.org/T306018) (owner: 10Sergio Gimeno) [07:11:18] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [07:11:51] (03Merged) 10jenkins-bot: Declare mediawiki.accountcreation_block stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822686 (https://phabricator.wikimedia.org/T306018) (owner: 10Sergio Gimeno) [07:13:26] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 7.491 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:13:36] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [07:15:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:15:22] (03CR) 10Muehlenhoff: [C: 03+2] profile::maps::tlsproxy: Unconditionally disable ssl_ecdhe_curve [puppet] - 10https://gerrit.wikimedia.org/r/826847 (owner: 10Muehlenhoff) [07:17:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:17:22] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:822686|Declare mediawiki.accountcreation_block stream (T306018)]] (duration: 04m 11s) [07:17:26] T306018: Instrument blocked account registration - https://phabricator.wikimedia.org/T306018 [07:17:50] !log UTC morning deploy window done [07:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:18:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:19:18] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:20:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:22:56] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:23:26] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [07:25:40] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.727 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [07:26:12] PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [07:28:36] RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 7.285 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [07:28:51] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T316121 (10phaultfinder) [07:30:19] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:30:36] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [07:35:12] PROBLEM - PHP7 rendering on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:35:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:36:54] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [07:37:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [07:37:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [07:37:30] (03PS2) 10Jelto: RelEng Access Requests [puppet] - 10https://gerrit.wikimedia.org/r/827494 (https://phabricator.wikimedia.org/T316528) (owner: 10Thcipriani) [07:37:40] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.079 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [07:38:12] PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [07:38:24] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:38:36] (03CR) 10Muehlenhoff: rancid: Switch to systemd::sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826867 (owner: 10Muehlenhoff) [07:39:48] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:39:56] RECOVERY - PHP7 rendering on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 9.484 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:41:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:41:44] (03CR) 10Ayounsi: [C: 03+1] rancid: Switch to systemd::sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826867 (owner: 10Muehlenhoff) [07:42:40] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [07:43:02] RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 8.943 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [07:43:47] (03CR) 10Ayounsi: "Removing my +1 as there are users/perm discussions in https://phabricator.wikimedia.org/T316569" [puppet] - 10https://gerrit.wikimedia.org/r/826867 (owner: 10Muehlenhoff) [07:44:22] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:44:56] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 2.211 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [07:45:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:45:46] PROBLEM - PHP7 rendering on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:46:06] (03CR) 10Muehlenhoff: [C: 03+2] xenon: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/804546 (owner: 10Muehlenhoff) [07:46:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:46:34] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [07:46:38] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.164 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:48:00] (03PS1) 10DCausse: Revert "deployment-prep: change ES version from 6 to 7" [puppet] - 10https://gerrit.wikimedia.org/r/827567 [07:48:12] RECOVERY - PHP7 rendering on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 9.902 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:49:48] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:52:20] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [07:53:31] (03CR) 10Gehel: [C: 03+2] Revert "deployment-prep: change ES version from 6 to 7" [puppet] - 10https://gerrit.wikimedia.org/r/827567 (owner: 10DCausse) [07:58:38] (03CR) 10Ladsgroup: [C: 03+1] "The switchmaster should be able to do this for codfw too. Let's give it a try for the next ones." [puppet] - 10https://gerrit.wikimedia.org/r/827864 (https://phabricator.wikimedia.org/T316522) (owner: 10Marostegui) [07:58:53] (03PS1) 10Ayounsi: BGP: remove local-as 14907 loops 2 for anycast peers [homer/public] - 10https://gerrit.wikimedia.org/r/827950 [08:02:18] (03CR) 10Ayounsi: "Not sure what's blocking here but would it be possible to merge the child commit in that chain? I see that CI is happy now." [software/homer] - 10https://gerrit.wikimedia.org/r/821254 (owner: 10Jbond) [08:04:14] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:04:48] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:05:04] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:05:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:05:49] (03CR) 10Ayounsi: [C: 03+2] Spicerack: add prettytable for peering cookbook [puppet] - 10https://gerrit.wikimedia.org/r/816824 (owner: 10Ayounsi) [08:09:41] (03PS1) 10Vgutierrez: trafficserver: Enforce per request timeout globally [puppet] - 10https://gerrit.wikimedia.org/r/827952 (https://phabricator.wikimedia.org/T315533) [08:09:48] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:10:33] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:11:30] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 6.030 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [08:12:15] (03PS1) 10Ladsgroup: Stop writing to old templatelinks fields in s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827953 (https://phabricator.wikimedia.org/T312865) [08:12:20] (03CR) 10JMeybohm: [C: 03+1] kubestage: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826828 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert) [08:12:33] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:12:36] PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:12:48] (03CR) 10JMeybohm: [C: 03+1] kubernetes: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826840 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert) [08:13:23] (03CR) 10JMeybohm: [C: 03+1] ml-staging: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826833 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert) [08:13:28] PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [08:13:49] (03CR) 10JMeybohm: [C: 03+1] ml-serve: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826842 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert) [08:13:51] (03PS2) 10Ladsgroup: Stop writing to old templatelinks fields in s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827953 (https://phabricator.wikimedia.org/T312865) [08:13:52] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 5.551 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:14:24] PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [08:14:28] (03CR) 10JMeybohm: [C: 03+1] deployment-server: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826849 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert) [08:14:48] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:14:52] RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 1.445 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:15:13] (03CR) 10JMeybohm: [C: 03+1] releases: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826852 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert) [08:15:22] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:15:38] RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [08:15:53] (03CR) 10JMeybohm: [C: 03+1] builder: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826853 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert) [08:16:40] RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 9.127 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [08:19:37] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deploy-phabricator for dancy - https://phabricator.wikimedia.org/T316524 (10Jelto) 05Open→03Resolved The requested access should be available now. I'm closing this task, feel free to re-open if there are problems with the access. [08:19:42] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deploy-phabricator, gerrit-root for dduvall - https://phabricator.wikimedia.org/T316526 (10Jelto) 05Open→03Resolved The requested access should be available now. I'm closing this task, feel free to re-open if there are problems with... [08:19:44] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deploy-phabricator for hashar - https://phabricator.wikimedia.org/T316527 (10Jelto) 05Open→03Resolved The requested access should be available now. I'm closing this task, feel free to re-open if there are problems with the access. [08:19:49] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deploy-phabricator, gerrit-root, phabricator-roots for jhuneidi - https://phabricator.wikimedia.org/T316521 (10Jelto) 05Open→03Resolved The requested access should be available now. I'm closing this task, feel free to re-open if ther... [08:19:52] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deploy-phabricator for jnuche - https://phabricator.wikimedia.org/T316528 (10Jelto) 05Open→03Resolved The requested access should be available now. I'm closing this task, feel free to re-open if there are problems with the access. [08:20:43] (03PS2) 10Vgutierrez: trafficserver: Enforce per request timeout globally [puppet] - 10https://gerrit.wikimedia.org/r/827952 (https://phabricator.wikimedia.org/T315533) [08:21:11] (03PS4) 10KartikMistry: Enable SectionTranslation on 10 more WPs where ContentTranslation is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827174 (https://phabricator.wikimedia.org/T313300) [08:22:11] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37036/console" [puppet] - 10https://gerrit.wikimedia.org/r/827952 (https://phabricator.wikimedia.org/T315533) (owner: 10Vgutierrez) [08:22:33] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:23:52] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Enforce per request timeout globally [puppet] - 10https://gerrit.wikimedia.org/r/827952 (https://phabricator.wikimedia.org/T315533) (owner: 10Vgutierrez) [08:24:42] !log ATS: enforce per-request timeout globally (205 secs) - T315533 [08:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:48] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:24:49] T315533: Set per-request timeout on ATS-BE - https://phabricator.wikimedia.org/T315533 [08:25:49] 10SRE, 10Traffic, 10serviceops: Set per-request timeout on ATS-BE - https://phabricator.wikimedia.org/T315533 (10Vgutierrez) 05Open→03Resolved [08:26:28] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:27:15] (03PS8) 10Ladsgroup: Schedule image suggestions notifications [puppet] - 10https://gerrit.wikimedia.org/r/811312 (https://phabricator.wikimedia.org/T300024) (owner: 10Matthias Mullie) [08:27:22] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Schedule image suggestions notifications [puppet] - 10https://gerrit.wikimedia.org/r/811312 (https://phabricator.wikimedia.org/T300024) (owner: 10Matthias Mullie) [08:29:22] (03PS1) 10ArielGlenn: switch snapshot hosts to use php7.4 [puppet] - 10https://gerrit.wikimedia.org/r/827954 (https://phabricator.wikimedia.org/T271736) [08:29:22] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:29:38] PROBLEM - PHP7 rendering on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:29:42] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [08:29:48] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:30:04] PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [08:30:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 10 hosts with reason: DC switchover x1 T316522 [08:30:37] T316522: Switchover x1 codfw master (db2096 -> db2115) - https://phabricator.wikimedia.org/T316522 [08:30:48] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:30:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 10 hosts with reason: DC switchover x1 T316522 [08:31:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2115 with weight 0 T316522', diff saved to https://phabricator.wikimedia.org/P33658 and previous config saved to /var/cache/conftool/dbconfig/20220830-083103-root.json [08:31:34] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 5.375 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:32:18] RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 6.878 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [08:32:22] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2115 to x1 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/827864 (https://phabricator.wikimedia.org/T316522) (owner: 10Marostegui) [08:32:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:34:10] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 7.736 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [08:35:48] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:36:07] !log Starting x1 codfw failover from db2096 to db2115 - T316522 [08:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:12] T316522: Switchover x1 codfw master (db2096 -> db2115) - https://phabricator.wikimedia.org/T316522 [08:36:22] RECOVERY - PHP7 rendering on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 9.632 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:36:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2115 to x1 codfw primary T316522', diff saved to https://phabricator.wikimedia.org/P33659 and previous config saved to /var/cache/conftool/dbconfig/20220830-083654-root.json [08:37:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:37:45] (03PS2) 10Ayounsi: Remove 185.15.56.0/24 from network::external [puppet] - 10https://gerrit.wikimedia.org/r/677872 (https://phabricator.wikimedia.org/T265864) [08:38:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2096 T316522', diff saved to https://phabricator.wikimedia.org/P33660 and previous config saved to /var/cache/conftool/dbconfig/20220830-083845-root.json [08:40:48] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:41:18] PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [08:41:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2096.codfw.wmnet with reason: Maintenance [08:41:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2096.codfw.wmnet with reason: Maintenance [08:41:39] (03PS1) 10Marostegui: db2096: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/827955 [08:41:52] PROBLEM - PHP7 rendering on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:42:10] (03PS1) 10Kosta Harlan: Temporarily disable change tag test [extensions/GrowthExperiments] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/827569 (https://phabricator.wikimedia.org/T316596) [08:42:29] (03CR) 10Marostegui: [C: 03+2] db2096: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/827955 (owner: 10Marostegui) [08:42:33] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:42:40] PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [08:43:12] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [08:43:30] RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 5.191 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [08:44:08] RECOVERY - PHP7 rendering on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 9.343 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:44:46] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:45:02] PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:45:20] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 2.606 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [08:45:48] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:47:04] RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 4.833 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [08:47:06] RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 1.025 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:48:02] PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [08:49:06] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.028 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:49:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give some weight to current x1 codfw master', diff saved to https://phabricator.wikimedia.org/P33661 and previous config saved to /var/cache/conftool/dbconfig/20220830-084945-root.json [08:51:21] (03CR) 10Clément Goubert: [C: 03+2] kubestage: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826828 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert) [08:52:24] RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [08:53:14] !log failover Ganeti master in codfw to ganeti2020 T311686 [08:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:20] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [08:53:50] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T316121 (10phaultfinder) [08:55:14] PROBLEM - PHP7 rendering on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:55:48] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:55:55] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: fix fresh install [puppet] - 10https://gerrit.wikimedia.org/r/827507 (owner: 10Giuseppe Lavagetto) [08:56:12] (03PS1) 10Marostegui: dbproxy1016,dbproxy1020: Add db1159 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/827958 (https://phabricator.wikimedia.org/T316506) [08:56:18] (03CR) 10JMeybohm: [C: 03+1] R:profile::docker::engine::version removal and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/826856 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert) [08:57:04] (03CR) 10Clément Goubert: [C: 03+2] ml-staging: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826833 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert) [08:57:16] (03CR) 10Clément Goubert: [C: 03+2] kubernetes: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826840 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert) [08:57:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:57:42] RECOVERY - PHP7 rendering on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 9.883 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:58:24] !log upgrading ganeti2010,ganeti2012,ganeti2024 to 3.0.2 T312637 [08:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:29] T312637: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 [08:58:54] PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [08:58:56] PROBLEM - ganeti-wconfd running on ganeti2022 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [08:58:58] <_joe_> !log powercycling parse1002, blank console [08:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:10] (03PS14) 10David Caro: ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) [08:59:20] (03PS3) 10David Caro: tox: use the default python3 for the system [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 [09:00:28] (03CR) 10Marostegui: [C: 03+2] dbproxy1016,dbproxy1020: Add db1159 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/827958 (https://phabricator.wikimedia.org/T316506) (owner: 10Marostegui) [09:00:45] (03PS4) 10Clément Goubert: ml-staging: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826833 (https://phabricator.wikimedia.org/T316341) [09:00:47] (03CR) 10Clément Goubert: [V: 03+2] ml-staging: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826833 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert) [09:02:24] (03CR) 10Clément Goubert: [C: 03+2] ml-serve: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826842 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert) [09:02:41] (03CR) 10CI reject: [V: 04-1] tox: use the default python3 for the system [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 (owner: 10David Caro) [09:03:23] RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 8.711 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [09:04:07] (03CR) 10Clément Goubert: [C: 03+2] deployment-server: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826849 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert) [09:05:24] (03CR) 10CI reject: [V: 04-1] ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:05:48] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:07:41] (03CR) 10FNegri: [C: 03+2] Add cloudcephosd1030 to the Ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/826843 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri) [09:07:51] (03PS2) 10FNegri: Add cloudcephosd1030 to the Ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/826843 (https://phabricator.wikimedia.org/T314870) [09:07:59] !log restart dbprov* hosts [09:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:04] (03CR) 10Clément Goubert: [C: 03+2] releases: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826852 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert) [09:10:03] (03CR) 10Filippo Giunchedi: [C: 03+1] netmon: Add the wikidev group for the rancid directory. [puppet] - 10https://gerrit.wikimedia.org/r/827553 (https://phabricator.wikimedia.org/T316569) (owner: 10Andrea Denisse) [09:10:13] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add tcpircbot logging tests [puppet] - 10https://gerrit.wikimedia.org/r/824317 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite) [09:10:48] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:11:26] (03CR) 10Klausman: [C: 03+1] api-gateway: custom host overrides in discovery services. [deployment-charts] - 10https://gerrit.wikimedia.org/r/825729 (owner: 10Hnowlan) [09:12:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:12:54] !log upgrading ganeti2027,ganeti2028 to 3.0.2 T312637 [09:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:58] T312637: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 [09:16:03] (03PS15) 10David Caro: ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) [09:16:23] (03CR) 10David Caro: ceph.bootstrap_and_add: add support to change the osd class type (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:16:48] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:17:05] (03CR) 10Clément Goubert: [C: 03+2] R:profile::docker::engine::version removal and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/826856 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert) [09:18:44] !log installing perf updates on Bullseye hosts [09:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:17] PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [09:20:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:20:31] Again.... [09:20:45] I guess the same thing as yesterday? [09:21:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:22:07] RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 7.448 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [09:24:59] (03PS1) 10Ayounsi: Squid: permit production networks instead of aggregate_networks [puppet] - 10https://gerrit.wikimedia.org/r/827964 (https://phabricator.wikimedia.org/T265864) [09:25:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:26:12] (03PS1) 10Marostegui: ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827965 [09:26:15] (03PS1) 10Marostegui: pc1014: Promote it to pc1 master [puppet] - 10https://gerrit.wikimedia.org/r/827966 [09:26:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:26:20] (03CR) 10CI reject: [V: 04-1] ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:26:36] (03PS2) 10Ayounsi: Squid: permit production networks instead of aggregate_networks [puppet] - 10https://gerrit.wikimedia.org/r/827964 (https://phabricator.wikimedia.org/T265864) [09:27:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:28:53] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827965 (owner: 10Marostegui) [09:28:58] (03CR) 10Marostegui: [C: 03+2] pc1014: Promote it to pc1 master [puppet] - 10https://gerrit.wikimedia.org/r/827966 (owner: 10Marostegui) [09:29:14] (03PS16) 10David Caro: ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) [09:29:42] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827965 (owner: 10Marostegui) [09:30:11] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:30:30] (03CR) 10Hashar: "Great thanks Daniel. I am quite happy to see the httpbb test helps confirm the configuration change works. I will finish the migration and" [puppet] - 10https://gerrit.wikimedia.org/r/824542 (https://phabricator.wikimedia.org/T315541) (owner: 10Hashar) [09:30:41] RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [09:31:17] jouncebot: nowandnext [09:31:17] No deployments scheduled for the next 3 hour(s) and 28 minute(s) [09:31:17] In 3 hour(s) and 28 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220830T1300) [09:31:17] In 3 hour(s) and 28 minute(s): Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220830T1300) [09:31:42] !log draining ganeti2022 for eventual reimage T311686 [09:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:48] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [09:32:10] (03CR) 10Ladsgroup: [C: 03+2] Stop writing to old templatelinks fields in s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827953 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup) [09:32:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:32:56] (03Merged) 10jenkins-bot: Stop writing to old templatelinks fields in s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827953 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup) [09:33:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:33:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:34:31] !log marostegui@deploy1002 Synchronized wmf-config/ProductionServices.php: Promote pc1014 to pc1 master (duration: 03m 50s) [09:34:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:36:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:36:22] (03CR) 10Michael Große: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827968 (https://phabricator.wikimedia.org/T316637) (owner: 10Michael Große) [09:37:33] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:39:19] PROBLEM - Disk space on netboxdb2002 is CRITICAL: DISK CRITICAL - free space: / 555 MB (3% inode=92%): /tmp 555 MB (3% inode=92%): /var/tmp 555 MB (3% inode=92%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=netboxdb2002&var-datasource=codfw+prometheus/ops [09:39:31] (03PS1) 10Marostegui: Revert "pc1014: Promote it to pc1 master" [puppet] - 10https://gerrit.wikimedia.org/r/827570 [09:39:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:39:40] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc1014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827571 [09:40:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:40:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:41:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:41:38] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:827953|Stop writing to old templatelinks fields in s6 (T312865)]] (duration: 03m 57s) [09:41:44] T312865: Turn off writing to the old columns of templatelinks in beta and production - https://phabricator.wikimedia.org/T312865 [09:46:27] PROBLEM - Check systemd state on db1117 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@m3.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:47:11] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you again for the review! I'll go ahead with this and see what sort of results we're getting" [alerts] - 10https://gerrit.wikimedia.org/r/825741 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [09:47:41] (03PS5) 10Filippo Giunchedi: sre: alert on appserver unavailability [alerts] - 10https://gerrit.wikimedia.org/r/825741 (https://phabricator.wikimedia.org/T305847) [09:48:19] RECOVERY - Check systemd state on db1117 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:49:02] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:51:09] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd2001.codfw.wmnet with reason: Switch instance to DRBD, T311686 [09:51:14] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [09:51:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd2001.codfw.wmnet with reason: Switch instance to DRBD, T311686 [09:51:36] !log dcausse@deploy1002 Started deploy [wikimedia/discovery/analytics@ff76338]: Add sd-alerts notifications to image_suggestions_weekly [09:52:33] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:53:16] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet [09:53:30] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host centrallog2002.codfw.wmnet [09:53:42] !log dcausse@deploy1002 Finished deploy [wikimedia/discovery/analytics@ff76338]: Add sd-alerts notifications to image_suggestions_weekly (duration: 02m 05s) [09:55:09] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet [09:58:36] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc1014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827571 (owner: 10Marostegui) [09:58:38] (03CR) 10Marostegui: [C: 03+2] Revert "pc1014: Promote it to pc1 master" [puppet] - 10https://gerrit.wikimedia.org/r/827570 (owner: 10Marostegui) [09:58:46] (03PS3) 10Volans: pylint: add timeouts to requests.* calls [cookbooks] - 10https://gerrit.wikimedia.org/r/827508 (owner: 10David Caro) [09:59:20] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc1014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827571 (owner: 10Marostegui) [09:59:46] dcaro: ^^^ I've updated your patch, if you want to have a look [10:01:08] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet [10:02:08] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite2003.codfw.wmnet [10:02:33] (03PS1) 10Marostegui: pc1014: Move it to pc3 [puppet] - 10https://gerrit.wikimedia.org/r/827973 [10:03:08] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org [10:03:14] (03CR) 10Marostegui: [C: 03+2] pc1014: Move it to pc3 [puppet] - 10https://gerrit.wikimedia.org/r/827973 (owner: 10Marostegui) [10:03:36] !log marostegui@deploy1002 Synchronized wmf-config/ProductionServices.php: Promote pc1011 to pc1 master (duration: 03m 44s) [10:03:45] jouncebot: next [10:03:45] In 2 hour(s) and 56 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220830T1300) [10:03:45] In 2 hour(s) and 56 minute(s): Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220830T1300) [10:04:46] (03CR) 10David Caro: pylint: add timeouts to requests.* calls (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/827508 (owner: 10David Caro) [10:06:16] (03CR) 10David Caro: [C: 03+1] "Just a nit, otherwise LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/827508 (owner: 10David Caro) [10:06:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:07:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:07:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:08:42] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org [10:08:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:08:48] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite2003.codfw.wmnet [10:11:44] (03PS4) 10Volans: pylint: add timeouts to requests.* calls [cookbooks] - 10https://gerrit.wikimedia.org/r/827508 (owner: 10David Caro) [10:11:52] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet [10:12:06] (03CR) 10Volans: "addressed comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/827508 (owner: 10David Caro) [10:12:08] RECOVERY - Disk space on netboxdb2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=netboxdb2002&var-datasource=codfw+prometheus/ops [10:15:14] (03CR) 10Btullis: [C: 03+2] Add BGP neighbor data for the new dse-k8s cluster [homer/public] - 10https://gerrit.wikimedia.org/r/826579 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis) [10:15:45] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/826579 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis) [10:15:49] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd2001.codfw.wmnet with reason: Switch instance to plain disks, T311686 [10:15:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd2001.codfw.wmnet with reason: Switch instance to plain disks, T311686 [10:15:54] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [10:15:56] (03Merged) 10jenkins-bot: Add BGP neighbor data for the new dse-k8s cluster [homer/public] - 10https://gerrit.wikimedia.org/r/826579 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis) [10:16:58] PROBLEM - cassandra-a SSL 10.64.48.234:7001 on restbase1030 is CRITICAL: SSL CRITICAL - Certificate restbase1030-a valid until 2022-09-29 10:16:53 +0000 (expires in 29 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [10:17:40] PROBLEM - cassandra-c SSL 10.64.48.236:7001 on restbase1030 is CRITICAL: SSL CRITICAL - Certificate restbase1030-c valid until 2022-09-29 10:16:58 +0000 (expires in 29 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [10:17:46] PROBLEM - cassandra-b SSL 10.64.16.181:7001 on restbase1029 is CRITICAL: SSL CRITICAL - Certificate restbase1029-b valid until 2022-09-29 10:16:48 +0000 (expires in 29 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [10:18:12] PROBLEM - cassandra-c SSL 10.64.0.211:7001 on restbase1028 is CRITICAL: SSL CRITICAL - Certificate restbase1028-c valid until 2022-09-29 10:16:42 +0000 (expires in 29 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [10:18:16] PROBLEM - cassandra-a SSL 10.64.0.209:7001 on restbase1028 is CRITICAL: SSL CRITICAL - Certificate restbase1028-a valid until 2022-09-29 10:16:37 +0000 (expires in 29 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [10:18:26] PROBLEM - cassandra-b SSL 10.64.48.235:7001 on restbase1030 is CRITICAL: SSL CRITICAL - Certificate restbase1030-b valid until 2022-09-29 10:16:56 +0000 (expires in 29 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [10:18:32] PROBLEM - cassandra-b SSL 10.64.0.210:7001 on restbase1028 is CRITICAL: SSL CRITICAL - Certificate restbase1028-b valid until 2022-09-29 10:16:40 +0000 (expires in 29 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [10:18:32] PROBLEM - cassandra-c SSL 10.64.16.182:7001 on restbase1029 is CRITICAL: SSL CRITICAL - Certificate restbase1029-c valid until 2022-09-29 10:16:51 +0000 (expires in 29 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [10:18:37] (03PS2) 10Btullis: Upgrade datahub to version 0.8.43 [deployment-charts] - 10https://gerrit.wikimedia.org/r/826844 (https://phabricator.wikimedia.org/T316336) [10:21:28] (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [10:21:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [10:21:51] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet [10:21:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [10:21:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:22:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:22:16] (ThanosSidecarPrometheusDown) firing: Thanos Sidecar cannot connect to Prometheus - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarPrometheusDown [10:22:16] (ThanosSidecarUnhealthy) firing: Thanos Sidecar is unhealthy. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarUnhealthy [10:22:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T316186)', diff saved to https://phabricator.wikimedia.org/P33663 and previous config saved to /var/cache/conftool/dbconfig/20220830-102220-ladsgroup.json [10:23:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling db1167 (T316186)', diff saved to https://phabricator.wikimedia.org/P33664 and previous config saved to /var/cache/conftool/dbconfig/20220830-102342-ladsgroup.json [10:23:46] the thanos alerts are expected [10:24:18] PROBLEM - cassandra-a SSL 10.64.16.180:7001 on restbase1029 is CRITICAL: SSL CRITICAL - Certificate restbase1029-a valid until 2022-09-29 10:16:45 +0000 (expires in 29 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [10:24:18] 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi) [10:25:46] (03CR) 10David Caro: [C: 03+1] pylint: add timeouts to requests.* calls [cookbooks] - 10https://gerrit.wikimedia.org/r/827508 (owner: 10David Caro) [10:26:28] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [10:27:16] (ThanosSidecarPrometheusDown) resolved: Thanos Sidecar cannot connect to Prometheus - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarPrometheusDown [10:27:16] (ThanosSidecarUnhealthy) resolved: Thanos Sidecar is unhealthy. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarUnhealthy [10:29:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:29:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: Maintenance [10:30:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: Maintenance [10:30:11] (03PS1) 10Slyngshede: profile::prometheus::ganeti::clusters Enable scraping of Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/827977 [10:30:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T316186)', diff saved to https://phabricator.wikimedia.org/P33665 and previous config saved to /var/cache/conftool/dbconfig/20220830-103012-ladsgroup.json [10:34:19] (ProbeDown) resolved: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:35:19] (03CR) 10Volans: [C: 03+2] pylint: add timeouts to requests.* calls [cookbooks] - 10https://gerrit.wikimedia.org/r/827508 (owner: 10David Caro) [10:35:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T316186)', diff saved to https://phabricator.wikimedia.org/P33666 and previous config saved to /var/cache/conftool/dbconfig/20220830-103530-ladsgroup.json [10:38:27] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37038/console" [puppet] - 10https://gerrit.wikimedia.org/r/827977 (owner: 10Slyngshede) [10:38:54] (03Merged) 10jenkins-bot: pylint: add timeouts to requests.* calls [cookbooks] - 10https://gerrit.wikimedia.org/r/827508 (owner: 10David Caro) [10:39:32] !log committing updated switch configuration https://gerrit.wikimedia.org/r/c/operations/homer/public/+/826579 [10:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2096.codfw.wmnet with reason: Maintenance [10:41:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2096.codfw.wmnet with reason: Maintenance [10:46:14] (03CR) 10Btullis: Add a helmfile configuration for the dse-k8s-eqiad cluster (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/826836 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [10:46:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling db2096 after maint', diff saved to https://phabricator.wikimedia.org/P33667 and previous config saved to /var/cache/conftool/dbconfig/20220830-104616-ladsgroup.json [10:47:44] (03PS3) 10Btullis: Add a helmfile configuration for the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/826836 (https://phabricator.wikimedia.org/T310196) [10:49:14] RECOVERY - Check systemd state on dse-k8s-worker1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:50:12] !log volans@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster [10:50:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P33668 and previous config saved to /var/cache/conftool/dbconfig/20220830-105036-ladsgroup.json [10:52:51] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37041/console" [puppet] - 10https://gerrit.wikimedia.org/r/827977 (owner: 10Slyngshede) [10:55:08] PROBLEM - Check systemd state on dse-k8s-worker1008 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:55:31] (03CR) 10Slyngshede: "I think it's safe to start collecting metrics from a few more Ganeti clusters." [puppet] - 10https://gerrit.wikimedia.org/r/827977 (owner: 10Slyngshede) [10:59:35] (03CR) 10Volans: "recheck" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/819049 (owner: 10Jbond) [11:03:19] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:04:11] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [11:05:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P33669 and previous config saved to /var/cache/conftool/dbconfig/20220830-110542-ladsgroup.json [11:06:16] (03PS1) 10Btullis: Fix policy name discrepancy between dse_k8s and kubedse [homer/public] - 10https://gerrit.wikimedia.org/r/827979 (https://phabricator.wikimedia.org/T310174) [11:07:47] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [11:08:14] (03PS1) 10Jelto: admin: add tsepothoabala to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/827980 (https://phabricator.wikimedia.org/T315409) [11:08:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:08:51] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Data-Engineering-Operations, 10Patch-For-Review: Access request to analytics system(s) for TThoabala - https://phabricator.wikimedia.org/T315409 (10Jelto) [11:16:50] jouncebot: nowandnext [11:16:50] No deployments scheduled for the next 1 hour(s) and 43 minute(s) [11:16:50] In 1 hour(s) and 43 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220830T1300) [11:16:50] In 1 hour(s) and 43 minute(s): Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220830T1300) [11:18:08] (03PS4) 10Volans: flake8: move all flake8 config to setup.cfg [software/pywmflib] - 10https://gerrit.wikimedia.org/r/819049 (owner: 10Jbond) [11:18:10] (03PS1) 10Volans: doc: update URL to requests library timeouts [software/pywmflib] - 10https://gerrit.wikimedia.org/r/827983 [11:20:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T316186)', diff saved to https://phabricator.wikimedia.org/P33670 and previous config saved to /var/cache/conftool/dbconfig/20220830-112048-ladsgroup.json [11:20:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance [11:21:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance [11:21:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T316186)', diff saved to https://phabricator.wikimedia.org/P33671 and previous config saved to /var/cache/conftool/dbconfig/20220830-112117-ladsgroup.json [11:22:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp1002.wikimedia.org [11:24:04] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS buster [11:27:08] (03PS1) 10Majavah: selenium: Use php-fpm version from PHP_VERSION environment [extensions/GrowthExperiments] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/827572 (https://phabricator.wikimedia.org/T316596) [11:27:32] (03CR) 10Majavah: [C: 03+2] selenium: Use php-fpm version from PHP_VERSION environment [extensions/GrowthExperiments] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/827572 (https://phabricator.wikimedia.org/T316596) (owner: 10Majavah) [11:27:33] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/827977 (owner: 10Slyngshede) [11:28:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T316186)', diff saved to https://phabricator.wikimedia.org/P33672 and previous config saved to /var/cache/conftool/dbconfig/20220830-112838-ladsgroup.json [11:29:18] (03PS1) 10Muehlenhoff: Failover IDP to idp1002 [dns] - 10https://gerrit.wikimedia.org/r/827985 [11:30:03] (03PS1) 10Majavah: Make phpfpm restart, php version agnostic [extensions/ProofreadPage] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/827573 (https://phabricator.wikimedia.org/T316596) [11:30:16] (03CR) 10Majavah: [C: 03+2] Make phpfpm restart, php version agnostic [extensions/ProofreadPage] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/827573 (https://phabricator.wikimedia.org/T316596) (owner: 10Majavah) [11:30:34] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Data-Engineering-Operations, 10Patch-For-Review: Access request to analytics system(s) for TThoabala - https://phabricator.wikimedia.org/T315409 (10Jelto) Thanks for the feedback! It seems that `analytics-privatedata-users` is the right group. >>! In T... [11:31:06] (03CR) 10Slyngshede: [C: 03+2] profile::prometheus::ganeti::clusters Enable scraping of Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/827977 (owner: 10Slyngshede) [11:31:34] (03CR) 10Btullis: [C: 03+2] Upgrade datahub to version 0.8.43 [deployment-charts] - 10https://gerrit.wikimedia.org/r/826844 (https://phabricator.wikimedia.org/T316336) (owner: 10Btullis) [11:32:22] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host idp1002.wikimedia.org [11:35:16] (03Merged) 10jenkins-bot: Upgrade datahub to version 0.8.43 [deployment-charts] - 10https://gerrit.wikimedia.org/r/826844 (https://phabricator.wikimedia.org/T316336) (owner: 10Btullis) [11:36:20] !log uploaded libxslt 1.1.29-2.1+deb9u2+wmf1 to apt.wikimedia.org [11:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:43:10] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [11:43:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P33673 and previous config saved to /var/cache/conftool/dbconfig/20220830-114345-ladsgroup.json [11:44:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:45:45] (03CR) 10Volans: [C: 03+2] "LGTM" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/819049 (owner: 10Jbond) [11:46:02] (03CR) 10Volans: [C: 03+2] "trivial fix of URL in docs, self-merging." [software/pywmflib] - 10https://gerrit.wikimedia.org/r/827983 (owner: 10Volans) [11:46:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:47:32] (03CR) 10CI reject: [V: 04-1] selenium: Use php-fpm version from PHP_VERSION environment [extensions/GrowthExperiments] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/827572 (https://phabricator.wikimedia.org/T316596) (owner: 10Majavah) [11:48:15] sigh.. GE failure is fixed by the ProofreadPage patch [11:48:50] (03PS2) 10Majavah: selenium: Use php-fpm version from PHP_VERSION environment [extensions/GrowthExperiments] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/827572 (https://phabricator.wikimedia.org/T316596) [11:48:54] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! The added reference in policy-options is needed to allow the ranges be accepted on CR routers from the spine switches." [homer/public] - 10https://gerrit.wikimedia.org/r/827979 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis) [11:49:00] (03CR) 10Majavah: selenium: Use php-fpm version from PHP_VERSION environment [extensions/GrowthExperiments] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/827572 (https://phabricator.wikimedia.org/T316596) (owner: 10Majavah) [11:49:43] (03Merged) 10jenkins-bot: flake8: move all flake8 config to setup.cfg [software/pywmflib] - 10https://gerrit.wikimedia.org/r/819049 (owner: 10Jbond) [11:49:45] (03Merged) 10jenkins-bot: doc: update URL to requests library timeouts [software/pywmflib] - 10https://gerrit.wikimedia.org/r/827983 (owner: 10Volans) [11:50:13] and apparently the other way around [11:51:18] (ProbeDown) resolved: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:51:22] jouncebot: next [11:51:23] In 1 hour(s) and 8 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220830T1300) [11:51:23] In 1 hour(s) and 8 minute(s): Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220830T1300) [11:52:04] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite1004.eqiad.wmnet [11:52:13] (03CR) 10Btullis: [C: 03+2] Fix policy name discrepancy between dse_k8s and kubedse [homer/public] - 10https://gerrit.wikimedia.org/r/827979 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis) [11:52:20] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet [11:52:53] (03Merged) 10jenkins-bot: Fix policy name discrepancy between dse_k8s and kubedse [homer/public] - 10https://gerrit.wikimedia.org/r/827979 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis) [11:53:47] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [11:53:50] (03CR) 10Nikerabbit: [C: 03+1] Enable SectionTranslation on 10 more WPs where ContentTranslation is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827174 (https://phabricator.wikimedia.org/T313300) (owner: 10KartikMistry) [11:54:37] (03CR) 10CI reject: [V: 04-1] Make phpfpm restart, php version agnostic [extensions/ProofreadPage] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/827573 (https://phabricator.wikimedia.org/T316596) (owner: 10Majavah) [11:54:41] (03CR) 10CI reject: [V: 04-1] selenium: Use php-fpm version from PHP_VERSION environment [extensions/GrowthExperiments] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/827572 (https://phabricator.wikimedia.org/T316596) (owner: 10Majavah) [11:54:53] zabe: .... I think I'll need to just force merge those then? [11:55:13] yeah [11:55:35] (03CR) 10Majavah: [V: 03+2 C: 03+2] "force merging due to an annoying CI circular dependency that I can't think of how to avoid" [extensions/ProofreadPage] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/827573 (https://phabricator.wikimedia.org/T316596) (owner: 10Majavah) [11:55:50] (03CR) 10Majavah: [V: 03+2 C: 03+2] "force merging due to an annoying CI circular dependency that I can't think of how to avoid" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/827572 (https://phabricator.wikimedia.org/T316596) (owner: 10Majavah) [11:58:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P33674 and previous config saved to /var/cache/conftool/dbconfig/20220830-115851-ladsgroup.json [11:59:51] (03Abandoned) 10Kosta Harlan: Temporarily disable change tag test [extensions/GrowthExperiments] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/827569 (https://phabricator.wikimedia.org/T316596) (owner: 10Kosta Harlan) [12:00:58] Hi team - I'm planning on dpeloying now, as my evening will be busy - If ou have stuff you'd like me to deploy, now is the time (I'll start in 10 minutes) [12:01:14] (03PS2) 10Majavah: Branch commit for wmf/1.39.0-wmf.27 [core] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/827612 (https://phabricator.wikimedia.org/T314188) (owner: 10TrainBranchBot) [12:01:39] (03CR) 10Majavah: [C: 03+2] "trying again with updated GrowthExperiments/ProofreadPage" [core] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/827612 (https://phabricator.wikimedia.org/T314188) (owner: 10TrainBranchBot) [12:01:44] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet [12:03:28] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [12:03:46] And obviously I posted my deployment message to the wrong chan - apologize folks [12:04:35] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host graphite1004.eqiad.wmnet [12:08:28] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [12:13:19] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:13:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T316186)', diff saved to https://phabricator.wikimedia.org/P33675 and previous config saved to /var/cache/conftool/dbconfig/20220830-121357-ladsgroup.json [12:14:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance [12:14:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance [12:14:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T316186)', diff saved to https://phabricator.wikimedia.org/P33676 and previous config saved to /var/cache/conftool/dbconfig/20220830-121421-ladsgroup.json [12:17:45] (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.27 [core] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/827612 (https://phabricator.wikimedia.org/T314188) (owner: 10TrainBranchBot) [12:18:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:18:20] mmhh I see graphite1004 not behaving with the new kernel as I would expect, or at least I see some lag in datapoints [12:18:36] I'll revert to the old kernel and reboot as a test [12:19:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:19:39] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite1004.eqiad.wmnet [12:19:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T316186)', diff saved to https://phabricator.wikimedia.org/P33677 and previous config saved to /var/cache/conftool/dbconfig/20220830-121938-ladsgroup.json [12:20:05] !log rollback and reboot graphite1004 with linux-image-5.10.0-16-amd64 [12:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:33] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:24:09] !log committing updated switch configuration https://gerrit.wikimedia.org/r/c/operations/homer/public/+/827979 [12:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:25:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:26:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:26:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:27:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:28:33] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:31:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:31:49] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host graphite1004.eqiad.wmnet [12:34:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P33678 and previous config saved to /var/cache/conftool/dbconfig/20220830-123445-ladsgroup.json [12:36:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:38:56] (03PS1) 10DCausse: Revert "Revert "deployment-prep: change ES version from 6 to 7"" [puppet] - 10https://gerrit.wikimedia.org/r/827574 [12:49:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P33679 and previous config saved to /var/cache/conftool/dbconfig/20220830-124951-ladsgroup.json [12:52:11] (03PS1) 10Btullis: Fix the datahub 0.8.43 deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/828001 (https://phabricator.wikimedia.org/T316336) [12:54:08] (03CR) 10Btullis: [C: 03+2] Add btullis to users to allow for router configuration [homer/public] - 10https://gerrit.wikimedia.org/r/826572 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis) [12:55:02] (03Merged) 10jenkins-bot: Add btullis to users to allow for router configuration [homer/public] - 10https://gerrit.wikimedia.org/r/826572 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis) [12:58:05] (03PS3) 10Snwachukwu: Update Puppet files for Airflow Upgrade to 2.3.2 [puppet] - 10https://gerrit.wikimedia.org/r/827526 (https://phabricator.wikimedia.org/T315580) [12:58:25] (03CR) 10Btullis: Add a helmfile configuration for the dse-k8s-eqiad cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/826836 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [12:58:41] (03CR) 10CI reject: [V: 04-1] Update Puppet files for Airflow Upgrade to 2.3.2 [puppet] - 10https://gerrit.wikimedia.org/r/827526 (https://phabricator.wikimedia.org/T315580) (owner: 10Snwachukwu) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220830T1300). [13:00:05] kart_: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220830T1300) [13:00:15] o/ [13:00:17] * kart_ is here. [13:00:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:00:22] kart_: do you want to self-serve, or should i deploy for you? [13:00:26] urbanecm: o/ [13:00:33] urbanecm: I can self deploy. [13:00:38] go ahead then [13:00:54] (03PS5) 10KartikMistry: Enable SectionTranslation on 10 more WPs where ContentTranslation is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827174 (https://phabricator.wikimedia.org/T313300) [13:01:06] (03PS1) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/828002 (https://phabricator.wikimedia.org/T316338) [13:01:14] hi [13:01:23] i had some patches too, they disappeared from the page somehow [13:01:51] https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=next&oldid=2007999&diffmode=source edit conflict i guess? [13:02:40] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:02:50] (03PS2) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/828002 (https://phabricator.wikimedia.org/T316338) [13:03:00] heh, looks like it MatmaRex [13:03:01] (03CR) 10KartikMistry: [C: 03+2] Enable SectionTranslation on 10 more WPs where ContentTranslation is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827174 (https://phabricator.wikimedia.org/T313300) (owner: 10KartikMistry) [13:03:17] we should have time for them too [13:03:48] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:04:00] (03Merged) 10jenkins-bot: Enable SectionTranslation on 10 more WPs where ContentTranslation is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827174 (https://phabricator.wikimedia.org/T313300) (owner: 10KartikMistry) [13:04:02] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:04:31] (03PS2) 10KartikMistry: testwiki: Fix language code for Bhojpuri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827500 (https://phabricator.wikimedia.org/T313296) [13:04:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T316186)', diff saved to https://phabricator.wikimedia.org/P33680 and previous config saved to /var/cache/conftool/dbconfig/20220830-130457-ladsgroup.json [13:05:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance [13:05:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance [13:05:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:05:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T316186)', diff saved to https://phabricator.wikimedia.org/P33681 and previous config saved to /var/cache/conftool/dbconfig/20220830-130521-ladsgroup.json [13:05:53] (03CR) 10Klausman: [C: 03+1] Add a helmfile configuration for the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/826836 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [13:07:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:08:26] (03CR) 10Klausman: [C: 03+1] Add a helmfile configuration for the dse-k8s-eqiad cluster (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/826836 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [13:09:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:09:03] Deploying first patch.. [13:09:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:09:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:11:03] !log installing libxslt security updates for stretch [13:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T316186)', diff saved to https://phabricator.wikimedia.org/P33682 and previous config saved to /var/cache/conftool/dbconfig/20220830-131140-ladsgroup.json [13:12:29] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:827174|Enable SectionTranslation on 10 more WPs where ContentTranslation is default (T313300)]] (duration: 03m 56s) [13:12:33] T313300: Enable Section Translation on 10 more Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T313300 [13:12:53] (03PS1) 10Marostegui: orchestrator.conf: Add Amir to orchestrator powerusers [puppet] - 10https://gerrit.wikimedia.org/r/828004 [13:13:12] (03CR) 10KartikMistry: [C: 03+2] testwiki: Fix language code for Bhojpuri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827500 (https://phabricator.wikimedia.org/T313296) (owner: 10KartikMistry) [13:14:36] (03Merged) 10jenkins-bot: testwiki: Fix language code for Bhojpuri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827500 (https://phabricator.wikimedia.org/T313296) (owner: 10KartikMistry) [13:14:45] (03PS1) 10Ayounsi: Netbox: only keep 2 days of hourly DB dumps [puppet] - 10https://gerrit.wikimedia.org/r/828005 (https://phabricator.wikimedia.org/T262677) [13:14:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10Jclark-ctr) [13:14:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson mc-wf1001 B8 U25 Port 27 Cableid 3286 mc-wf1002 D8 U26 Port 30 Cableid 2013339101803 [13:15:30] (03CR) 10Ladsgroup: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/828004 (owner: 10Marostegui) [13:15:38] (03CR) 10Marostegui: [C: 03+2] orchestrator.conf: Add Amir to orchestrator powerusers [puppet] - 10https://gerrit.wikimedia.org/r/828004 (owner: 10Marostegui) [13:17:26] (03PS4) 10Btullis: Add a helmfile configuration for the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/826836 (https://phabricator.wikimedia.org/T310196) [13:17:30] Deploying 2nd patch.. [13:17:44] urbanecm: I should be done in few minutes.. [13:17:49] ack [13:17:55] (03PS5) 10Slyngshede: Initial checkin. User and Group classes for interacting with LDAP. [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) [13:18:16] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/828005 (https://phabricator.wikimedia.org/T262677) (owner: 10Ayounsi) [13:19:26] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, see nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/828002 (https://phabricator.wikimedia.org/T316338) (owner: 10Vgutierrez) [13:19:53] (03CR) 10Btullis: Add a helmfile configuration for the dse-k8s-eqiad cluster (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/826836 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [13:20:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:20:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:21:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:21:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:21:12] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:827500|testwiki: Fix language code for Bhojpuri (T313296)]] (duration: 03m 53s) [13:21:16] T313296: Enable Content and Section translation on wikipedias with new MT support from Google - https://phabricator.wikimedia.org/T313296 [13:21:28] urbanecm: done :) [13:21:34] (03PS1) 10Marostegui: mariadb: Promote db1159 to m3 master [puppet] - 10https://gerrit.wikimedia.org/r/828009 (https://phabricator.wikimedia.org/T316506) [13:21:37] great! [13:21:41] MatmaRex: i can deploy your patches now [13:21:50] (03CR) 10Marostegui: [C: 04-2] "Wait for failover day" [puppet] - 10https://gerrit.wikimedia.org/r/828009 (https://phabricator.wikimedia.org/T316506) (owner: 10Marostegui) [13:21:52] (03PS3) 10Urbanecm: Enable reply tool by default on fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748765 (https://phabricator.wikimedia.org/T297533) (owner: 10Esanders) [13:22:02] (03CR) 10Urbanecm: [C: 03+2] Enable reply tool by default on fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748765 (https://phabricator.wikimedia.org/T297533) (owner: 10Esanders) [13:22:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:22:31] (03PS2) 10Urbanecm: Make DiscussionTools topicsubscription, autotopicsub opt-out on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827548 (https://phabricator.wikimedia.org/T315714) (owner: 10Bartosz Dziewoński) [13:22:50] (03CR) 10Urbanecm: [C: 03+2] Make DiscussionTools topicsubscription, autotopicsub opt-out on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827548 (https://phabricator.wikimedia.org/T315714) (owner: 10Bartosz Dziewoński) [13:22:51] (03Merged) 10jenkins-bot: Enable reply tool by default on fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748765 (https://phabricator.wikimedia.org/T297533) (owner: 10Esanders) [13:23:19] MatmaRex: first patch is at mwdebug1001, can you test please? [13:23:55] (03Merged) 10jenkins-bot: Make DiscussionTools topicsubscription, autotopicsub opt-out on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827548 (https://phabricator.wikimedia.org/T315714) (owner: 10Bartosz Dziewoński) [13:24:23] looking [13:24:53] seems okay [13:24:58] thanks, syncing [13:25:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:26:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P33683 and previous config saved to /var/cache/conftool/dbconfig/20220830-132646-ladsgroup.json [13:26:55] (03PS3) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/828002 (https://phabricator.wikimedia.org/T316338) [13:27:05] (03CR) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828002 (https://phabricator.wikimedia.org/T316338) (owner: 10Vgutierrez) [13:27:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:27:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:28:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:28:16] (03CR) 10Vgutierrez: [C: 03+2] Increase query-sorting to 100%, remove sampling code [puppet] - 10https://gerrit.wikimedia.org/r/826997 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [13:28:31] !log Increase roll-out of query-sorting to 100% - T314868 [13:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:35] T314868: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868 [13:28:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:29:22] (03CR) 10Ayounsi: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/37042/netboxdb2002.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/828005 (https://phabricator.wikimedia.org/T262677) (owner: 10Ayounsi) [13:29:22] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: ea3228b739614e376b01f0575264fb5964a28b90: Enable reply tool by default on fiwiki (T297533) (duration: 04m 01s) [13:29:28] T297533: Config Change: Deploy Reply Tool as Opt-Out at fi.wiki - https://phabricator.wikimedia.org/T297533 [13:29:45] MatmaRex: second patch is at mwdebug1001, can you check? [13:30:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:30:32] urbanecm: looks good as well! [13:30:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:30:37] that was quick, syncing! [13:33:40] PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [13:33:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:34:34] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 25500e503954bdd61d991ef18d45a514e5591be4: Make DiscussionTools topicsubscription, autotopicsub opt-out on all wikis (T315714) (duration: 03m 56s) [13:34:39] T315714: [Config change] Offer Topic Subscriptions (desktop) as opt-out feature at all projects - https://phabricator.wikimedia.org/T315714 [13:34:41] MatmaRex: and, live [13:34:43] anything else? [13:34:47] thanks [13:34:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:34:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:35:08] mw1440 : ffmpeg overload, again [13:35:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:35:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:36:05] * urbanecm is going to do a backport too [13:36:45] (03PS1) 10Urbanecm: RenderTranslationPageJob: Add patrol status for translation page [extensions/Translate] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827577 (https://phabricator.wikimedia.org/T315708) [13:36:52] (03CR) 10Urbanecm: [C: 03+2] RenderTranslationPageJob: Add patrol status for translation page [extensions/Translate] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827577 (https://phabricator.wikimedia.org/T315708) (owner: 10Urbanecm) [13:37:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:39:41] (03CR) 10Muehlenhoff: "Some initial comments." [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede) [13:40:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:41:20] (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/37043/install1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/827964 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [13:41:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P33684 and previous config saved to /var/cache/conftool/dbconfig/20220830-134152-ladsgroup.json [13:42:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:42:38] PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:43:24] ^^ same pattern with these alerts today. mw1437/1438/1439 maxed cpu due to ffmpeg jobs. [13:44:02] has resolved now, I'm inclined not to take any action (wikitech suggests dedicating hosts to jobrunner as that is more important, but seems like we are just on the edge of triggering this occasionally and it's not having much wider impact) [13:44:16] I'll keep an eye on it. cc heron, cdanis [13:45:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:45:26] sgtm thanks topranks [13:47:04] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "ci issues" [extensions/Translate] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827577 (https://phabricator.wikimedia.org/T315708) (owner: 10Urbanecm) [13:47:14] PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [13:47:24] PROBLEM - PHP7 rendering on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:48:52] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:49:12] PROBLEM - PHP7 rendering on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:49:42] RECOVERY - PHP7 rendering on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 1.274 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:50:22] PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [13:50:34] RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [13:50:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:51:28] RECOVERY - PHP7 rendering on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 1.828 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:51:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:52:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:52:18] RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 8.396 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:52:20] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.26/extensions/Translate/src/PageTranslation/RenderTranslationPageJob.php: 75d8e6cba30e20f3ee91bb04f1b59423c96244b6: RenderTranslationPageJob: Add patrol status for translation page (T315708) (duration: 03m 59s) [13:52:24] T315708: Edits to translated pages left unpatrolled - https://phabricator.wikimedia.org/T315708 [13:52:26] B&C done [13:52:44] RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 6.131 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [13:52:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:54:32] CannotCreateActorException: Cannot create an actor for a usable name that is not an existing user: user_name="Contact Form on Meta" [13:54:35] :| [13:54:39] :/ [13:54:45] is that relevant to anything i did zabe? [13:54:49] (today, i mean) [13:54:54] no [13:55:01] it's CheckUser actor migration [13:55:27] but it's very low frequency [13:55:36] PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [13:55:37] probably people don't use special:Contact that often [13:55:40] PROBLEM - PHP7 rendering on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:56:09] (03PS1) 10Ayounsi: Exclude cloud-eqiad prefix from lists trusted networks [puppet] - 10https://gerrit.wikimedia.org/r/828016 (https://phabricator.wikimedia.org/T265864) [13:56:52] RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 2.909 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [13:56:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T316186)', diff saved to https://phabricator.wikimedia.org/P33685 and previous config saved to /var/cache/conftool/dbconfig/20220830-135658-ladsgroup.json [13:57:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance [13:57:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance [13:57:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T316186)', diff saved to https://phabricator.wikimedia.org/P33686 and previous config saved to /var/cache/conftool/dbconfig/20220830-135733-ladsgroup.json [13:58:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:00:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:00:25] (03PS2) 10Ayounsi: Exclude cloud-eqiad prefix from lists trusted networks [puppet] - 10https://gerrit.wikimedia.org/r/828016 (https://phabricator.wikimedia.org/T265864) [14:01:49] (03PS3) 10Ayounsi: Exclude cloud-eqiad prefix from lists trusted networks [puppet] - 10https://gerrit.wikimedia.org/r/828016 (https://phabricator.wikimedia.org/T265864) [14:02:10] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [14:02:52] RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 6.214 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [14:02:52] RECOVERY - PHP7 rendering on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 2.168 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:03:15] (03PS1) 10Volans: tox: add --no-external-config to prospector [software/pywmflib] - 10https://gerrit.wikimedia.org/r/828017 [14:03:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:04:26] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.029 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [14:04:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T316186)', diff saved to https://phabricator.wikimedia.org/P33687 and previous config saved to /var/cache/conftool/dbconfig/20220830-140452-ladsgroup.json [14:05:00] PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [14:05:09] (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/37045/lists1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/828016 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [14:05:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:07:14] RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [14:07:34] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:08:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:09:59] (03PS1) 10Ayounsi: Exclude cloud-eqiad prefix from MXs trusted networks [puppet] - 10https://gerrit.wikimedia.org/r/828019 (https://phabricator.wikimedia.org/T265864) [14:10:03] (03PS1) 10Andrew Bogott: Horizon: put into maintenance mode for Xena upgrade [puppet] - 10https://gerrit.wikimedia.org/r/828020 (https://phabricator.wikimedia.org/T296561) [14:10:07] (03PS1) 10Andrew Bogott: Revert "Horizon: put into maintenance mode for Xena upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/828021 [14:10:14] PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [14:10:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:10:56] (03CR) 10Volans: [C: 03+2] "trivial, self-merging" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/828017 (owner: 10Volans) [14:11:18] PROBLEM - PHP7 rendering on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:11:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:12:22] PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:12:24] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 3.558 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:12:53] (03PS3) 10Ayounsi: Squid: permit production networks instead of aggregate_networks [puppet] - 10https://gerrit.wikimedia.org/r/827964 (https://phabricator.wikimedia.org/T265864) [14:12:55] (03PS4) 10Ayounsi: Exclude cloud-eqiad prefix from lists trusted networks [puppet] - 10https://gerrit.wikimedia.org/r/828016 (https://phabricator.wikimedia.org/T265864) [14:12:58] (03PS2) 10Ayounsi: Exclude cloud-eqiad prefix from MXs trusted networks [puppet] - 10https://gerrit.wikimedia.org/r/828019 (https://phabricator.wikimedia.org/T265864) [14:13:00] (03PS3) 10Ayounsi: Remove 185.15.56.0/24 from network::external [puppet] - 10https://gerrit.wikimedia.org/r/677872 (https://phabricator.wikimedia.org/T265864) [14:13:40] RECOVERY - PHP7 rendering on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 6.361 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:14:06] PROBLEM - PHP7 rendering on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:14:42] RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 5.685 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:14:49] topranks cdanis should we consider dividing the jobrunner cluster? [14:15:13] (03Merged) 10jenkins-bot: tox: add --no-external-config to prospector [software/pywmflib] - 10https://gerrit.wikimedia.org/r/828017 (owner: 10Volans) [14:15:33] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:15:35] (03PS1) 10Volans: tests: remove unnecessary pylint disable [software/pywmflib] - 10https://gerrit.wikimedia.org/r/828022 [14:16:15] (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1003/37046/mx1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/828019 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [14:16:30] PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [14:17:33] herron: the logic makes sense - I'm not really familiar with these systems though, so unsure at what point it gets to the level we need to do it [14:17:56] PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:18:11] (03PS5) 10Volans: tox: add --no-external-config to prospector [software/homer] - 10https://gerrit.wikimedia.org/r/821254 (owner: 10Jbond) [14:18:14] (03PS2) 10Andrew Bogott: Revert "Horizon: put into maintenance mode for Xena upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/828021 [14:18:16] (03PS1) 10Andrew Bogott: Upgrade eqiad1 to openstack Xena [puppet] - 10https://gerrit.wikimedia.org/r/828024 (https://phabricator.wikimedia.org/T296561) [14:18:16] I guess the other thing we should maybe do is assess why the videoscaler is running so hot, it calmed down yesterday evening but came last night [14:18:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:18:24] hmm [14:18:52] (03CR) 10Volans: "comment inline, updated CR" [software/homer] - 10https://gerrit.wikimedia.org/r/821254 (owner: 10Jbond) [14:19:10] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [14:19:58] RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 9.947 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [14:19:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P33688 and previous config saved to /var/cache/conftool/dbconfig/20220830-141959-ladsgroup.json [14:20:34] (03PS4) 10Ayounsi: Remove 185.15.56.0/24 from network::external [puppet] - 10https://gerrit.wikimedia.org/r/677872 (https://phabricator.wikimedia.org/T265864) [14:20:36] (03PS1) 10Ayounsi: Exclude cloud-eqiad prefix from VRT trusted networks [puppet] - 10https://gerrit.wikimedia.org/r/828025 (https://phabricator.wikimedia.org/T265864) [14:20:36] PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:20:44] herron: I notice that jobrunner is enabled for more hosts than videoscaler, so potentially the hosts it's running on that aren't doing video scaling are providing enough resources for the overall cluster to work [14:21:10] The PHP rendering logs I assume are from the same thing, that may be a bigger worry? [14:21:18] RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 7.989 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [14:21:26] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [14:22:02] PROBLEM - PHP7 rendering on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:22:58] RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 5.350 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:23:12] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:23:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:24:26] RECOVERY - PHP7 rendering on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 6.738 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:26:28] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [14:26:54] PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [14:27:42] !log herron@puppetmaster1001 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1438 [14:27:56] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 3.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:28:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:28:44] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [14:28:53] !log herron@puppetmaster1001 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1438.eqiad.wmnet [14:29:54] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Remove 185.15.56.0/24 from network::external - https://phabricator.wikimedia.org/T265864 (10ayounsi) >>! In T265864#6995696, @Legoktm wrote: > It would be nice if we could deploy this change for services in... [14:30:05] (03CR) 10Volans: [C: 03+1] "LGTM if you have tested it" [software/homer] - 10https://gerrit.wikimedia.org/r/820778 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi) [14:30:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:31:02] RECOVERY - PHP7 rendering on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 8.721 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:31:46] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:32:06] PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [14:32:32] (03CR) 10Ayounsi: [C: 03+1] "fine by me" [software/homer] - 10https://gerrit.wikimedia.org/r/821254 (owner: 10Jbond) [14:32:34] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:32:46] (03CR) 10Volans: [C: 03+2] tox: add --no-external-config to prospector [software/homer] - 10https://gerrit.wikimedia.org/r/821254 (owner: 10Jbond) [14:33:32] PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [14:33:46] !log herron@puppetmaster1001 conftool action : set/pooled=no; selector: cluster=videoscaler,name=mw1437.eqiad.wmnet [14:33:46] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:34:22] RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [14:34:56] 10SRE, 10MediaWiki-General, 10Traffic, 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), 10Patch-For-Review: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868 (10ori) 05In progress→03Resolved p:05Triage→03Medium [14:35:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P33689 and previous config saved to /var/cache/conftool/dbconfig/20220830-143505-ladsgroup.json [14:35:07] 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10MW-1.39-notes (1.39.0-wmf.25; 2022-08-15), 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10ori) [14:35:07] !log herron@puppetmaster1001 conftool action : set/pooled=no; selector: cluster=videoscaler,name=mw1439.eqiad.wmnet [14:35:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:35:56] RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 8.388 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [14:36:30] RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [14:37:11] (03Merged) 10jenkins-bot: tox: add --no-external-config to prospector [software/homer] - 10https://gerrit.wikimedia.org/r/821254 (owner: 10Jbond) [14:37:40] (03CR) 10Klausman: [C: 03+1] Add a helmfile configuration for the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/826836 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [14:38:50] !log de-pooling mw1437/mw1439/mw1440 from jobrunner cluster as those hosts are busy running videoscaler tasks [14:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:39:20] !log cmooney@cumin1001 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1437.eqiad.wmnet [14:39:38] !log cmooney@cumin1001 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1439.eqiad.wmnet [14:39:55] !log cmooney@cumin1001 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1440.eqiad.wmnet [14:41:00] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [14:41:05] (03CR) 10Btullis: [C: 03+2] Add a helmfile configuration for the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/826836 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [14:41:10] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:42:28] (03PS6) 10Ayounsi: Bump pynetbox to ~= 6.6 [software/homer] - 10https://gerrit.wikimedia.org/r/820778 (https://phabricator.wikimedia.org/T310745) [14:42:30] (03PS1) 10Ayounsi: Enable pynetbox threading [software/homer] - 10https://gerrit.wikimedia.org/r/828031 (https://phabricator.wikimedia.org/T311486) [14:44:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:45:18] !log cmooney@cumin1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,name=mw1437.eqiad.wmnet [14:45:27] !log cmooney@cumin1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,name=mw1439.eqiad.wmnet [14:46:00] (03Merged) 10jenkins-bot: Add a helmfile configuration for the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/826836 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [14:47:24] (03CR) 10CI reject: [V: 04-1] Enable pynetbox threading [software/homer] - 10https://gerrit.wikimedia.org/r/828031 (https://phabricator.wikimedia.org/T311486) (owner: 10Ayounsi) [14:48:08] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.031 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [14:50:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T316186)', diff saved to https://phabricator.wikimedia.org/P33690 and previous config saved to /var/cache/conftool/dbconfig/20220830-145011-ladsgroup.json [14:50:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance [14:50:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance [14:50:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T316186)', diff saved to https://phabricator.wikimedia.org/P33691 and previous config saved to /var/cache/conftool/dbconfig/20220830-145035-ladsgroup.json [14:50:46] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:51:44] 10SRE, 10MediaWiki-General, 10Traffic, 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), 10Patch-For-Review: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868 (10ori) This is now complete. Many thanks to @Vgutierrez for partnering with me to get this rolled out. [14:52:14] (03PS2) 10Btullis: Fix the datahub 0.8.43 deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/828001 (https://phabricator.wikimedia.org/T316336) [14:56:59] (03CR) 10Btullis: [C: 03+2] Fix the datahub 0.8.43 deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/828001 (https://phabricator.wikimedia.org/T316336) (owner: 10Btullis) [14:57:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T316186)', diff saved to https://phabricator.wikimedia.org/P33693 and previous config saved to /var/cache/conftool/dbconfig/20220830-145755-ladsgroup.json [14:58:00] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [14:59:16] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:00:26] (03Merged) 10jenkins-bot: Fix the datahub 0.8.43 deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/828001 (https://phabricator.wikimedia.org/T316336) (owner: 10Btullis) [15:01:37] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2110 - https://phabricator.wikimedia.org/T315229 (10Papaul) [15:01:39] (03CR) 10Ayounsi: [C: 03+2] Bump pynetbox to ~= 6.6 [software/homer] - 10https://gerrit.wikimedia.org/r/820778 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi) [15:02:07] (03CR) 10FNegri: [C: 03+2] Horizon: put into maintenance mode for Xena upgrade [puppet] - 10https://gerrit.wikimedia.org/r/828020 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [15:02:22] (03CR) 10Hashar: "This is the very first iteration for sending Gerrit events to our EventGate platform. I wrote a little bit of doc at https://wikitech.wiki" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814725 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar) [15:05:05] (03CR) 10Hashar: "This change actually POST the events. It definitely would need some configuration variables to tweak the destination URL, timeout value an" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 (owner: 10Hashar) [15:05:11] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [15:06:06] PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [15:06:19] (03Merged) 10jenkins-bot: Bump pynetbox to ~= 6.6 [software/homer] - 10https://gerrit.wikimedia.org/r/820778 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi) [15:06:22] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [15:06:41] (03PS2) 10Ayounsi: Enable pynetbox threading [software/homer] - 10https://gerrit.wikimedia.org/r/828031 (https://phabricator.wikimedia.org/T311486) [15:06:50] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [15:07:14] (03CR) 10Hashar: "When Antoine discovers the beauty of java concurrent work queue. The parent patch used org.wikimedia.eventutilities.core.http.BasicHttpCl" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/816115 (owner: 10Hashar) [15:07:46] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [15:08:56] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 5.103 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:09:06] (03CR) 10Hashar: "This is similar to a composer/npm lock file, else I would have to invoke a maven_jar() for each of the dependency in the tree which is def" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/816172 (owner: 10Hashar) [15:09:58] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [15:10:18] (ProbeDown) firing: Service labweb-ssl:7443 has failed probes (http_labweb-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#labweb-ssl:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:10:19] (ProbeDown) firing: Service labweb-ssl:7443 has failed probes (http_labweb-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#labweb-ssl:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:10:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard2002.codfw.wmnet [15:10:53] (03CR) 10CI reject: [V: 04-1] Enable pynetbox threading [software/homer] - 10https://gerrit.wikimedia.org/r/828031 (https://phabricator.wikimedia.org/T311486) (owner: 10Ayounsi) [15:10:53] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [15:12:43] (03PS1) 10Ayounsi: Enable pynetbox threading for DNS/Ganeti/Mgmt scripts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/828035 (https://phabricator.wikimedia.org/T311486) [15:13:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P33694 and previous config saved to /var/cache/conftool/dbconfig/20220830-151301-ladsgroup.json [15:13:12] RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [15:14:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2002.codfw.wmnet [15:14:15] (03CR) 10FNegri: [C: 03+1] Upgrade eqiad1 to openstack Xena [puppet] - 10https://gerrit.wikimedia.org/r/828024 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [15:14:42] (03CR) 10Andrew Bogott: [C: 03+2] Upgrade eqiad1 to openstack Xena [puppet] - 10https://gerrit.wikimedia.org/r/828024 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [15:14:52] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:16:28] (03PS3) 10Ayounsi: Enable pynetbox threading [software/homer] - 10https://gerrit.wikimedia.org/r/828031 (https://phabricator.wikimedia.org/T311486) [15:16:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host stat1009.eqiad.wmnet [15:17:32] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on ganeti2022.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [15:17:37] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [15:17:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on ganeti2022.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [15:17:58] (03CR) 10DCausse: [C: 03+1] "I believe that wgTranslateTranslationServices in CommonSettings.php should be adapted too to use the ES6Compat transport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson) [15:19:14] RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:19:32] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T316121 (10phaultfinder) [15:19:32] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.028 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:19:44] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [15:23:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1009.eqiad.wmnet [15:25:06] !log restarting ats in cp6007 [15:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:13] damn typo :) [15:25:18] !log restarting ats in cp6008 [15:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:46] PROBLEM - PHP7 rendering on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:27:56] PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [15:28:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P33695 and previous config saved to /var/cache/conftool/dbconfig/20220830-152807-ladsgroup.json [15:29:48] (03PS1) 10Clément Goubert: P:gitlab::runner Remove docker_version parameter [puppet] - 10https://gerrit.wikimedia.org/r/828041 (https://phabricator.wikimedia.org/T316341) [15:30:54] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37047/console" [puppet] - 10https://gerrit.wikimedia.org/r/828041 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert) [15:32:56] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [15:32:57] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [15:33:13] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [15:33:14] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [15:39:30] (03CR) 10Giuseppe Lavagetto: [C: 03+1] P:gitlab::runner Remove docker_version parameter [puppet] - 10https://gerrit.wikimedia.org/r/828041 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert) [15:40:09] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] P:gitlab::runner Remove docker_version parameter [puppet] - 10https://gerrit.wikimedia.org/r/828041 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert) [15:40:29] <_joe_> btullis: :)) [15:40:49] Uh oh, what have I done? :-) [15:41:37] I've been following https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#Apply_RBAC_rules_and_PSPs to add the RBACs and pod security policies. [15:42:49] btullis: the right thing ;-) [15:43:10] Phew! [15:43:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T316186)', diff saved to https://phabricator.wikimedia.org/P33696 and previous config saved to /var/cache/conftool/dbconfig/20220830-154314-ladsgroup.json [15:43:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance [15:43:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance [15:43:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1190 (T316186)', diff saved to https://phabricator.wikimedia.org/P33697 and previous config saved to /var/cache/conftool/dbconfig/20220830-154337-ladsgroup.json [15:47:30] (03PS4) 10Hnowlan: deployment-prep: Drop deployment-restbase03, no longer to be used [puppet] - 10https://gerrit.wikimedia.org/r/790424 (https://phabricator.wikimedia.org/T306052) (owner: 10Jforrester) [15:48:54] RECOVERY - Check systemd state on dse-k8s-worker1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:34] RECOVERY - PHP7 rendering on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:49:44] RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [15:50:23] (03CR) 10Hnowlan: [C: 03+2] deployment-prep: Drop deployment-restbase03, no longer to be used [puppet] - 10https://gerrit.wikimedia.org/r/790424 (https://phabricator.wikimedia.org/T306052) (owner: 10Jforrester) [15:51:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T316186)', diff saved to https://phabricator.wikimedia.org/P33698 and previous config saved to /var/cache/conftool/dbconfig/20220830-155101-ladsgroup.json [15:52:51] (03CR) 10Volans: [C: 03+1] "Would be nice to know if it helps or not. Beside that I'm ok to try it." [software/homer] - 10https://gerrit.wikimedia.org/r/828031 (https://phabricator.wikimedia.org/T311486) (owner: 10Ayounsi) [15:52:54] 10SRE, 10MediaWiki-General, 10Traffic, 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), 10Patch-For-Review: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868 (10ori) [15:55:19] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:56:14] PROBLEM - Check systemd state on dse-k8s-worker1008 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:52] (03PS1) 10Btullis: Add a kublet node_label to each master of the dse-k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/828049 (https://phabricator.wikimedia.org/T310172) [15:59:33] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10RobH) [16:00:05] jbond and rzl: Dear deployers, time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220830T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:02:14] (03PS2) 10Clément Goubert: admin: set shell to undef if user is removed [puppet] - 10https://gerrit.wikimedia.org/r/825755 [16:04:56] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37048/console" [puppet] - 10https://gerrit.wikimedia.org/r/825755 (owner: 10Clément Goubert) [16:06:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P33699 and previous config saved to /var/cache/conftool/dbconfig/20220830-160607-ladsgroup.json [16:07:00] (03PS1) 10Btullis: Label the first four of the dse-k8s-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/828052 (https://phabricator.wikimedia.org/T310177) [16:48:19] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:48:31] (03PS2) 10Btullis: Label the eight dse-k8s-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/828052 (https://phabricator.wikimedia.org/T310177) [16:49:44] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2022.codfw.wmnet with reason: host reimage [16:50:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2005.codfw.wmnet [16:51:08] (03PS1) 10Chad: codesearch: configure ports for design and discovery [puppet] - 10https://gerrit.wikimedia.org/r/828057 [16:52:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2022.codfw.wmnet with reason: host reimage [16:53:51] (03CR) 10Muehlenhoff: routinator: Switch to systemd::sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826869 (owner: 10Muehlenhoff) [16:53:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2005.codfw.wmnet [16:55:10] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [16:55:11] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [16:56:04] -^ That was syncing namespaces. [16:56:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host theemin.codfw.wmnet [16:58:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:58:50] (03CR) 10Muehlenhoff: Cookbooks to perform rolling restart/reboot of an LDAP replica cluster (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/826578 (owner: 10Muehlenhoff) [17:01:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host theemin.codfw.wmnet [17:03:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:03:36] (03PS1) 10Samtar: InitialiseSettings.php: Enable Realtime Preview on Group 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828059 (https://phabricator.wikimedia.org/T314828) [17:07:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:08:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2022.codfw.wmnet with OS bullseye [17:08:17] (03CR) 10MusikAnimal: InitialiseSettings.php: Enable Realtime Preview on Group 2 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828059 (https://phabricator.wikimedia.org/T314828) (owner: 10Samtar) [17:08:17] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2022.codfw.wmnet with OS bullseye completed: - ganeti2022 (**PASS**) - Downtimed on... [17:08:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:08:23] !log dduvall@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.27 refs T314188 (duration: 39m 07s) [17:08:28] T314188: 1.39.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T314188 [17:08:46] (03PS1) 10Ori: Link to Wikitech doc from query-normalization VCL [puppet] - 10https://gerrit.wikimedia.org/r/828060 [17:08:55] (03PS2) 10Ryan Kemper: Revert "Revert "deployment-prep: change ES version from 6 to 7"" [puppet] - 10https://gerrit.wikimedia.org/r/827574 (owner: 10DCausse) [17:09:04] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] Revert "Revert "deployment-prep: change ES version from 6 to 7"" [puppet] - 10https://gerrit.wikimedia.org/r/827574 (owner: 10DCausse) [17:11:29] !log joal@deploy1002 Started deploy [analytics/refinery@aa8f88f]: Regular analytics weekly train [analytics/refinery@aa8f88f] [17:12:03] !log installing logrotate security updates on Bullseye [17:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:17] (03CR) 10David Caro: "Got a question, looks ok though" [puppet] - 10https://gerrit.wikimedia.org/r/826986 (https://phabricator.wikimedia.org/T316463) (owner: 10Majavah) [17:13:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:13:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:17:27] (03CR) 10Majavah: dynamicproxy: improve /zones API (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/826986 (https://phabricator.wikimedia.org/T316463) (owner: 10Majavah) [17:17:31] (03PS2) 10Majavah: dynamicproxy: improve /zones API [puppet] - 10https://gerrit.wikimedia.org/r/826986 (https://phabricator.wikimedia.org/T316463) [17:18:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:18:41] (03CR) 10MusikAnimal: InitialiseSettings.php: Enable Realtime Preview on Group 2 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828059 (https://phabricator.wikimedia.org/T314828) (owner: 10Samtar) [17:20:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:23:18] (ProbeDown) resolved: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:23:30] (03CR) 10MusikAnimal: [C: 04-1] InitialiseSettings.php: Enable Realtime Preview on Group 2 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828059 (https://phabricator.wikimedia.org/T314828) (owner: 10Samtar) [17:28:05] (03CR) 10MusikAnimal: [C: 04-1] "Probably not a big deal, but config settings for Beta Features are supposed to go between the "BetaFeatures start" and "BetaFeatures end" " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828059 (https://phabricator.wikimedia.org/T314828) (owner: 10Samtar) [17:32:18] (03PS1) 10Stang: dewiki: Trun off patrolling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828061 (https://phabricator.wikimedia.org/T316393) [17:32:35] (03CR) 10Dzahn: "sorry, I don't know about this to be able to review it. maybe Amir or Andrew would have more information" [puppet] - 10https://gerrit.wikimedia.org/r/828016 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [17:33:15] (03CR) 10Dzahn: [C: 03+1] "this fixes the git clone issue (just makes it work again as documented). +1 after reading response from Arzhel as well" [puppet] - 10https://gerrit.wikimedia.org/r/827553 (https://phabricator.wikimedia.org/T316569) (owner: 10Andrea Denisse) [17:35:35] (03CR) 10Herron: [C: 03+1] netmon: Add the wikidev group for the rancid directory. [puppet] - 10https://gerrit.wikimedia.org/r/827553 (https://phabricator.wikimedia.org/T316569) (owner: 10Andrea Denisse) [17:37:25] (03CR) 10Cwhite: [C: 03+1] netmon: Add the wikidev group for the rancid directory. [puppet] - 10https://gerrit.wikimedia.org/r/827553 (https://phabricator.wikimedia.org/T316569) (owner: 10Andrea Denisse) [17:37:39] !log joal@deploy1002 Finished deploy [analytics/refinery@aa8f88f]: Regular analytics weekly train [analytics/refinery@aa8f88f] (duration: 26m 10s) [17:37:59] !log joal@deploy1002 Started deploy [analytics/refinery@aa8f88f] (thin): Regular analytics weekly train THIN [analytics/refinery@aa8f88f] [17:38:08] !log joal@deploy1002 Finished deploy [analytics/refinery@aa8f88f] (thin): Regular analytics weekly train THIN [analytics/refinery@aa8f88f] (duration: 00m 08s) [17:38:23] !log joal@deploy1002 Started deploy [analytics/refinery@aa8f88f] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@aa8f88f] [17:42:49] (03CR) 10Andrea Denisse: [C: 03+2] netmon: Add the wikidev group for the rancid directory. [puppet] - 10https://gerrit.wikimedia.org/r/827553 (https://phabricator.wikimedia.org/T316569) (owner: 10Andrea Denisse) [17:46:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:46:42] !log joal@deploy1002 Finished deploy [analytics/refinery@aa8f88f] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@aa8f88f] (duration: 08m 19s) [17:47:38] (03PS2) 10Samtar: InitialiseSettings.php: Enable Realtime Preview on Group 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828059 (https://phabricator.wikimedia.org/T314828) [17:48:24] (03CR) 10Samtar: InitialiseSettings.php: Enable Realtime Preview on Group 2 (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828059 (https://phabricator.wikimedia.org/T314828) (owner: 10Samtar) [17:49:03] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics: Define a fleetwide uid and gid mappings for the Netmon instances containing LibreNMS and Rancid. - https://phabricator.wikimedia.org/T315388 (10andrea.denisse) a:03andrea.denisse [17:50:44] (03CR) 10Ladsgroup: [C: 03+1] mariadb: Promote db1159 to m3 master [puppet] - 10https://gerrit.wikimedia.org/r/828009 (https://phabricator.wikimedia.org/T316506) (owner: 10Marostegui) [17:51:00] (03CR) 10Ryan Kemper: [C: 03+2] apifeatureusage: Temporarily remove index template during 6->7 transition [puppet] - 10https://gerrit.wikimedia.org/r/815783 (https://phabricator.wikimedia.org/T313434) (owner: 10Ebernhardson) [17:51:18] (ProbeDown) resolved: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:53:24] (03CR) 10MusikAnimal: [C: 03+1] InitialiseSettings.php: Enable Realtime Preview on Group 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828059 (https://phabricator.wikimedia.org/T314828) (owner: 10Samtar) [17:59:54] (03CR) 10Btullis: [C: 03+2] Add linktarget to sqooped tables [puppet] - 10https://gerrit.wikimedia.org/r/826564 (https://phabricator.wikimedia.org/T314666) (owner: 10Joal) [18:00:04] dduvall and hashar: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220830T1800). [18:09:19] (03PS1) 10TrainBranchBot: group0 wikis to 1.39.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828063 (https://phabricator.wikimedia.org/T314188) [18:09:22] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.39.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828063 (https://phabricator.wikimedia.org/T314188) (owner: 10TrainBranchBot) [18:10:05] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828063 (https://phabricator.wikimedia.org/T314188) (owner: 10TrainBranchBot) [18:12:38] (03CR) 10Ryan Kemper: "We need to merge this after everything is on es 7, so a few weeks from today" [puppet] - 10https://gerrit.wikimedia.org/r/815784 (https://phabricator.wikimedia.org/T313434) (owner: 10Ebernhardson) [18:15:01] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.27 refs T314188 [18:15:06] T314188: 1.39.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T314188 [18:15:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:16:20] (03PS2) 10Herron: WIP: victorps.py: add print_weekly_schedule command [software/klaxon] - 10https://gerrit.wikimedia.org/r/827562 (https://phabricator.wikimedia.org/T309115) [18:16:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:16:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:17:21] (03CR) 10CI reject: [V: 04-1] WIP: victorps.py: add print_weekly_schedule command [software/klaxon] - 10https://gerrit.wikimedia.org/r/827562 (https://phabricator.wikimedia.org/T309115) (owner: 10Herron) [18:17:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:18:59] (03PS3) 10Herron: WIP: victorps.py: add print_weekly_schedule command [software/klaxon] - 10https://gerrit.wikimedia.org/r/827562 (https://phabricator.wikimedia.org/T309115) [18:19:31] (03CR) 10BBlack: [C: 03+1] Link to Wikitech doc from query-normalization VCL [puppet] - 10https://gerrit.wikimedia.org/r/828060 (owner: 10Ori) [18:31:04] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.0.209:7001 on restbase1028 is CRITICAL: SSL CRITICAL - Certificate restbase1028-a valid until 2022-09-29 10:16:37 +0000 (expires in 29 days) eevans Certificates nearing expiration (T316697) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [18:31:04] ACKNOWLEDGEMENT - cassandra-b SSL 10.64.0.210:7001 on restbase1028 is CRITICAL: SSL CRITICAL - Certificate restbase1028-b valid until 2022-09-29 10:16:40 +0000 (expires in 29 days) eevans Certificates nearing expiration (T316697) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [18:31:04] ACKNOWLEDGEMENT - cassandra-c SSL 10.64.0.211:7001 on restbase1028 is CRITICAL: SSL CRITICAL - Certificate restbase1028-c valid until 2022-09-29 10:16:42 +0000 (expires in 29 days) eevans Certificates nearing expiration (T316697) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [18:31:04] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.16.180:7001 on restbase1029 is CRITICAL: SSL CRITICAL - Certificate restbase1029-a valid until 2022-09-29 10:16:45 +0000 (expires in 29 days) eevans Certificates nearing expiration (T316697) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [18:31:04] ACKNOWLEDGEMENT - cassandra-b SSL 10.64.16.181:7001 on restbase1029 is CRITICAL: SSL CRITICAL - Certificate restbase1029-b valid until 2022-09-29 10:16:48 +0000 (expires in 29 days) eevans Certificates nearing expiration (T316697) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [18:31:04] ACKNOWLEDGEMENT - cassandra-c SSL 10.64.16.182:7001 on restbase1029 is CRITICAL: SSL CRITICAL - Certificate restbase1029-c valid until 2022-09-29 10:16:51 +0000 (expires in 29 days) eevans Certificates nearing expiration (T316697) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [18:31:04] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.48.234:7001 on restbase1030 is CRITICAL: SSL CRITICAL - Certificate restbase1030-a valid until 2022-09-29 10:16:53 +0000 (expires in 29 days) eevans Certificates nearing expiration (T316697) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [18:31:05] ACKNOWLEDGEMENT - cassandra-b SSL 10.64.48.235:7001 on restbase1030 is CRITICAL: SSL CRITICAL - Certificate restbase1030-b valid until 2022-09-29 10:16:56 +0000 (expires in 29 days) eevans Certificates nearing expiration (T316697) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [18:31:05] ACKNOWLEDGEMENT - cassandra-c SSL 10.64.48.236:7001 on restbase1030 is CRITICAL: SSL CRITICAL - Certificate restbase1030-c valid until 2022-09-29 10:16:58 +0000 (expires in 29 days) eevans Certificates nearing expiration (T316697) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [18:31:41] (03CR) 10Ori: [C: 03+2] Link to Wikitech doc from query-normalization VCL [puppet] - 10https://gerrit.wikimedia.org/r/828060 (owner: 10Ori) [18:33:05] (03PS1) 10Andrew Bogott: Neutron: define regular_user rule for policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/828064 (https://phabricator.wikimedia.org/T316685) [18:42:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:43:05] (03PS2) 10Andrew Bogott: Neutron: define regular_user and admin_only rules for policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/828064 (https://phabricator.wikimedia.org/T316685) [18:43:31] (03PS1) 10Ryan Kemper: git-sync-upstream: fix inconsequential typo [puppet] - 10https://gerrit.wikimedia.org/r/828068 [18:45:11] (03CR) 10Andrew Bogott: [C: 03+2] Neutron: define regular_user and admin_only rules for policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/828064 (https://phabricator.wikimedia.org/T316685) (owner: 10Andrew Bogott) [18:46:14] (03CR) 10Ryan Kemper: [C: 03+2] "Self-merging because this just contains a change to a comment" [puppet] - 10https://gerrit.wikimedia.org/r/828068 (owner: 10Ryan Kemper) [18:49:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:49:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:52:20] (03PS1) 10Andrew Bogott: Neutron: remove policies that refer to nonexistent rules [puppet] - 10https://gerrit.wikimedia.org/r/828070 (https://phabricator.wikimedia.org/T316685) [18:54:36] (03PS1) 10Andrew Bogott: Revert "don't rsync to clouddumps1001,2 while they are still being set up" [puppet] - 10https://gerrit.wikimedia.org/r/828071 (https://phabricator.wikimedia.org/T302981) [18:55:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:56:38] (03CR) 10Andrew Bogott: "I'm going to merge this shortly, please let me know if it causes chaos!" [puppet] - 10https://gerrit.wikimedia.org/r/828071 (https://phabricator.wikimedia.org/T302981) (owner: 10Andrew Bogott) [19:01:31] (03CR) 10Andrew Bogott: [C: 03+2] Neutron: remove policies that refer to nonexistent rules [puppet] - 10https://gerrit.wikimedia.org/r/828070 (https://phabricator.wikimedia.org/T316685) (owner: 10Andrew Bogott) [19:09:29] (03PS1) 10Aaron Schulz: Set "max lag" for all x2 servers to INF [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828072 (https://phabricator.wikimedia.org/T312809) [19:09:31] (03CR) 10Andrew Bogott: [C: 03+2] Revert "don't rsync to clouddumps1001,2 while they are still being set up" [puppet] - 10https://gerrit.wikimedia.org/r/828071 (https://phabricator.wikimedia.org/T302981) (owner: 10Andrew Bogott) [19:12:00] (03CR) 10Andrew Bogott: [C: 03+2] Exclude /mnt from systemd-logind restrictions on Bullseye and later [puppet] - 10https://gerrit.wikimedia.org/r/826806 (https://phabricator.wikimedia.org/T316123) (owner: 10Muehlenhoff) [19:15:49] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10Andrew) [19:15:54] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Auth extremely slow on clouddumps100[12] - https://phabricator.wikimedia.org/T316123 (10Andrew) 05Open→03Resolved All better! Thanks Moritz. [19:23:35] (03PS1) 10Ryan Kemper: elastic: expand deployment-prep heap to 3G [puppet] - 10https://gerrit.wikimedia.org/r/828077 (https://phabricator.wikimedia.org/T316240) [19:24:24] (03CR) 10Ebernhardson: [C: 03+1] "instances are refusing writes, saying heap is too full. 3G seems appropriate" [puppet] - 10https://gerrit.wikimedia.org/r/828077 (https://phabricator.wikimedia.org/T316240) (owner: 10Ryan Kemper) [19:25:00] (03CR) 10Ryan Kemper: "Search backend error during sending 1 documents to the enwikinews_general index(s) after 30: circuit_breaking_exception: [parent] Data too" [puppet] - 10https://gerrit.wikimedia.org/r/828077 (https://phabricator.wikimedia.org/T316240) (owner: 10Ryan Kemper) [19:25:06] (03CR) 10Ryan Kemper: [C: 03+2] elastic: expand deployment-prep heap to 3G [puppet] - 10https://gerrit.wikimedia.org/r/828077 (https://phabricator.wikimedia.org/T316240) (owner: 10Ryan Kemper) [19:36:16] (03PS4) 10AOkoth: gitlab: copy ssh host keys for failover [puppet] - 10https://gerrit.wikimedia.org/r/820163 (https://phabricator.wikimedia.org/T296713) [19:36:18] (03PS1) 10AOkoth: vrts: create /opt/otrs folder [puppet] - 10https://gerrit.wikimedia.org/r/828078 [19:36:58] (03PS2) 10AOkoth: vrts: create /opt/otrs folder [puppet] - 10https://gerrit.wikimedia.org/r/828078 [19:44:34] PROBLEM - puppet last run on relforge1003 is CRITICAL: CRITICAL: Puppet has been disabled for 605184 seconds, message: es 7 upgrade - ryankemper, last run 7 days ago with 2 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:44:56] ^ oops, enabling puppet on relforge* [19:45:43] !log [Relforge] `ryankemper@cumin1001:~$ sudo -E cumin '*relforge*' 'run-puppet-agent --force'` [19:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:58] RECOVERY - puppet last run on relforge1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:00:05] RoanKattouw, Urbanecm, cjming, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220830T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:00:17] indeed. nothing to do. [20:02:40] * TheresNoTime is here [20:02:43] oh [20:02:44] :P [20:06:19] (03CR) 10Urbanecm: [C: 04-1] "applying -1 per T316601 to avoid accidental deployment, as i see it scheduled for deployment in ~11 hours." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823677 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [20:07:19] i'll do a PS change given nothing's scheduled [20:11:41] !log urbanecm@deploy1002 Synchronized private/PrivateSettings.php: Update T250887 mitigations (duration: 03m 43s) [20:11:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:11:49] * urbanecm done [20:14:22] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:18:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:18:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:21:42] RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:24:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:31:20] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [20:32:04] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:37:48] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 243.73 ms [20:42:56] (03CR) 10Jforrester: cirrus: Handle transition to elasticsearch 7.10 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson) [20:43:06] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw es7 cluster upgrade - ryankemper@cumin2002 - T316719 [20:43:11] T316719: Upgrade codfw cluster to Elasticsearch 7.10.2 - https://phabricator.wikimedia.org/T316719 [20:43:19] !log ryankemper@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw es7 cluster upgrade - ryankemper@cumin2002 - T316719 [20:45:17] * ryankemper ctrl+c'd out; forgot to merge the puppet patch that actually bumps the version [20:51:06] (03CR) 10Ryan Kemper: [C: 03+1] "Thanks. PCC looks like a no-op as expected." [puppet] - 10https://gerrit.wikimedia.org/r/826841 (owner: 10Muehlenhoff) [20:51:26] (03CR) 10Ryan Kemper: [C: 03+2] elasticsearch::tlsproxy: Unconditionally disable ssl_ecdhe_curve [puppet] - 10https://gerrit.wikimedia.org/r/826841 (owner: 10Muehlenhoff) [20:52:49] (03CR) 10JHathaway: [C: 03+1] "From grepping the logs, I don't see anything too concerning that would break" [puppet] - 10https://gerrit.wikimedia.org/r/828019 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [20:53:41] (03CR) 10Ryan Kemper: "Looks like Brian and I accidentally did the same thing already in Ic76dfce2084ba9e6d0f77510d40d074fba2b88f6 without noticing this patch, s" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/803321 (https://phabricator.wikimedia.org/T309720) (owner: 10Ebernhardson) [20:54:07] (03Abandoned) 10Ryan Kemper: Revert "Revert "Upgrade to elasticsearch 7.10.2"" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/803321 (https://phabricator.wikimedia.org/T309720) (owner: 10Ebernhardson) [20:56:20] (03PS1) 10Ryan Kemper: elastic: upgrade codfw elasticsearch to 7.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/828092 (https://phabricator.wikimedia.org/T316719) [21:02:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: Maintenance [21:02:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: Maintenance [21:02:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T314041)', diff saved to https://phabricator.wikimedia.org/P33703 and previous config saved to /var/cache/conftool/dbconfig/20220830-210218-ladsgroup.json [21:02:22] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [21:04:09] (03PS2) 10Ryan Kemper: elastic: upgrade codfw elasticsearch to 7.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/828092 (https://phabricator.wikimedia.org/T316719) [21:04:23] (03PS3) 10Ryan Kemper: elastic: upgrade codfw elasticsearch to 7.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/828092 (https://phabricator.wikimedia.org/T316719) [21:05:07] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/828092 (https://phabricator.wikimedia.org/T316719) (owner: 10Ryan Kemper) [21:40:46] (03PS4) 10Ryan Kemper: elastic: upgrade codfw elasticsearch to 7.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/828092 (https://phabricator.wikimedia.org/T316719) [21:40:52] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/828092 (https://phabricator.wikimedia.org/T316719) (owner: 10Ryan Kemper) [22:01:17] (03PS1) 10Andrew Bogott: Dumps: switch to using clouddumps hosts rather than the old labstores. [puppet] - 10https://gerrit.wikimedia.org/r/828102 (https://phabricator.wikimedia.org/T309346) [22:01:25] (03PS1) 10Andrew Bogott: Dumps: stop mounting the old labstore100x servers on VMs [puppet] - 10https://gerrit.wikimedia.org/r/828103 (https://phabricator.wikimedia.org/T309346) [22:02:18] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10Andrew) Useful context for this at https://wikitech.wikimedia.org/wiki/Dumps/XML-SQL_Dumps#A_labstore_host_dies_(web_or_nfs_... [22:03:23] (03CR) 10Andrew Bogott: [C: 04-1] "do not merge until we're sure there aren't remaining reverences to the labstore1006 and 1007 mounts on toolforge." [puppet] - 10https://gerrit.wikimedia.org/r/828103 (https://phabricator.wikimedia.org/T309346) (owner: 10Andrew Bogott) [22:40:36] PROBLEM - SSH on mw1313.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:00:26] (03PS1) 10Cwhite: hiera: map logstash.wm.o to kibana7.codfw [puppet] - 10https://gerrit.wikimedia.org/r/828109 (https://phabricator.wikimedia.org/T304440) [23:00:27] (03PS1) 10Cwhite: hiera: upgrade codfw to opensearch v2 [puppet] - 10https://gerrit.wikimedia.org/r/828110 (https://phabricator.wikimedia.org/T304440) [23:00:29] (03PS1) 10Cwhite: hiera: all eqiad and codfw logging clusters to opensearch v2 [puppet] - 10https://gerrit.wikimedia.org/r/828111 (https://phabricator.wikimedia.org/T304440) [23:00:31] (03PS1) 10Cwhite: hiera: map logstash.wm.o to kibana7.eqiad [puppet] - 10https://gerrit.wikimedia.org/r/828112 (https://phabricator.wikimedia.org/T304440) [23:27:18] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [23:28:28] (03PS2) 10Aaron Schulz: Set "max lag" for all x2 servers to INF [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828072 (https://phabricator.wikimedia.org/T312809) [23:28:53] (03PS3) 10Aaron Schulz: Set "max lag" for all x2 servers to INF [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828072 (https://phabricator.wikimedia.org/T312809) [23:41:54] RECOVERY - SSH on mw1313.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:50:02] (03CR) 10Ryan Kemper: [C: 03+2] elastic: upgrade codfw elasticsearch to 7.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/828092 (https://phabricator.wikimedia.org/T316719) (owner: 10Ryan Kemper) [23:50:21] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw es7 cluster upgrade - ryankemper@cumin2002 - T316719 [23:50:26] !log ryankemper@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw es7 cluster upgrade - ryankemper@cumin2002 - T316719 [23:50:28] T316719: Upgrade codfw cluster to Elasticsearch 7.10.2 - https://phabricator.wikimedia.org/T316719 [23:55:06] !log T316719 Merged https://phabricator.wikimedia.org/T316719; running puppet across codfw fleet: `ryankemper@cumin2002:~$ sudo -E cumin -b 6 'A:elastic-codfw' 'run-puppet-agent'` [23:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log