[00:00:09] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit1001.wikimedia.org with reason: service restart [00:00:50] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit.wikimedia.org with reason: service restart [00:01:05] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit.wikimedia.org with reason: service restart [00:01:18] RECOVERY - etcd request latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [00:02:47] (Device rebooted) resolved: Device ps1-b7-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [00:03:14] RECOVERY - ps1-b7-codfw-infeed-load-tower-A-phase-X on ps1-b7-codfw is OK: SNMP OK - ps1-b7-codfw-infeed-load-tower-A-phase-X 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:03:28] !log gerrit - service restart to deploy config change to add second replica T313250 [00:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:31] T313250: Bring up gerrit2002 - https://phabricator.wikimedia.org/T313250 [00:03:38] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:05:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P32271 and previous config saved to /var/cache/conftool/dbconfig/20220804-000536-marostegui.json [00:06:25] !log gerrit - [2022-08-04 00:05:33,173] Replication to gerrit2@gerrit2002.wikimedia.org:/srv/gerrit/git/analytics/geowiki.git started... [CONTEXT pushOneId="83ad5008" ] [00:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:37] !log gerrit - [2022-08-04 00:05:33,173] Replication to gerrit2@gerrit2002.wikimedia.org:/srv/gerrit/git/analytics/geowiki.git started.. T313250 [00:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:30] 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10Papaul) [00:12:54] (Device rebooted) firing: Alert for device ps1-c1-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [00:14:52] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:15:53] (03PS1) 10Tim Starling: Explicitly declare replaceable settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820247 [00:16:32] (03PS2) 10Tim Starling: Explicitly declare replaceable settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820247 [00:17:48] (03CR) 10CI reject: [V: 04-1] Explicitly declare replaceable settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820247 (owner: 10Tim Starling) [00:17:54] (Device rebooted) resolved: Device ps1-c1-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [00:18:04] 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10Platonides) Changing the ordering (perhaps coupled with varnish redirecting all '?title=X&action=history' to the new '?action=history&ti... [00:18:47] 10SRE, 10Gerrit, 10serviceops, 10serviceops-collab, and 2 others: replacement for gerrit2001, decom gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Dzahn) [00:20:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P32272 and previous config saved to /var/cache/conftool/dbconfig/20220804-002043-marostegui.json [00:27:27] 10SRE, 10SRE-Access-Requests: Requesting access to the Desktop Improvements project statistics for SGrabarczuk - https://phabricator.wikimedia.org/T313616 (10sgrabarczuk) It's working. Thank you! [00:31:40] (03PS3) 10Tim Starling: Explicitly declare replaceable settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820247 [00:33:43] (03PS1) 10Dzahn: gerrit: decom gerrit2001 [puppet] - 10https://gerrit.wikimedia.org/r/820248 (https://phabricator.wikimedia.org/T243027) [00:35:22] (03PS1) 10Dzahn: gerrit: remove hiera data for old replica [puppet] - 10https://gerrit.wikimedia.org/r/820249 (https://phabricator.wikimedia.org/T243027) [00:35:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T312972)', diff saved to https://phabricator.wikimedia.org/P32273 and previous config saved to /var/cache/conftool/dbconfig/20220804-003549-marostegui.json [00:35:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [00:35:54] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [00:36:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [00:36:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T312972)', diff saved to https://phabricator.wikimedia.org/P32274 and previous config saved to /var/cache/conftool/dbconfig/20220804-003611-marostegui.json [00:36:31] (03PS1) 10Dzahn: site: remove gerrit2001, merge gerrit1001/2002 regex [puppet] - 10https://gerrit.wikimedia.org/r/820250 (https://phabricator.wikimedia.org/T243027) [00:37:53] (03CR) 10Dzahn: "On gerrit2002 we merged the config change and a bit later I did the gerrit service restart and then it started replicating to gerrit2002! " [puppet] - 10https://gerrit.wikimedia.org/r/820249 (https://phabricator.wikimedia.org/T243027) (owner: 10Dzahn) [00:38:16] (03CR) 10Dzahn: "(service restart and log on gerrit1001)" [puppet] - 10https://gerrit.wikimedia.org/r/820249 (https://phabricator.wikimedia.org/T243027) (owner: 10Dzahn) [00:38:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T312972)', diff saved to https://phabricator.wikimedia.org/P32275 and previous config saved to /var/cache/conftool/dbconfig/20220804-003822-marostegui.json [00:39:30] (03CR) 10Dzahn: "All this stuff comes after we removed it from gerrit config but before we run the decom cookbook on the machine. will need puppet ron on a" [puppet] - 10https://gerrit.wikimedia.org/r/820248 (https://phabricator.wikimedia.org/T243027) (owner: 10Dzahn) [00:40:08] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [00:40:30] (03CR) 10Dzahn: "And then finally this will be the last merge after the decom cookbook ran. After this we should be able to call it a decom mission success" [puppet] - 10https://gerrit.wikimedia.org/r/820250 (https://phabricator.wikimedia.org/T243027) (owner: 10Dzahn) [00:42:35] (03CR) 10Dzahn: [C: 03+2] "after deploying this also needed a gerrit service restart. after that was done it started..log lines like:" [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [00:43:07] (03CR) 10Dzahn: [C: 03+2] "see /var/log/gerrit/replication_log" [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [00:44:04] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:45:08] (03PS2) 10Dzahn: gerrit: remove hiera data for old replica [puppet] - 10https://gerrit.wikimedia.org/r/820249 (https://phabricator.wikimedia.org/T243027) [00:53:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P32276 and previous config saved to /var/cache/conftool/dbconfig/20220804-005328-marostegui.json [00:56:12] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:08:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P32277 and previous config saved to /var/cache/conftool/dbconfig/20220804-010834-marostegui.json [01:18:18] (03CR) 10Dzahn: [C: 03+2] devtools: Allow for scap deployment of scap [puppet] - 10https://gerrit.wikimedia.org/r/820220 (https://phabricator.wikimedia.org/T314195) (owner: 10Dduvall) [01:23:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T312972)', diff saved to https://phabricator.wikimedia.org/P32278 and previous config saved to /var/cache/conftool/dbconfig/20220804-012341-marostegui.json [01:23:46] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [01:25:02] (03CR) 10Dzahn: "yes, this is obviously not the real secret prod key but a "fake key" but it's not as fake as the SNAKEOIL string which meant things would " [labs/private] - 10https://gerrit.wikimedia.org/r/820221 (https://phabricator.wikimedia.org/T314195) (owner: 10Dduvall) [01:26:00] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:31:06] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [01:35:56] (03CR) 10Dzahn: [C: 03+2] phabricator: Provide script for ensuring correct config file ownership [puppet] - 10https://gerrit.wikimedia.org/r/820174 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall) [01:36:22] (03CR) 10Dzahn: [C: 03+2] phabricator: Stop managing /srv/phab/repos [puppet] - 10https://gerrit.wikimedia.org/r/820213 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall) [01:38:41] (03CR) 10Dzahn: [C: 03+2] "noop on phab1001 and others" [puppet] - 10https://gerrit.wikimedia.org/r/820213 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall) [01:38:50] (03PS4) 10Dzahn: phabricator: Provide script for ensuring correct config file ownership [puppet] - 10https://gerrit.wikimedia.org/r/820174 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall) [01:40:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:45:51] (03CR) 10Dzahn: "/usr/local/sbin/phab_deploy_ensure_config_ownership has been created" [puppet] - 10https://gerrit.wikimedia.org/r/820174 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall) [01:46:07] (03PS1) 10Reedy: CommonSettings-labs: Fix usage of $wgSFSValidateIPListLocationMD5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820254 [01:48:13] (03PS1) 10Reedy: wikitech: Remove old LDAP config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820255 [01:50:45] (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:58:05] 10SRE, 10SRE-Access-Requests: Requesting access to the Desktop Improvements project statistics for SGrabarczuk - https://phabricator.wikimedia.org/T313616 (10Dzahn) 05In progress→03Resolved Thanks for confirming!:) Closing as resolved. [01:58:24] 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509 (10Dzahn) [02:20:45] (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:32:41] (03PS2) 10KartikMistry: Update cxserver to 2022-08-04-022612-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/820075 (https://phabricator.wikimedia.org/T313296) [02:56:13] (KubernetesRsyslogDown) firing: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:17:49] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) >>! In T279664#8122731, @Joe wrote: > Do we expect that to happen regularly on a high percentage of requests? If 17% of all requests need to make... [03:19:15] (03PS1) 10KartikMistry: Enable SectionTranslation on 10 Wikipedias where ContentTranslation is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820261 (https://phabricator.wikimedia.org/T308829) [03:47:14] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) >>! In T279664#8123041, @MatthewVernon wrote: > Without that, I'm not sure what we can do to work around the fact that MW doesn't reliably write/d... [04:09:02] !log krinkle@mwmaint1002 pull aborted: (duration: 00m 05s) [04:17:48] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [04:29:26] !log on mw2377 fiddling with CPU frequency control and doing benchmarks [04:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:31:42] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [04:40:38] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [04:51:16] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:58:28] (03PS3) 10KartikMistry: Update cxserver to 2022-08-04-022612-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/820075 (https://phabricator.wikimedia.org/T313296) [05:10:46] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:16:56] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2030.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [05:17:02] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [05:17:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2030.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [05:22:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2030.codfw.wmnet with OS bullseye [05:22:24] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2030.codfw.wmnet with OS bullseye [05:23:57] (03PS1) 10Marostegui: Revert "mariadb: Disable notifications pdu C rows" [puppet] - 10https://gerrit.wikimedia.org/r/820266 [05:23:59] (03PS1) 10Marostegui: Revert "mariadb: Disable notifications on codfw racks" [puppet] - 10https://gerrit.wikimedia.org/r/820267 [05:26:16] (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Disable notifications pdu C rows" [puppet] - 10https://gerrit.wikimedia.org/r/820266 (owner: 10Marostegui) [05:26:25] (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Disable notifications on codfw racks" [puppet] - 10https://gerrit.wikimedia.org/r/820267 (owner: 10Marostegui) [05:26:42] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:30:13] (03PS1) 10Marostegui: production-m5.sql: Remove grants for labweb1001/labweb1002 [puppet] - 10https://gerrit.wikimedia.org/r/820286 (https://phabricator.wikimedia.org/T314528) [05:32:43] * kart_ updating cxserver.. [05:32:49] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-08-04-022612-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/820075 (https://phabricator.wikimedia.org/T313296) (owner: 10KartikMistry) [05:35:40] (03Abandoned) 10Muehlenhoff: librenms: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810323 (owner: 10Muehlenhoff) [05:36:44] (03Merged) 10jenkins-bot: Update cxserver to 2022-08-04-022612-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/820075 (https://phabricator.wikimedia.org/T313296) (owner: 10KartikMistry) [05:36:59] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2030.codfw.wmnet with reason: host reimage [05:38:34] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:39:04] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:40:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2030.codfw.wmnet with reason: host reimage [05:41:41] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [05:42:35] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [05:43:21] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [05:43:22] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:43:42] RECOVERY - etcd request latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:44:14] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [05:49:40] !log Updated cxserver to 2022-08-04-022612-production (T313296, T308248) [05:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:44] T313296: Enable Content and Section translation on wikipedias with new MT support from Google - https://phabricator.wikimedia.org/T313296 [05:49:45] T308248: Newly supported languages in Google Translate - https://phabricator.wikimedia.org/T308248 [05:54:52] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:57:22] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:57:56] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:58:16] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:59:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2030.codfw.wmnet with OS bullseye [05:59:34] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2030.codfw.wmnet with OS bullseye completed: - ganeti2030 (**PASS**) - Downtimed on... [06:00:05] kormat, marostegui, and Amir1: Time to snap out of that daydream and deploy Primary database switchover. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220804T0600). [06:01:28] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10Sustainability: Enable bracketed-paste-mode for production shells (e.g. deployment, mwmaint) - https://phabricator.wikimedia.org/T293614 (10MoritzMuehlenhoff) >>! In T293614#8127446, @Lucas_Werkmeister_WMDE wrote: > That’s great, thanks! In that... [06:02:28] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) [06:05:14] (03CR) 10DCausse: "I believe that the version of the corresponding Chart.yaml must be changed for this change to be deployed as a new chart." [deployment-charts] - 10https://gerrit.wikimedia.org/r/819752 (https://phabricator.wikimedia.org/T314426) (owner: 10Ebernhardson) [06:06:06] <_joe_> !log restarted memcached on mc2038 to pick up the actual production configuration [06:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:20] RECOVERY - Memcached on mc2038 is OK: TCP OK - 0.032 second response time on 10.192.0.191 port 11214 https://wikitech.wikimedia.org/wiki/Memcached [06:21:00] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:25:49] 10SRE, 10Cloud-VPS, 10cloud-services-team (Kanban): CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10tstarling) I was doing comparative benchmarks of eqiad and codfw. @ori suggested that I look at CPU scaling as a possible reason for the discrepancy. The performance impact of setting... [06:31:05] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [06:35:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:38:44] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [06:48:56] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [06:54:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2030.codfw.wmnet [06:58:20] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [06:58:27] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [06:58:42] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [07:00:05] Amir1, apergos, and jnuche: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220804T0700). [07:00:12] morning! there is one trainee signed up today but there are no patches in the queue. [07:00:42] I am in the google meet in case our trainee turns up, in which case I'll let them know to reschedule [07:02:10] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:02:25] 👍 [07:02:36] (and good morning!) [07:02:37] (03CR) 10Ayounsi: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/820089 (https://phabricator.wikimedia.org/T307221) (owner: 10Ayounsi) [07:03:44] (03CR) 10CI reject: [V: 04-1] add include for 2620:0:862:fe08::/64 PTR [dns] - 10https://gerrit.wikimedia.org/r/820089 (https://phabricator.wikimedia.org/T307221) (owner: 10Ayounsi) [07:03:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2030.codfw.wmnet [07:05:12] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [07:07:57] (03PS1) 10Marostegui: mariadb: Downtime D3 databases [puppet] - 10https://gerrit.wikimedia.org/r/820369 (https://phabricator.wikimedia.org/T310146) [07:08:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es[2023-2025].codfw.wmnet with reason: codfw pdu maintenance [07:09:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es[2023-2025].codfw.wmnet with reason: codfw pdu maintenance [07:09:27] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2030.codfw.wmnet to cluster codfw and group A [07:09:36] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2030.codfw.wmnet to cluster codfw and group A [07:09:42] (03CR) 10Marostegui: [C: 03+2] mariadb: Downtime D3 databases [puppet] - 10https://gerrit.wikimedia.org/r/820369 (https://phabricator.wikimedia.org/T310146) (owner: 10Marostegui) [07:10:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2030.codfw.wmnet [07:11:40] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:11:48] (03CR) 10Ayounsi: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/820089 (https://phabricator.wikimedia.org/T307221) (owner: 10Ayounsi) [07:12:04] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [07:12:41] (03PS1) 10Marostegui: mariadb: Disable notifications DBs in C5 [puppet] - 10https://gerrit.wikimedia.org/r/820370 (https://phabricator.wikimedia.org/T310145) [07:13:41] (03CR) 10Marostegui: [C: 03+2] mariadb: Disable notifications DBs in C5 [puppet] - 10https://gerrit.wikimedia.org/r/820370 (https://phabricator.wikimedia.org/T310145) (owner: 10Marostegui) [07:14:04] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:14:18] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:16:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2135,2160].codfw.wmnet with reason: codfw pdu maintenance [07:16:13] (03CR) 10Ayounsi: [C: 03+2] add include for 2620:0:862:fe08::/64 PTR [dns] - 10https://gerrit.wikimedia.org/r/820089 (https://phabricator.wikimedia.org/T307221) (owner: 10Ayounsi) [07:16:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2135,2160].codfw.wmnet with reason: codfw pdu maintenance [07:16:20] (03PS2) 10Ayounsi: add include for 2620:0:862:fe08::/64 PTR [dns] - 10https://gerrit.wikimedia.org/r/820089 (https://phabricator.wikimedia.org/T307221) [07:17:26] (03PS1) 10Marostegui: mariadb: Disable notifications DBs in C6 [puppet] - 10https://gerrit.wikimedia.org/r/820371 (https://phabricator.wikimedia.org/T310145) [07:18:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2030.codfw.wmnet [07:19:19] (03PS2) 10Marostegui: mariadb: Disable notifications DBs in C6 [puppet] - 10https://gerrit.wikimedia.org/r/820371 (https://phabricator.wikimedia.org/T310145) [07:20:19] (03CR) 10Marostegui: [C: 03+2] mariadb: Disable notifications DBs in C6 [puppet] - 10https://gerrit.wikimedia.org/r/820371 (https://phabricator.wikimedia.org/T310145) (owner: 10Marostegui) [07:21:16] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [07:23:18] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:23:30] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:25:33] (03PS1) 10Slyngshede: Initial check in [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820372 [07:28:15] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10Marostegui) All db*, es* hosts powered off. [07:29:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2030.codfw.wmnet to cluster codfw and group A [07:30:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2030.codfw.wmnet to cluster codfw and group A [07:30:57] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10Marostegui) All db* hosts powered off [07:31:03] (03PS1) 10Ayounsi: Revert "geodns: Map out African countries by DC latency" [dns] - 10https://gerrit.wikimedia.org/r/820272 [07:31:13] (03PS2) 10Ayounsi: Revert "geodns: Map out African countries by DC latency" [dns] - 10https://gerrit.wikimedia.org/r/820272 [07:32:19] (03CR) 10Ayounsi: [C: 03+2] Revert "geodns: Map out African countries by DC latency" [dns] - 10https://gerrit.wikimedia.org/r/820272 (owner: 10Ayounsi) [07:35:45] (JobUnavailable) firing: (5) Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:36:21] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [07:36:35] PROBLEM - haproxy failover on dbproxy2004 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:40:18] (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:40:45] (JobUnavailable) firing: (6) Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:42:25] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) [07:42:39] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) [07:44:40] (03PS1) 10Marostegui: instances.yaml: Remove db2089 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/820374 (https://phabricator.wikimedia.org/T313799) [07:45:18] (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:46:30] 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10ayounsi) p:05Medium→03High There are currently 3 Icinga alerts for servers with a failed PSU: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=kafka-main2002&... [07:46:57] !log grow sda/sdb 3 by 100G on thanos-be1003 - T314275 [07:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:00] T314275: thanos-be2004 sdb3 fully used - https://phabricator.wikimedia.org/T314275 [07:47:05] !log grow sda/sdb 3 by 100G on thanos-be2002 - T314275 [07:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:46] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db2089 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/820374 (https://phabricator.wikimedia.org/T313799) (owner: 10Marostegui) [07:49:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2089 from dbctl T313799', diff saved to https://phabricator.wikimedia.org/P32280 and previous config saved to /var/cache/conftool/dbconfig/20220804-074957-marostegui.json [07:50:02] T313799: decommission db2089 - https://phabricator.wikimedia.org/T313799 [07:50:10] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: route k8s messages to k8s partition [puppet] - 10https://gerrit.wikimedia.org/r/820182 (https://phabricator.wikimedia.org/T314381) (owner: 10Cwhite) [07:50:16] (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: alert on Icinga max check latency [alerts] - 10https://gerrit.wikimedia.org/r/820072 (https://phabricator.wikimedia.org/T314353) (owner: 10Filippo Giunchedi) [07:50:47] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [07:50:52] (03CR) 10Ladsgroup: [C: 03+1] production-m5.sql: Remove grants for labweb1001/labweb1002 [puppet] - 10https://gerrit.wikimedia.org/r/820286 (https://phabricator.wikimedia.org/T314528) (owner: 10Marostegui) [07:52:34] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10Marostegui) @Papaul we have some doubts about whether C1 was done or not. Can you update the list of racks that were done yesterday? Thanks! [07:54:28] (03CR) 10Marostegui: [C: 03+2] production-m5.sql: Remove grants for labweb1001/labweb1002 [puppet] - 10https://gerrit.wikimedia.org/r/820286 (https://phabricator.wikimedia.org/T314528) (owner: 10Marostegui) [07:55:11] !log Remove grants for 208.80.154.160/208.80.155.109 T314528 [07:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:15] T314528: Revoke MariaDB grants for labweb1001/1002 - https://phabricator.wikimedia.org/T314528 [07:58:43] (03PS4) 10Ladsgroup: site.pp: Combine mariadb replicas and master in each section [puppet] - 10https://gerrit.wikimedia.org/r/820102 [07:59:32] 10SRE: consider hybrid caching options for ssd+disk - https://phabricator.wikimedia.org/T88992 (10fgiunchedi) a:05fgiunchedi→03None [07:59:46] 10SRE, 10SRE-swift-storage, 10Patch-For-Review: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918 (10fgiunchedi) a:05fgiunchedi→03None [08:00:02] 10SRE, 10Security: Disable agent forwarding to important hosts - https://phabricator.wikimedia.org/T198138 (10fgiunchedi) a:05fgiunchedi→03None [08:00:09] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [08:00:18] (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:01:15] 10SRE, 10Observability-Alerting, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Icinga downtimes not working - https://phabricator.wikimedia.org/T314353 (10fgiunchedi) [08:02:06] (03CR) 10Ladsgroup: [C: 03+2] site.pp: Combine mariadb replicas and master in each section [puppet] - 10https://gerrit.wikimedia.org/r/820102 (owner: 10Ladsgroup) [08:03:06] 10SRE, 10Observability-Alerting, 10User-fgiunchedi: Icinga downtimes not working - https://phabricator.wikimedia.org/T314353 (10fgiunchedi) 05Open→03Stalled We are now alerting on elevated max check latency, I'm going to stall the task and re-evaluate in a couple of months if we need to deploy auto-remed... [08:04:11] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd2002.codfw.wmnet with reason: Switch instance to plain disks, T311686 [08:04:16] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [08:04:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd2002.codfw.wmnet with reason: Switch instance to plain disks, T311686 [08:04:55] (03PS1) 10Ladsgroup: Start reading from new templatelinks columns in commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820376 (https://phabricator.wikimedia.org/T306673) [08:05:18] (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:07:51] RECOVERY - BGP status on cr2-eqord is OK: BGP OK - up: 162, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:12:45] (03PS2) 10Giuseppe Lavagetto: scap: remove configuration for deploy* [puppet] - 10https://gerrit.wikimedia.org/r/819630 [08:12:47] (03Abandoned) 10Slyngshede: Initial check in [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820372 (owner: 10Slyngshede) [08:12:58] (03CR) 10Giuseppe Lavagetto: scap: remove configuration for deploy* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819630 (owner: 10Giuseppe Lavagetto) [08:13:01] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [08:16:24] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 10:45:00 on mc[2047-2048].codfw.wmnet with reason: PDU swap [08:16:38] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:45:00 on mc[2047-2048].codfw.wmnet with reason: PDU swap [08:16:43] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.2.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [08:18:14] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Marostegui) Any rough ETA about when these hosts will be ready? We are also seeing some mgmt alerts for db1186, db1187 and db1188 regarding their... [08:18:35] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [08:19:15] !log power off mc2047 and mc2048 [08:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:44] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap: remove configuration for deploy* [puppet] - 10https://gerrit.wikimedia.org/r/819630 (owner: 10Giuseppe Lavagetto) [08:19:48] (03CR) 10Ladsgroup: auto_schema: Start depooling codfw replicas (031 comment) [software] - 10https://gerrit.wikimedia.org/r/820142 (https://phabricator.wikimedia.org/T314486) (owner: 10Ladsgroup) [08:19:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1132, db111, db1127, db1143', diff saved to https://phabricator.wikimedia.org/P32281 and previous config saved to /var/cache/conftool/dbconfig/20220804-081958-root.json [08:21:51] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [08:22:03] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) I am off starting tomorrow and back online on the 17th. As we are not fully sure about the situation with 10.6 hosts, I... [08:22:19] !log oblivian@mwmaint1002 pull aborted: (duration: 00m 11s) [08:23:53] (03CR) 10Ayounsi: "Thanks for the refactor!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [08:24:41] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 10:36:00 on mc[2049-2050].codfw.wmnet with reason: PDU swap [08:24:47] (03CR) 10Marostegui: "Good, so after merging this, we need to start doing codfw master manually or listing it, right?" [software] - 10https://gerrit.wikimedia.org/r/820142 (https://phabricator.wikimedia.org/T314486) (owner: 10Ladsgroup) [08:24:55] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:36:00 on mc[2049-2050].codfw.wmnet with reason: PDU swap [08:26:09] (03CR) 10Ayounsi: PeeringDB API: initial commit (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [08:26:20] !log power off mc2049 and mc2050 [08:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:33] (03CR) 10Ladsgroup: auto_schema: Start depooling codfw replicas (031 comment) [software] - 10https://gerrit.wikimedia.org/r/820142 (https://phabricator.wikimedia.org/T314486) (owner: 10Ladsgroup) [08:27:41] (03CR) 10Marostegui: [C: 03+1] auto_schema: Start depooling codfw replicas [software] - 10https://gerrit.wikimedia.org/r/820142 (https://phabricator.wikimedia.org/T314486) (owner: 10Ladsgroup) [08:28:58] !log imported gsasl 1.8.0-8+wmf1 to stretch-wikimedia [08:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:38] (03PS1) 10Giuseppe Lavagetto: maintenance: restart php-fpm if needed [puppet] - 10https://gerrit.wikimedia.org/r/820378 [08:32:29] !log kubectl cordon kubernetes2022.codfw.wmnet [08:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:10] (03PS2) 10Giuseppe Lavagetto: maintenance: restart php-fpm if needed [puppet] - 10https://gerrit.wikimedia.org/r/820378 [08:35:03] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [08:35:56] !log kubectl drain kubernetes2022.codfw.wmnet [08:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:54] !log jelto@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes2022.codfw.wmnet [08:38:57] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 10:22:00 on kubernetes2022.codfw.wmnet with reason: PDU swap [08:39:10] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:22:00 on kubernetes2022.codfw.wmnet with reason: PDU swap [08:39:15] RECOVERY - etcd request latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:39:15] (03PS2) 10Giuseppe Lavagetto: scap: do not restart php on the mwmaint servers [puppet] - 10https://gerrit.wikimedia.org/r/819631 [08:39:25] (03CR) 10Giuseppe Lavagetto: scap: do not restart php on the mwmaint servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819631 (owner: 10Giuseppe Lavagetto) [08:39:31] (03PS3) 10Giuseppe Lavagetto: scap: do not restart php on the mwmaint servers [puppet] - 10https://gerrit.wikimedia.org/r/819631 [08:39:35] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:39:45] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [08:41:18] <_joe_> I have no idea of what's the reason for ^^ [08:41:26] <_joe_> and I don't have time to investigate tbh [08:43:44] !log oblivian@deploy1002 Synchronized README: testing new scap configuration (duration: 03m 18s) [08:45:17] !log power off kubernetes2022 [08:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:13] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap: do not restart php on the mwmaint servers [puppet] - 10https://gerrit.wikimedia.org/r/819631 (owner: 10Giuseppe Lavagetto) [08:48:27] !log draining ganeti2017 T311686 [08:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:32] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [08:48:32] (03CR) 10Btullis: [C: 03+2] Add an option to use the PKI for etcd intra-cluster certificates [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [08:49:56] <_joe_> btullis: ahem [08:50:13] <_joe_> I'd like to be consulted on patches that can change the behaviour of etcd [08:50:37] <_joe_> so, before I merge it with mine, did you stop puppet on all etcd nodes before merging this patch? [08:50:39] _joe_: Sincere apologies. It's a noop. [08:51:03] <_joe_> btullis: did you verify that for all etcd clusters in prod? if so it's ok [08:51:04] So no, I wasn't planning to touch any existing etcd clusters. [08:51:19] Yes, I ran a PCC check for all etcd roles. [08:51:27] <_joe_> ok then :) [08:51:27] Only a parameter change that defaults to false. [08:51:50] <_joe_> ok, merging :) [08:51:56] I would have added you for review, but I didn't know who would like to have been consulted. [08:51:59] Thanks. [08:52:31] <_joe_> btullis: no problems, that's why I was telling you [08:52:49] Gotcha, thanks. [08:52:55] (03PS1) 10Ayounsi: Interface description: don't add the z_side when disabled [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/820380 [08:53:51] PROBLEM - IPMI Sensor Status on es2021 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:53:53] <_joe_> usually "git blame" on the files you're changing is a good starting point to figure out who to ask for a review [08:54:13] <_joe_> {{merged}} btw :) [08:54:15] I will open a task for es2021 [08:55:07] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:55:20] Strangely, my PCC run has disappeared from the gerrit comments. [08:55:48] 10ops-codfw, 10DBA: es2021 (B3) now power supply redudancy - https://phabricator.wikimedia.org/T314559 (10Marostegui) [08:55:58] 10ops-codfw, 10DBA: es2021 (B3) now power supply redudancy - https://phabricator.wikimedia.org/T314559 (10Marostegui) p:05Triage→03Medium [08:56:11] But here was one: https://puppet-compiler.wmflabs.org/pcc-worker1001/1384/ [08:56:18] 10ops-codfw, 10DBA: es2021 (B3) lost power supply redundancy - https://phabricator.wikimedia.org/T314559 (10Marostegui) [08:56:25] (03PS1) 10Muehlenhoff: Add library hint for gsasl [puppet] - 10https://gerrit.wikimedia.org/r/820383 [08:56:34] jelto: ^ bgp status need's an ack again [08:56:54] s/ack/downtime/ [08:56:58] (KubernetesCalicoDown) firing: kubernetes2022.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:57:05] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:57:45] I'll do that, same for CalicoDown (I thought I created the correct silence in alertmanager :/) [08:57:52] !log oblivian@mwmaint1002 pull aborted: (duration: 00m 18s) [08:58:17] !log installing gsasl security updates [08:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:13] (03PS2) 10Ayounsi: Interface description: don't add the z_side when disabled [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/820380 [09:02:08] (03PS1) 10Urbanecm: SpecialEditGrowthConfigLogger: Update schema version [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820276 (https://phabricator.wikimedia.org/T314173) [09:03:09] !log oblivian@mwmaint1002 pull aborted: (duration: 00m 06s) [09:04:17] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:05:12] jouncebot: nowandnext [09:05:12] No deployments scheduled for the next 0 hour(s) and 54 minute(s) [09:05:12] In 0 hour(s) and 54 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220804T1000) [09:05:19] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [09:05:29] (03CR) 10Urbanecm: [C: 03+2] SpecialEditGrowthConfigLogger: Update schema version [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820276 (https://phabricator.wikimedia.org/T314173) (owner: 10Urbanecm) [09:10:01] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [09:11:43] PROBLEM - BFD status on cr3-knams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:11:57] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:11:59] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:12:09] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:12:55] !log jelto@cumin1001 conftool action : set/pooled=inactive; selector: name=kubernetes2022.codfw.wmnet [09:13:09] PROBLEM - BFD status on cr2-drmrs is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:13:23] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:13:31] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 3/6 UP : OSPFv3: 3/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:13:47] (03PS1) 10Giuseppe Lavagetto: scap: do not use double quotes to define an empty value [puppet] - 10https://gerrit.wikimedia.org/r/820384 [09:14:39] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [09:14:55] (03PS1) 10Ayounsi: Don't add server name on disabled interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/820385 [09:15:29] RECOVERY - BFD status on cr2-drmrs is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:15:43] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] scap: do not use double quotes to define an empty value [puppet] - 10https://gerrit.wikimedia.org/r/820384 (owner: 10Giuseppe Lavagetto) [09:15:43] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:15:51] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:15:58] (03PS4) 10Urbanecm: [beta] Growth: Switch to structured mentor list at all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808269 (https://phabricator.wikimedia.org/T310905) [09:16:06] (03CR) 10Urbanecm: [C: 03+2] [beta] Growth: Switch to structured mentor list at all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808269 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [09:16:18] (03PS2) 10Urbanecm: testwiki: Growth: Switch to structured mentor list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819053 (https://phabricator.wikimedia.org/T310905) [09:16:23] RECOVERY - BFD status on cr3-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:16:26] (03CR) 10Urbanecm: [C: 03+2] testwiki: Growth: Switch to structured mentor list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819053 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [09:16:35] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:16:39] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 13 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:16:47] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:17:17] (03PS1) 10Marostegui: mariadb: Decommission db2089 [puppet] - 10https://gerrit.wikimedia.org/r/820386 (https://phabricator.wikimedia.org/T313799) [09:17:24] (03Merged) 10jenkins-bot: [beta] Growth: Switch to structured mentor list at all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808269 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [09:17:44] (03Merged) 10jenkins-bot: testwiki: Growth: Switch to structured mentor list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819053 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [09:18:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2089.codfw.wmnet [09:20:14] (03PS1) 10Phuedx: beta: Remove $wgMediaViewerNetworkPerformanceSamplingFactor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820389 (https://phabricator.wikimedia.org/T310890) [09:21:03] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01132 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [09:21:13] RECOVERY - Check systemd state on search-loader2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:21:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:21:55] (03CR) 10Cathal Mooney: [C: 03+1] "+1. I can see how it'd be a nice to have but agree it's not worth adding steps for other SREs." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/820380 (owner: 10Ayounsi) [09:22:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:22:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:22:21] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Interface description: don't add the z_side when disabled [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/820380 (owner: 10Ayounsi) [09:23:03] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [09:23:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:23:59] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [09:25:29] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [cookbooks] - 10https://gerrit.wikimedia.org/r/820385 (owner: 10Ayounsi) [09:25:47] !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: wmf-netbox.py update - ayounsi@cumin1001 [09:26:10] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2089 [puppet] - 10https://gerrit.wikimedia.org/r/820386 (https://phabricator.wikimedia.org/T313799) (owner: 10Marostegui) [09:26:18] (03PS1) 10Jbond: P:etcd::v3: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/820390 [09:26:35] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 0614a39bf15252c95a96565dd7c986237f3d3323: testwiki: Growth: Switch to structured mentor list (T310905) (duration: 03m 38s) [09:26:39] btullis: FYI yuor change is causing pouppet failures, i think the above fixes it, about to merge [09:26:40] T310905: Deploy structured wikitext mentor list to Wikimedia wikis - https://phabricator.wikimedia.org/T310905 [09:26:41] (03CR) 10Ayounsi: [C: 03+2] Don't add server name on disabled interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/820385 (owner: 10Ayounsi) [09:26:44] 10ops-codfw, 10decommission-hardware: decommission db2089 - https://phabricator.wikimedia.org/T313799 (10Marostegui) a:03Papaul [09:26:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:26:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2089.codfw.wmnet [09:26:48] (03Merged) 10jenkins-bot: SpecialEditGrowthConfigLogger: Update schema version [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820276 (https://phabricator.wikimedia.org/T314173) (owner: 10Urbanecm) [09:26:49] 10ops-codfw, 10decommission-hardware: decommission db2089 - https://phabricator.wikimedia.org/T313799 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2089.codfw.wmnet` - db2089.codfw.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager - Found phys... [09:26:56] 10ops-codfw, 10decommission-hardware: decommission db2089 - https://phabricator.wikimedia.org/T313799 (10Marostegui) @Papaul this is ready for you [09:27:24] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: wmf-netbox.py update - ayounsi@cumin1001 [09:28:05] (03CR) 10Jbond: [C: 03+2] P:etcd::v3: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/820390 (owner: 10Jbond) [09:28:19] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:etcd::v3: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/820390 (owner: 10Jbond) [09:28:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:29:15] (03PS1) 10Marostegui: db2177: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/820391 (https://phabricator.wikimedia.org/T311494) [09:29:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:29:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:29:35] btullis: https://gerrit.wikimedia.org/r/c/operations/puppet/+/820390 has fixed the issue [09:29:39] (03PS1) 10Urbanecm: testwiki: Growth: Assign enrollasmentor to * [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820392 (https://phabricator.wikimedia.org/T310905) [09:29:41] (03CR) 10Urbanecm: [C: 03+2] testwiki: Growth: Assign enrollasmentor to * [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820392 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [09:30:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:30:34] (03CR) 10Marostegui: [C: 03+2] db2177: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/820391 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [09:30:50] (03Merged) 10jenkins-bot: Don't add server name on disabled interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/820385 (owner: 10Ayounsi) [09:31:14] (03Merged) 10jenkins-bot: testwiki: Growth: Assign enrollasmentor to * [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820392 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [09:31:34] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 9:30:00 on 9 hosts with reason: PDU swap [09:31:41] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 9:30:00 on 9 hosts with reason: PDU swap [09:32:29] !log set/pooled=inactive mw22[71-79].codfw.wmnet [09:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:00] (03PS1) 10Marostegui: instances.yaml: Add db2177 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/820393 (https://phabricator.wikimedia.org/T311494) [09:35:09] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2177 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/820393 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [09:35:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:35:31] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: ddcd333015bb58a98709a5005a5db7e8519dd0a5: testwiki: Growth: Assign enrollasmentor to * (T310905) (duration: 03m 41s) [09:35:34] T310905: Deploy structured wikitext mentor list to Wikimedia wikis - https://phabricator.wikimedia.org/T310905 [09:36:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:36:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:37:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2177 to s3 T311494', diff saved to https://phabricator.wikimedia.org/P32282 and previous config saved to /var/cache/conftool/dbconfig/20220804-093704-marostegui.json [09:37:08] T311494: Productionize db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T311494 [09:37:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:37:45] (03CR) 10Giuseppe Lavagetto: [C: 03+2] jwt_authorizer: Provide microservice for JSON Web Token authorization [puppet] - 10https://gerrit.wikimedia.org/r/816018 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall) [09:37:56] (03PS1) 10Lucas Werkmeister (WMDE): Remove unused $wgWBCSEnableDispatchingQueryBuilder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820397 [09:37:58] (03PS1) 10Lucas Werkmeister (WMDE): Remove unused SearchSettingsForSDC.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820398 [09:38:18] (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:38:38] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.0004921 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [09:38:50] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.23/extensions/GrowthExperiments/includes/EventLogging/SpecialEditGrowthConfigLogger.php: ba67dd940217e9f786f4349b4da0fe088475fde9: SpecialEditGrowthConfigLogger: Update schema version (T314173, T312148) (duration: 03m 18s) [09:38:56] * urbanecm done [09:38:56] T314173: editgrowthconfig schema: '' should NOT have additional properties, - https://phabricator.wikimedia.org/T314173 [09:38:56] T312148: Add instrumentation to Special:EditGrowthConfig - https://phabricator.wikimedia.org/T312148 [09:39:53] !log power off mw22[71-79].codfw.wmnet [09:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:14] PROBLEM - puppet last run on mwdebug2001 is CRITICAL: CRITICAL: Puppet has been disabled for 604998 seconds, message: joe messing with php-fpm - oblivian, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:41:01] (03PS7) 10Thiemo Kreuz (WMDE): Streamline/modernize code in MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737857 [09:41:39] <_joe_> sogj O [09:42:01] <_joe_> err. off by one. Sigh I'll fix mwdebug2001 [09:42:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:43:18] (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:43:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:43:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:44:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:45:40] RECOVERY - puppet last run on mwdebug2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:47:39] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Update requirements [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/820399 (owner: 10Ayounsi) [09:47:59] !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: update requirements - ayounsi@cumin1001 [09:49:28] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [09:49:35] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: update requirements - ayounsi@cumin1001 [09:50:01] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10Sustainability: Enable bracketed-paste-mode for production shells (e.g. deployment, mwmaint) - https://phabricator.wikimedia.org/T293614 (10Lucas_Werkmeister_WMDE) 05Open→03Stalled [09:54:44] (03PS1) 10Lucas Werkmeister (WMDE): Remove unused $wgIncludejQueryMigrate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820404 (https://phabricator.wikimedia.org/T280944) [09:55:42] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:56:41] yay, stashbot is back [09:57:05] should we manually repeat the logmsgbot !logs since 9:42 UTC? [10:00:04] mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220804T1000). [10:00:36] !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: update requirements + wmf-netbox - ayounsi@cumin1001 [10:00:44] !log stop db2099 T310145 [10:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:48] T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 [10:02:14] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: update requirements + wmf-netbox - ayounsi@cumin1001 [10:03:17] !log stashbot temporarily parted and lost several logs between 9:42 UTC and 9:49 UTC; mainly mwdebug helmfil start/done, also ayounsi sre.deploy.python-code cookbook to cumin1001, cumin2002; see IRC logs [10:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:16] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:05:00] (03PS2) 10Giuseppe Lavagetto: trafficserver: allow x-wikimedia-debug to pick a php backend [puppet] - 10https://gerrit.wikimedia.org/r/819510 (https://phabricator.wikimedia.org/T312653) [10:05:06] (03CR) 10Giuseppe Lavagetto: trafficserver: allow x-wikimedia-debug to pick a php backend (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/819510 (https://phabricator.wikimedia.org/T312653) (owner: 10Giuseppe Lavagetto) [10:11:20] PROBLEM - Host parse2014 is DOWN: PING CRITICAL - Packet loss = 100% [10:12:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:12:46] PROBLEM - Host parse2011 is DOWN: PING CRITICAL - Packet loss = 100% [10:12:58] PROBLEM - Host parse2012 is DOWN: PING CRITICAL - Packet loss = 100% [10:13:00] PROBLEM - Host parse2013 is DOWN: PING CRITICAL - Packet loss = 100% [10:13:00] PROBLEM - Host parse2015 is DOWN: PING CRITICAL - Packet loss = 100% [10:15:16] (03PS14) 10Thiemo Kreuz (WMDE): Use more compact PHP7 syntax where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 [10:16:46] PROBLEM - Host mw2352 is DOWN: PING CRITICAL - Packet loss = 100% [10:17:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:17:22] PROBLEM - Host mw2350 is DOWN: PING CRITICAL - Packet loss = 100% [10:17:22] PROBLEM - Host mw2351 is DOWN: PING CRITICAL - Packet loss = 100% [10:17:22] PROBLEM - Host mw2353 is DOWN: PING CRITICAL - Packet loss = 100% [10:17:22] PROBLEM - Host mw2354 is DOWN: PING CRITICAL - Packet loss = 100% [10:17:22] PROBLEM - Host mw2355 is DOWN: PING CRITICAL - Packet loss = 100% [10:17:23] PROBLEM - Host mw2356 is DOWN: PING CRITICAL - Packet loss = 100% [10:17:23] PROBLEM - Host mw2357 is DOWN: PING CRITICAL - Packet loss = 100% [10:17:24] PROBLEM - Host mw2358 is DOWN: PING CRITICAL - Packet loss = 100% [10:17:24] PROBLEM - Host mw2360 is DOWN: PING CRITICAL - Packet loss = 100% [10:17:26] PROBLEM - Host mw2359 is DOWN: PING CRITICAL - Packet loss = 100% [10:18:18] (03CR) 10Thiemo Kreuz (WMDE): Use more compact PHP7 syntax where possible (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 (owner: 10Thiemo Kreuz (WMDE)) [10:18:24] PROBLEM - Host mw2363 is DOWN: PING CRITICAL - Packet loss = 100% [10:18:28] PROBLEM - Host mw2376 is DOWN: PING CRITICAL - Packet loss = 100% [10:18:28] PROBLEM - Host mw2361 is DOWN: PING CRITICAL - Packet loss = 100% [10:18:28] PROBLEM - Host mw2362 is DOWN: PING CRITICAL - Packet loss = 100% [10:18:28] PROBLEM - Host mw2364 is DOWN: PING CRITICAL - Packet loss = 100% [10:18:28] PROBLEM - Host mw2365 is DOWN: PING CRITICAL - Packet loss = 100% [10:18:48] PROBLEM - Host mw2368 is DOWN: PING CRITICAL - Packet loss = 100% [10:18:48] PROBLEM - Host mw2369 is DOWN: PING CRITICAL - Packet loss = 100% [10:18:56] PROBLEM - Host mw2367 is DOWN: PING CRITICAL - Packet loss = 100% [10:18:56] PROBLEM - Host mw2370 is DOWN: PING CRITICAL - Packet loss = 100% [10:18:56] PROBLEM - Host mw2371 is DOWN: PING CRITICAL - Packet loss = 100% [10:18:56] PROBLEM - Host mw2372 is DOWN: PING CRITICAL - Packet loss = 100% [10:18:56] PROBLEM - Host mw2373 is DOWN: PING CRITICAL - Packet loss = 100% [10:18:57] PROBLEM - Host mw2374 is DOWN: PING CRITICAL - Packet loss = 100% [10:18:57] PROBLEM - Host mw2375 is DOWN: PING CRITICAL - Packet loss = 100% [10:18:58] PROBLEM - Host mw2366 is DOWN: PING CRITICAL - Packet loss = 100% [10:19:32] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 9:00:00 on 32 hosts with reason: PDU swap [10:19:54] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 9:00:00 on 32 hosts with reason: PDU swap [10:20:07] (03PS3) 10Thiemo Kreuz (WMDE): Remove unused code from StaticSiteConfiguration class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737858 [10:20:18] (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:22:54] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [10:25:11] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for gsasl [puppet] - 10https://gerrit.wikimedia.org/r/820383 (owner: 10Muehlenhoff) [10:25:18] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:27:16] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2017.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [10:27:22] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [10:27:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2017.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [10:30:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2017.codfw.wmnet with OS bullseye [10:30:16] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2017.codfw.wmnet with OS bullseye [10:35:50] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [10:44:52] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [10:45:10] PROBLEM - Host backup2006 is DOWN: PING CRITICAL - Packet loss = 100% [10:49:18] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2017.codfw.wmnet with reason: host reimage [10:49:22] PROBLEM - Host db2126 is DOWN: PING CRITICAL - Packet loss = 100% [10:49:56] em, that is not supposed to happen [10:50:17] 10SRE, 10LDAP-Access-Requests: LDAP access for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10Aklapper) @Siko_WMDE: Please make any potential internal docs point to https://phabricator.wikimedia.org/tag/ldap-access-requests/ which has canonical instructions for such requests, for future refer... [10:51:55] 10SRE, 10LDAP-Access-Requests: LDAP access to wmde and nda for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10Aklapper) [10:51:55] backup2006 seems down on mgmt interface to, so power, not network [10:52:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2017.codfw.wmnet with reason: host reimage [10:53:43] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2022.codfw.wmnet [10:53:58] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2015.codfw.wmnet [10:55:09] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [10:55:39] ah, I see now, the server was not put back up, and the alert just expired [11:01:10] (03PS1) 10Btullis: Bootstrap etcd on the dse_k8s_etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129) [11:01:12] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10ayounsi) [11:01:19] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10Patch-For-Review: Enable OIDC in CAS - https://phabricator.wikimedia.org/T311999 (10ayounsi) [11:01:49] (03PS1) 10Marostegui: db2126: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/820417 [11:02:36] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [11:02:54] (03CR) 10Marostegui: [C: 03+2] db2126: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/820417 (owner: 10Marostegui) [11:04:54] (03PS2) 10Btullis: Bootstrap etcd on the dse_k8s_etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129) [11:05:11] 10Puppet, 10SRE, 10Infrastructure-Foundations: remove old puppet certificates fom puppet master - https://phabricator.wikimedia.org/T314564 (10taavi) [11:08:06] 10Puppet, 10SRE, 10Infrastructure-Foundations: remove old puppet certificates fom puppet master - https://phabricator.wikimedia.org/T314564 (10jbond) from the scripot above we get the following list db2051.codfw.wmnet db2057.codfw.wmnet db2063.codfw.wmnet kafka1001.eqiad.wmnet kafka1002.eqiad.wmnet kafka10... [11:09:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2017.codfw.wmnet with OS bullseye [11:09:58] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2017.codfw.wmnet with OS bullseye completed: - ganeti2017 (**PASS**) - Downtimed on... [11:10:35] 10Puppet, 10SRE, 10Infrastructure-Foundations: remove old puppet certificates fom puppet master - https://phabricator.wikimedia.org/T314564 (10jcrespo) > This is likely host that where decommissioned before the current decommissioning scripts which force a puppet clean. Based on logs at T220002#5574262 that... [11:12:56] (03PS3) 10Btullis: Bootstrap etcd on the dse_k8s_etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129) [11:14:42] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:44] 10Puppet, 10SRE, 10Infrastructure-Foundations: remove old puppet certificates fom puppet master - https://phabricator.wikimedia.org/T314564 (10jbond) 05Open→03Resolved a:03jbond > the decom script "lied" to us and had a bug and didn't delete the certs, even if it told us that it did This i think is the... [11:17:03] 10Puppet, 10SRE, 10Infrastructure-Foundations: remove old puppet certificates fom puppet master - https://phabricator.wikimedia.org/T314564 (10jbond) [11:17:07] 10SRE: Puppet certificate discrepancies - https://phabricator.wikimedia.org/T250483 (10jbond) [11:18:54] RECOVERY - Puppet CA expired certs on puppetmaster1001 is OK: OK: all puppet agent certs fine https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate [11:24:52] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [11:26:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2017.codfw.wmnet [11:34:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2017.codfw.wmnet [11:35:15] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-herron: Puppet hosts with signed certificate present on agent but not master - https://phabricator.wikimedia.org/T185239 (10jbond) [11:36:03] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-herron: Puppet hosts with signed certificate present on agent but not master - https://phabricator.wikimedia.org/T185239 (10jbond) 05Open→03Resolved a:03jbond Closing all servers listed have been decomissioned [11:36:27] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Puppet hosts with their cert revoked can still run puppet - https://phabricator.wikimedia.org/T184444 (10jbond) [11:36:45] (03PS4) 10Btullis: Bootstrap etcd on the dse_k8s_etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129) [11:37:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2017.codfw.wmnet to cluster codfw and group D [11:40:34] 10Puppet, 10SRE, 10Infrastructure-Foundations: remove old puppet certificates from puppet master - https://phabricator.wikimedia.org/T314564 (10Aklapper) [11:41:00] (JobUnavailable) firing: (6) Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:41:12] !log installing libpgjava security updates [11:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2017.codfw.wmnet to cluster codfw and group D [11:45:40] (03PS1) 10Jbond: P:puppetmasters: Convert 004 puppetmasteres to canaries [puppet] - 10https://gerrit.wikimedia.org/r/820428 (https://phabricator.wikimedia.org/T314136) [11:46:43] !log installing Linux 5.10.127-2 kernels on Bullseye hosts [11:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:43] (03PS2) 10Jbond: P:puppetmasters: Convert 004 puppetmasteres to canaries [puppet] - 10https://gerrit.wikimedia.org/r/820428 (https://phabricator.wikimedia.org/T314136) [11:50:54] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 7 DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36620/console" [puppet] - 10https://gerrit.wikimedia.org/r/820428 (https://phabricator.wikimedia.org/T314136) (owner: 10Jbond) [11:52:19] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [11:53:26] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:puppetmasters: Convert 004 puppetmasteres to canaries [puppet] - 10https://gerrit.wikimedia.org/r/820428 (https://phabricator.wikimedia.org/T314136) (owner: 10Jbond) [11:55:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:01:16] (03PS1) 10Jbond: hieradata: migrate idp-test2002 to canaries [puppet] - 10https://gerrit.wikimedia.org/r/820431 [12:02:16] (03CR) 10Jbond: [C: 03+2] hieradata: migrate idp-test2002 to canaries [puppet] - 10https://gerrit.wikimedia.org/r/820431 (owner: 10Jbond) [12:03:01] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [12:03:23] !log send sretest100[12] and idp-test2001 to the new puppetmaster[12]004 servers to test [12:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:47] (03PS5) 10Eigyan: [config]: Add click event logging for mobile and desktop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812391 (https://phabricator.wikimedia.org/T310852) [12:10:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:19:11] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [12:26:05] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [12:31:50] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) [12:32:07] 10SRE, 10Data-Engineering: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10CDanis) @BTullis are you still interested in this? [12:36:09] (03CR) 10Andrew Bogott: [C: 03+2] hieradata: switch traffic to cloudrabbit1001-3 [puppet] - 10https://gerrit.wikimedia.org/r/816818 (owner: 10Majavah) [12:43:18] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [12:45:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Dedicated cloudrabbit nodes in eqiad1 - https://phabricator.wikimedia.org/T314522 (10Andrew) https://gerrit.wikimedia.org/r/c/operations/puppet/+/816818 merged [12:45:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Dedicated cloudrabbit nodes in eqiad1 - https://phabricator.wikimedia.org/T314522 (10Andrew) Before this is closed out I want to review the firewall rules and see if we need to limit port access to VM + prod networks [12:46:38] (03PS4) 10Xcollazo: airflow - Configure new platform_eng instance and rename old one as legacy. [puppet] - 10https://gerrit.wikimedia.org/r/817774 (https://phabricator.wikimedia.org/T312858) [12:47:52] 10SRE, 10LDAP-Access-Requests: LDAP access to wmde and nda for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10Dzahn) Hello @Siko_WMDE please create a user on the Wikitech wiki ( https://wikitech.wikimedia.org/wiki/Special:CreateAccount) and let us know the user name you picked once done. A... [12:48:08] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [12:48:42] !log installing Linux 4.19.249 kernels on Buster hosts [12:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:02] RECOVERY - Disk space on gitlab2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gitlab2002&var-datasource=codfw+prometheus/ops [12:50:54] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:51:19] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Papaul) @ayounsi see @Marostegui comment above. Thnaks [12:54:40] (03PS1) 10CDanis: Print VO API response when we do escalate [software/klaxon] - 10https://gerrit.wikimedia.org/r/820439 (https://phabricator.wikimedia.org/T313603) [12:58:49] (03CR) 10Muehlenhoff: [C: 03+2] sysfs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811226 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220804T1300) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220804T1300). [13:00:05] danisztls and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:45] (JobUnavailable) firing: (6) Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:01:02] o/ I have a patch of my own coming in a sec [13:01:05] o/ [13:01:14] o/ [13:01:23] (I need 10 more minutes or so, if anyone else wants to start deploying first) [13:02:38] i'll start from danisztls's patch then [13:02:54] (03PS3) 10Majavah: QuickSurveys: Deploy research incentive survey to Bengali wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819175 (https://phabricator.wikimedia.org/T314333) (owner: 10DDesouza) [13:03:14] (03CR) 10Majavah: [C: 03+2] QuickSurveys: Deploy research incentive survey to Bengali wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819175 (https://phabricator.wikimedia.org/T314333) (owner: 10DDesouza) [13:04:13] (03PS1) 10Majavah: Remove unused CA P3P config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820441 [13:04:44] (03Merged) 10jenkins-bot: QuickSurveys: Deploy research incentive survey to Bengali wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819175 (https://phabricator.wikimedia.org/T314333) (owner: 10DDesouza) [13:05:05] ok [13:05:24] danisztls: can you test on mwdebug1001 please? [13:05:29] taavi: yes [13:06:18] (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:06:54] that alert seems to be for codfw, probably due to the dc maintenance, ignoring [13:07:06] !log installing jetty9 security updates [13:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:08] taavi: looks good [13:08:23] thanks, syncing [13:09:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Dedicated cloudrabbit nodes in eqiad1 - https://phabricator.wikimedia.org/T314522 (10Andrew) >>! In T314522#8131078, @Andrew wrote: > Before this is closed out I want to review the firewall rules and see if we need to limit port access to VM + p... [13:11:16] (03PS1) 10Jbond: O:puppetmaster: introduce new puppetmaster[12]004 backends [puppet] - 10https://gerrit.wikimedia.org/r/820442 (https://phabricator.wikimedia.org/T314136) [13:11:18] (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:11:26] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [13:11:52] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:819175|QuickSurveys: Deploy research incentive survey to Bengali wiki (T314333)]] (duration: 03m 26s) [13:11:56] T314333: Deploy Research Incentive Survey on Bengali Wikipedia - https://phabricator.wikimedia.org/T314333 [13:11:58] danisztls: and it's live! [13:12:07] (03PS2) 10Majavah: Remove unused CA P3P config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820441 [13:12:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:12:08] taavi: thanks [13:12:15] (03CR) 10Majavah: [C: 03+2] "deploying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820441 (owner: 10Majavah) [13:13:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:13:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:13:36] (03Merged) 10jenkins-bot: Remove unused CA P3P config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820441 (owner: 10Majavah) [13:14:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:14:16] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36622/console" [puppet] - 10https://gerrit.wikimedia.org/r/820442 (https://phabricator.wikimedia.org/T314136) (owner: 10Jbond) [13:14:17] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2065.codfw.wmnet with reason: T310145 [13:14:20] T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 [13:14:31] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2065.codfw.wmnet with reason: T310145 [13:14:33] !log intorudce new puppetmaster backends puppetmaster[12]004 [13:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:51] (03CR) 10Jbond: [V: 03+1 C: 03+2] O:puppetmaster: introduce new puppetmaster[12]004 backends [puppet] - 10https://gerrit.wikimedia.org/r/820442 (https://phabricator.wikimedia.org/T314136) (owner: 10Jbond) [13:15:26] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10bking) [13:15:48] PROBLEM - Check systemd state on mw2386 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:17:09] taavi: should I continue with my config changes? [13:17:23] Lucas_WMDE: still syncing mine, just a sec [13:17:26] ok [13:17:55] !log taavi@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:820441|Remove unused CA P3P config]] (duration: 03m 09s) [13:18:03] Lucas_WMDE: all done [13:18:28] thanks [13:19:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:20:19] (03PS2) 10Lucas Werkmeister (WMDE): Remove unused $wgWBCSEnableDispatchingQueryBuilder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820397 [13:20:33] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove unused $wgWBCSEnableDispatchingQueryBuilder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820397 (owner: 10Lucas Werkmeister (WMDE)) [13:21:23] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Muniza - https://phabricator.wikimedia.org/T292955 (10diego) 05Resolved→03Open [13:21:35] (03Merged) 10jenkins-bot: Remove unused $wgWBCSEnableDispatchingQueryBuilder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820397 (owner: 10Lucas Werkmeister (WMDE)) [13:22:19] pulled to mwdebug1001, testing a bit [13:23:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:23:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:24:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:25:06] scap only restarting php-fpm on ~260 instead of ~300 hosts, I assume due to the codfw stuff [13:26:40] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/SearchSettingsForSDC.php: Config: [[gerrit:820397|Remove unused $wgWBCSEnableDispatchingQueryBuilder]] (duration: 03m 01s) [13:26:59] (03CR) 10Lucas Werkmeister (WMDE): "CCing other people who edited this file… is it okay to remove, or do you want to keep it around for future convenience?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820398 (owner: 10Lucas Werkmeister (WMDE)) [13:27:12] I’ll skip ^ that change for now and do the other two removals first [13:27:56] (03PS2) 10Lucas Werkmeister (WMDE): Remove unused $wgLegacyJavaScriptGlobals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820402 (https://phabricator.wikimedia.org/T72470) [13:29:05] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove unused $wgLegacyJavaScriptGlobals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820402 (https://phabricator.wikimedia.org/T72470) (owner: 10Lucas Werkmeister (WMDE)) [13:29:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:30:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:30:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:30:26] (03Merged) 10jenkins-bot: Remove unused $wgLegacyJavaScriptGlobals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820402 (https://phabricator.wikimedia.org/T72470) (owner: 10Lucas Werkmeister (WMDE)) [13:31:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:31:38] syncing [13:34:19] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [software/klaxon] - 10https://gerrit.wikimedia.org/r/820439 (https://phabricator.wikimedia.org/T313603) (owner: 10CDanis) [13:34:24] (03Abandoned) 10Jbond: P:adduser: apply adduser before any packages are installed [puppet] - 10https://gerrit.wikimedia.org/r/819541 (https://phabricator.wikimedia.org/T235067) (owner: 10Jbond) [13:34:34] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820402|Remove unused $wgLegacyJavaScriptGlobals (T72470)]] (1/2) (duration: 02m 58s) [13:34:38] T72470: Remove legacy javascript globals - https://phabricator.wikimedia.org/T72470 [13:35:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:36:15] (03Abandoned) 10Jbond: P:sretest: import blackbox to sretest to check if its just genrally slow [puppet] - 10https://gerrit.wikimedia.org/r/817789 (owner: 10Jbond) [13:36:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:37:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:37:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:37:54] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:820402|Remove unused $wgLegacyJavaScriptGlobals (T72470)]] (2/2) (duration: 02m 59s) [13:38:12] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Muniza - https://phabricator.wikimedia.org/T292955 (10diego) Hi! @MunizaA had a problem with her laptop, and she needs to add a new ssh public key to access the cluster. The new key is here P32283 @CDanis could you help us with this plea... [13:38:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:38:46] (03PS2) 10Lucas Werkmeister (WMDE): Remove unused $wgIncludejQueryMigrate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820404 (https://phabricator.wikimedia.org/T280944) [13:39:48] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2066.codfw.wmnet with reason: T310145 [13:39:48] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove unused $wgIncludejQueryMigrate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820404 (https://phabricator.wikimedia.org/T280944) (owner: 10Lucas Werkmeister (WMDE)) [13:39:54] T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 [13:40:02] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2066.codfw.wmnet with reason: T310145 [13:40:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:40:16] (03PS3) 10Jbond: cli: Add ability to override th amount of retries and backoffs [software/debmonitor] - 10https://gerrit.wikimedia.org/r/812556 [13:40:18] (03CR) 10Jbond: cli: Add ability to override th amount of retries and backoffs (033 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/812556 (owner: 10Jbond) [13:40:41] (03Merged) 10jenkins-bot: Remove unused $wgIncludejQueryMigrate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820404 (https://phabricator.wikimedia.org/T280944) (owner: 10Lucas Werkmeister (WMDE)) [13:40:46] (03CR) 10Jbond: cli: Add ability to override th amount of retries and backoffs (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/812556 (owner: 10Jbond) [13:41:06] (03PS4) 10Jbond: cli: Add ability to override the amount of retries and backoffs [software/debmonitor] - 10https://gerrit.wikimedia.org/r/812556 [13:43:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:44:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:44:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:44:42] (03PS1) 10Andrew Bogott: Trove: fix copy/paste user with trove_guest_rabbit_pass [puppet] - 10https://gerrit.wikimedia.org/r/820452 [13:45:03] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820404|Remove unused $wgIncludejQueryMigrate (T280944)]] (1/2) (duration: 02m 58s) [13:45:03] (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:45:06] T280944: Phase out jQuery Migrate v3 - https://phabricator.wikimedia.org/T280944 [13:45:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:45:34] (03PS2) 10Andrew Bogott: Trove: fix copy/paste error with trove_guest_rabbit_pass [puppet] - 10https://gerrit.wikimedia.org/r/820452 [13:47:48] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Muniza - https://phabricator.wikimedia.org/T292955 (10RhinosF1) a:05MunizaA→03None Hi @Diego, the SRE on duty changes weekly. It is now @mutante. I'll make sure they see this. [13:48:18] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:820404|Remove unused $wgIncludejQueryMigrate (T280944)]] (2/2) (duration: 03m 03s) [13:48:42] (03CR) 10Andrew Bogott: [C: 03+2] Trove: fix copy/paste error with trove_guest_rabbit_pass [puppet] - 10https://gerrit.wikimedia.org/r/820452 (owner: 10Andrew Bogott) [13:49:45] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10bking) [13:49:55] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Muniza - https://phabricator.wikimedia.org/T292955 (10diego) Thanks @RhinosF1 ! [13:49:56] anything else to deploy? [13:50:06] otherwise I might do another one for MathUseRestBase [13:50:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:51:08] Lucas_WMDE: I have a couple of mw-config patches from the unused config thing if you want :P [13:51:18] sure ^^ [13:51:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:51:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:51:52] (03PS1) 10Lucas Werkmeister (WMDE): Remove unused $wgMathUseRestBase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820454 (https://phabricator.wikimedia.org/T274436) [13:52:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:52:41] Reedy: like https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/820255 ? [13:53:02] yeah, that one and https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/820254 [13:53:28] alright, grepping for the names from the first one [13:53:37] (03PS2) 10Lucas Werkmeister (WMDE): wikitech: Remove old LDAP config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820255 (owner: 10Reedy) [13:53:57] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "No occurrences in all of deploy1002:/srv/mediawiki-staging outside of wikitech.php 👍" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820255 (owner: 10Reedy) [13:54:11] 10SRE, 10LDAP-Access-Requests: LDAP access to wmde and nda for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10WMDE-leszek) I confirm @Siko_WMDE's identity, and approve the request. Thank you! [13:54:55] (03Merged) 10jenkins-bot: wikitech: Remove old LDAP config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820255 (owner: 10Reedy) [13:55:26] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "That is indeed the correct variable name in php-1.39.0-wmf.23/extensions/StopForumSpam/extension.json / php-1.39.0-wmf.23/extensions/StopF" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820254 (owner: 10Reedy) [13:55:42] Reedy: want to test them on mwdebug? [13:55:43] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10MatthewVernon) All the ms-* nodes in `C4` & `C7` must be back and properly in service before we can start on `D2`, I'm afraid. I'll be on IRC, but please don't star... [13:55:46] otherwise I’m happy to sync them directly [13:56:27] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10MatthewVernon) Having moved `C2` to today, it needs to wait until all the ms-* nodes in `D2` are fully back up before starting. [13:56:36] I don't see much point testing them either :) [13:56:38] Feel free to sync <3 [13:56:41] sounds good :) [13:56:43] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10MatthewVernon) [13:56:50] (03PS2) 10Lucas Werkmeister (WMDE): CommonSettings-labs: Fix usage of $wgSFSValidateIPListLocationMD5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820254 (owner: 10Reedy) [13:56:53] syncing [13:56:54] (03CR) 10Elukey: "Ben, the change seems to fail for the new hosts:" [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [13:57:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:58:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:58:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:58:37] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be[2058,2064].codfw.wmnet with reason: PDU work [13:58:52] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be[2058,2064].codfw.wmnet with reason: PDU work [13:58:58] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ec46c9c7-d251-4875-87f9-040b391ea22a) set by mvernon@cumin1001 for 1 day, 0:00:00 on 2 host(s) and... [13:59:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:59:36] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/wikitech.php: Config: [[gerrit:820255|wikitech: Remove old LDAP config vars]] (duration: 02m 54s) [13:59:52] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] CommonSettings-labs: Fix usage of $wgSFSValidateIPListLocationMD5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820254 (owner: 10Reedy) [14:01:13] jouncebot: now [14:01:14] No deployments scheduled for the next 1 hour(s) and 58 minute(s) [14:01:16] ok [14:01:50] (03Merged) 10jenkins-bot: CommonSettings-labs: Fix usage of $wgSFSValidateIPListLocationMD5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820254 (owner: 10Reedy) [14:03:19] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ayounsi) That's something for #ops-eqiad, I think there was some confusion during the provisioning of those hosts: For example, [[ https://netbox... [14:03:48] (03CR) 10Jforrester: "Oops, thanks for spotting this!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820402 (https://phabricator.wikimedia.org/T72470) (owner: 10Lucas Werkmeister (WMDE)) [14:03:52] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 103, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:04:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:04:26] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove unused $wgLegacyJavaScriptGlobals (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820402 (https://phabricator.wikimedia.org/T72470) (owner: 10Lucas Werkmeister (WMDE)) [14:04:51] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2033.codfw.wmnet with reason: T310145 [14:04:54] T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 [14:05:05] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2033.codfw.wmnet with reason: T310145 [14:05:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:05:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:05:58] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/CommonSettings-labs.php: Config: [[gerrit:820254|CommonSettings-labs: Fix usage of $wgSFSValidateIPListLocationMD5]] (duration: 02m 51s) [14:06:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:06:33] phuedx: should we deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/820389 too? (MediaViewer unused cleanup) [14:06:39] if you happen to be around [14:07:00] (03CR) 10Btullis: Bootstrap etcd on the dse_k8s_etcd cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [14:07:04] (03PS2) 10Lucas Werkmeister (WMDE): Remove unused $wgMathUseRestBase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820454 (https://phabricator.wikimedia.org/T274436) [14:07:20] in the meantime I’ll do the MathUseRestBase cleanup [14:08:45] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove unused $wgMathUseRestBase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820454 (https://phabricator.wikimedia.org/T274436) (owner: 10Lucas Werkmeister (WMDE)) [14:09:57] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "LGTM, no references to this variable outside of IS-labs.php on deploy1002:/srv/mediawiki-staging." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820389 (https://phabricator.wikimedia.org/T310890) (owner: 10Phuedx) [14:10:04] (03Merged) 10jenkins-bot: Remove unused $wgMathUseRestBase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820454 (https://phabricator.wikimedia.org/T274436) (owner: 10Lucas Werkmeister (WMDE)) [14:11:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:12:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:12:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:13:10] (03PS1) 10Jforrester: Wikifunctions: Drop two config items moved to docker [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820459 [14:13:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:13:34] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/CommonSettings-labs.php: Config: [[gerrit:820454|Remove unused $wgMathUseRestBase (T274436)]] (duration: 03m 01s) [14:13:36] T274436: Enable RESTbaseless validation in wikibase - https://phabricator.wikimedia.org/T274436 [14:14:23] I think I’ll stop there for now :) [14:14:40] !log UTC afternoon backport+config window done [14:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:12] (03PS1) 10Samtar: DefaultConfig: add a (humorous) deployment message [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/820460 [14:16:01] (03CR) 10CI reject: [V: 04-1] DefaultConfig: add a (humorous) deployment message [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/820460 (owner: 10Samtar) [14:16:19] TheresNoTime: Computer says no. [14:16:29] noooooooo [14:17:07] There was a critical error during execution of Flake8: plugin code for `flake8-logging-format[logging-format]` does not match ^[A-Z]{1,3}[0-9]{0,3}$ [14:17:18] unrelated CI error? :> [14:17:58] TheresNoTime: lmao [14:18:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:18:34] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_uwsgi-striker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:34] computer says NO [14:18:54] 😭 [14:19:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:19:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:20:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:20:15] looks like we need https://github.com/globality-corp/flake8-logging-format/commit/f3cdb24468241ebe85e41b0bd2e8958c76b4dec6 [14:20:30] I guess flake8 got stricter about requirements for its plugins [14:21:21] !log shutdown ms-be20[58,64].codfw.wmnet for PDU swap T310145 [14:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:23] T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 [14:21:26] https://pypi.org/project/flake8-logging-format/#history doesn’t show a published version that could include this fix though :< [14:22:41] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on logstash2035.codfw.wmnet with reason: pdu [14:22:52] !log poweroff logstash2035 - T310145 [14:22:55] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on logstash2035.codfw.wmnet with reason: pdu [14:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:57] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2032.codfw.wmnet with reason: T310145 [14:23:11] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2032.codfw.wmnet with reason: T310145 [14:24:28] filed https://phabricator.wikimedia.org/T314576 [14:24:43] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 4:30:00 on gitlab-runner2003.codfw.wmnet with reason: PDU swap [14:24:47] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs2001.codfw.wmnet with reason: T310145 [14:25:00] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs2001.codfw.wmnet with reason: T310145 [14:25:07] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on gitlab-runner2003.codfw.wmnet with reason: PDU swap [14:25:16] !log power off gitlab-runner2003 [14:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:32] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:18] (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:29:47] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10bking) [14:30:56] 10SRE, 10LDAP-Access-Requests: LDAP access to wmde and nda for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10Siko_WMDE) Hi @Dzahn, I already created a user on Wikitech wiki, the name is: Siko_WMDE Thank you and best regards, Simon [14:31:18] (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:31:52] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs2011.codfw.wmnet with reason: T310145 [14:31:57] T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 [14:32:18] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs2011.codfw.wmnet with reason: T310145 [14:32:24] (03PS1) 10Samtar: requirements.txt: Pin flake8 to v4.0.1 [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/820462 (https://phabricator.wikimedia.org/T314576) [14:32:42] heh [14:32:53] TheresNoTime: Might be worth filing a task about doing that more widely... [14:33:41] I think taavi mentioned the flake8 updates end of last month caused quite a few issues [14:33:53] * TheresNoTime assumes there must already be a task.. [14:35:09] (03CR) 10CDanis: [C: 03+1] requestctl: Add a reminder to "requestctl commit" after enable/disable [software/conftool] - 10https://gerrit.wikimedia.org/r/817351 (https://phabricator.wikimedia.org/T305580) (owner: 10RLazarus) [14:35:17] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2016.codfw.wmnet [14:35:22] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2020.codfw.wmnet [14:35:28] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2025.codfw.wmnet [14:35:49] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2007.codfw.wmnet [14:36:26] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2009.codfw.wmnet [14:37:15] (03PS1) 10Ayounsi: Netbox: add hourly postgres backups [puppet] - 10https://gerrit.wikimedia.org/r/820463 (https://phabricator.wikimedia.org/T262677) [14:37:18] (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:38:38] Lucas_WMDE: I wasn't around :) I've scheduled that patch for deployment next week. I'd deploy it myself in the meantime but I haven't regenerated my keys yet [14:38:45] Thanks for the ping though [14:38:48] phuedx: ok, sounds good! [14:39:00] I already gave it a +1 ^^ [14:40:32] (03PS6) 10Eigyan: [config]: Add click event logging for mobile and desktop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812391 (https://phabricator.wikimedia.org/T310852) [14:40:40] PROBLEM - Host elastic2082.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:40:45] logged T314577 ^^ [14:40:45] T314577: Flake8 5.0.0 release breaking CI jobs - https://phabricator.wikimedia.org/T314577 [14:41:12] TheresNoTime: they not already a task? [14:41:19] i thought 5.0.2 fixed it [14:41:34] PROBLEM - Host elastic2081.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:41:36] PROBLEM - Host wdqs2011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:41:48] RhinosF1: not that I could find, and I guess I should rename that to `5.0.0+` [14:42:18] (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:42:56] TheresNoTime: 5.0.0 created too many issues. I know a lot got fixed by one of .1 .2 or .3 [14:43:14] (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/36624/" [puppet] - 10https://gerrit.wikimedia.org/r/820463 (https://phabricator.wikimedia.org/T262677) (owner: 10Ayounsi) [14:43:34] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:43:56] the failing CI build installed flake8==5.0.4 so I guess it’s still broken in that version [14:43:58] PROBLEM - Host elastic2065.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:43:58] PROBLEM - Host elastic2066.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:44:38] PROBLEM - Host mc2047.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:44:38] PROBLEM - Host mc2048.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:44:47] Lucas_WMDE: we're on .4 now? [14:44:58] PROBLEM - Host logstash2035.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:45:03] apparently yes [14:45:12] PROBLEM - Host ms-backup2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:45:15] (03PS2) 10Samtar: DefaultConfig: add a (humorous) deployment message [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/820460 [14:45:42] PROBLEM - Host ms-be2058.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:45:50] PROBLEM - Host ms-be2064.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:46:40] seeee it is a good change (if CI gets fixed) :D [14:47:21] :D [14:47:53] TheresNoTime: ci went v+2, i can give you a +1 because i have a working mouse [14:48:03] (03CR) 10RhinosF1: [C: 03+1] DefaultConfig: add a (humorous) deployment message [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/820460 (owner: 10Samtar) [14:48:10] PROBLEM - Host backup2003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:48:12] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:48:27] Lucas_WMDE: i'm waiting to bother bumping my projects until it settles [14:49:19] getting that merged won't count for https://twitter.com/TheresNoTimeFor/status/1534271845641469965 though :( [14:49:27] 10SRE, 10conftool: Annotate X-Analytics header with any matching actions - https://phabricator.wikimedia.org/T305582 (10CDanis) [14:49:29] I’m sure the flake8 maintainers have already copiously advised everyone to pin their dependencies and use pip-tools etc. [14:49:37] or is it only the Pallets folks that like to do that ^^ [14:49:54] TheresNoTime: your fault for being so specific with “MediaWiki core” [14:50:00] (03PS1) 10Andrew Bogott: Remove rabbitmq profile from cloudcontrol nodes [puppet] - 10https://gerrit.wikimedia.org/r/820465 (https://phabricator.wikimedia.org/T314522) [14:52:14] RECOVERY - Host ms-be2064.mgmt is UP: PING WARNING - Packet loss = 71%, RTA = 42.71 ms [14:54:09] (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/pcc-worker1003/36625/cloudcontrol1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/820465 (https://phabricator.wikimedia.org/T314522) (owner: 10Andrew Bogott) [14:54:32] RECOVERY - Host backup2003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.65 ms [14:55:18] (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:56:07] !log draining codfw-ulsfo link - T310310 [14:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:22] RECOVERY - Host ms-be2058.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.77 ms [14:56:22] RECOVERY - Host ms-backup2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.73 ms [14:56:27] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mc[2030-2031].codfw.wmnet with reason: PDU swap [14:56:42] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mc[2030-2031].codfw.wmnet with reason: PDU swap [14:56:48] RECOVERY - Host elastic2065.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms [14:56:48] RECOVERY - Host elastic2066.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.60 ms [14:57:28] RECOVERY - Host mc2047.mgmt is UP: PING OK - Packet loss = 0%, RTA = 87.51 ms [14:57:29] RECOVERY - Host mc2048.mgmt is UP: PING OK - Packet loss = 0%, RTA = 82.31 ms [14:57:48] RECOVERY - Host logstash2035.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms [14:58:09] !log power off mc20[30-31] [14:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:46] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:00:18] (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:00:58] RECOVERY - Host elastic2082.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms [15:00:58] RECOVERY - Host elastic2081.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.03 ms [15:01:00] RECOVERY - Host wdqs2011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.83 ms [15:01:02] (03PS1) 10Milimetric: role::common::aqs: update mw history [puppet] - 10https://gerrit.wikimedia.org/r/820468 [15:01:25] (03CR) 10Andrew Bogott: [C: 03+2] Remove rabbitmq profile from cloudcontrol nodes [puppet] - 10https://gerrit.wikimedia.org/r/820465 (https://phabricator.wikimedia.org/T314522) (owner: 10Andrew Bogott) [15:01:54] (03CR) 10Btullis: [C: 03+2] role::common::aqs: update mw history [puppet] - 10https://gerrit.wikimedia.org/r/820468 (owner: 10Milimetric) [15:05:20] !log btullis@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [15:05:24] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [15:06:49] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on logstash2002.codfw.wmnet with reason: pdu [15:07:03] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on logstash2002.codfw.wmnet with reason: pdu [15:07:38] <_joe_> !log pwoering down mc203{0,1} [15:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:52] PROBLEM - Aggregate IPsec Tunnel Status eqiad on alert1001 is CRITICAL: instance=mc1048 site=eqiad tunnel=mc2030_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [15:09:22] !log poweroff logstash2002 - T310145 [15:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:25] T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 [15:10:11] (03CR) 10Ahmon Dancy: scap: do not restart php on the mwmaint servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819631 (owner: 10Giuseppe Lavagetto) [15:11:12] RECOVERY - IPMI Sensor Status on kafka-main2002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:11:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool hosts for PDU maint (T310145)', diff saved to https://phabricator.wikimedia.org/P32284 and previous config saved to /var/cache/conftool/dbconfig/20220804-151121-ladsgroup.json [15:12:48] !log mvernon@cumin1001 START - Cookbook sre.hosts.remove-downtime for ms-be[2058,2064].codfw.wmnet [15:12:48] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[2058,2064].codfw.wmnet [15:13:24] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp203[12]\.codfw\.wmnet,service=ats-tls [15:13:30] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp203[12]\.codfw\.wmnet,service=ats-be [15:13:36] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp203[12]\.codfw\.wmnet,service=varnish-fe [15:13:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db[2114,2126,2166].codfw.wmnet with reason: Maintenance (T310145) [15:13:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db[2114,2126,2166].codfw.wmnet with reason: Maintenance (T310145) [15:14:02] PROBLEM - Host db2126.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:15:10] PROBLEM - Host db2102.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:15:12] PROBLEM - Host db2114.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:15:22] PROBLEM - Host db2165.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:15:22] PROBLEM - Host db2166.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:16:12] PROBLEM - Host parse2012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:16:12] PROBLEM - Host parse2013.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:16:16] I was too late to shut it down but it's fine, it's depooled and downtimed [15:16:25] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on restbase[2016,2020,2025].codfw.wmnet with reason: PDU maintenance [15:16:40] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on restbase[2016,2020,2025].codfw.wmnet with reason: PDU maintenance [15:16:46] PROBLEM - Host logstash2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:16:55] !log btullis@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [15:19:32] PROBLEM - Host wdqs2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:19:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool C6 for PDU maint (T310145)', diff saved to https://phabricator.wikimedia.org/P32285 and previous config saved to /var/cache/conftool/dbconfig/20220804-151958-ladsgroup.json [15:20:02] T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 [15:20:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db[2116,2127,2167-2168].codfw.wmnet,es2022.codfw.wmnet with reason: Maintenance (T310145) [15:21:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db[2116,2127,2167-2168].codfw.wmnet,es2022.codfw.wmnet with reason: Maintenance (T310145) [15:21:35] !log un-drain codfw-ulsfo link - T310310 [15:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:49] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp203[78]\.codfw\.wmnet,service=ats-tls [15:23:56] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp203[78]\.codfw\.wmnet,service=ats-be [15:24:04] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp203[78]\.codfw\.wmnet,service=varnish-fe [15:24:30] PROBLEM - Host restbase2016.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:24:51] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on cp[2037-2038].codfw.wmnet with reason: shutdown for PDU upgrade [15:25:19] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cp[2037-2038].codfw.wmnet with reason: shutdown for PDU upgrade [15:25:20] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:30:00 on phab2001.codfw.wmnet with reason: PDU swap [15:25:28] !log power off phab2001 [15:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:34] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:30:00 on phab2001.codfw.wmnet with reason: PDU swap [15:26:32] PROBLEM - Host mc2030.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:27:09] !log power off cp2037,cp2038: PDU upgrade [15:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:54] PROBLEM - Host restbase2020.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:27:58] PROBLEM - Host elastic2033.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:27:59] PROBLEM - Host elastic2032.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:28:48] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [15:28:54] PROBLEM - Host parse2011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:29:04] PROBLEM - Host mc2031.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:29:20] PROBLEM - Host gitlab-runner2003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:30:34] PROBLEM - Host restbase2025.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:32:06] PROBLEM - Host ores2006 is DOWN: PING CRITICAL - Packet loss = 100% [15:32:12] PROBLEM - PyBal backends health check on lvs2008 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled: git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:32:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:rack/setup/install new machine learning hosts - https://phabricator.wikimedia.org/T314587 (10RobH) [15:32:56] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled: git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:34:15] (03CR) 10Cwhite: [C: 03+2] service::docker: Add SyslogIdentifier to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/820237 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [15:34:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:rack/setup/install new machine learning hosts - https://phabricator.wikimedia.org/T314587 (10RobH) [15:34:58] PROBLEM - Host ores2006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:35:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:rack/setup/install new machine learning hosts - https://phabricator.wikimedia.org/T314587 (10BTullis) [15:35:44] PROBLEM - Host phab2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:36:22] PROBLEM - Host ml-serve2003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:36:28] RECOVERY - IPMI Sensor Status on ml-serve2006 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:36:30] PROBLEM - Host ps1-c5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:37:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:rack/setup/install new machine learning hosts - https://phabricator.wikimedia.org/T314587 (10RobH) [15:37:06] <_joe_> !log uncordoning ml-serve200{1,6} [15:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:rack/setup/install new machine learning hosts - https://phabricator.wikimedia.org/T314587 (10RobH) a:03Jclark-ctr [15:38:58] PROBLEM - Juniper alarms on asw-c-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:40:19] (03CR) 10Cwhite: [C: 03+2] striker: route syslog output to ELK cluster via kafka [puppet] - 10https://gerrit.wikimedia.org/r/820238 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [15:41:44] PROBLEM - Host ganeti2011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:41:44] PROBLEM - Host ganeti2012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:42:08] PROBLEM - IPMI Sensor Status on ganeti2012 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:44:33] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10ssingh) [15:45:04] (03PS1) 10Ahmon Dancy: Update the known host key for gerrit2002.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/820474 (https://phabricator.wikimedia.org/T243027) [15:46:00] RECOVERY - Juniper alarms on asw-c-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:47:24] RECOVERY - Host elastic2033.mgmt is UP: PING OK - Packet loss = 0%, RTA = 38.51 ms [15:47:24] RECOVERY - Host elastic2032.mgmt is UP: PING OK - Packet loss = 0%, RTA = 44.62 ms [15:48:06] RECOVERY - Host ganeti2011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.65 ms [15:48:06] RECOVERY - Host ganeti2012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.50 ms [15:48:20] RECOVERY - Host parse2011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 44.89 ms [15:48:38] RECOVERY - Host phab2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.74 ms [15:48:46] RECOVERY - Host gitlab-runner2003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 47.35 ms [15:48:48] RECOVERY - Host parse2011 is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms [15:49:12] PROBLEM - Host wdqs2008 is DOWN: PING CRITICAL - Packet loss = 100% [15:49:18] RECOVERY - Host ml-serve2003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms [15:49:23] (03PS1) 10BCornwall: Revert "Revert "geodns: Map out African countries by DC latency"" [dns] - 10https://gerrit.wikimedia.org/r/820486 [15:49:58] RECOVERY - Host restbase2025.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.73 ms [15:50:42] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2048.codfw.wmnet with reason: T310145 [15:50:46] T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 [15:50:56] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2048.codfw.wmnet with reason: T310145 [15:50:58] RECOVERY - Aggregate IPsec Tunnel Status eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [15:51:08] RECOVERY - Host db2165.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.80 ms [15:51:59] RECOVERY - Host wdqs2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.44 ms [15:52:30] (03PS2) 10Ahmon Dancy: Update the known host key for gerrit2002.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/820474 (https://phabricator.wikimedia.org/T243027) [15:52:32] RECOVERY - Host mc2030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.42 ms [15:52:36] RECOVERY - Host parse2012 is UP: PING OK - Packet loss = 0%, RTA = 33.18 ms [15:52:42] RECOVERY - Host parse2013 is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms [15:53:52] RECOVERY - Host restbase2020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 46.31 ms [15:54:14] RECOVERY - Host db2102.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.47 ms [15:54:16] RECOVERY - Host parse2013.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.81 ms [15:54:20] RECOVERY - Host db2114.mgmt is UP: PING WARNING - Packet loss = 66%, RTA = 33.74 ms [15:54:24] RECOVERY - Host ores2006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 40.39 ms [15:55:10] RECOVERY - Host parse2012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms [15:55:40] RECOVERY - Host logstash2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.98 ms [15:56:48] RECOVERY - Host restbase2016.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.72 ms [15:57:19] PROBLEM - Host kubetcd2005 is DOWN: PING CRITICAL - Packet loss = 100% [15:58:16] PROBLEM - Host ganeti2012 is DOWN: PING CRITICAL - Packet loss = 100% [15:58:18] PROBLEM - Host ml-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:58:22] PROBLEM - Host build2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:59:32] RECOVERY - Host db2126.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.80 ms [16:00:00] RECOVERY - Host db2166.mgmt is UP: PING OK - Packet loss = 0%, RTA = 35.49 ms [16:00:04] RECOVERY - Host ganeti2012 is UP: PING OK - Packet loss = 0%, RTA = 33.26 ms [16:00:05] jbond and rzl: (Dis)respected human, time to deploy Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220804T1600). Please do the needful. [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:22] RECOVERY - Host ores2006 is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms [16:00:28] RECOVERY - Host mc2031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.43 ms [16:00:30] RECOVERY - Host kubetcd2005 is UP: PING OK - Packet loss = 0%, RTA = 33.43 ms [16:00:36] RECOVERY - Host ml-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 33.36 ms [16:00:52] jbond/rzl: Can you process https://gerrit.wikimedia.org/r/c/operations/puppet/+/820474 ? [16:00:54] PROBLEM - Check systemd state on ganeti2012 is CRITICAL: CRITICAL - degraded: The following units failed: nic-saturation-exporter.service,prometheus-ganeti-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:01:02] RECOVERY - Host build2001 is UP: PING OK - Packet loss = 0%, RTA = 33.40 ms [16:01:44] RECOVERY - Host db2126 is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms [16:02:02] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [16:02:07] (WcqsStreamingUpdaterFlinkJobNotRunning) firing: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [16:02:54] !log ebysans@deploy1002 Started deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288] [16:02:58] RECOVERY - Check systemd state on ganeti2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:02:58] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be[2036,2049,2054].codfw.wmnet,thanos-be2003.codfw.wmnet with reason: PDU work [16:03:14] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be[2036,2049,2054].codfw.wmnet,thanos-be2003.codfw.wmnet with reason: PDU work [16:03:21] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0f30d2ec-1037-4449-b903-79ae6c2ccede) set by mvernon@cumin1001 for 1 day, 0:00:00 on 4 host(s) and... [16:03:30] PROBLEM - Host db2095.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:03:32] PROBLEM - Host db2115.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:04:10] PROBLEM - IPMI Sensor Status on ores2006 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:04:12] PROBLEM - Host es2022.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:04:28] PROBLEM - Cassandra instance data free space on restbase1016 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7103 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [16:05:01] (03PS5) 10Xcollazo: airflow - Configure new platform_eng instance and rename old one as legacy. [puppet] - 10https://gerrit.wikimedia.org/r/817774 (https://phabricator.wikimedia.org/T312858) [16:05:21] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10bking) [16:06:09] (KubernetesRsyslogDown) firing: (3) rsyslog on kubernetes2007:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:06:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve2007:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:06:22] PROBLEM - Host db2127.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:06:29] !log shutdown ms-be20[39,49,54].codfw.wmnet,thanos-be2003 for PDU swap T310145 [16:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:32] T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 [16:06:32] PROBLEM - Host mw2361.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:06:32] PROBLEM - Host mw2360.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:06:32] PROBLEM - Host mw2362.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:06:32] PROBLEM - Host mw2356.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:06:32] PROBLEM - Host mw2359.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:06:33] PROBLEM - Host mw2363.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:06:49] (WdqsStreamingUpdaterFlinkJobNotRunning) resolved: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [16:06:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [16:06:49] (WcqsStreamingUpdaterFlinkJobNotRunning) resolved: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [16:06:58] PROBLEM - Host db2167.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:06:58] PROBLEM - Host mw2350.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:06:58] PROBLEM - Host mw2351.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:06:58] PROBLEM - Host mw2352.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:06:58] PROBLEM - Host mw2353.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:06:59] PROBLEM - Host mw2354.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:06:59] PROBLEM - Host mw2355.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:07:00] PROBLEM - Host mw2357.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:07:00] PROBLEM - Host mw2358.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:07:02] PROBLEM - Host ml-serve-ctrl2001 is DOWN: PING CRITICAL - Packet loss = 100% [16:07:10] PROBLEM - Host webperf2003 is DOWN: PING CRITICAL - Packet loss = 100% [16:07:16] PROBLEM - Host ganeti2014 is DOWN: PING CRITICAL - Packet loss = 100% [16:07:18] PROBLEM - Host wdqs2008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:07:26] PROBLEM - Host dragonfly-supernode2001 is DOWN: PING CRITICAL - Packet loss = 100% [16:07:40] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:07:46] PROBLEM - Host db2135.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:08:02] PROBLEM - Host dbproxy2003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:08:12] RECOVERY - Host ganeti2014 is UP: PING OK - Packet loss = 0%, RTA = 33.23 ms [16:08:14] PROBLEM - Host mw2364.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:08:14] PROBLEM - Host mw2365.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:08:33] (03CR) 10Jbond: [C: 03+2] Update the known host key for gerrit2002.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/820474 (https://phabricator.wikimedia.org/T243027) (owner: 10Ahmon Dancy) [16:08:48] ACKNOWLEDGEMENT - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating Andrew Bogott side-effect of rabbitmq work https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:08:56] PROBLEM - Host db2099.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:09:02] PROBLEM - Host db2116.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:09:16] PROBLEM - Host db2168.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:09:26] PROBLEM - Juniper alarms on asw-c-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [16:09:36] PROBLEM - Host db2179.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:09:36] PROBLEM - Host db2180.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:09:56] PROBLEM - Host ganeti2014.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:09:58] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:10:02] PROBLEM - Host ganeti2013.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:10:04] PROBLEM - Host parse2014.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:10:06] PROBLEM - Host parse2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:10:38] RECOVERY - Host webperf2003 is UP: PING OK - Packet loss = 0%, RTA = 34.79 ms [16:10:58] RECOVERY - Host ml-serve-ctrl2001 is UP: PING OK - Packet loss = 0%, RTA = 35.62 ms [16:10:58] (KubernetesCalicoDown) firing: ml-serve-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:11:09] (03CR) 10BCornwall: [C: 03+2] Revert "Revert "geodns: Map out African countries by DC latency"" [dns] - 10https://gerrit.wikimedia.org/r/820486 (owner: 10BCornwall) [16:11:36] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:11:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [16:12:12] !log poweroff logstash2028 - T310145 [16:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:16] T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 [16:12:34] RECOVERY - IPMI Sensor Status on ganeti2012 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:12:40] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:14:40] PROBLEM - Host logstash2028 is DOWN: PING CRITICAL - Packet loss = 100% [16:15:25] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Start depooling codfw replicas [software] - 10https://gerrit.wikimedia.org/r/820142 (https://phabricator.wikimedia.org/T314486) (owner: 10Ladsgroup) [16:15:45] (JobUnavailable) firing: (5) Reduced availability for job dragonfly_supernode in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:15:58] (KubernetesCalicoDown) resolved: ml-serve-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:15:58] (KubernetesRsyslogDown) firing: (3) rsyslog on kubernetes2007:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:16:56] (03Merged) 10jenkins-bot: auto_schema: Start depooling codfw replicas [software] - 10https://gerrit.wikimedia.org/r/820142 (https://phabricator.wikimedia.org/T314486) (owner: 10Ladsgroup) [16:17:55] !log deploying authdns - geodns: Map out African countries by DC latency (T311472) [16:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:59] T311472: DRMRS: Geodns Configuration -- Phase 2 - https://phabricator.wikimedia.org/T311472 [16:18:44] RECOVERY - k8s requests count to the API on ml-serve-ctrl2001 is OK: (C)100 ge (W)50 ge 1.192 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [16:19:10] RECOVERY - Host db2135.mgmt is UP: PING OK - Packet loss = 0%, RTA = 363.55 ms [16:19:28] RECOVERY - Host dbproxy2003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms [16:19:30] RECOVERY - Host wdqs2008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.87 ms [16:19:59] Thanks jbond! [16:20:10] RECOVERY - Host ganeti2014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.67 ms [16:20:24] RECOVERY - Host db2095.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.58 ms [16:20:26] RECOVERY - Host db2115.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.76 ms [16:20:26] RECOVERY - Host db2099.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.30 ms [16:20:32] RECOVERY - Host db2116.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.72 ms [16:20:34] jouncebot: nowandnext [16:20:34] For the next 0 hour(s) and 39 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220804T1600) [16:20:34] In 0 hour(s) and 39 minute(s): Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220804T1700) [16:20:58] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:21:06] RECOVERY - Host es2022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms [16:21:34] RECOVERY - Host ganeti2013.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.94 ms [16:22:01] (03PS2) 10Ladsgroup: Start reading from new templatelinks columns in commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820376 (https://phabricator.wikimedia.org/T306673) [16:22:05] (03CR) 10Ladsgroup: [C: 03+2] Start reading from new templatelinks columns in commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820376 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [16:22:06] RECOVERY - Juniper alarms on asw-c-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [16:22:59] (03Merged) 10jenkins-bot: Start reading from new templatelinks columns in commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820376 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [16:23:44] RECOVERY - Host mw2350 is UP: PING OK - Packet loss = 0%, RTA = 33.12 ms [16:23:46] RECOVERY - Host mw2353 is UP: PING OK - Packet loss = 0%, RTA = 33.16 ms [16:23:48] RECOVERY - Host mw2354 is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms [16:23:52] RECOVERY - Host mw2351 is UP: PING OK - Packet loss = 0%, RTA = 33.65 ms [16:24:00] RECOVERY - Host mw2352 is UP: PING OK - Packet loss = 0%, RTA = 33.25 ms [16:24:00] RECOVERY - Host mw2355 is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms [16:24:04] RECOVERY - Host mw2356.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.80 ms [16:24:22] RECOVERY - Host wdqs2008 is UP: PING OK - Packet loss = 0%, RTA = 33.47 ms [16:24:30] RECOVERY - Host mw2351.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms [16:24:30] RECOVERY - Host mw2350.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.69 ms [16:24:30] RECOVERY - Host mw2352.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.68 ms [16:24:30] RECOVERY - Host mw2353.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.62 ms [16:24:30] RECOVERY - Host mw2355.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.64 ms [16:24:30] RECOVERY - Host mw2354.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms [16:24:38] PROBLEM - Query Service HTTP Port on wdqs2008 is CRITICAL: connect to address 127.0.0.1 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service [16:25:42] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:25:58] PROBLEM - WDQS SPARQL on wdqs2008 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 244 bytes in 1.183 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:26:14] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:26:44] RECOVERY - Query Service HTTP Port on wdqs2008 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.135 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [16:26:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:27:44] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2008 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:27:46] RECOVERY - Host parse2014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.22 ms [16:27:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:27:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:28:00] RECOVERY - WDQS SPARQL on wdqs2008 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.231 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:28:20] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2008 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:28:39] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820376|Start reading from new templatelinks columns in commons (T306673)]] (duration: 03m 00s) [16:28:42] T306673: Turn on read new for templatelinks on beta and production - https://phabricator.wikimedia.org/T306673 [16:28:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:29:55] RECOVERY - Host mw2357 is UP: PING OK - Packet loss = 0%, RTA = 33.18 ms [16:29:57] RECOVERY - Host mw2356 is UP: PING OK - Packet loss = 0%, RTA = 33.14 ms [16:29:57] RECOVERY - Host mw2360 is UP: PING OK - Packet loss = 0%, RTA = 33.21 ms [16:29:59] RECOVERY - Host mw2362 is UP: PING OK - Packet loss = 0%, RTA = 33.23 ms [16:29:59] RECOVERY - Host mw2361 is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms [16:30:01] RECOVERY - Host mw2363 is UP: PING OK - Packet loss = 0%, RTA = 33.23 ms [16:30:07] RECOVERY - Host mw2358 is UP: PING OK - Packet loss = 0%, RTA = 33.29 ms [16:30:07] RECOVERY - Host mw2357.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.73 ms [16:30:09] RECOVERY - Host mw2365 is UP: PING OK - Packet loss = 0%, RTA = 33.37 ms [16:30:11] RECOVERY - Host mw2358.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.73 ms [16:30:11] RECOVERY - Host mw2359 is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms [16:30:13] RECOVERY - Host mw2364 is UP: PING OK - Packet loss = 0%, RTA = 33.21 ms [16:30:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool D3 for PDU maint', diff saved to https://phabricator.wikimedia.org/P32286 and previous config saved to /var/cache/conftool/dbconfig/20220804-163037-ladsgroup.json [16:30:39] D3: test - ignore - https://phabricator.wikimedia.org/D3 [16:30:45] RECOVERY - Host db2127.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.73 ms [16:30:53] RECOVERY - Host mw2359.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms [16:30:53] RECOVERY - Host mw2360.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.69 ms [16:30:53] RECOVERY - Host mw2361.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.69 ms [16:30:53] RECOVERY - Host mw2362.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.91 ms [16:30:53] RECOVERY - Host mw2363.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.26 ms [16:31:11] RECOVERY - Host mw2364.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.77 ms [16:31:11] RECOVERY - Host mw2365.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.68 ms [16:31:47] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [16:32:13] RECOVERY - Host db2168.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms [16:32:25] RECOVERY - Host parse2015 is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms [16:32:25] RECOVERY - Host parse2014 is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms [16:32:33] RECOVERY - Host db2179.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.79 ms [16:32:33] RECOVERY - Host db2180.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.33 ms [16:32:54] !log ebysans@deploy1002 Finished deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288] (duration: 29m 59s) [16:32:59] RECOVERY - Host parse2015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms