[00:03:49] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2028.mgmt.codfw.wmnet with reboot policy FORCED [00:05:28] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2029.mgmt.codfw.wmnet with reboot policy FORCED [00:05:33] (DatasourceError) firing: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [00:06:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P51964 and previous config saved to /var/cache/conftool/dbconfig/20230830-000602-ladsgroup.json [00:08:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2030.mgmt.codfw.wmnet with reboot policy FORCED [00:10:33] (DatasourceError) resolved: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [00:10:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2031.mgmt.codfw.wmnet with reboot policy FORCED [00:10:39] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [00:12:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2032.mgmt.codfw.wmnet with reboot policy FORCED [00:14:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2028.mgmt.codfw.wmnet with reboot policy FORCED [00:15:05] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2033.mgmt.codfw.wmnet with reboot policy FORCED [00:17:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2029.mgmt.codfw.wmnet with reboot policy FORCED [00:19:29] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2030.mgmt.codfw.wmnet with reboot policy FORCED [00:21:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T343718)', diff saved to https://phabricator.wikimedia.org/P51965 and previous config saved to /var/cache/conftool/dbconfig/20230830-002108-ladsgroup.json [00:21:18] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [00:21:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2031.mgmt.codfw.wmnet with reboot policy FORCED [00:25:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2032.mgmt.codfw.wmnet with reboot policy FORCED [00:26:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2033.mgmt.codfw.wmnet with reboot policy FORCED [00:29:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2034.mgmt.codfw.wmnet with reboot policy FORCED [00:29:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2035.mgmt.codfw.wmnet with reboot policy FORCED [00:29:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2036.mgmt.codfw.wmnet with reboot policy FORCED [00:29:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2037.mgmt.codfw.wmnet with reboot policy FORCED [00:30:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2038.mgmt.codfw.wmnet with reboot policy FORCED [00:30:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2039.mgmt.codfw.wmnet with reboot policy FORCED [00:30:39] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [00:35:19] RECOVERY - Check systemd state on dbstore1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:19] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/952985 [00:38:22] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/952985 (owner: 10TrainBranchBot) [00:41:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2036.mgmt.codfw.wmnet with reboot policy FORCED [00:41:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2038.mgmt.codfw.wmnet with reboot policy FORCED [00:41:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2039.mgmt.codfw.wmnet with reboot policy FORCED [00:41:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2035.mgmt.codfw.wmnet with reboot policy FORCED [00:41:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2037.mgmt.codfw.wmnet with reboot policy FORCED [00:44:51] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: adds-changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:13] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/952985 (owner: 10TrainBranchBot) [00:54:22] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Jhancock.wm) [01:00:05] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes2034.mgmt.codfw.wmnet with reboot policy FORCED [01:00:21] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2034.mgmt.codfw.wmnet with reboot policy FORCED [01:01:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T343718)', diff saved to https://phabricator.wikimedia.org/P51966 and previous config saved to /var/cache/conftool/dbconfig/20230830-010144-ladsgroup.json [01:01:50] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [01:09:50] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes2034.mgmt.codfw.wmnet with reboot policy FORCED [01:16:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P51967 and previous config saved to /var/cache/conftool/dbconfig/20230830-011650-ladsgroup.json [01:31:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [01:31:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P51968 and previous config saved to /var/cache/conftool/dbconfig/20230830-013156-ladsgroup.json [01:37:46] (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [01:38:39] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:47:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T343718)', diff saved to https://phabricator.wikimedia.org/P51969 and previous config saved to /var/cache/conftool/dbconfig/20230830-014702-ladsgroup.json [01:47:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1221.eqiad.wmnet with reason: Maintenance [01:47:09] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [01:47:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1221.eqiad.wmnet with reason: Maintenance [01:47:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [01:47:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [01:47:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1221 (T343718)', diff saved to https://phabricator.wikimedia.org/P51970 and previous config saved to /var/cache/conftool/dbconfig/20230830-014730-ladsgroup.json [01:48:56] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:01:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:01:20] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:08:56] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:33:58] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:48:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:53:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:30:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T343718)', diff saved to https://phabricator.wikimedia.org/P51971 and previous config saved to /var/cache/conftool/dbconfig/20230830-043024-ladsgroup.json [04:30:30] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [04:33:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:38:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:45:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P51972 and previous config saved to /var/cache/conftool/dbconfig/20230830-044530-ladsgroup.json [04:48:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:53:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:00:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P51973 and previous config saved to /var/cache/conftool/dbconfig/20230830-050036-ladsgroup.json [05:15:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T343718)', diff saved to https://phabricator.wikimedia.org/P51974 and previous config saved to /var/cache/conftool/dbconfig/20230830-051543-ladsgroup.json [05:15:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [05:15:49] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [05:15:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [05:18:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:23:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:37:46] (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [05:40:41] 10SRE, 10Infrastructure-Foundations, 10netops: Configure bgp-error-tolerance on Juniper routers - https://phabricator.wikimedia.org/T340111 (10akosiaris) FYI, same mitigation applies to https://supportportal.juniper.net/s/article/2023-08-29-Out-of-Cycle-Security-Bulletin-Junos-OS-and-Junos-OS-Evolved-A-craft... [05:42:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1165 upgrade to mariadb 10.6', diff saved to https://phabricator.wikimedia.org/P51975 and previous config saved to /var/cache/conftool/dbconfig/20230830-054248-root.json [05:43:26] (03PS1) 10Marostegui: db1165: Upgrade to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/953361 [05:44:40] (03CR) 10Marostegui: [C: 03+1] mariadb::packages_wmf: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/952881 (owner: 10Muehlenhoff) [05:44:52] (03PS3) 10Marostegui: mariadb: Upgrade db1225 to mariadb 10.6 (and generate 10.6 backups) [puppet] - 10https://gerrit.wikimedia.org/r/942652 (https://phabricator.wikimedia.org/T334650) (owner: 10Jcrespo) [05:44:56] (03CR) 10Marostegui: [C: 03+2] db1165: Upgrade to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/953361 (owner: 10Marostegui) [05:48:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:49:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 1%: Repooling after onsite upgrade', diff saved to https://phabricator.wikimedia.org/P51976 and previous config saved to /var/cache/conftool/dbconfig/20230830-054940-root.json [05:50:26] (03PS1) 10Marostegui: db1173: Upgrade to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/953508 [05:50:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1173 upgrade to mariadb 10.6', diff saved to https://phabricator.wikimedia.org/P51977 and previous config saved to /var/cache/conftool/dbconfig/20230830-055034-root.json [05:51:28] (03CR) 10Marostegui: [C: 03+2] db1173: Upgrade to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/953508 (owner: 10Marostegui) [05:52:13] (03CR) 10Marostegui: [C: 03+2] mariadb: Upgrade db1225 to mariadb 10.6 (and generate 10.6 backups) [puppet] - 10https://gerrit.wikimedia.org/r/942652 (https://phabricator.wikimedia.org/T334650) (owner: 10Jcrespo) [05:53:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:56:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:57:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 1%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P51978 and previous config saved to /var/cache/conftool/dbconfig/20230830-055704-root.json [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230830T0600) [06:01:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:04:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 3%: Repooling after onsite upgrade', diff saved to https://phabricator.wikimedia.org/P51979 and previous config saved to /var/cache/conftool/dbconfig/20230830-060445-root.json [06:07:09] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1173 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/953487 (https://phabricator.wikimedia.org/T345223) [06:07:14] (03PS1) 10Gerrit maintenance bot: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/953488 (https://phabricator.wikimedia.org/T345223) [06:07:56] (03PS1) 10Ayounsi: cr: set bgp-error-tolerance on all sessions [homer/public] - 10https://gerrit.wikimedia.org/r/953509 (https://phabricator.wikimedia.org/T340111) [06:11:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:12:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 3%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P51980 and previous config saved to /var/cache/conftool/dbconfig/20230830-061209-root.json [06:15:02] (03PS1) 10Ayounsi: sre.network.tls: use fqdn's hostname to store cert in config [cookbooks] - 10https://gerrit.wikimedia.org/r/953510 (https://phabricator.wikimedia.org/T334594) [06:16:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:18:30] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1126.eqiad.wmnet with OS bullseye [06:19:39] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1127.eqiad.wmnet with OS bullseye [06:19:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 5%: Repooling after onsite upgrade', diff saved to https://phabricator.wikimedia.org/P51981 and previous config saved to /var/cache/conftool/dbconfig/20230830-061950-root.json [06:27:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P51982 and previous config saved to /var/cache/conftool/dbconfig/20230830-062714-root.json [06:31:54] !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox circuit ID 33 [06:32:07] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 33 [06:32:21] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1126.eqiad.wmnet with reason: host reimage [06:33:21] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1127.eqiad.wmnet with reason: host reimage [06:33:31] !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox circuit ID 33 [06:33:44] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 33 [06:33:49] 10SRE, 10Infrastructure-Foundations, 10netops: xe-3/2/1: down -> Transport: cr1-esams:xe-0/0/7 (Lumen, BDFS2448 80ms 10Gbps wave) {#2013} - https://phabricator.wikimedia.org/T345138 (10ops-monitoring-bot) ===== Automated diagnostic for Netbox circuit ID 33 --- **Interface cr1-esams:xe-0/0/7** - admin-status... [06:34:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 10%: Repooling after onsite upgrade', diff saved to https://phabricator.wikimedia.org/P51983 and previous config saved to /var/cache/conftool/dbconfig/20230830-063455-root.json [06:35:00] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:35:27] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1126.eqiad.wmnet with reason: host reimage [06:35:48] 10SRE, 10Infrastructure-Foundations, 10netops: xe-3/2/1: down -> Transport: cr1-esams:xe-0/0/7 (Lumen, BDFS2448 80ms 10Gbps wave) {#2013} - https://phabricator.wikimedia.org/T345138 (10ayounsi) 05Open→03Resolved a:03ayounsi RFO sent by email. [06:37:27] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1127.eqiad.wmnet with reason: host reimage [06:41:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance [06:41:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance [06:41:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T343718)', diff saved to https://phabricator.wikimedia.org/P51984 and previous config saved to /var/cache/conftool/dbconfig/20230830-064131-ladsgroup.json [06:41:37] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [06:41:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:42:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [06:42:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [06:42:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P51985 and previous config saved to /var/cache/conftool/dbconfig/20230830-064219-root.json [06:43:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T343718)', diff saved to https://phabricator.wikimedia.org/P51986 and previous config saved to /var/cache/conftool/dbconfig/20230830-064343-ladsgroup.json [06:44:47] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:45:53] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:46:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:46:44] (03PS6) 10Elukey: LiftWing: add latency/availability SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [06:47:51] (03CR) 10Elukey: LiftWing: add latency/availability SLO dashboards (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [06:50:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 25%: Repooling after onsite upgrade', diff saved to https://phabricator.wikimedia.org/P51987 and previous config saved to /var/cache/conftool/dbconfig/20230830-064959-root.json [06:50:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1020.eqiad.wmnet [06:51:00] (03PS1) 10Alexandros Kosiaris: modules: Add CHANGELOG for MariaDB egress template [deployment-charts] - 10https://gerrit.wikimedia.org/r/953550 [06:51:02] (03PS1) 10Alexandros Kosiaris: linkrecommendation: TODO [deployment-charts] - 10https://gerrit.wikimedia.org/r/953551 (https://phabricator.wikimedia.org/T340843) [06:51:47] (03PS7) 10Elukey: LiftWing: add latency/availability SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [06:51:49] (03PS2) 10Alexandros Kosiaris: linkrecommendation: Update vendor modules for T340843 [deployment-charts] - 10https://gerrit.wikimedia.org/r/953551 (https://phabricator.wikimedia.org/T340843) [06:53:39] (03CR) 10Jcrespo: [C: 03+1] mariadb::packages_wmf: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/952881 (owner: 10Muehlenhoff) [06:54:22] (03CR) 10Elukey: LiftWing: add latency/availability SLO dashboards (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [06:56:13] (03PS1) 10Slyngshede: Minor UI tweaks. [software/bitu] - 10https://gerrit.wikimedia.org/r/953552 [06:56:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:57:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1020.eqiad.wmnet [06:57:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P51988 and previous config saved to /var/cache/conftool/dbconfig/20230830-065723-root.json [06:57:29] (03PS2) 10Slyngshede: Minor UI tweaks. [software/bitu] - 10https://gerrit.wikimedia.org/r/953552 [06:58:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P51989 and previous config saved to /var/cache/conftool/dbconfig/20230830-065849-ladsgroup.json [06:58:54] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1126.eqiad.wmnet with OS bullseye [06:59:01] (03CR) 10Alexandros Kosiaris: [C: 03+2] modules: Add CHANGELOG for MariaDB egress template [deployment-charts] - 10https://gerrit.wikimedia.org/r/953550 (owner: 10Alexandros Kosiaris) [06:59:15] (03PS8) 10Elukey: LiftWing: add latency/availability SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [06:59:29] (03Merged) 10jenkins-bot: modules: Add CHANGELOG for MariaDB egress template [deployment-charts] - 10https://gerrit.wikimedia.org/r/953550 (owner: 10Alexandros Kosiaris) [07:00:03] (03CR) 10Elukey: LiftWing: add latency/availability SLO dashboards (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [07:00:05] Amir1, Urbanecm, and taavi: That opportune time is upon us again. Time for a UTC morning backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230830T0700). [07:00:05] pfischer: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:44] pfischer: can you self-serve? [07:00:52] (03CR) 10Jelto: [C: 03+2] Revert "trafficserver: switch all miscweb services to codfw cname" [puppet] - 10https://gerrit.wikimedia.org/r/950825 (owner: 10Jelto) [07:01:03] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1127.eqiad.wmnet with OS bullseye [07:01:06] Amir1: it’s my first patch. I can’t +2 in the config repo [07:01:28] okay, I deploy it then [07:01:33] (03CR) 10Ladsgroup: [C: 03+2] Disable search result deduplication. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952346 (https://phabricator.wikimedia.org/T341227) (owner: 10Peter Fischer) [07:01:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:01:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952346 (https://phabricator.wikimedia.org/T341227) (owner: 10Peter Fischer) [07:01:51] Amir1: thanks. [07:01:57] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2003.codfw.wmnet [07:02:21] (03Merged) 10jenkins-bot: Disable search result deduplication. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952346 (https://phabricator.wikimedia.org/T341227) (owner: 10Peter Fischer) [07:03:10] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:952346|Disable search result deduplication. (T341227)]] [07:03:15] T341227: Make local_sites_with_dupe filter configurable and count duplicates - https://phabricator.wikimedia.org/T341227 [07:03:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:03:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1020.eqiad.wmnet [07:04:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1020.eqiad.wmnet [07:04:55] !log ladsgroup@deploy1002 ladsgroup and pfischer: Backport for [[gerrit:952346|Disable search result deduplication. (T341227)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [07:05:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 50%: Repooling after onsite upgrade', diff saved to https://phabricator.wikimedia.org/P51990 and previous config saved to /var/cache/conftool/dbconfig/20230830-070504-root.json [07:06:00] pfischer: it's live in mwdebug, can you test it there? [07:06:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1021.eqiad.wmnet [07:08:02] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2003.codfw.wmnet [07:08:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:09:14] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1128.eqiad.wmnet with OS bullseye [07:09:27] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1129.eqiad.wmnet with OS bullseye [07:10:07] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2004.codfw.wmnet [07:10:11] pfischer: do you know how to use mwdebug? https://wikitech.wikimedia.org/wiki/WikimediaDebug#Browser_usage [07:11:27] (03PS3) 10Slyngshede: Minor UI tweaks. [software/bitu] - 10https://gerrit.wikimedia.org/r/953552 [07:11:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance [07:11:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:11:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance [07:11:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T343718)', diff saved to https://phabricator.wikimedia.org/P51991 and previous config saved to /var/cache/conftool/dbconfig/20230830-071152-ladsgroup.json [07:12:01] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [07:12:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P51992 and previous config saved to /var/cache/conftool/dbconfig/20230830-071228-root.json [07:13:27] !log ladsgroup@deploy1002 ladsgroup and pfischer: Continuing with sync [07:13:37] I'm moving forward with this [07:13:47] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:13:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P51993 and previous config saved to /var/cache/conftool/dbconfig/20230830-071356-ladsgroup.json [07:14:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T343718)', diff saved to https://phabricator.wikimedia.org/P51994 and previous config saved to /var/cache/conftool/dbconfig/20230830-071416-ladsgroup.json [07:15:12] (03PS4) 10Slyngshede: Minor UI tweaks. [software/bitu] - 10https://gerrit.wikimedia.org/r/953552 [07:15:13] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:16:17] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2004.codfw.wmnet [07:16:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:16:54] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2005.codfw.wmnet [07:17:41] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1021.eqiad.wmnet [07:18:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1021.eqiad.wmnet [07:18:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:19:03] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:952346|Disable search result deduplication. (T341227)]] (duration: 15m 53s) [07:19:08] T341227: Make local_sites_with_dupe filter configurable and count duplicates - https://phabricator.wikimedia.org/T341227 [07:20:05] (03CR) 10Filippo Giunchedi: [C: 03+1] uwsgi: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/953273 (owner: 10Muehlenhoff) [07:20:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 75%: Repooling after onsite upgrade', diff saved to https://phabricator.wikimedia.org/P51995 and previous config saved to /var/cache/conftool/dbconfig/20230830-072009-root.json [07:20:29] Amir1: No, but I’m looking into mwdebug [07:20:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1021.eqiad.wmnet [07:22:31] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10ayounsi) > In other words what is its usage today? One use case that I'm aware of is Icinga not alerting for hosts unreach... [07:22:46] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1129.eqiad.wmnet with reason: host reimage [07:22:55] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1128.eqiad.wmnet with reason: host reimage [07:23:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:23:38] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2005.codfw.wmnet [07:24:35] (03PS1) 10Alexandros Kosiaris: Update modules/README.md [deployment-charts] - 10https://gerrit.wikimedia.org/r/953553 [07:25:06] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2006.codfw.wmnet [07:25:33] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1129.eqiad.wmnet with reason: host reimage [07:26:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:26:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1021.eqiad.wmnet [07:26:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1021.eqiad.wmnet [07:27:06] (03CR) 10Muehlenhoff: [C: 03+2] uwsgi: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/953273 (owner: 10Muehlenhoff) [07:27:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P51996 and previous config saved to /var/cache/conftool/dbconfig/20230830-072733-root.json [07:27:45] (Processor usage over 85%) firing: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [07:28:03] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1128.eqiad.wmnet with reason: host reimage [07:28:27] (03CR) 10Ayounsi: [C: 03+2] sre.network.tls: use fqdn's hostname to store cert in config [cookbooks] - 10https://gerrit.wikimedia.org/r/953510 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [07:29:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T343718)', diff saved to https://phabricator.wikimedia.org/P51997 and previous config saved to /var/cache/conftool/dbconfig/20230830-072902-ladsgroup.json [07:29:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [07:29:08] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [07:29:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [07:29:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P51998 and previous config saved to /var/cache/conftool/dbconfig/20230830-072922-ladsgroup.json [07:31:02] (03Merged) 10jenkins-bot: sre.network.tls: use fqdn's hostname to store cert in config [cookbooks] - 10https://gerrit.wikimedia.org/r/953510 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [07:31:22] (03CR) 10Muehlenhoff: [C: 03+2] openstack: Remove obsolete client classes [puppet] - 10https://gerrit.wikimedia.org/r/953274 (owner: 10Muehlenhoff) [07:31:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:31:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1128 upgrade to mariadb 10.4.31', diff saved to https://phabricator.wikimedia.org/P51999 and previous config saved to /var/cache/conftool/dbconfig/20230830-073144-root.json [07:31:50] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2006.codfw.wmnet [07:32:46] (Processor usage over 85%) resolved: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [07:32:48] (03CR) 10Jelto: [C: 03+2] trafficserver: switch wikiworkshop.org and research.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/949842 (https://phabricator.wikimedia.org/T334511) (owner: 10Jelto) [07:33:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:33:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 1%: Repooling after upgrade 10.4.31 T344309', diff saved to https://phabricator.wikimedia.org/P52000 and previous config saved to /var/cache/conftool/dbconfig/20230830-073347-root.json [07:33:54] T344309: Compile and package MariaDB 11.0.3 10.6.15, 10.4.31 - https://phabricator.wikimedia.org/T344309 [07:35:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 100%: Repooling after onsite upgrade', diff saved to https://phabricator.wikimedia.org/P52001 and previous config saved to /var/cache/conftool/dbconfig/20230830-073514-root.json [07:37:09] (03CR) 10Slyngshede: "A few UI updates based on feedback and review with designer. This should help to ensure that once the new design lands people will see it " [software/bitu] - 10https://gerrit.wikimedia.org/r/953552 (owner: 10Slyngshede) [07:37:46] (03PS3) 10Alexandros Kosiaris: linkrecommendation: Update vendor modules for T340843 [deployment-charts] - 10https://gerrit.wikimedia.org/r/953551 (https://phabricator.wikimedia.org/T340843) [07:37:48] (03PS1) 10Alexandros Kosiaris: linkrecommendation: Remove all hardcoded dbproxy networkpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/953554 (https://phabricator.wikimedia.org/T340843) [07:38:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:39:39] 10SRE, 10Bitu, 10Infrastructure-Foundations: Switch developer account creation to Bitu - https://phabricator.wikimedia.org/T345226 (10SLyngshede-WMF) [07:39:49] 10SRE, 10Bitu, 10Infrastructure-Foundations: Switch developer account creation to Bitu - https://phabricator.wikimedia.org/T345226 (10SLyngshede-WMF) [07:41:06] (03PS1) 10Marostegui: mariadb: Move db1119 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/953555 (https://phabricator.wikimedia.org/T339835) [07:41:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2028.codfw.wmnet with reason: Maintenance [07:41:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2028.codfw.wmnet with reason: Maintenance [07:42:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2028 (T344589)', diff saved to https://phabricator.wikimedia.org/P52002 and previous config saved to /var/cache/conftool/dbconfig/20230830-074202-ladsgroup.json [07:42:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52003 and previous config saved to /var/cache/conftool/dbconfig/20230830-074238-root.json [07:42:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1022.eqiad.wmnet [07:44:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P52004 and previous config saved to /var/cache/conftool/dbconfig/20230830-074428-ladsgroup.json [07:45:38] 10SRE, 10Bitu, 10Infrastructure-Foundations: Switch developer account creation to Bitu - https://phabricator.wikimedia.org/T345226 (10SLyngshede-WMF) [07:46:25] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:47:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2028 (T344589)', diff saved to https://phabricator.wikimedia.org/P52005 and previous config saved to /var/cache/conftool/dbconfig/20230830-074702-ladsgroup.json [07:47:51] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:48:12] Amir1: Thank you for deploying my patch. Out of curiosity: How long does it take for such config deployments to be fully rolled out? [07:48:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:48:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 3%: Repooling after upgrade 10.4.31 T344309', diff saved to https://phabricator.wikimedia.org/P52006 and previous config saved to /var/cache/conftool/dbconfig/20230830-074852-root.json [07:48:58] T344309: Compile and package MariaDB 11.0.3 10.6.15, 10.4.31 - https://phabricator.wikimedia.org/T344309 [07:50:41] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1129.eqiad.wmnet with OS bullseye [07:51:40] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1128.eqiad.wmnet with OS bullseye [07:53:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:53:56] pfischer: i went to meeting, it went live half an hour ago [07:54:30] 09:19 to be exact [07:54:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1022.eqiad.wmnet [07:55:49] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:56:17] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:56:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:56:39] PROBLEM - Host ml-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [07:57:00] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2007.codfw.wmnet [07:57:13] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:57:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [07:57:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [07:57:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T343718)', diff saved to https://phabricator.wikimedia.org/P52007 and previous config saved to /var/cache/conftool/dbconfig/20230830-075736-ladsgroup.json [07:57:42] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [07:59:09] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:59:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T343718)', diff saved to https://phabricator.wikimedia.org/P52008 and previous config saved to /var/cache/conftool/dbconfig/20230830-075934-ladsgroup.json [07:59:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance [07:59:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance [07:59:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T343718)', diff saved to https://phabricator.wikimedia.org/P52009 and previous config saved to /var/cache/conftool/dbconfig/20230830-075956-ladsgroup.json [08:00:12] (03CR) 10David Caro: "We are actually upgrading to antelope, @fnegri is taking care of that :)" [puppet] - 10https://gerrit.wikimedia.org/r/952848 (owner: 10Muehlenhoff) [08:00:31] RECOVERY - Host ml-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [08:00:59] (03CR) 10David Caro: [C: 03+2] cloudcephosd: Remove Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/944937 (owner: 10Muehlenhoff) [08:01:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1022.eqiad.wmnet [08:01:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1022.eqiad.wmnet [08:01:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:01:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:01:54] (03CR) 10David Caro: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/944937 (owner: 10Muehlenhoff) [08:02:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2028', diff saved to https://phabricator.wikimedia.org/P52010 and previous config saved to /var/cache/conftool/dbconfig/20230830-080208-ladsgroup.json [08:03:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:03:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 5%: Repooling after upgrade 10.4.31 T344309', diff saved to https://phabricator.wikimedia.org/P52011 and previous config saved to /var/cache/conftool/dbconfig/20230830-080356-root.json [08:04:03] T344309: Compile and package MariaDB 11.0.3 10.6.15, 10.4.31 - https://phabricator.wikimedia.org/T344309 [08:04:19] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2007.codfw.wmnet [08:05:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1023.eqiad.wmnet [08:05:24] (03CR) 10Elukey: "My 2c: I am seeing a pattern of revert/re-apply for wikifunctions, and the reason for this revert seems to indicate that the code is not 1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/953212 (https://phabricator.wikimedia.org/T344998) (owner: 10Jforrester) [08:06:01] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2008.codfw.wmnet [08:08:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:10:56] (03PS9) 10Elukey: LiftWing: add latency/availability SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [08:11:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:13:14] I am going to check these alerts --^ in a bit [08:15:54] (03CR) 10Muehlenhoff: Minor UI tweaks. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/953552 (owner: 10Slyngshede) [08:16:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:17:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2028', diff saved to https://phabricator.wikimedia.org/P52012 and previous config saved to /var/cache/conftool/dbconfig/20230830-081714-ladsgroup.json [08:17:53] (03PS5) 10Slyngshede: Minor UI tweaks. [software/bitu] - 10https://gerrit.wikimedia.org/r/953552 [08:18:24] (03CR) 10Slyngshede: Minor UI tweaks. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/953552 (owner: 10Slyngshede) [08:18:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:18:57] PROBLEM - Check systemd state on ml-serve-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:19:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 10%: Repooling after upgrade 10.4.31 T344309', diff saved to https://phabricator.wikimedia.org/P52013 and previous config saved to /var/cache/conftool/dbconfig/20230830-081901-root.json [08:19:07] T344309: Compile and package MariaDB 11.0.3 10.6.15, 10.4.31 - https://phabricator.wikimedia.org/T344309 [08:21:51] (03CR) 10Muehlenhoff: [C: 03+2] etcd: Remove obsolete check [puppet] - 10https://gerrit.wikimedia.org/r/952198 (owner: 10Muehlenhoff) [08:23:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:24:08] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:26:07] PROBLEM - Host ores2008 is DOWN: PING CRITICAL - Packet loss = 100% [08:26:12] (03CR) 10Elukey: LiftWing: add latency/availability SLO dashboards (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [08:26:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T343718)', diff saved to https://phabricator.wikimedia.org/P52014 and previous config saved to /var/cache/conftool/dbconfig/20230830-082645-ladsgroup.json [08:26:52] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [08:26:59] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/953552 (owner: 10Slyngshede) [08:27:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1023.eqiad.wmnet [08:28:53] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:30:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T343718)', diff saved to https://phabricator.wikimedia.org/P52015 and previous config saved to /var/cache/conftool/dbconfig/20230830-083025-ladsgroup.json [08:31:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1023.eqiad.wmnet [08:31:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1023.eqiad.wmnet [08:31:46] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/953509 (https://phabricator.wikimedia.org/T340111) (owner: 10Ayounsi) [08:31:47] RECOVERY - Check systemd state on ml-serve-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:32:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2028 (T344589)', diff saved to https://phabricator.wikimedia.org/P52016 and previous config saved to /var/cache/conftool/dbconfig/20230830-083220-ladsgroup.json [08:32:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2030.codfw.wmnet with reason: Maintenance [08:32:37] Checking ores2008 [08:32:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2030.codfw.wmnet with reason: Maintenance [08:32:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2030 (T344589)', diff saved to https://phabricator.wikimedia.org/P52017 and previous config saved to /var/cache/conftool/dbconfig/20230830-083246-ladsgroup.json [08:33:33] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1025.eqiad.wmnet [08:33:57] (03CR) 10Ayounsi: [C: 03+2] cr: set bgp-error-tolerance on all sessions [homer/public] - 10https://gerrit.wikimedia.org/r/953509 (https://phabricator.wikimedia.org/T340111) (owner: 10Ayounsi) [08:34:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 25%: Repooling after upgrade 10.4.31 T344309', diff saved to https://phabricator.wikimedia.org/P52018 and previous config saved to /var/cache/conftool/dbconfig/20230830-083406-root.json [08:34:12] T344309: Compile and package MariaDB 11.0.3 10.6.15, 10.4.31 - https://phabricator.wikimedia.org/T344309 [08:34:29] (03Merged) 10jenkins-bot: cr: set bgp-error-tolerance on all sessions [homer/public] - 10https://gerrit.wikimedia.org/r/953509 (https://phabricator.wikimedia.org/T340111) (owner: 10Ayounsi) [08:34:36] dead DIMM [08:34:41] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:34:42] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Configure bgp-error-tolerance on Juniper routers - https://phabricator.wikimedia.org/T340111 (10cmooney) Agreed this seems to make sense, and Juniper are advising it: https://supportportal.juniper.net/s/article/2023-08-29-Out-of-Cycle-Secu... [08:34:45] I'll open a DCops task [08:35:34] Hmm is someone already on it ? moritzm ? [08:35:41] console: Serial Device 2 is currently in use [08:35:53] (03CR) 10Muehlenhoff: [C: 03+2] aphlict : Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/952461 (owner: 10Muehlenhoff) [08:36:05] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:37:12] claime: I am rebooting it for kernel upgrades [08:37:15] see SAL [08:37:20] but of course the host doesn't like me [08:37:26] Why did I not see it [08:37:29] wtf [08:37:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2030 (T344589)', diff saved to https://phabricator.wikimedia.org/P52019 and previous config saved to /var/cache/conftool/dbconfig/20230830-083737-ladsgroup.json [08:37:42] (03CR) 10Cathal Mooney: [C: 03+2] Allow MGMT_NETWORKS connect to apt server private server on 8080 [puppet] - 10https://gerrit.wikimedia.org/r/952478 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [08:37:49] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Minor UI tweaks. [software/bitu] - 10https://gerrit.wikimedia.org/r/953552 (owner: 10Slyngshede) [08:38:51] !log set bgp-error-tolerance on all sessions - T340111 [08:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:56] T340111: Configure bgp-error-tolerance on Juniper routers - https://phabricator.wikimedia.org/T340111 [08:40:39] elukey, claime: given the server's age (May 2017!) we can also just decom it? ORES codfw should be fine with one server less and it's not for long anyway [08:41:51] (ProbeDown) firing: (2) Service idm1001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#idm1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:41:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P52020 and previous config saved to /var/cache/conftool/dbconfig/20230830-084151-ladsgroup.json [08:42:29] moritzm: If it doesn't come back, we probably can [08:42:35] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10MatthewVernon) Hi @Jhancock.wm this system looks to have been set up with individual RAID-0, rather than as non-RAID JBOD? [08:43:18] elukey: Apparently searching for ores in https://sal.toolforge.org/production doesn't match ores200X, I have to search ores* [08:43:21] Good to know lol [08:43:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1025.eqiad.wmnet [08:44:45] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device asw-a2-codfw [08:45:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P52021 and previous config saved to /var/cache/conftool/dbconfig/20230830-084532-ladsgroup.json [08:46:17] (03CR) 10Muehlenhoff: [C: 03+2] mariadb::packages_wmf: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/952881 (owner: 10Muehlenhoff) [08:46:51] (ProbeDown) firing: (4) Service idm1001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:47:48] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw-a2-codfw [08:47:50] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device asw2-c2-eqiad [08:49:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 50%: Repooling after upgrade 10.4.31 T344309', diff saved to https://phabricator.wikimedia.org/P52023 and previous config saved to /var/cache/conftool/dbconfig/20230830-084911-root.json [08:49:17] T344309: Compile and package MariaDB 11.0.3 10.6.15, 10.4.31 - https://phabricator.wikimedia.org/T344309 [08:49:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1025.eqiad.wmnet [08:50:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1025.eqiad.wmnet [08:50:43] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1026.eqiad.wmnet [08:50:49] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw2-c2-eqiad [08:50:51] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device asw2-d7-eqiad [08:51:10] !log stopping puppet to fix broken drive labelling after disk swap thanos-be1003 T345079 [08:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:15] T345079: Broken disk on thanos-be1003 - https://phabricator.wikimedia.org/T345079 [08:52:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:52:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2030', diff saved to https://phabricator.wikimedia.org/P52024 and previous config saved to /var/cache/conftool/dbconfig/20230830-085243-ladsgroup.json [08:53:46] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw2-d7-eqiad [08:53:49] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device asw-0604-eqsin [08:54:25] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:55:23] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:56:49] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:56:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P52025 and previous config saved to /var/cache/conftool/dbconfig/20230830-085657-ladsgroup.json [08:56:58] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw-0604-eqsin [08:57:00] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device asw-b2-codfw [08:57:19] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:57:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:58:06] (03PS1) 10Muehlenhoff: Remove now obsolete check given all Stretch VMs are gone [puppet] - 10https://gerrit.wikimedia.org/r/953562 [08:58:30] 10SRE, 10ops-eqiad: Broken disk on thanos-be1003 - https://phabricator.wikimedia.org/T345079 (10MatthewVernon) puppet had spotted the new drive was `/dev/sdm` so made a new filesystem labelled `swift-sdm1` on it (even though there was already an FS with that label mounted and in use). I fixed this up by hand,... [08:58:48] 10SRE, 10SRE-swift-storage, 10ops-eqiad: Broken disk on thanos-be1003 - https://phabricator.wikimedia.org/T345079 (10MatthewVernon) [08:59:36] !log elukey@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ores2008.codfw.wmnet [08:59:50] jouncebot: nowandnext [08:59:50] No deployments scheduled for the next 1 hour(s) and 0 minute(s) [08:59:50] In 1 hour(s) and 0 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230830T1000) [08:59:57] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw-b2-codfw [08:59:59] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device asw-c2-codfw [09:00:01] (03CR) 10Ladsgroup: [C: 03+2] Allow setting configurations through rtl dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953328 (owner: 10Zabe) [09:00:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953328 (owner: 10Zabe) [09:00:24] (03CR) 10CI reject: [V: 04-1] Remove now obsolete check given all Stretch VMs are gone [puppet] - 10https://gerrit.wikimedia.org/r/953562 (owner: 10Muehlenhoff) [09:00:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P52026 and previous config saved to /var/cache/conftool/dbconfig/20230830-090038-ladsgroup.json [09:00:41] (03Merged) 10jenkins-bot: Allow setting configurations through rtl dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953328 (owner: 10Zabe) [09:01:12] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:953328|Allow setting configurations through rtl dblist]] [09:01:51] 10ops-codfw, 10Machine-Learning-Team: Check ores2008's cable - https://phabricator.wikimedia.org/T345233 (10elukey) [09:02:12] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2009.codfw.wmnet [09:02:45] !log ladsgroup@deploy1002 ladsgroup and zabe: Backport for [[gerrit:953328|Allow setting configurations through rtl dblist]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [09:03:00] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw-c2-codfw [09:03:02] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device asw-d2-codfw [09:04:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 75%: Repooling after upgrade 10.4.31 T344309', diff saved to https://phabricator.wikimedia.org/P52027 and previous config saved to /var/cache/conftool/dbconfig/20230830-090415-root.json [09:04:24] T344309: Compile and package MariaDB 11.0.3 10.6.15, 10.4.31 - https://phabricator.wikimedia.org/T344309 [09:04:25] !log ladsgroup@deploy1002 ladsgroup and zabe: Continuing with sync [09:05:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1026.eqiad.wmnet [09:06:04] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw-d2-codfw [09:06:06] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device fasw-c8a-codfw [09:07:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2030', diff saved to https://phabricator.wikimedia.org/P52028 and previous config saved to /var/cache/conftool/dbconfig/20230830-090749-ladsgroup.json [09:07:57] PROBLEM - Host aux-k8s-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [09:08:56] (03PS1) 10Ladsgroup: admin: Add Mabualruz to analytics-private-data [puppet] - 10https://gerrit.wikimedia.org/r/953565 (https://phabricator.wikimedia.org/T342535) [09:09:01] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2009.codfw.wmnet [09:09:24] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1001.eqiad.wmnet [09:09:28] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device fasw-c8a-codfw [09:09:30] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device fasw-c1a-eqiad [09:09:38] (03PS1) 10Muehlenhoff: Update toolsdb_replica_cnf spec test to run on default OS selection [puppet] - 10https://gerrit.wikimedia.org/r/953566 [09:10:05] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:953328|Allow setting configurations through rtl dblist]] (duration: 08m 52s) [09:10:25] RECOVERY - Host aux-k8s-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 3.02 ms [09:11:34] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:12:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1026.eqiad.wmnet [09:12:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T343718)', diff saved to https://phabricator.wikimedia.org/P52029 and previous config saved to /var/cache/conftool/dbconfig/20230830-091203-ladsgroup.json [09:12:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1026.eqiad.wmnet [09:12:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [09:12:10] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [09:12:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [09:12:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:12:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:12:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T343718)', diff saved to https://phabricator.wikimedia.org/P52030 and previous config saved to /var/cache/conftool/dbconfig/20230830-091242-ladsgroup.json [09:12:43] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1027.eqiad.wmnet [09:12:50] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device fasw-c1a-eqiad [09:15:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T343718)', diff saved to https://phabricator.wikimedia.org/P52031 and previous config saved to /var/cache/conftool/dbconfig/20230830-091544-ladsgroup.json [09:15:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance [09:16:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance [09:16:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [09:16:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [09:16:09] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1001.eqiad.wmnet [09:16:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T343718)', diff saved to https://phabricator.wikimedia.org/P52032 and previous config saved to /var/cache/conftool/dbconfig/20230830-091610-ladsgroup.json [09:16:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:17:52] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1002.eqiad.wmnet [09:18:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T343718)', diff saved to https://phabricator.wikimedia.org/P52033 and previous config saved to /var/cache/conftool/dbconfig/20230830-091833-ladsgroup.json [09:18:39] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [09:19:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 100%: Repooling after upgrade 10.4.31 T344309', diff saved to https://phabricator.wikimedia.org/P52034 and previous config saved to /var/cache/conftool/dbconfig/20230830-091922-root.json [09:19:27] T344309: Compile and package MariaDB 11.0.3 10.6.15, 10.4.31 - https://phabricator.wikimedia.org/T344309 [09:20:34] (03PS1) 10Slyngshede: P:idm fix blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/953567 [09:21:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2096.codfw.wmnet with reason: Maintenance [09:21:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2096.codfw.wmnet with reason: Maintenance [09:21:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2096 (T344589)', diff saved to https://phabricator.wikimedia.org/P52035 and previous config saved to /var/cache/conftool/dbconfig/20230830-092147-ladsgroup.json [09:22:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1137.eqiad.wmnet with reason: Maintenance [09:22:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1137.eqiad.wmnet with reason: Maintenance [09:22:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1137 (T344589)', diff saved to https://phabricator.wikimedia.org/P52036 and previous config saved to /var/cache/conftool/dbconfig/20230830-092228-ladsgroup.json [09:22:49] (03CR) 10CI reject: [V: 04-1] P:idm fix blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/953567 (owner: 10Slyngshede) [09:22:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2030 (T344589)', diff saved to https://phabricator.wikimedia.org/P52037 and previous config saved to /var/cache/conftool/dbconfig/20230830-092255-ladsgroup.json [09:24:29] (03PS1) 10Urbanecm: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/953569 [09:24:50] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1002.eqiad.wmnet [09:25:03] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1003.eqiad.wmnet [09:25:06] (03Abandoned) 10Slyngshede: P:idm fix blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/953567 (owner: 10Slyngshede) [09:25:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1027.eqiad.wmnet [09:25:47] (03CR) 10Urbanecm: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/953569 (owner: 10Urbanecm) [09:26:15] (03PS1) 10Slyngshede: P:idm fix blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/953570 [09:26:34] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/953569 (owner: 10Urbanecm) [09:27:19] !log urbanecm@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [09:28:27] !log urbanecm@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [09:28:32] (03CR) 10CI reject: [V: 04-1] P:idm fix blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/953570 (owner: 10Slyngshede) [09:28:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2096 (T344589)', diff saved to https://phabricator.wikimedia.org/P52038 and previous config saved to /var/cache/conftool/dbconfig/20230830-092851-ladsgroup.json [09:28:58] !log urbanecm@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [09:30:59] !log urbanecm@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [09:31:09] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1003.eqiad.wmnet [09:31:17] !log urbanecm@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [09:32:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1027.eqiad.wmnet [09:32:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1027.eqiad.wmnet [09:32:39] !log urbanecm@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [09:33:04] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1004.eqiad.wmnet [09:33:17] (03PS2) 10Ayounsi: Enable gNMI on all devices [homer/public] - 10https://gerrit.wikimedia.org/r/948540 (https://phabricator.wikimedia.org/T326322) [09:33:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P52039 and previous config saved to /var/cache/conftool/dbconfig/20230830-093339-ladsgroup.json [09:34:36] (03PS3) 10Ayounsi: Enable gNMI on access switches [homer/public] - 10https://gerrit.wikimedia.org/r/948540 (https://phabricator.wikimedia.org/T326322) [09:37:15] (03PS4) 10Ayounsi: Enable gNMI on access and cloud switches [homer/public] - 10https://gerrit.wikimedia.org/r/948540 (https://phabricator.wikimedia.org/T326322) [09:37:46] (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [09:39:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T343718)', diff saved to https://phabricator.wikimedia.org/P52040 and previous config saved to /var/cache/conftool/dbconfig/20230830-093913-ladsgroup.json [09:39:20] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [09:40:04] (03Abandoned) 10Filippo Giunchedi: mesh: add KUBERNETES_NODE (spec.nodeName) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953268 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [09:40:48] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1004.eqiad.wmnet [09:40:56] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1005.eqiad.wmnet [09:43:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2096', diff saved to https://phabricator.wikimedia.org/P52041 and previous config saved to /var/cache/conftool/dbconfig/20230830-094357-ladsgroup.json [09:45:44] (03CR) 10JMeybohm: "Did something change in wikifunctions since the last try?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/953212 (https://phabricator.wikimedia.org/T344998) (owner: 10Jforrester) [09:47:50] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1005.eqiad.wmnet [09:48:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P52042 and previous config saved to /var/cache/conftool/dbconfig/20230830-094845-ladsgroup.json [09:49:25] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1006.eqiad.wmnet [09:50:35] (03PS1) 10Elukey: admin_ng: raise knative's container-concurrency-target-percentage to 85 [deployment-charts] - 10https://gerrit.wikimedia.org/r/953574 (https://phabricator.wikimedia.org/T344058) [09:51:16] (03CR) 10Ayounsi: [C: 03+2] Enable gNMI on access and cloud switches [homer/public] - 10https://gerrit.wikimedia.org/r/948540 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [09:51:55] (03Merged) 10jenkins-bot: Enable gNMI on access and cloud switches [homer/public] - 10https://gerrit.wikimedia.org/r/948540 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [09:53:48] (03CR) 10JMeybohm: "Thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/953553 (owner: 10Alexandros Kosiaris) [09:54:15] (03PS1) 10Filippo Giunchedi: mesh: new configuration version [deployment-charts] - 10https://gerrit.wikimedia.org/r/953575 [09:54:17] (03PS1) 10Filippo Giunchedi: mesh: add tracing support [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) [09:54:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P52043 and previous config saved to /var/cache/conftool/dbconfig/20230830-095419-ladsgroup.json [09:54:48] (03CR) 10Filippo Giunchedi: "Just the new version, changes are at Ia88f7200cf" [deployment-charts] - 10https://gerrit.wikimedia.org/r/953575 (owner: 10Filippo Giunchedi) [09:56:08] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1006.eqiad.wmnet [09:56:21] (03PS1) 10Majavah: P:wmcs::kubeadm: remove version defaults [puppet] - 10https://gerrit.wikimedia.org/r/953577 [09:56:21] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:57:36] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43058/console" [puppet] - 10https://gerrit.wikimedia.org/r/953577 (owner: 10Majavah) [09:58:07] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) [09:58:15] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:58:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1029.eqiad.wmnet [09:58:37] (03CR) 10CI reject: [V: 04-1] P:wmcs::kubeadm: remove version defaults [puppet] - 10https://gerrit.wikimedia.org/r/953577 (owner: 10Majavah) [09:58:56] (03CR) 10JMeybohm: [C: 03+1] helmfile: add entries and namespace for media-analytics service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951544 (https://phabricator.wikimedia.org/T336380) (owner: 10Hnowlan) [09:59:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2096', diff saved to https://phabricator.wikimedia.org/P52044 and previous config saved to /var/cache/conftool/dbconfig/20230830-095903-ladsgroup.json [09:59:11] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:59:36] (03CR) 10Majavah: [C: 03+2] "merging to fix CI on unrelated patch" [puppet] - 10https://gerrit.wikimedia.org/r/953566 (owner: 10Muehlenhoff) [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230830T1000) [10:00:23] (03PS2) 10Majavah: P:wmcs::kubeadm: remove version defaults [puppet] - 10https://gerrit.wikimedia.org/r/953577 [10:00:41] (03PS1) 10Elukey: ml-services: tune knative's container concurrency settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/953578 (https://phabricator.wikimedia.org/T344058) [10:01:07] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:03:38] (03CR) 10JMeybohm: [C: 03+1] wmnet: add geo-analytics and media-analytics ingress records [dns] - 10https://gerrit.wikimedia.org/r/953311 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan) [10:03:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T343718)', diff saved to https://phabricator.wikimedia.org/P52045 and previous config saved to /var/cache/conftool/dbconfig/20230830-100351-ladsgroup.json [10:03:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [10:03:58] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [10:04:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [10:04:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T343718)', diff saved to https://phabricator.wikimedia.org/P52046 and previous config saved to /var/cache/conftool/dbconfig/20230830-100413-ladsgroup.json [10:06:52] !log Rolling reboot codfw wikikube k8s nodes [10:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:07] !log jiji@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-codfw [10:08:26] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322 (10ayounsi) @cmooney I came across https://www.juniper.net/documentation/us/en/software/junos/interfaces-telemetry/topics/ref/statement/... [10:08:30] (03CR) 10Hnowlan: [C: 03+2] helmfile: add entries and namespace for media-analytics service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951544 (https://phabricator.wikimedia.org/T336380) (owner: 10Hnowlan) [10:08:52] (03CR) 10JMeybohm: [C: 04-1] "This needs to be based off of configuration 1.4.0" [deployment-charts] - 10https://gerrit.wikimedia.org/r/953575 (owner: 10Filippo Giunchedi) [10:09:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P52047 and previous config saved to /var/cache/conftool/dbconfig/20230830-100926-ladsgroup.json [10:09:49] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:10:18] 10SRE, 10Infrastructure-Foundations, 10netops: Configure bgp-error-tolerance on Juniper routers - https://phabricator.wikimedia.org/T340111 (10ayounsi) 05Open→03Resolved a:03ayounsi All done. [10:10:22] 10SRE, 10Infrastructure-Foundations, 10netops: Configure bgp-error-tolerance on Juniper routers - https://phabricator.wikimedia.org/T340111 (10ayounsi) Relevant: https://blog.benjojo.co.uk/post/bgp-path-attributes-grave-error-handling [10:11:03] (03Merged) 10jenkins-bot: helmfile: add entries and namespace for media-analytics service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951544 (https://phabricator.wikimedia.org/T336380) (owner: 10Hnowlan) [10:12:26] (03PS2) 10Filippo Giunchedi: mesh: new configuration version [deployment-charts] - 10https://gerrit.wikimedia.org/r/953575 (https://phabricator.wikimedia.org/T320563) [10:12:28] (03PS2) 10Filippo Giunchedi: mesh: add tracing support [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) [10:12:32] urbanecm: hi, can you restart DT script so it could pick the config so I can reboot the db? [10:12:35] (03PS2) 10Elukey: ml-services: tune knative's container concurrency settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/953578 (https://phabricator.wikimedia.org/T344058) [10:12:46] !log hnowlan@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [10:12:55] (03CR) 10CI reject: [V: 04-1] mesh: add tracing support [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [10:13:14] (03CR) 10Filippo Giunchedi: mesh: new configuration version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953575 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [10:13:23] !log hnowlan@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [10:14:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2096 (T344589)', diff saved to https://phabricator.wikimedia.org/P52048 and previous config saved to /var/cache/conftool/dbconfig/20230830-101410-ladsgroup.json [10:14:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2131.codfw.wmnet with reason: Maintenance [10:14:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2131.codfw.wmnet with reason: Maintenance [10:14:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2131 (T344589)', diff saved to https://phabricator.wikimedia.org/P52049 and previous config saved to /var/cache/conftool/dbconfig/20230830-101437-ladsgroup.json [10:14:49] !log hnowlan@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [10:15:18] Amir1: sure, done! [10:15:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1029.eqiad.wmnet [10:15:25] thanks! [10:16:00] (03PS1) 10Muehlenhoff: Remove obsolete rolemap file [puppet] - 10https://gerrit.wikimedia.org/r/953579 [10:16:25] !log +50g to prometheus eqiad 'services' instance [10:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:34] !log hnowlan@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [10:16:51] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:59] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [10:17:01] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:17:44] (03PS2) 10Muehlenhoff: Remove now obsolete check given all Stretch VMs are gone [puppet] - 10https://gerrit.wikimedia.org/r/953562 [10:18:15] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:18:15] hmm, probably something else opened connection to x1 [10:18:31] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [10:20:02] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [10:20:36] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [10:21:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1029.eqiad.wmnet [10:21:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1029.eqiad.wmnet [10:22:39] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:22:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131 (T344589)', diff saved to https://phabricator.wikimedia.org/P52050 and previous config saved to /var/cache/conftool/dbconfig/20230830-102241-ladsgroup.json [10:22:50] (03CR) 10MVernon: [C: 03+1] Remove obsolete rolemap file [puppet] - 10https://gerrit.wikimedia.org/r/953579 (owner: 10Muehlenhoff) [10:24:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T343718)', diff saved to https://phabricator.wikimedia.org/P52051 and previous config saved to /var/cache/conftool/dbconfig/20230830-102432-ladsgroup.json [10:24:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [10:24:38] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [10:24:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [10:24:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T343718)', diff saved to https://phabricator.wikimedia.org/P52052 and previous config saved to /var/cache/conftool/dbconfig/20230830-102452-ladsgroup.json [10:25:51] (03PS31) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [10:26:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137 (T344589)', diff saved to https://phabricator.wikimedia.org/P52053 and previous config saved to /var/cache/conftool/dbconfig/20230830-102559-ladsgroup.json [10:26:50] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1030.eqiad.wmnet [10:27:46] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete rolemap file [puppet] - 10https://gerrit.wikimedia.org/r/953579 (owner: 10Muehlenhoff) [10:28:23] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T345239 (10phaultfinder) [10:28:43] (03PS32) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [10:28:47] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/media-analytics: apply [10:29:53] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:31:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T343718)', diff saved to https://phabricator.wikimedia.org/P52054 and previous config saved to /var/cache/conftool/dbconfig/20230830-103140-ladsgroup.json [10:31:42] BGP errors related to kubernetes-codfw are expected [10:31:46] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [10:32:45] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [10:33:34] (03CR) 10Majavah: replica_cnf_api: add envvars backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [10:35:00] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:35:01] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices1006 - aborrero@cumin1001" [10:35:48] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices1006 - aborrero@cumin1001" [10:35:48] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:36:42] (03PS6) 10Arturo Borrero Gonzalez: cloudservices1006: prepare service [puppet] - 10https://gerrit.wikimedia.org/r/941383 (https://phabricator.wikimedia.org/T342161) [10:37:07] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:37:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131', diff saved to https://phabricator.wikimedia.org/P52055 and previous config saved to /var/cache/conftool/dbconfig/20230830-103747-ladsgroup.json [10:37:50] (03PS7) 10Arturo Borrero Gonzalez: cloudservices1006: prepare service [puppet] - 10https://gerrit.wikimedia.org/r/941383 (https://phabricator.wikimedia.org/T342161) [10:38:53] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Remove now obsolete check given all Stretch VMs are gone [puppet] - 10https://gerrit.wikimedia.org/r/953562 (owner: 10Muehlenhoff) [10:38:59] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [10:40:23] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:41:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137', diff saved to https://phabricator.wikimedia.org/P52056 and previous config saved to /var/cache/conftool/dbconfig/20230830-104105-ladsgroup.json [10:41:47] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:41:49] !log aborrero@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudservices1006 [10:42:17] !log aborrero@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudservices1006 [10:45:59] 10SRE, 10Infrastructure-Foundations, 10netops: Use mgmt_junos on all network devices - https://phabricator.wikimedia.org/T327862 (10ayounsi) 05Resolved→03Open Re-opening as the fasw got upgraded since, so we can enable `mgmt_junos` [10:46:09] (03PS8) 10Arturo Borrero Gonzalez: cloudservices1006: prepare service [puppet] - 10https://gerrit.wikimedia.org/r/941383 (https://phabricator.wikimedia.org/T342621) [10:46:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P52057 and previous config saved to /var/cache/conftool/dbconfig/20230830-104646-ladsgroup.json [10:46:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10aborrero) [10:47:03] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:48:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices1006: prepare service [puppet] - 10https://gerrit.wikimedia.org/r/941383 (https://phabricator.wikimedia.org/T342621) (owner: 10Arturo Borrero Gonzalez) [10:48:29] (03CR) 10Slyngshede: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/953570 (owner: 10Slyngshede) [10:48:32] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/media-analytics: apply [10:48:53] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [10:48:55] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudservices1006.eqiad.wmnet with OS bullseye [10:50:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1030.eqiad.wmnet [10:50:23] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:50:36] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/media-analytics: apply [10:51:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T343718)', diff saved to https://phabricator.wikimedia.org/P52058 and previous config saved to /var/cache/conftool/dbconfig/20230830-105138-ladsgroup.json [10:51:43] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [10:51:51] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:52:50] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/953570 (owner: 10Slyngshede) [10:52:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131', diff saved to https://phabricator.wikimedia.org/P52059 and previous config saved to /var/cache/conftool/dbconfig/20230830-105254-ladsgroup.json [10:52:54] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device ssw1-a1-codfw.mgmt.codfw.wmnet [10:52:56] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [10:53:24] (03CR) 10Slyngshede: [C: 03+2] P:idm fix blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/953570 (owner: 10Slyngshede) [10:54:53] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:54:54] !log installing grub2 updates from bullseye point release [10:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:41] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - cmooney@cumin1001" [10:55:47] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:56:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137', diff saved to https://phabricator.wikimedia.org/P52060 and previous config saved to /var/cache/conftool/dbconfig/20230830-105612-ladsgroup.json [10:56:19] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:56:31] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - cmooney@cumin1001" [10:56:31] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:56:47] 10SRE, 10Traffic: Varnish mobile redirection misses some sites - https://phabricator.wikimedia.org/T344175 (10Fabfur) a:03Fabfur [10:57:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1030.eqiad.wmnet [10:57:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1030.eqiad.wmnet [10:57:39] !log enable mgmt_junos on fasw-c-codfw - T327862 [10:57:39] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:44] T327862: Use mgmt_junos on all network devices - https://phabricator.wikimedia.org/T327862 [11:00:17] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:00:55] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/media-analytics: apply [11:01:11] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/media-analytics: apply [11:01:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P52061 and previous config saved to /var/cache/conftool/dbconfig/20230830-110152-ladsgroup.json [11:04:51] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:06:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P52062 and previous config saved to /var/cache/conftool/dbconfig/20230830-110644-ladsgroup.json [11:07:27] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:08:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131 (T344589)', diff saved to https://phabricator.wikimedia.org/P52063 and previous config saved to /var/cache/conftool/dbconfig/20230830-110800-ladsgroup.json [11:09:35] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:11:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137 (T344589)', diff saved to https://phabricator.wikimedia.org/P52064 and previous config saved to /var/cache/conftool/dbconfig/20230830-111118-ladsgroup.json [11:11:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1220.eqiad.wmnet with reason: Maintenance [11:11:30] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/media-analytics: apply [11:11:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1220.eqiad.wmnet with reason: Maintenance [11:11:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1220 (T344589)', diff saved to https://phabricator.wikimedia.org/P52065 and previous config saved to /var/cache/conftool/dbconfig/20230830-111143-ladsgroup.json [11:12:31] !log aborrero@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudservices1006.eqiad.wmnet with OS bullseye [11:12:57] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudservices1006.eqiad.wmnet with OS bullseye [11:13:50] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1031.eqiad.wmnet [11:14:47] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:14:51] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:59] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:03] !log switch cumin to the puppetdb api micro service Gerrit:953203 [11:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:18] (03CR) 10Jbond: [V: 03+1 C: 03+2] cumin: update cumin host to use the puppetdb-micro service [puppet] - 10https://gerrit.wikimedia.org/r/953203 (https://phabricator.wikimedia.org/T341497) (owner: 10Jbond) [11:16:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T343718)', diff saved to https://phabricator.wikimedia.org/P52066 and previous config saved to /var/cache/conftool/dbconfig/20230830-111659-ladsgroup.json [11:17:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [11:17:05] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [11:17:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [11:17:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T343718)', diff saved to https://phabricator.wikimedia.org/P52067 and previous config saved to /var/cache/conftool/dbconfig/20230830-111720-ladsgroup.json [11:17:41] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:18:55] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:18:57] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:19:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1220 (T344589)', diff saved to https://phabricator.wikimedia.org/P52068 and previous config saved to /var/cache/conftool/dbconfig/20230830-111952-ladsgroup.json [11:20:11] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:20:55] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:21:11] (03CR) 10Cathal Mooney: [C: 03+2] Increase the number of retries for ZTP provision cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/942695 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [11:21:14] (03PS1) 10Slyngshede: Additional shell account name validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/953582 [11:21:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P52069 and previous config saved to /var/cache/conftool/dbconfig/20230830-112150-ladsgroup.json [11:21:51] (ProbeDown) firing: (4) Service idm1001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:22:13] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:23:52] (03Merged) 10jenkins-bot: Increase the number of retries for ZTP provision cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/942695 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [11:24:07] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf,ops for arnaudb - https://phabricator.wikimedia.org/T345241 (10ABran-WMF) [11:24:51] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:29:10] (03CR) 10Jbond: "thanks updated" [puppet] - 10https://gerrit.wikimedia.org/r/940384 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [11:30:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1031.eqiad.wmnet [11:31:17] (03CR) 10Muehlenhoff: [C: 03+2] Remove now obsolete check given all Stretch VMs are gone [puppet] - 10https://gerrit.wikimedia.org/r/953562 (owner: 10Muehlenhoff) [11:31:45] (03CR) 10Hnowlan: [C: 03+2] wmnet: add geo-analytics and media-analytics ingress records [dns] - 10https://gerrit.wikimedia.org/r/953311 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan) [11:31:49] PROBLEM - Host aux-k8s-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100% [11:31:59] PROBLEM - Host kubetcd1005 is DOWN: PING CRITICAL - Packet loss = 100% [11:34:11] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye [11:34:47] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:34:47] (03PS1) 10Muehlenhoff: php: Remove obsolete check [puppet] - 10https://gerrit.wikimedia.org/r/953584 [11:34:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1220', diff saved to https://phabricator.wikimedia.org/P52070 and previous config saved to /var/cache/conftool/dbconfig/20230830-113459-ladsgroup.json [11:35:25] RECOVERY - Host aux-k8s-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 2.61 ms [11:35:51] RECOVERY - Host kubetcd1005 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [11:36:05] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf,ops for arnaudb - https://phabricator.wikimedia.org/T345241 (10jcrespo) p:05Triage→03High [11:36:13] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf,ops for arnaudb - https://phabricator.wikimedia.org/T345241 (10jcrespo) a:03jcrespo [11:36:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1031.eqiad.wmnet [11:36:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1031.eqiad.wmnet [11:36:45] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf,ops for arnaudb - https://phabricator.wikimedia.org/T345241 (10jcrespo) Owning it as we will do it slowly for learning it purposes (Clinic duty person is aware). [11:36:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T343718)', diff saved to https://phabricator.wikimedia.org/P52071 and previous config saved to /var/cache/conftool/dbconfig/20230830-113656-ladsgroup.json [11:36:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [11:37:02] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [11:37:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [11:37:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T343718)', diff saved to https://phabricator.wikimedia.org/P52072 and previous config saved to /var/cache/conftool/dbconfig/20230830-113728-ladsgroup.json [11:37:34] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST customresourcedefinitions) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:40:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1032.eqiad.wmnet [11:41:09] (03PS1) 10Slyngshede: P:IDM Enable addition validator. [puppet] - 10https://gerrit.wikimedia.org/r/953585 [11:42:21] (03CR) 10Slyngshede: "The additional validator also needs to be enabled using Puppet, see: https://gerrit.wikimedia.org/r/c/operations/puppet/+/953585" [software/bitu] - 10https://gerrit.wikimedia.org/r/953582 (owner: 10Slyngshede) [11:42:34] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST customresourcedefinitions) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:43:48] (03PS1) 10Vgutierrez: trafficserver: Allow configuring transaction_active_timeout_in [puppet] - 10https://gerrit.wikimedia.org/r/953587 (https://phabricator.wikimedia.org/T341755) [11:44:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T343718)', diff saved to https://phabricator.wikimedia.org/P52073 and previous config saved to /var/cache/conftool/dbconfig/20230830-114421-ladsgroup.json [11:44:27] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [11:44:49] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/953585 (owner: 10Slyngshede) [11:44:51] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:46:47] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:46:59] (03PS2) 10Vgutierrez: trafficserver: Allow configuring transaction_active_timeout_in [puppet] - 10https://gerrit.wikimedia.org/r/953587 (https://phabricator.wikimedia.org/T341755) [11:47:04] (03PS1) 10Slyngshede: P:IDM Update blackbox URL [puppet] - 10https://gerrit.wikimedia.org/r/953589 [11:47:43] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [11:48:08] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/953589 (owner: 10Slyngshede) [11:48:11] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:48:14] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/953589 (owner: 10Slyngshede) [11:48:38] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [11:49:17] (03CR) 10Slyngshede: [C: 03+2] P:IDM Update blackbox URL [puppet] - 10https://gerrit.wikimedia.org/r/953589 (owner: 10Slyngshede) [11:50:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1032.eqiad.wmnet [11:50:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1220', diff saved to https://phabricator.wikimedia.org/P52074 and previous config saved to /var/cache/conftool/dbconfig/20230830-115005-ladsgroup.json [11:51:13] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a1-codfw - cmooney@cumin1001" [11:51:25] PROBLEM - Host dse-k8s-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100% [11:51:32] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [11:51:51] (ProbeDown) resolved: (2) Service idm1001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#idm1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:52:03] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a1-codfw - cmooney@cumin1001" [11:52:03] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:52:04] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device ssw1-a1-codfw.mgmt.codfw.wmnet [11:52:27] PROBLEM - Host kubestagetcd1004 is DOWN: PING CRITICAL - Packet loss = 100% [11:52:29] PROBLEM - Host aux-k8s-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [11:52:59] (03PS2) 10Slyngshede: P:IDM Enable addition validator. [puppet] - 10https://gerrit.wikimedia.org/r/953585 [11:53:29] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:55:04] (03PS2) 10Samtar: IS: Enable Phonos on all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951042 (https://phabricator.wikimedia.org/T336763) [11:55:33] RECOVERY - Host aux-k8s-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [11:55:39] RECOVERY - Host dse-k8s-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [11:55:53] RECOVERY - Host kubestagetcd1004 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [11:56:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1032.eqiad.wmnet [11:56:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1032.eqiad.wmnet [11:56:49] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:57:41] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1033.eqiad.wmnet [11:58:15] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:59:01] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudservices1006.eqiad.wmnet with OS bullseye [11:59:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P52076 and previous config saved to /var/cache/conftool/dbconfig/20230830-115927-ladsgroup.json [12:00:41] (03CR) 10Alexandros Kosiaris: [C: 03+2] linkrecommendation: Update vendor modules for T340843 [deployment-charts] - 10https://gerrit.wikimedia.org/r/953551 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [12:01:29] (03Merged) 10jenkins-bot: linkrecommendation: Update vendor modules for T340843 [deployment-charts] - 10https://gerrit.wikimedia.org/r/953551 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [12:03:07] 10SRE, 10Infrastructure-Foundations, 10netops: Use mgmt_junos on all network devices - https://phabricator.wikimedia.org/T327862 (10ayounsi) 05Open→03Resolved Nevermind, still doesn't work on the fasw. [12:03:18] (03PS1) 10Ilias Sarantopoulos: ores-extension: fix thresholds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953590 (https://phabricator.wikimedia.org/T343308) [12:03:33] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:04:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T343718)', diff saved to https://phabricator.wikimedia.org/P52077 and previous config saved to /var/cache/conftool/dbconfig/20230830-120415-ladsgroup.json [12:04:24] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [12:05:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1220 (T344589)', diff saved to https://phabricator.wikimedia.org/P52078 and previous config saved to /var/cache/conftool/dbconfig/20230830-120511-ladsgroup.json [12:05:30] (03PS1) 10Ayounsi: gNMI: don't use mgmt_junos on asw1-eqsin and fasw [homer/public] - 10https://gerrit.wikimedia.org/r/953591 (https://phabricator.wikimedia.org/T327862) [12:06:40] (03CR) 10Ladsgroup: [C: 03+1] ores-extension: fix thresholds (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953590 (https://phabricator.wikimedia.org/T343308) (owner: 10Ilias Sarantopoulos) [12:06:55] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:07:17] (03CR) 10Ayounsi: [C: 03+2] gNMI: don't use mgmt_junos on asw1-eqsin and fasw [homer/public] - 10https://gerrit.wikimedia.org/r/953591 (https://phabricator.wikimedia.org/T327862) (owner: 10Ayounsi) [12:07:41] jouncebot: nowandnext [12:07:41] No deployments scheduled for the next 0 hour(s) and 52 minute(s) [12:07:41] In 0 hour(s) and 52 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230830T1300) [12:07:44] (03CR) 10Muehlenhoff: "Looks good, one comment inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/953582 (owner: 10Slyngshede) [12:07:48] isaranto: let's do it [12:07:53] (03Merged) 10jenkins-bot: gNMI: don't use mgmt_junos on asw1-eqsin and fasw [homer/public] - 10https://gerrit.wikimedia.org/r/953591 (https://phabricator.wikimedia.org/T327862) (owner: 10Ayounsi) [12:08:12] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/953585 (owner: 10Slyngshede) [12:08:16] isaranto: can you fix the verylikelybad I mentioned? [12:08:19] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:08:25] possibly just set it to null [12:08:36] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS bullseye [12:08:47] (03CR) 10Jbond: tox: add python 3.11 (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/953294 (owner: 10Jbond) [12:09:20] (03PS2) 10Alexandros Kosiaris: linkrecommendation: Remove all hardcoded dbproxy networkpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/953554 (https://phabricator.wikimedia.org/T340843) [12:09:23] Amir1: I'm looking into it [12:09:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1033.eqiad.wmnet [12:10:16] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudservices1006.eqiad.wmnet with OS bullseye [12:10:20] (03CR) 10Alexandros Kosiaris: [C: 03+2] linkrecommendation: Remove all hardcoded dbproxy networkpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/953554 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [12:10:34] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudservices1006.eqiad.wmnet with OS bullseye [12:11:01] (03PS1) 10Muehlenhoff: Failover IDP to idp2002 [dns] - 10https://gerrit.wikimedia.org/r/953592 [12:11:03] (03Merged) 10jenkins-bot: linkrecommendation: Remove all hardcoded dbproxy networkpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/953554 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [12:12:11] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:12:36] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudservices1006.eqiad.wmnet with OS bullseye [12:12:45] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10ayounsi) Before running homer, the cookbook needs to call the `sre.network.tls` cookbook with the device's name as parameter to add the TLS cert... [12:12:58] (03PS1) 10Jcrespo: Add abran to the list of privileged LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/953593 (https://phabricator.wikimedia.org/T345241) [12:13:15] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudservices1006.eqiad.wmnet with reason: host reimage [12:13:35] (03CR) 10Abijeet Patro: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953216 (owner: 10Abijeet Patro) [12:13:39] (03CR) 10CI reject: [V: 04-1] Add abran to the list of privileged LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/953593 (https://phabricator.wikimedia.org/T345241) (owner: 10Jcrespo) [12:14:00] (03CR) 10CI reject: [V: 04-1] Enable MinT translation service for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953216 (owner: 10Abijeet Patro) [12:14:02] (03PS2) 10Jcrespo: admin: Add abran to the list of privileged LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/953593 (https://phabricator.wikimedia.org/T345241) [12:14:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P52079 and previous config saved to /var/cache/conftool/dbconfig/20230830-121433-ladsgroup.json [12:14:35] (03CR) 10Ayounsi: [C: 03+2] Add gNMI based telemetry collection using gNMIc [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [12:14:49] (03CR) 10CI reject: [V: 04-1] admin: Add abran to the list of privileged LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/953593 (https://phabricator.wikimedia.org/T345241) (owner: 10Jcrespo) [12:15:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1033.eqiad.wmnet [12:15:27] (03CR) 10Ilias Sarantopoulos: ores-extension: fix thresholds (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953590 (https://phabricator.wikimedia.org/T343308) (owner: 10Ilias Sarantopoulos) [12:15:38] (03PS3) 10Jcrespo: admin: Add abran to the list of privileged LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/953593 (https://phabricator.wikimedia.org/T345241) [12:15:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1033.eqiad.wmnet [12:16:07] (03CR) 10Ladsgroup: [C: 03+2] ores-extension: fix thresholds (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953590 (https://phabricator.wikimedia.org/T343308) (owner: 10Ilias Sarantopoulos) [12:16:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953590 (https://phabricator.wikimedia.org/T343308) (owner: 10Ilias Sarantopoulos) [12:16:43] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudservices1006.eqiad.wmnet with reason: host reimage [12:16:46] (03Merged) 10jenkins-bot: ores-extension: fix thresholds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953590 (https://phabricator.wikimedia.org/T343308) (owner: 10Ilias Sarantopoulos) [12:17:00] (03CR) 10Nikerabbit: Enable MinT translation service for testwiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953216 (owner: 10Abijeet Patro) [12:17:17] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:953590|ores-extension: fix thresholds (T343308)]] [12:17:23] T343308: fiwiki RC filters classify all edits as 'very likely bad faith' - https://phabricator.wikimedia.org/T343308 [12:17:24] (03PS3) 10Elukey: ml-services: tune knative's container concurrency settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/953578 (https://phabricator.wikimedia.org/T344058) [12:17:31] (03CR) 10Ladsgroup: [C: 03+1] admin: Add abran to the list of privileged LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/953593 (https://phabricator.wikimedia.org/T345241) (owner: 10Jcrespo) [12:18:14] (03CR) 10Elukey: [C: 03+2] admin_ng: raise knative's container-concurrency-target-percentage to 85 [deployment-charts] - 10https://gerrit.wikimedia.org/r/953574 (https://phabricator.wikimedia.org/T344058) (owner: 10Elukey) [12:18:19] (03PS3) 10Abijeet Patro: Enable MinT translation service for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953216 [12:18:22] (03CR) 10Abijeet Patro: Enable MinT translation service for testwiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953216 (owner: 10Abijeet Patro) [12:18:24] (03PS4) 10Elukey: ml-services: tune knative's container concurrency settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/953578 (https://phabricator.wikimedia.org/T344058) [12:18:33] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:19:01] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1007.eqiad.wmnet [12:19:08] !log ladsgroup@deploy1002 isaranto and ladsgroup: Backport for [[gerrit:953590|ores-extension: fix thresholds (T343308)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [12:19:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P52080 and previous config saved to /var/cache/conftool/dbconfig/20230830-121921-ladsgroup.json [12:19:33] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:19:53] !log ladsgroup@deploy1002 isaranto and ladsgroup: Continuing with sync [12:20:53] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:21:51] (03CR) 10Jbond: puppetdb: optimize query (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/953295 (owner: 10Jbond) [12:22:05] (03CR) 10Filippo Giunchedi: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [12:22:45] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:23:39] (03PS1) 10Arturo Borrero Gonzalez: cloudservices: enable ns0-next.openstack.eqiad1.wikimediacloud.org [puppet] - 10https://gerrit.wikimedia.org/r/953595 (https://phabricator.wikimedia.org/T345240) [12:24:49] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:24:57] (03PS2) 10Arturo Borrero Gonzalez: cloudservices: enable ns0-next.openstack.eqiad1.wikimediacloud.org [puppet] - 10https://gerrit.wikimedia.org/r/953595 (https://phabricator.wikimedia.org/T345240) [12:25:06] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1007.eqiad.wmnet [12:25:12] (03CR) 10Arturo Borrero Gonzalez: "check-experimental" [puppet] - 10https://gerrit.wikimedia.org/r/953595 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [12:26:44] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:27:06] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:27:47] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:28:01] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/953595 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [12:28:09] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:28:57] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [12:29:06] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [12:29:27] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:29:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T343718)', diff saved to https://phabricator.wikimedia.org/P52081 and previous config saved to /var/cache/conftool/dbconfig/20230830-122940-ladsgroup.json [12:29:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [12:29:47] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [12:29:51] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1008.eqiad.wmnet [12:29:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [12:30:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T343718)', diff saved to https://phabricator.wikimedia.org/P52082 and previous config saved to /var/cache/conftool/dbconfig/20230830-123001-ladsgroup.json [12:31:03] (03PS2) 10Slyngshede: Additional shell account name validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/953582 [12:31:13] (03PS3) 10Arturo Borrero Gonzalez: cloudservices: enable ns0-next.openstack.eqiad1.wikimediacloud.org [puppet] - 10https://gerrit.wikimedia.org/r/953595 (https://phabricator.wikimedia.org/T345240) [12:32:00] (03CR) 10Muehlenhoff: [C: 03+2] Failover IDP to idp2002 [dns] - 10https://gerrit.wikimedia.org/r/953592 (owner: 10Muehlenhoff) [12:32:03] (03CR) 10Slyngshede: Additional shell account name validation. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/953582 (owner: 10Slyngshede) [12:32:29] PROBLEM - Check systemd state on netflow4002 is CRITICAL: CRITICAL - degraded: The following units failed: gnmic.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:35] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/953595 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [12:32:37] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/953582 (owner: 10Slyngshede) [12:32:45] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:33:01] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:33:14] (03PS1) 10Jbond: gnmi: Add require on files [puppet] - 10https://gerrit.wikimedia.org/r/953598 [12:33:20] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/953593 (https://phabricator.wikimedia.org/T345241) (owner: 10Jcrespo) [12:33:37] (03CR) 10CI reject: [V: 04-1] gnmi: Add require on files [puppet] - 10https://gerrit.wikimedia.org/r/953598 (owner: 10Jbond) [12:33:55] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:33:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1034.eqiad.wmnet [12:34:04] (03PS2) 10Jbond: gnmi: Add require on files [puppet] - 10https://gerrit.wikimedia.org/r/953598 [12:34:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P52083 and previous config saved to /var/cache/conftool/dbconfig/20230830-123427-ladsgroup.json [12:34:46] (03PS1) 10Joal: Fix hadoop-yarn log aggregation compression [puppet] - 10https://gerrit.wikimedia.org/r/953599 [12:34:53] stevemunene: --^ [12:35:20] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Additional shell account name validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/953582 (owner: 10Slyngshede) [12:35:33] (03CR) 10Slyngshede: [C: 03+2] P:IDM Enable addition validator. [puppet] - 10https://gerrit.wikimedia.org/r/953585 (owner: 10Slyngshede) [12:36:39] (03CR) 10Ayounsi: [C: 03+1] gnmi: Add require on files [puppet] - 10https://gerrit.wikimedia.org/r/953598 (owner: 10Jbond) [12:37:47] (03PS2) 10ArielGlenn: dumps: Advertise growthmentorship dumps from index.html [puppet] - 10https://gerrit.wikimedia.org/r/953348 (owner: 10Urbanecm) [12:37:50] (03CR) 10Ssingh: "This is ready to be merged. Let me know if I should find a window in the deployment calendar or if this can be merged as-is like the last " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951591 (https://phabricator.wikimedia.org/T344704) (owner: 10Ssingh) [12:37:55] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1008.eqiad.wmnet [12:38:06] jouncebot: nowandnext [12:38:07] No deployments scheduled for the next 0 hour(s) and 21 minute(s) [12:38:07] In 0 hour(s) and 21 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230830T1300) [12:38:11] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1009.eqiad.wmnet [12:38:31] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951591 (https://phabricator.wikimedia.org/T344704) (owner: 10Ssingh) [12:38:33] (03CR) 10ArielGlenn: [C: 03+2] dumps: Advertise growthmentorship dumps from index.html [puppet] - 10https://gerrit.wikimedia.org/r/953348 (owner: 10Urbanecm) [12:38:37] (03CR) 10Ayounsi: [C: 03+2] gnmi: Add require on files [puppet] - 10https://gerrit.wikimedia.org/r/953598 (owner: 10Jbond) [12:39:02] (03PS1) 10Elukey: admin_ng: increase resource quotas for ml-serve's experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/953601 [12:39:13] (03CR) 10Clément Goubert: [C: 03+1] php: Remove obsolete check [puppet] - 10https://gerrit.wikimedia.org/r/953584 (owner: 10Muehlenhoff) [12:39:15] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:39:15] (03Merged) 10jenkins-bot: wmf-config: remove public subnets from reverse-proxy.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951591 (https://phabricator.wikimedia.org/T344704) (owner: 10Ssingh) [12:39:22] whoever is running puppet-merge, go ahead and merge my thing, thanks! [12:39:46] thx! [12:39:57] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:40:11] no, thank *you* :-) [12:40:47] (03PS1) 10Alexandros Kosiaris: linkrecommendation: Utilize the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/953603 [12:41:27] Amir1: oops, sorry, didn't realize you are deploying too. scap is just taking a while for you it seems? [12:41:33] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:41:40] (03CR) 10Alexandros Kosiaris: [C: 03+2] linkrecommendation: Utilize the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/953603 (owner: 10Alexandros Kosiaris) [12:42:04] (03CR) 10Elukey: [C: 03+2] admin_ng: increase resource quotas for ml-serve's experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/953601 (owner: 10Elukey) [12:42:25] (03Merged) 10jenkins-bot: linkrecommendation: Utilize the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/953603 (owner: 10Alexandros Kosiaris) [12:42:43] taavi: yup [12:42:53] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:43:00] mostly done though [12:43:10] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:953590|ores-extension: fix thresholds (T343308)]] (duration: 25m 53s) [12:43:11] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:43:15] T343308: fiwiki RC filters classify all edits as 'very likely bad faith' - https://phabricator.wikimedia.org/T343308 [12:43:34] !log taavi@deploy1002 Started scap: Backport for [[gerrit:951591|wmf-config: remove public subnets from reverse-proxy.php (T344704 T329219)]] [12:43:41] T329219: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 [12:43:41] T344704: Blocked on English Wikipedia / Wikimedia thinks my IP is 10.80.1.11 - https://phabricator.wikimedia.org/T344704 [12:43:50] taavi: thanks for merging this! [12:45:00] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1009.eqiad.wmnet [12:45:56] (03CR) 10Arturo Borrero Gonzalez: "PCC: https://puppet-compiler.wmflabs.org/output/953595/43063/" [puppet] - 10https://gerrit.wikimedia.org/r/953595 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [12:46:28] (03PS3) 10Jbond: firewall: move conntrack logic to firewall module [puppet] - 10https://gerrit.wikimedia.org/r/953276 (https://phabricator.wikimedia.org/T336497) [12:46:30] (03PS11) 10Jbond: ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) [12:46:32] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43064/console" [puppet] - 10https://gerrit.wikimedia.org/r/953587 (https://phabricator.wikimedia.org/T341755) (owner: 10Vgutierrez) [12:46:47] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [12:46:48] !log taavi@deploy1002 sukhe and taavi: Backport for [[gerrit:951591|wmf-config: remove public subnets from reverse-proxy.php (T344704 T329219)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [12:47:19] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:47:24] Special:MyContribs works as expected, syncing [12:47:28] !log taavi@deploy1002 sukhe and taavi: Continuing with sync [12:47:31] (03PS1) 10Ayounsi: gnmi: remove require for wmf-ca-certificates.crt [puppet] - 10https://gerrit.wikimedia.org/r/953604 [12:48:07] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:48:49] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:49:01] (03CR) 10Muehlenhoff: [C: 03+2] php: Remove obsolete check [puppet] - 10https://gerrit.wikimedia.org/r/953584 (owner: 10Muehlenhoff) [12:49:28] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/953604 (owner: 10Ayounsi) [12:49:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T343718)', diff saved to https://phabricator.wikimedia.org/P52084 and previous config saved to /var/cache/conftool/dbconfig/20230830-124933-ladsgroup.json [12:49:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance [12:49:39] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [12:49:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance [12:49:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1188 (T343718)', diff saved to https://phabricator.wikimedia.org/P52085 and previous config saved to /var/cache/conftool/dbconfig/20230830-124954-ladsgroup.json [12:50:00] (03PS4) 10Arturo Borrero Gonzalez: cloudservices: enable ns0-next.openstack.eqiad1.wikimediacloud.org [puppet] - 10https://gerrit.wikimedia.org/r/953595 (https://phabricator.wikimedia.org/T345240) [12:50:02] (03PS1) 10Arturo Borrero Gonzalez: cloudservices1006: make it aware of ns0-next.openstack.eqiad1.wikimediacloud.org [puppet] - 10https://gerrit.wikimedia.org/r/953605 (https://phabricator.wikimedia.org/T345240) [12:50:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices1006: make it aware of ns0-next.openstack.eqiad1.wikimediacloud.org [puppet] - 10https://gerrit.wikimedia.org/r/953605 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [12:51:05] (03CR) 10Ayounsi: [C: 03+2] gnmi: remove require for wmf-ca-certificates.crt [puppet] - 10https://gerrit.wikimedia.org/r/953604 (owner: 10Ayounsi) [12:51:19] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:51:35] (03PS3) 10Filippo Giunchedi: mesh: new configuration version [deployment-charts] - 10https://gerrit.wikimedia.org/r/953575 (https://phabricator.wikimedia.org/T320563) [12:51:37] (03PS3) 10Filippo Giunchedi: mesh: add tracing support [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) [12:52:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T343718)', diff saved to https://phabricator.wikimedia.org/P52086 and previous config saved to /var/cache/conftool/dbconfig/20230830-125206-ladsgroup.json [12:52:21] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:52:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:52:43] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:52:56] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:53:15] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:53:23] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:53:29] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: sre.ganeti.makevm: Create machine types - https://phabricator.wikimedia.org/T344972 (10Volans) To which types are you referring to? Do you mean to define some standard setup for VMs? Where would those be defined? [12:53:42] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:54:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1034.eqiad.wmnet [12:54:23] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:54:35] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [12:54:45] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [12:55:02] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:951591|wmf-config: remove public subnets from reverse-proxy.php (T344704 T329219)]] (duration: 11m 28s) [12:55:08] T329219: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 [12:55:09] T344704: Blocked on English Wikipedia / Wikimedia thinks my IP is 10.80.1.11 - https://phabricator.wikimedia.org/T344704 [12:55:13] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:39] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf,ops for arnaudb - https://phabricator.wikimedia.org/T345241 (10jcrespo) Ldap groups added: ` root@mwmaint1002:~$ ldapsearch -x cn=wmf | grep arnaudb member: uid=arnaudb,ou=people,dc=wikimedia,dc=org ✔ root@mwmaint1002:~$ ldapsearch -x cn=... [12:56:47] RECOVERY - Check systemd state on netflow4002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:51] !log restart kubelet on ml-serve1001 to clear prometheus metrics [12:56:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T343718)', diff saved to https://phabricator.wikimedia.org/P52087 and previous config saved to /var/cache/conftool/dbconfig/20230830-125650-ladsgroup.json [12:56:52] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [12:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:00] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [12:57:20] (03CR) 10Jcrespo: [C: 03+2] admin: Add abran to the list of privileged LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/953593 (https://phabricator.wikimedia.org/T345241) (owner: 10Jcrespo) [12:57:21] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:57:30] ^ arnaudb [12:57:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:57:50] ack [12:58:01] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: sre.ganeti.makevm: Create machine types - https://phabricator.wikimedia.org/T344972 (10MoritzMuehlenhoff) Basically when creating a new bastion one wouldn't need to look up the current config, but would be able to simply pass --type bastion which would be a s... [12:58:06] (the patch, not the router thingy) [12:58:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1034.eqiad.wmnet [12:58:40] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device ssw1-a1-codfw.mgmt.codfw.wmnet [12:58:41] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:58:41] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:59:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1034.eqiad.wmnet [12:59:53] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230830T1300). [13:00:05] TheresNoTime: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:07] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:11] o/ [13:00:18] * TheresNoTime can self-deploy [13:00:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2034.mgmt.codfw.wmnet with reboot policy FORCED [13:00:21] \o/ [13:00:40] Lucas_WMDE: can you +1 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/951042 ? [13:00:49] 👀 [13:00:52] sorry wrong sensory organ [13:00:54] 👂 [13:01:17] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:01:23] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] IS: Enable Phonos on all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951042 (https://phabricator.wikimedia.org/T336763) (owner: 10Samtar) [13:01:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951042 (https://phabricator.wikimedia.org/T336763) (owner: 10Samtar) [13:01:35] TheresNoTime: -1, please exclude 'lockeddown' [13:01:38] aaa [13:01:46] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - cmooney@cumin1001" [13:02:01] (03PS1) 10Jbond: firewall: add conntrack require on the active firewall [puppet] - 10https://gerrit.wikimedia.org/r/953610 (https://phabricator.wikimedia.org/T336497) [13:02:01] (removed the +2) [13:02:08] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device asw2-22-ulsfo [13:02:08] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device asw2-22-ulsfo [13:02:20] (03CR) 10Majavah: [C: 04-1] "This needs to exclude 'lockeddown' as we don't want this on loginwiki or votewiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951042 (https://phabricator.wikimedia.org/T336763) (owner: 10Samtar) [13:02:34] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - cmooney@cumin1001" [13:02:34] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:02:41] what’s special about this setting compared to tons of other settings that use 'default' without 'lockeddown'? [13:02:50] is it because it loads an additional extension? [13:02:51] (03CR) 10Jbond: firewall: move conntrack logic to firewall module (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/953276 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [13:03:03] yes [13:03:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2034.mgmt.codfw.wmnet with reboot policy FORCED [13:03:21] (03CR) 10Andrew Bogott: [C: 03+1] "The pcc output looks OK to me. Unfortunately, though, the designate worker pool is not maintained by puppet, only the template file that " [puppet] - 10https://gerrit.wikimedia.org/r/953595 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [13:03:57] (03PS3) 10Samtar: IS: Enable Phonos on all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951042 (https://phabricator.wikimedia.org/T336763) [13:04:19] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:04:28] !log samtar@deploy1002 backport Cancelled [13:04:29] (03PS16) 10Ilias Sarantopoulos: ores-extension: replace thresholds with numeric values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948542 (https://phabricator.wikimedia.org/T343308) [13:04:44] taavi: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/951042 for review, thx [13:04:51] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:04:53] (03CR) 10Majavah: [C: 03+1] IS: Enable Phonos on all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951042 (https://phabricator.wikimedia.org/T336763) (owner: 10Samtar) [13:05:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951042 (https://phabricator.wikimedia.org/T336763) (owner: 10Samtar) [13:05:11] ta [13:05:27] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:05:31] * Lucas_WMDE tries to find documentation where this could be added [13:05:51] (03CR) 10Nikerabbit: [C: 03+1] Enable MinT translation service for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953216 (owner: 10Abijeet Patro) [13:05:53] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:05:56] (03Merged) 10jenkins-bot: IS: Enable Phonos on all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951042 (https://phabricator.wikimedia.org/T336763) (owner: 10Samtar) [13:06:07] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.295 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:06:21] 10SRE, 10Bitu, 10Infrastructure-Foundations: Switch developer account creation to Bitu - https://phabricator.wikimedia.org/T345226 (10SLyngshede-WMF) p:05Triage→03Medium [13:06:25] !log samtar@deploy1002 Started scap: Backport for [[gerrit:951042|IS: Enable Phonos on all projects (T336763)]] [13:06:35] T336763: Enable PhonosInlineAudioPlayerMode on all projects - https://phabricator.wikimedia.org/T336763 [13:06:43] (03PS3) 10Slyngshede: Disable user creation on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952209 (https://phabricator.wikimedia.org/T345226) (owner: 10Andrew Bogott) [13:06:43] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:07:09] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P52088 and previous config saved to /var/cache/conftool/dbconfig/20230830-130712-ladsgroup.json [13:08:56] !log samtar@deploy1002 samtar: Backport for [[gerrit:951042|IS: Enable Phonos on all projects (T336763)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:08:59] * TheresNoTime testing [13:09:18] (03PS4) 10Slyngshede: Disable user creation on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952209 (https://phabricator.wikimedia.org/T345226) (owner: 10Andrew Bogott) [13:09:21] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:09:29] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2026'] [13:09:37] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2027'] [13:09:40] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2028'] [13:09:44] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2029'] [13:09:47] (03PS1) 10Muehlenhoff: Update to 6.6.11 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/953615 [13:09:53] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2030'] [13:09:57] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2031'] [13:10:02] (03PS1) 10Majavah: Disable NearbyPages on lockeddown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953616 [13:10:04] (03PS1) 10Majavah: Disable Collection on lockeddown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953617 [13:10:06] (03PS1) 10Majavah: Disable FileExporter on lockeddown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953618 [13:10:10] !log samtar@deploy1002 samtar: Continuing with sync [13:10:20] ^ could use reviews if either of you have a bit of time [13:10:52] (03PS1) 10Ayounsi: gnmic: set systemd service to type=simple [puppet] - 10https://gerrit.wikimedia.org/r/953620 [13:11:07] (03CR) 10Samtar: [C: 03+1] Disable NearbyPages on lockeddown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953616 (owner: 10Majavah) [13:11:22] (03CR) 10Samtar: [C: 03+1] Disable Collection on lockeddown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953617 (owner: 10Majavah) [13:11:23] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:11:41] (03CR) 10Samtar: [C: 03+1] Disable FileExporter on lockeddown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953618 (owner: 10Majavah) [13:11:54] (03CR) 10Ayounsi: [C: 03+2] gnmic: set systemd service to type=simple [puppet] - 10https://gerrit.wikimedia.org/r/953620 (owner: 10Ayounsi) [13:11:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P52089 and previous config saved to /var/cache/conftool/dbconfig/20230830-131157-ladsgroup.json [13:12:00] (03CR) 10Majavah: [C: 04-1] "Probably need to grant `autocreateaccount`, at least on labswiki, so that new users can log in to wikitech" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952209 (https://phabricator.wikimedia.org/T345226) (owner: 10Andrew Bogott) [13:12:08] (03PS5) 10Slyngshede: Disable user creation on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952209 (https://phabricator.wikimedia.org/T345226) (owner: 10Andrew Bogott) [13:14:11] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:23] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:15:54] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:951042|IS: Enable Phonos on all projects (T336763)]] (duration: 09m 29s) [13:16:00] T336763: Enable PhonosInlineAudioPlayerMode on all projects - https://phabricator.wikimedia.org/T336763 [13:17:55] * TheresNoTime done deploying [13:18:03] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [13:18:25] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:20:12] (03PS6) 10Slyngshede: Disable user creation on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952209 (https://phabricator.wikimedia.org/T345226) (owner: 10Andrew Bogott) [13:20:51] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply security updates - bking@cumin1001 - T344587 [13:20:58] thanks, I'll push out mine then [13:21:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953616 (owner: 10Majavah) [13:21:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953617 (owner: 10Majavah) [13:21:15] !log jiji@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-worker-codfw [13:21:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953618 (owner: 10Majavah) [13:21:21] !log elukey@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka main-codfw cluster: Reboot kafka nodes [13:21:37] (03CR) 10Slyngshede: Disable user creation on wikitech (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952209 (https://phabricator.wikimedia.org/T345226) (owner: 10Andrew Bogott) [13:22:03] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:22:05] (03PS7) 10Slyngshede: Disable user creation on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952209 (https://phabricator.wikimedia.org/T345226) (owner: 10Andrew Bogott) [13:22:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P52090 and previous config saved to /var/cache/conftool/dbconfig/20230830-132218-ladsgroup.json [13:22:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:22:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:22:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2026'] [13:22:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2027'] [13:22:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2028'] [13:23:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2029'] [13:23:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2030'] [13:23:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2031'] [13:23:12] (03Merged) 10jenkins-bot: Disable NearbyPages on lockeddown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953616 (owner: 10Majavah) [13:23:15] (03CR) 10Muehlenhoff: firewall: move conntrack logic to firewall module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953276 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [13:23:52] (03Merged) 10jenkins-bot: Disable Collection on lockeddown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953617 (owner: 10Majavah) [13:24:02] (03Merged) 10jenkins-bot: Disable FileExporter on lockeddown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953618 (owner: 10Majavah) [13:24:03] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:24:15] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2032'] [13:24:19] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2033'] [13:24:23] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2034'] [13:24:28] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2035'] [13:24:30] !log taavi@deploy1002 Started scap: Backport for [[gerrit:953616|Disable NearbyPages on lockeddown]], [[gerrit:953617|Disable Collection on lockeddown]], [[gerrit:953618|Disable FileExporter on lockeddown]] [13:24:32] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2036'] [13:24:36] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2037'] [13:25:18] taavi, TheresNoTime: I added some documentation here, idk how many people will see it but at least it theoretically exists now https://wikitech.wikimedia.org/w/index.php?title=Configuration_files&diff=prev&oldid=2106947 [13:25:49] * taavi almost clicks the rollback button when trying to thank the edit [13:25:55] :D [13:25:58] !log taavi@deploy1002 taavi: Backport for [[gerrit:953616|Disable NearbyPages on lockeddown]], [[gerrit:953617|Disable Collection on lockeddown]], [[gerrit:953618|Disable FileExporter on lockeddown]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:26:13] maybe “rollback” should ask for confirmation first [13:26:15] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:26:57] I think it does [13:27:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P52091 and previous config saved to /var/cache/conftool/dbconfig/20230830-132703-ladsgroup.json [13:27:04] (03PS33) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [13:27:09] anyways, login and votewiki still load, so syncing [13:27:10] !log taavi@deploy1002 taavi: Continuing with sync [13:27:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:28:08] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [13:28:19] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:29:07] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:29:39] (03CR) 10Elukey: [C: 03+1] Increase the kafka-jumbo maximum message size to 10 MB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952160 (https://phabricator.wikimedia.org/T307959) (owner: 10Btullis) [13:30:21] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [13:30:31] (03PS34) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [13:30:51] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:30:59] (03PS1) 10Alexandros Kosiaris: linkrecommendation: Use mesh.deployment 1.2 sextant module [deployment-charts] - 10https://gerrit.wikimedia.org/r/953629 (https://phabricator.wikimedia.org/T340843) [13:31:07] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:32:40] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:953616|Disable NearbyPages on lockeddown]], [[gerrit:953617|Disable Collection on lockeddown]], [[gerrit:953618|Disable FileExporter on lockeddown]] (duration: 08m 10s) [13:32:59] !log failover ganeti master in eqiad to ganeti1027 [13:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:07] (03CR) 10Alexandros Kosiaris: [C: 03+2] linkrecommendation: Use mesh.deployment 1.2 sextant module [deployment-charts] - 10https://gerrit.wikimedia.org/r/953629 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [13:33:21] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:33:52] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [13:34:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2032'] [13:34:11] (03Merged) 10jenkins-bot: linkrecommendation: Use mesh.deployment 1.2 sextant module [deployment-charts] - 10https://gerrit.wikimedia.org/r/953629 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [13:34:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2036'] [13:34:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2033'] [13:34:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2034'] [13:34:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2035'] [13:34:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2037'] [13:35:11] (03PS8) 10Slyngshede: Disable user creation on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952209 (https://phabricator.wikimedia.org/T345226) (owner: 10Andrew Bogott) [13:35:23] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:35:32] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2038'] [13:35:36] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2039'] [13:35:45] (03PS9) 10Slyngshede: Disable user creation on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952209 (https://phabricator.wikimedia.org/T345226) (owner: 10Andrew Bogott) [13:36:01] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host doh6001.wikimedia.org with OS bookworm [13:36:11] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:12] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host doh6001.wikimedia.org with OS bookworm [13:36:23] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2026'] [13:36:25] PROBLEM - ganeti-wconfd running on ganeti1024 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [13:36:28] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2027'] [13:36:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2027'] [13:36:54] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kubernetes2026'] [13:36:57] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2027'] [13:37:03] (03CR) 10Slyngshede: Disable user creation on wikitech (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952209 (https://phabricator.wikimedia.org/T345226) (owner: 10Andrew Bogott) [13:37:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T343718)', diff saved to https://phabricator.wikimedia.org/P52092 and previous config saved to /var/cache/conftool/dbconfig/20230830-133724-ladsgroup.json [13:37:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance [13:37:30] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [13:37:39] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kubernetes2027'] [13:37:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance [13:37:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1197 (T343718)', diff saved to https://phabricator.wikimedia.org/P52093 and previous config saved to /var/cache/conftool/dbconfig/20230830-133745-ladsgroup.json [13:37:46] (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [13:38:13] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:38:14] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322 (10ayounsi) I rolled the certificate to all the cloudsw, cr, and asw devices. I enabled gnmic on all the cloudsw and asw devices. I conf... [13:38:25] (03CR) 10Majavah: Disable user creation on wikitech (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952209 (https://phabricator.wikimedia.org/T345226) (owner: 10Andrew Bogott) [13:38:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:39:56] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [13:40:07] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [13:40:09] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:40:15] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:40:23] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:40:50] ^ execpted, BGP/BFD in drmrs [13:41:09] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [13:41:28] oh wait, I am on on-call today, ha [13:41:55] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [13:41:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T343718)', diff saved to https://phabricator.wikimedia.org/P52094 and previous config saved to /var/cache/conftool/dbconfig/20230830-134157-ladsgroup.json [13:42:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T343718)', diff saved to https://phabricator.wikimedia.org/P52095 and previous config saved to /var/cache/conftool/dbconfig/20230830-134209-ladsgroup.json [13:42:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [13:42:25] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:42:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [13:42:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T343718)', diff saved to https://phabricator.wikimedia.org/P52096 and previous config saved to /var/cache/conftool/dbconfig/20230830-134232-ladsgroup.json [13:42:38] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [13:42:56] (03PS1) 10Alexandros Kosiaris: Depend mesh.configuration:1.4 on mesh.deployment:1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/953630 [13:43:11] (03PS1) 10AOkoth: vrts: apply role and setup hiera values [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) [13:43:13] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:43:22] (03CR) 10CI reject: [V: 04-1] Depend mesh.configuration:1.4 on mesh.deployment:1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/953630 (owner: 10Alexandros Kosiaris) [13:43:54] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be100[34] - https://phabricator.wikimedia.org/T342675 (10Jclark-ctr) a:03Jclark-ctr [13:43:57] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:44:50] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Eevans) [13:44:52] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [13:44:58] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [13:45:02] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) p:05Triage→03Medium [13:45:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2038'] [13:45:11] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kubernetes2039'] [13:45:15] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:45:17] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2039'] [13:45:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2039'] [13:46:07] (03PS1) 10Ayounsi: gnmi: allow v6 connectivity [homer/public] - 10https://gerrit.wikimedia.org/r/953632 (https://phabricator.wikimedia.org/T326322) [13:46:21] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [13:47:31] (03PS1) 10Jforrester: Drop experimental mediawiki-dev chart, unused(?) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953633 [13:47:39] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply security updates - bking@cumin1001 - T344587 [13:47:57] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Jhancock.wm) [13:48:03] (03CR) 10Stevemunene: [C: 03+2] Fix hadoop-yarn log aggregation compression [puppet] - 10https://gerrit.wikimedia.org/r/953599 (owner: 10Joal) [13:48:27] (03CR) 10Jforrester: [C: 04-2] "DNM until RelEng confirm it's no longer needed." [deployment-charts] - 10https://gerrit.wikimedia.org/r/953633 (owner: 10Jforrester) [13:48:57] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:49:38] !log disabling DHCP snooping on mr1-codfw to test ztp operation [13:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:15] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:41] 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: wdqs1010 unreachable from SSH or DRAC - https://phabricator.wikimedia.org/T344518 (10bking) [13:52:17] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:53:03] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:53:55] (03CR) 10Ssingh: dnsrecursor: use validate_cmd for pdns-recursor config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937139 (owner: 10Ssingh) [13:54:27] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T345239 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm restarted the idrac. waiting for okay in T344110 to update firmware and reboot. [13:55:07] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:55:58] (03PS1) 10Ayounsi: gnmi: remove mgmt_junos restriction [homer/public] - 10https://gerrit.wikimedia.org/r/953636 (https://phabricator.wikimedia.org/T326322) [13:56:35] (03CR) 10Ayounsi: [C: 03+2] gnmi: allow v6 connectivity [homer/public] - 10https://gerrit.wikimedia.org/r/953632 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [13:56:49] (03CR) 10Ayounsi: [C: 03+2] gnmi: remove mgmt_junos restriction [homer/public] - 10https://gerrit.wikimedia.org/r/953636 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [13:57:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P52097 and previous config saved to /var/cache/conftool/dbconfig/20230830-135704-ladsgroup.json [13:57:08] (03Merged) 10jenkins-bot: gnmi: allow v6 connectivity [homer/public] - 10https://gerrit.wikimedia.org/r/953632 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [13:57:17] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:57:21] (03PS2) 10Alexandros Kosiaris: Depend mesh.configuration:1.4 on mesh.deployment:1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/953630 [13:57:23] (03Merged) 10jenkins-bot: gnmi: remove mgmt_junos restriction [homer/public] - 10https://gerrit.wikimedia.org/r/953636 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [13:57:44] (03CR) 10CI reject: [V: 04-1] Depend mesh.configuration:1.4 on mesh.deployment:1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/953630 (owner: 10Alexandros Kosiaris) [13:57:54] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh6001.wikimedia.org with reason: host reimage [13:58:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10VRiley-WMF) titan1002 - racked in D2 U12 [13:59:05] RECOVERY - Host ores2008 is UP: PING OK - Packet loss = 0%, RTA = 33.18 ms [14:00:05] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230830T1400) [14:00:37] (03PS3) 10Alexandros Kosiaris: Depend mesh.configuration:1.4 on mesh.deployment:1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/953630 [14:02:00] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh6001.wikimedia.org with reason: host reimage [14:02:39] (03PS1) 10Vgutierrez: trafficserver: Set active timeouts to 1h [puppet] - 10https://gerrit.wikimedia.org/r/953638 (https://phabricator.wikimedia.org/T341755) [14:03:41] 10SRE, 10ops-codfw, 10Machine-Learning-Team: Check ores2008's cable - https://phabricator.wikimedia.org/T345233 (10Jhancock.wm) @elukey We reseated the server and switch side of the patch. Looks like it might be the SFP. I've swapped it and the server's pinging. I'm gonna close this for now but please reopen... [14:04:15] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:04:24] 10SRE, 10ops-codfw, 10Machine-Learning-Team: Check ores2008's cable - https://phabricator.wikimedia.org/T345233 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:04:29] PROBLEM - ores_workers_running on ores2008 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [14:06:19] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:20] (03PS1) 10Jbond: puppetserver: prepare to migrate to new infrastructre [puppet] - 10https://gerrit.wikimedia.org/r/953640 (https://phabricator.wikimedia.org/T340739) [14:07:33] RECOVERY - ores_workers_running on ores2008 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [14:07:45] (03CR) 10Jbond: [C: 03+2] puppetserver: prepare to migrate to new infrastructre [puppet] - 10https://gerrit.wikimedia.org/r/953640 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [14:08:57] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:09:06] (03CR) 10Jaime Nuche: "Added Jeena to the review, maybe she'll know more about this chart" [deployment-charts] - 10https://gerrit.wikimedia.org/r/953633 (owner: 10Jforrester) [14:09:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T343718)', diff saved to https://phabricator.wikimedia.org/P52098 and previous config saved to /var/cache/conftool/dbconfig/20230830-140938-ladsgroup.json [14:09:45] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [14:09:51] (03PS1) 10Majavah: hieradata: Remove cloudcumin wm-bot proxy rule [puppet] - 10https://gerrit.wikimedia.org/r/953642 [14:11:17] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43066/console" [puppet] - 10https://gerrit.wikimedia.org/r/953642 (owner: 10Majavah) [14:12:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P52099 and previous config saved to /var/cache/conftool/dbconfig/20230830-141210-ladsgroup.json [14:14:45] 10SRE, 10Traffic: Varnish mobile redirection misses some sites - https://phabricator.wikimedia.org/T344175 (10Fabfur) Hi, thanks for reporting this! Can you provide us a full list of sites that shows this behavior? It would be extremely helpful to us to patch that regex(es)... Thanks a lot! [14:14:51] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:10] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10MoritzMuehlenhoff) From a high level view that seems perfectly fine. We initiate non-wiki offboardings from the production networks in a similar manne... [14:18:03] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/953643 [14:18:57] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:20:31] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/953643 (owner: 10Muehlenhoff) [14:21:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1218 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P52100 and previous config saved to /var/cache/conftool/dbconfig/20230830-142155-ladsgroup.json [14:22:27] (03PS1) 10Ayounsi: Revert "gnmi: remove mgmt_junos restriction" [homer/public] - 10https://gerrit.wikimedia.org/r/953646 [14:22:28] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1024.eqiad.wmnet [14:23:05] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:23:07] (03PS1) 10Ssingh: P:wikidough: add a require on the acmechief setup [puppet] - 10https://gerrit.wikimedia.org/r/953644 [14:23:34] (03CR) 10Jforrester: [C: 04-1] Re-apply "Fix wikifunctions orchestrator not using the service mesh" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953212 (https://phabricator.wikimedia.org/T344998) (owner: 10Jforrester) [14:23:59] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:24:03] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh6001.wikimedia.org with OS bookworm [14:24:09] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43067/console" [puppet] - 10https://gerrit.wikimedia.org/r/953644 (owner: 10Ssingh) [14:24:15] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host doh6001.wikimedia.org with OS bookworm completed: - doh6001 (**PASS**) - Downtimed on Icinga/Al... [14:24:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P52101 and previous config saved to /var/cache/conftool/dbconfig/20230830-142444-ladsgroup.json [14:24:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:25:18] (03PS1) 10Jbond: stie.pp: move server back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/953645 (https://phabricator.wikimedia.org/T340739) [14:25:24] (03CR) 10Ayounsi: [C: 03+2] Revert "gnmi: remove mgmt_junos restriction" [homer/public] - 10https://gerrit.wikimedia.org/r/953646 (owner: 10Ayounsi) [14:25:59] (03Merged) 10jenkins-bot: Revert "gnmi: remove mgmt_junos restriction" [homer/public] - 10https://gerrit.wikimedia.org/r/953646 (owner: 10Ayounsi) [14:26:38] (03CR) 10Jbond: [C: 03+2] stie.pp: move server back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/953645 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [14:27:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T343718)', diff saved to https://phabricator.wikimedia.org/P52102 and previous config saved to /var/cache/conftool/dbconfig/20230830-142716-ladsgroup.json [14:27:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance [14:27:20] (03CR) 10JMeybohm: mesh: add tracing support (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:27:22] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [14:27:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance [14:27:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1222 (T343718)', diff saved to https://phabricator.wikimedia.org/P52103 and previous config saved to /var/cache/conftool/dbconfig/20230830-142737-ladsgroup.json [14:27:55] (03CR) 10JMeybohm: "Please also add a line to modules/mesh/CHANGELOG.md" [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:27:58] (03PS1) 10Hnowlan: cassandra-http-gateway: use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/953666 (https://phabricator.wikimedia.org/T300033) [14:29:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:30:51] (03CR) 10David Caro: replica_cnf_api: add envvars backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [14:30:53] !log fab@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [14:33:20] !log fab@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [14:34:00] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device ssw1-a1-codfw.mgmt.codfw.wmnet [14:34:47] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): spicerack: update spicerack to work with the newer puppet infrastructure - https://phabricator.wikimedia.org/T341496 (10jbond) [14:35:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1024.eqiad.wmnet [14:37:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1218 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P52104 and previous config saved to /var/cache/conftool/dbconfig/20230830-143700-ladsgroup.json [14:37:12] !log fab@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [14:37:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:37:52] !log dbmaint on s4@codfw (T207253) [14:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:57] T207253: Automatically compare a few tables per section between hosts and DC - https://phabricator.wikimedia.org/T207253 [14:38:33] (03PS1) 10Hnowlan: thumbor: Update dependencies to be ready for cert manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/953667 (https://phabricator.wikimedia.org/T300033) [14:39:10] !log fab@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [14:39:18] !log fab@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [14:39:45] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:39:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P52105 and previous config saved to /var/cache/conftool/dbconfig/20230830-143950-ladsgroup.json [14:40:20] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10Jclark-ctr) [14:40:47] !log disable bacula backup1002, backup2002 jobs [14:40:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:55] (03PS2) 10Hnowlan: service: add media-analytics service entry [puppet] - 10https://gerrit.wikimedia.org/r/951901 (https://phabricator.wikimedia.org/T336380) [14:41:09] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:58] !log fab@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [14:42:00] (03CR) 10Muehlenhoff: firewall: move conntrack logic to firewall module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953276 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [14:42:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1024.eqiad.wmnet [14:42:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1024.eqiad.wmnet [14:42:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:42:44] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10Jclark-ctr) moss-be1003 F 3 , U 3. Port 3 cableid 202220224 [14:43:00] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10Jclark-ctr) [14:48:17] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster search_codfw: apply security updates - bking@cumin1001 - T344587 [14:49:07] !log bking@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster search_codfw: apply security updates - bking@cumin1001 - T344587 [14:49:15] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply security updates - bking@cumin1001 - T344587 [14:50:40] 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10kamila) [14:51:18] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device ssw1-a1-codfw.mgmt.codfw.wmnet [14:51:19] (03PS2) 10Muehlenhoff: Update to 6.6.11 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/953615 [14:51:20] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [14:51:35] (03PS1) 10Ayounsi: POP L3 switches: use mgmt IP as primary [homer/public] - 10https://gerrit.wikimedia.org/r/953671 [14:51:37] (03PS1) 10Jbond: puppetserver1002: host yaml should be host not fqdn [puppet] - 10https://gerrit.wikimedia.org/r/953670 [14:52:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1218 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P52106 and previous config saved to /var/cache/conftool/dbconfig/20230830-145205-ladsgroup.json [14:52:09] (03CR) 10Jbond: [C: 03+2] puppetserver1002: host yaml should be host not fqdn [puppet] - 10https://gerrit.wikimedia.org/r/953670 (owner: 10Jbond) [14:52:47] (03CR) 10Cathal Mooney: [C: 03+1] POP L3 switches: use mgmt IP as primary [homer/public] - 10https://gerrit.wikimedia.org/r/953671 (owner: 10Ayounsi) [14:52:51] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:53:03] (03CR) 10Ayounsi: [C: 03+2] POP L3 switches: use mgmt IP as primary [homer/public] - 10https://gerrit.wikimedia.org/r/953671 (owner: 10Ayounsi) [14:53:51] (03Merged) 10jenkins-bot: POP L3 switches: use mgmt IP as primary [homer/public] - 10https://gerrit.wikimedia.org/r/953671 (owner: 10Ayounsi) [14:54:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T343718)', diff saved to https://phabricator.wikimedia.org/P52107 and previous config saved to /var/cache/conftool/dbconfig/20230830-145457-ladsgroup.json [14:55:08] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [14:55:37] !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "POP switches - ayounsi@cumin1001" [14:56:33] (03PS2) 10Vgutierrez: trafficserver: Set active timeouts to 1h in upload [puppet] - 10https://gerrit.wikimedia.org/r/953638 (https://phabricator.wikimedia.org/T341755) [14:56:42] (03PS1) 10Muehlenhoff: Update links to create an account and password reset to point to Bitu [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/953673 (https://phabricator.wikimedia.org/T338008) [14:56:45] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "POP switches - ayounsi@cumin1001" [14:57:30] (03CR) 10Jelto: "looks mostly good but we have vrts1001 and vrts1002 as active_hosts with this change and the puppet code was not intended for two active h" [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [14:58:11] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/953615 (owner: 10Muehlenhoff) [14:59:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T343718)', diff saved to https://phabricator.wikimedia.org/P52108 and previous config saved to /var/cache/conftool/dbconfig/20230830-145918-ladsgroup.json [14:59:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10Jclark-ctr) titan1001 F 1 , U 25 Port 7 Cableid 20220255 [14:59:27] 10SRE, 10ops-codfw, 10Machine-Learning-Team: Check ores2008's cable - https://phabricator.wikimedia.org/T345233 (10elukey) @Jhancock.wm thanks a lot! [15:00:18] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device ssw1-a1-codfw.mgmt.codfw.wmnet [15:00:35] (03CR) 10Elukey: LiftWing: add latency/availability SLO dashboards (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [15:04:04] (03PS2) 10Ssingh: P:wikidough: add a require on the acmechief setup [puppet] - 10https://gerrit.wikimedia.org/r/953644 [15:04:31] (03CR) 10Vgutierrez: [C: 03+1] P:wikidough: add a require on the acmechief setup [puppet] - 10https://gerrit.wikimedia.org/r/953644 (owner: 10Ssingh) [15:04:59] (03PS3) 10Ssingh: P:wikidough: add a require on the acmechief setup [puppet] - 10https://gerrit.wikimedia.org/r/953644 [15:05:08] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Update to 6.6.11 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/953615 (owner: 10Muehlenhoff) [15:05:22] 10SRE, 10serviceops, 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10kamila) [15:05:52] (03PS1) 10Cathal Mooney: Modify Juniper ZTP script used during initial provision [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) [15:06:29] (03PS1) 10JMeybohm: jaeger: Configure ingress using istio CRD [deployment-charts] - 10https://gerrit.wikimedia.org/r/953675 (https://phabricator.wikimedia.org/T344253) [15:06:31] (03PS1) 10Majavah: icinga: remove PAWS check [puppet] - 10https://gerrit.wikimedia.org/r/953676 [15:06:42] 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10kamila) [15:06:52] (03PS1) 10Ayounsi: sflow: use loopback IP explicitely [homer/public] - 10https://gerrit.wikimedia.org/r/953677 [15:07:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1218 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P52109 and previous config saved to /var/cache/conftool/dbconfig/20230830-150709-ladsgroup.json [15:07:21] (03CR) 10Ayounsi: [C: 03+2] sflow: use loopback IP explicitely [homer/public] - 10https://gerrit.wikimedia.org/r/953677 (owner: 10Ayounsi) [15:07:23] (03CR) 10Ssingh: [C: 03+2] P:wikidough: add a require on the acmechief setup [puppet] - 10https://gerrit.wikimedia.org/r/953644 (owner: 10Ssingh) [15:07:59] (03Merged) 10jenkins-bot: sflow: use loopback IP explicitely [homer/public] - 10https://gerrit.wikimedia.org/r/953677 (owner: 10Ayounsi) [15:08:06] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/953673 (https://phabricator.wikimedia.org/T338008) (owner: 10Muehlenhoff) [15:08:11] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host doh4001.wikimedia.org with OS bookworm [15:08:22] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host doh4001.wikimedia.org with OS bookworm [15:08:49] 10ops-codfw: Duplicate IP on mgmt network - https://phabricator.wikimedia.org/T345266 (10phaultfinder) [15:09:50] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:10:44] (03PS3) 10Vgutierrez: trafficserver: Set active timeouts to 1h in upload [puppet] - 10https://gerrit.wikimedia.org/r/953638 (https://phabricator.wikimedia.org/T341755) [15:10:46] (03PS1) 10Vgutierrez: varnish: Increase send_timeout in upload [puppet] - 10https://gerrit.wikimedia.org/r/953678 (https://phabricator.wikimedia.org/T341755) [15:11:52] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device asw1-by27-esams [15:12:08] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-by27-esams [15:12:37] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43068/console" [puppet] - 10https://gerrit.wikimedia.org/r/953678 (https://phabricator.wikimedia.org/T341755) (owner: 10Vgutierrez) [15:13:00] (03PS2) 10JMeybohm: jaeger: Configure ingress using istio CRD [deployment-charts] - 10https://gerrit.wikimedia.org/r/953675 (https://phabricator.wikimedia.org/T344253) [15:14:12] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43069/console" [puppet] - 10https://gerrit.wikimedia.org/r/953638 (https://phabricator.wikimedia.org/T341755) (owner: 10Vgutierrez) [15:14:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P52110 and previous config saved to /var/cache/conftool/dbconfig/20230830-151424-ladsgroup.json [15:15:53] (03PS1) 10BCornwall: Remove most knams references/comments [dns] - 10https://gerrit.wikimedia.org/r/953681 [15:16:15] (03PS1) 10Ayounsi: sre.network.tls: wrong cert field in log [cookbooks] - 10https://gerrit.wikimedia.org/r/953682 [15:17:40] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device ssw1-a1-codfw.mgmt.codfw.wmnet [15:17:42] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [15:17:44] (03CR) 10Arturo Borrero Gonzalez: firewall: move conntrack logic to firewall module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953276 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [15:18:57] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:19:11] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:19:23] (03CR) 10Ayounsi: [C: 03+2] sre.network.tls: wrong cert field in log [cookbooks] - 10https://gerrit.wikimedia.org/r/953682 (owner: 10Ayounsi) [15:21:18] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10thcipriani) >>! In T342535#9123141, @Mabualruz wrote: >>>! In T342535#9097588, @thcipriani wrote: >> @Mabualruz I can't remember... [15:21:50] (03Merged) 10jenkins-bot: sre.network.tls: wrong cert field in log [cookbooks] - 10https://gerrit.wikimedia.org/r/953682 (owner: 10Ayounsi) [15:23:02] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device asw1-bw27-esams [15:23:18] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-bw27-esams [15:23:25] (03CR) 10BCornwall: [C: 03+1] varnish: Increase send_timeout in upload [puppet] - 10https://gerrit.wikimedia.org/r/953678 (https://phabricator.wikimedia.org/T341755) (owner: 10Vgutierrez) [15:23:34] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device asw1-b12-drmrs [15:23:52] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b12-drmrs [15:24:01] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device asw1-b13-drmrs [15:24:19] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b13-drmrs [15:24:38] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.reboot-workers (exit_code=0) for Kafka main-codfw cluster: Reboot kafka nodes [15:25:44] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:26:42] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:26:56] (03PS3) 10Hnowlan: service: add media-analytics service entry [puppet] - 10https://gerrit.wikimedia.org/r/951901 (https://phabricator.wikimedia.org/T336380) [15:27:38] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:27:40] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh4001.wikimedia.org with reason: host reimage [15:29:20] (03PS1) 10Jbond: cluster::managment: add ssh fingerprints for new puppetservers [puppet] - 10https://gerrit.wikimedia.org/r/953683 (https://phabricator.wikimedia.org/T340739) [15:29:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P52111 and previous config saved to /var/cache/conftool/dbconfig/20230830-152931-ladsgroup.json [15:29:42] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "I like the idea! I am not sure about the targets so we can revisit (As discussed, these will most probably be the result of load testing)." [deployment-charts] - 10https://gerrit.wikimedia.org/r/953578 (https://phabricator.wikimedia.org/T344058) (owner: 10Elukey) [15:30:13] (03CR) 10BCornwall: [C: 03+1] trafficserver: Allow configuring transaction_active_timeout_in [puppet] - 10https://gerrit.wikimedia.org/r/953587 (https://phabricator.wikimedia.org/T341755) (owner: 10Vgutierrez) [15:30:40] (03CR) 10BBlack: [C: 03+1] trafficserver: Allow configuring transaction_active_timeout_in [puppet] - 10https://gerrit.wikimedia.org/r/953587 (https://phabricator.wikimedia.org/T341755) (owner: 10Vgutierrez) [15:31:00] (03CR) 10BBlack: [C: 03+1] trafficserver: Set active timeouts to 1h in upload [puppet] - 10https://gerrit.wikimedia.org/r/953638 (https://phabricator.wikimedia.org/T341755) (owner: 10Vgutierrez) [15:31:09] 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10kamila) p:05Triage→03Medium [15:31:19] (03CR) 10BBlack: [C: 03+2] varnish: Increase send_timeout in upload [puppet] - 10https://gerrit.wikimedia.org/r/953678 (https://phabricator.wikimedia.org/T341755) (owner: 10Vgutierrez) [15:31:36] (03CR) 10BBlack: [C: 03+1] varnish: Increase send_timeout in upload [puppet] - 10https://gerrit.wikimedia.org/r/953678 (https://phabricator.wikimedia.org/T341755) (owner: 10Vgutierrez) [15:31:54] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh4001.wikimedia.org with reason: host reimage [15:32:06] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster2002.codfw.wmnet [15:33:24] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: pdns: use modern recursor setting for cloudservices1006 [puppet] - 10https://gerrit.wikimedia.org/r/953685 (https://phabricator.wikimedia.org/T345240) [15:33:57] (03PS4) 10Jbond: firewall: move conntrack logic to firewall module [puppet] - 10https://gerrit.wikimedia.org/r/953276 (https://phabricator.wikimedia.org/T336497) [15:33:59] (03PS12) 10Jbond: ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) [15:34:01] (03PS2) 10Jbond: firewall: add conntrack require on the active firewall [puppet] - 10https://gerrit.wikimedia.org/r/953610 (https://phabricator.wikimedia.org/T336497) [15:34:29] (03CR) 10FNegri: [C: 03+1] "Yes this can be removed now!" [puppet] - 10https://gerrit.wikimedia.org/r/953642 (owner: 10Majavah) [15:34:47] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ssingh) [15:35:10] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Allow configuring transaction_active_timeout_in [puppet] - 10https://gerrit.wikimedia.org/r/953587 (https://phabricator.wikimedia.org/T341755) (owner: 10Vgutierrez) [15:35:40] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:36:33] (03CR) 10Alexandros Kosiaris: Update modules/README.md (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953553 (owner: 10Alexandros Kosiaris) [15:36:41] (03PS2) 10Alexandros Kosiaris: Update modules/README.md [deployment-charts] - 10https://gerrit.wikimedia.org/r/953553 [15:36:58] (03PS4) 10Ebernhardson: Draft: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 [15:37:02] (03CR) 10Majavah: [V: 03+1 C: 03+2] hieradata: Remove cloudcumin wm-bot proxy rule [puppet] - 10https://gerrit.wikimedia.org/r/953642 (owner: 10Majavah) [15:37:40] (03CR) 10CI reject: [V: 04-1] Draft: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (owner: 10Ebernhardson) [15:37:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10Jclark-ctr) dbstore1008 E 2. U 41. Port 38 Cableid 230304500161 dbstore1009. F 2. U 40. Port. 39 Cableid 230304500156 [15:38:42] PROBLEM - Host db1201 #page is DOWN: PING CRITICAL - Packet loss = 100% [15:38:49] woot [15:38:54] hi [15:38:54] ouch [15:39:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1201', diff saved to https://phabricator.wikimedia.org/P52112 and previous config saved to /var/cache/conftool/dbconfig/20230830-153915-root.json [15:39:18] Depooled [15:39:20] marostegui: I can depool [15:39:20] ok [15:39:21] thanks [15:39:28] need me? in an interview [15:39:45] Nah all fine [15:39:49] I will take care of it [15:40:14] I can fill in for c.laime if that changes - feel free to ping me please [15:40:20] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10Mabualruz) >>! In T342535#9131286, @thcipriani wrote: > Yes please! Fill out the form here to make a task, and I'll get you on th... [15:40:36] marostegui: hth if I can (on on-call) [15:40:43] I see nothing in getsel [15:40:58] PROBLEM - Check systemd state on elastic2065 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:41:24] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:41:32] marostegui: want me to file a task? [15:41:40] sukhe: already done [15:41:42] (SystemdUnitFailed) firing: elasticsearch-disable-readahead.service Failed on elastic2065:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:41:46] ok :) [15:41:55] (03PS1) 10Marostegui: db1201: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/953686 [15:42:48] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:54] (03CR) 10Jbond: "thanks see inline suggest we move discussion to https://gerrit.wikimedia.org/r/c/operations/puppet/+/953610" [puppet] - 10https://gerrit.wikimedia.org/r/953276 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [15:42:56] (03CR) 10Marostegui: [C: 03+2] db1201: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/953686 (owner: 10Marostegui) [15:43:21] (03CR) 10Jbond: [C: 03+2] cluster::managment: add ssh fingerprints for new puppetservers [puppet] - 10https://gerrit.wikimedia.org/r/953683 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [15:43:47] (03PS1) 10Bking: rdf-streaming-updater: Update egress rules for ZK [deployment-charts] - 10https://gerrit.wikimedia.org/r/953688 (https://phabricator.wikimedia.org/T344614) [15:43:57] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:44:17] (03CR) 10FNegri: [C: 04-1] Openstack: remove support for Stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952848 (owner: 10Muehlenhoff) [15:44:20] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host kubemaster2002.codfw.wmnet [15:44:30] (03CR) 10CI reject: [V: 04-1] rdf-streaming-updater: Update egress rules for ZK [deployment-charts] - 10https://gerrit.wikimedia.org/r/953688 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [15:44:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T343718)', diff saved to https://phabricator.wikimedia.org/P52113 and previous config saved to /var/cache/conftool/dbconfig/20230830-154437-ladsgroup.json [15:44:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [15:44:43] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [15:44:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [15:44:56] PROBLEM - Check systemd state on kubemaster2002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:03] 10ops-eqiad, 10DBA: db1201 crashed - https://phabricator.wikimedia.org/T345271 (10Marostegui) The host looks reachable from the idrac, but not from anywhere else. Looks like a networking issue: ` [167720.577345] tg3 0000:04:00.0 eno1: Link is down root@db1201:~# ethtool eno1 | grep Link Link detected: no `... [15:47:06] (03PS35) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [15:47:10] (03CR) 10Peter Fischer: rdf-streaming-updater: Update egress rules for ZK (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953688 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [15:47:13] 10ops-eqiad, 10DBA: db1201 network down - https://phabricator.wikimedia.org/T345271 (10Marostegui) [15:47:38] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:48:15] (03PS2) 10Bking: rdf-streaming-updater: Update egress rules for ZK [deployment-charts] - 10https://gerrit.wikimedia.org/r/953688 (https://phabricator.wikimedia.org/T344614) [15:48:28] RECOVERY - BFD status on cr3-ulsfo is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:48:57] (03CR) 10CI reject: [V: 04-1] rdf-streaming-updater: Update egress rules for ZK [deployment-charts] - 10https://gerrit.wikimedia.org/r/953688 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [15:49:06] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic2038-production-search-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [15:49:12] RECOVERY - Check systemd state on kubemaster2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:18] PROBLEM - Check systemd state on elastic2038 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:29] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh4001.wikimedia.org with OS bookworm [15:49:30] (03PS3) 10Bking: rdf-streaming-updater: Update egress rules for ZK [deployment-charts] - 10https://gerrit.wikimedia.org/r/953688 (https://phabricator.wikimedia.org/T344614) [15:49:39] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host doh4001.wikimedia.org with OS bookworm completed: - doh4001 (**PASS**) - Downtimed on Icinga/Al... [15:49:56] (03CR) 10Peter Fischer: [C: 03+1] rdf-streaming-updater: Update egress rules for ZK (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953688 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [15:50:10] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster2001.codfw.wmnet [15:50:50] 10ops-codfw: Duplicate IP on mgmt network - https://phabricator.wikimedia.org/T345266 (10phaultfinder) [15:51:42] (SystemdUnitFailed) firing: (2) elasticsearch-disable-readahead.service Failed on elastic2038:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:51:43] 10SRE, 10Infrastructure-Foundations, 10netops: Juniper ZTP fails on certain devices due to DHCP binding on management router - https://phabricator.wikimedia.org/T345273 (10cmooney) p:05Triage→03Medium [15:52:02] 10SRE, 10Infrastructure-Foundations, 10netops: Juniper ZTP fails on certain devices due to DHCP binding on management router - https://phabricator.wikimedia.org/T345273 (10cmooney) [15:52:07] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10cmooney) [15:52:44] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:53:18] 10SRE, 10Infrastructure-Foundations, 10netops: Juniper ZTP fails on certain devices due to DHCP binding on management router - https://phabricator.wikimedia.org/T345273 (10cmooney) [15:53:28] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: Update egress rules for ZK [deployment-charts] - 10https://gerrit.wikimedia.org/r/953688 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [15:54:06] (CirrusSearchJVMGCYoungPoolInsufficient) resolved: Elasticsearch instance elastic2038-production-search-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [15:54:13] (03Merged) 10jenkins-bot: rdf-streaming-updater: Update egress rules for ZK [deployment-charts] - 10https://gerrit.wikimedia.org/r/953688 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [15:54:30] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster2001.codfw.wmnet [15:54:52] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10kamila) [15:55:30] 10SRE, 10MW-on-K8s, 10Observability-Logging, 10serviceops: Keep calculating latencies for MediaWiki requests in the WikiKube environment - https://phabricator.wikimedia.org/T276095 (10kamila) 05Open→03Resolved The remaining Benthos errors are due to T340935, other than that this is working. (I still n... [15:56:00] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [15:56:37] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [15:56:46] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [15:56:50] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [15:57:21] 10SRE, 10ChangeProp, 10EventStreams, 10Image-Suggestion-API, and 5 others: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10hnowlan) [15:58:41] (03PS1) 10Andrea Denisse: pontoon: Enroll the netmon-03 host (Bookworm) with the netmon role [puppet] - 10https://gerrit.wikimedia.org/r/953691 (https://phabricator.wikimedia.org/T344136) [15:58:42] 10SRE, 10ops-codfw, 10Machine-Learning-Team: Check ores2008's cable - https://phabricator.wikimedia.org/T345233 (10calbon) @Jhancock.wm Thanks! [15:59:00] 10SRE, 10Infrastructure-Foundations, 10netops: Juniper ZTP fails on certain devices due to DHCP binding on management router - https://phabricator.wikimedia.org/T345273 (10cmooney) [15:59:41] (03CR) 10Filippo Giunchedi: [C: 03+1] pontoon: Enroll the netmon-03 host (Bookworm) with the netmon role [puppet] - 10https://gerrit.wikimedia.org/r/953691 (https://phabricator.wikimedia.org/T344136) (owner: 10Andrea Denisse) [16:00:49] (03CR) 10Andrea Denisse: [C: 03+2] pontoon: Enroll the netmon-03 host (Bookworm) with the netmon role [puppet] - 10https://gerrit.wikimedia.org/r/953691 (https://phabricator.wikimedia.org/T344136) (owner: 10Andrea Denisse) [16:02:46] 10SRE, 10Infrastructure-Foundations, 10netops: Juniper ZTP fails on certain devices due to DHCP binding on management router - https://phabricator.wikimedia.org/T345273 (10cmooney) [16:07:56] RECOVERY - Check systemd state on elastic2065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:08:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:08:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:09:09] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:11:42] (SystemdUnitFailed) firing: (2) elasticsearch-disable-readahead.service Failed on elastic2038:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:12:13] 10SRE, 10Infrastructure-Foundations, 10netops: Juniper ZTP fails on certain devices due to DHCP binding on management router - https://phabricator.wikimedia.org/T345273 (10cmooney) [16:19:09] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:19:57] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply security updates - bking@cumin1001 - T344587 [16:20:34] (03PS1) 10Cathal Mooney: Change DHCP relay function on management routers to 'forward-only' [homer/public] - 10https://gerrit.wikimedia.org/r/953695 (https://phabricator.wikimedia.org/T345273) [16:24:19] jouncebot: nowandnext [16:24:19] No deployments scheduled for the next 0 hour(s) and 35 minute(s) [16:24:19] 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10kamila) [16:24:19] In 0 hour(s) and 35 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230830T1700) [16:24:38] (03CR) 10Ladsgroup: [C: 03+2] ores-extension: replace thresholds with numeric values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948542 (https://phabricator.wikimedia.org/T343308) (owner: 10Ilias Sarantopoulos) [16:25:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948542 (https://phabricator.wikimedia.org/T343308) (owner: 10Ilias Sarantopoulos) [16:26:04] (03Merged) 10jenkins-bot: ores-extension: replace thresholds with numeric values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948542 (https://phabricator.wikimedia.org/T343308) (owner: 10Ilias Sarantopoulos) [16:26:30] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:948542|ores-extension: replace thresholds with numeric values (T343308)]] [16:26:36] T343308: fiwiki RC filters classify all edits as 'very likely bad faith' - https://phabricator.wikimedia.org/T343308 [16:28:06] !log ladsgroup@deploy1002 ladsgroup and isaranto: Backport for [[gerrit:948542|ores-extension: replace thresholds with numeric values (T343308)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [16:30:25] !log ladsgroup@deploy1002 ladsgroup and isaranto: Continuing with sync [16:36:39] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:948542|ores-extension: replace thresholds with numeric values (T343308)]] (duration: 10m 09s) [16:36:46] T343308: fiwiki RC filters classify all edits as 'very likely bad faith' - https://phabricator.wikimedia.org/T343308 [16:37:34] (KubernetesAPILatency) firing: (11) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:38:50] (03CR) 10FNegri: replica_cnf_api: add envvars backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [16:42:34] (KubernetesAPILatency) resolved: (11) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:54:15] (03CR) 10Ayounsi: [C: 03+1] Change DHCP relay function on management routers to 'forward-only' [homer/public] - 10https://gerrit.wikimedia.org/r/953695 (https://phabricator.wikimedia.org/T345273) (owner: 10Cathal Mooney) [16:56:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:57:04] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Juniper ZTP fails on certain devices due to DHCP binding on management router - https://phabricator.wikimedia.org/T345273 (10ayounsi) Could we use `forward-only` everywhere once we move to DHCP option 97 with {T304677} ? [16:59:40] (03PS1) 10BBlack: Fix cache_upload timeouts in single-backend sites [puppet] - 10https://gerrit.wikimedia.org/r/953700 (https://phabricator.wikimedia.org/T288106) [17:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230830T1700) [17:01:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:01:39] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Juniper ZTP fails on certain devices due to DHCP binding on management router - https://phabricator.wikimedia.org/T345273 (10cmooney) >>! In T345273#9131609, @ayounsi wrote: > Could we use `forward-only` everywhere once we move to DHCP opti... [17:12:10] 10SRE, 10ops-codfw, 10serviceops: Decommission thumbor200[34] - https://phabricator.wikimedia.org/T344597 (10wiki_willy) a:03Jhancock.wm [17:13:52] (03CR) 10Volans: [C: 04-1] "small thing missing inline" [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [17:19:56] 10SRE, 10Movement-Insights, 10Traffic: Varnish mobile redirection misses some sites - https://phabricator.wikimedia.org/T344175 (10nshahquinn-wmf) a:05Fabfur→03nshahquinn-wmf Yes, definitely. It might take me a few days since I accidentally deleted the code that I used to get the list 😅, but it won't be... [17:20:23] 10SRE, 10Movement-Insights, 10Traffic: Varnish mobile redirection misses some sites - https://phabricator.wikimedia.org/T344175 (10nshahquinn-wmf) p:05Triage→03Medium [17:22:12] 10SRE, 10Movement-Insights, 10Traffic: Varnish mobile redirection misses some sites - https://phabricator.wikimedia.org/T344175 (10Fabfur) If you want in the meantime we can start with this first list of domains and then add the others [17:29:31] 10SRE, 10Movement-Insights, 10Traffic: Varnish mobile redirection misses some sites - https://phabricator.wikimedia.org/T344175 (10nshahquinn-wmf) @Fabfur I don't think there's any reason to do that. It will be easier for you to do it all at once, and it's already been like this for years without causing any... [17:37:29] (03CR) 10Vivian Rook: [C: 03+1] icinga: remove PAWS check [puppet] - 10https://gerrit.wikimedia.org/r/953676 (owner: 10Majavah) [17:37:46] (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [17:39:56] oh man [17:40:06] (03PS2) 10HMonroy: wikidiff2: set maxSplitSize = 10 by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952940 (https://phabricator.wikimedia.org/T341754) [17:40:06] this doh hosts thing [17:46:00] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host doh2001.wikimedia.org with OS bookworm [17:46:10] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host doh2001.wikimedia.org with OS bookworm [17:47:00] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:47:09] ^ expected [17:47:59] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ssingh) [17:50:06] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:50:06] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:50:16] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:52:02] (03CR) 10Majavah: [C: 03+2] icinga: remove PAWS check [puppet] - 10https://gerrit.wikimedia.org/r/953676 (owner: 10Majavah) [17:53:57] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:56:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:59:22] (03CR) 10Cathal Mooney: Modify Juniper ZTP script used during initial provision (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [18:00:05] jeena and dduvall: Dear deployers, time to do the Train log triage with CPT deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230830T1800). [18:00:05] jeena and dduvall: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230830T1800). [18:01:04] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953708 (https://phabricator.wikimedia.org/T343726) [18:01:06] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953708 (https://phabricator.wikimedia.org/T343726) (owner: 10TrainBranchBot) [18:01:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:01:50] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953708 (https://phabricator.wikimedia.org/T343726) (owner: 10TrainBranchBot) [18:01:59] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh2001.wikimedia.org with reason: host reimage [18:03:15] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster search_eqiad: apply security updates - bking@cumin1001 - T344587 [18:03:15] !log bking@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster search_eqiad: apply security updates - bking@cumin1001 - T344587 [18:03:24] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply security updates - bking@cumin1001 - T344587 [18:04:32] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh2001.wikimedia.org with reason: host reimage [18:06:53] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:08:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:08:30] hmmm [18:08:34] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [18:08:50] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.24 refs T343726 [18:08:56] T343726: 1.41.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T343726 [18:09:01] hm, I just merged a icinga change so looking at that warning [18:09:03] taavi: ^ [18:09:09] paws failure [18:11:08] (03PS1) 10Majavah: icinga: Remove PAWS certificate check [puppet] - 10https://gerrit.wikimedia.org/r/953711 [18:11:53] (KubernetesAPILatency) firing: (7) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:12:56] sukhe: looks like I missed a separate check in icinga::certs, fixed in https://gerrit.wikimedia.org/r/953711 [18:13:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:13:27] taavi: looking [18:13:58] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:14:02] (03CR) 10Ssingh: [C: 03+1] icinga: Remove PAWS certificate check [puppet] - 10https://gerrit.wikimedia.org/r/953711 (owner: 10Majavah) [18:14:13] (03CR) 10Majavah: [C: 03+2] icinga: Remove PAWS certificate check [puppet] - 10https://gerrit.wikimedia.org/r/953711 (owner: 10Majavah) [18:15:04] !log jhuneidi@deploy1002 Synchronized php: group1 wikis to 1.41.0-wmf.24 refs T343726 (duration: 06m 13s) [18:15:10] T343726: 1.41.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T343726 [18:16:53] (KubernetesAPILatency) firing: (6) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:17:08] (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:18:20] RECOVERY - BFD status on cr2-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:18:20] RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:18:26] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:18:50] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [18:19:39] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh2001.wikimedia.org with OS bookworm [18:19:49] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host doh2001.wikimedia.org with OS bookworm completed: - doh2001 (**PASS**) - Downtimed on Icinga/Al... [18:20:08] there we go, sorry about that [18:22:17] taavi: all good, thanks for resolving [18:25:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: wdqs1010 unreachable from SSH or DRAC - https://phabricator.wikimedia.org/T344518 (10RKemper) a:03Papaul [18:26:15] (03PS2) 10Cathal Mooney: Modify Juniper ZTP script used during initial provision [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) [18:26:39] (03CR) 10CI reject: [V: 04-1] Modify Juniper ZTP script used during initial provision [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [18:27:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: hw troubleshooting: wdqs1010 unreachable from SSH or DRAC - https://phabricator.wikimedia.org/T344518 (10RKemper) [18:28:19] (03PS3) 10Cathal Mooney: Modify Juniper ZTP script used during initial provision [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) [18:28:45] !log ryankemper@cumin1001 START - Cookbook sre.hosts.decommission for hosts wdqs1005.eqiad.wmnet [18:29:40] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:30:00] PROBLEM - Check systemd state on elastic1088 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:30:14] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:31:42] (SystemdUnitFailed) firing: (2) elasticsearch-disable-readahead.service Failed on elastic1088:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:35:09] (03CR) 10Cathal Mooney: [C: 03+2] Change DHCP relay function on management routers to 'forward-only' [homer/public] - 10https://gerrit.wikimedia.org/r/953695 (https://phabricator.wikimedia.org/T345273) (owner: 10Cathal Mooney) [18:35:44] PROBLEM - MariaDB memory on clouddb1017 is CRITICAL: CRIT Memory 98% used. Largest process: mysqld (1667042) = 60.2% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [18:35:46] (03Merged) 10jenkins-bot: Change DHCP relay function on management routers to 'forward-only' [homer/public] - 10https://gerrit.wikimedia.org/r/953695 (https://phabricator.wikimedia.org/T345273) (owner: 10Cathal Mooney) [18:37:03] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [18:37:04] PROBLEM - Host rdb1010 is DOWN: PING CRITICAL - Packet loss = 100% [18:37:56] RECOVERY - Host rdb1010 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [18:39:29] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a1-codfw - cmooney@cumin1001" [18:41:19] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a1-codfw - cmooney@cumin1001" [18:41:19] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:41:19] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device ssw1-a1-codfw.mgmt.codfw.wmnet [18:49:38] (03PS4) 10Cathal Mooney: Modify Juniper ZTP script used during initial provision [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) [18:57:38] RECOVERY - Check systemd state on elastic1088 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:01:42] (SystemdUnitFailed) firing: (2) elasticsearch-disable-readahead.service Failed on elastic1088:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:12:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:17:01] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts wdqs1005.eqiad.wmnet [19:17:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:19:26] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply security updates - bking@cumin1001 - T344587 [19:27:20] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:29:10] (03PS1) 10Krinkle: clientError: Investigate when mw.util is compromised by third-party script [extensions/WikimediaEvents] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/953647 (https://phabricator.wikimedia.org/T343944) [19:32:36] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:44:52] (03PS1) 10BCornwall: mtail: Use "bad" requests for varnish SLI metrics [puppet] - 10https://gerrit.wikimedia.org/r/953725 (https://phabricator.wikimedia.org/T341606) [19:47:17] (03CR) 10CI reject: [V: 04-1] mtail: Use "bad" requests for varnish SLI metrics [puppet] - 10https://gerrit.wikimedia.org/r/953725 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [19:48:19] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 4 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " (due to missing physical file for old image e... - https://phabricator.wikimedia.org/T244567 [19:55:07] (03PS1) 10Majavah: team-wmcs: Add Galera checks [alerts] - 10https://gerrit.wikimedia.org/r/953727 (https://phabricator.wikimedia.org/T345294) [19:58:44] (03PS3) 10HMonroy: Add comment about mirroring of wgMobileUrlTemplate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952925 (https://phabricator.wikimedia.org/T344185) (owner: 10Neil Shah-Quinn (WMF)) [20:00:06] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230830T2000). [20:00:06] nsq64 and hmonroy: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:51] I can deploy :) [20:01:28] woo :D [20:01:31] i'd also want to backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FlaggedRevs/+/953715 , if we have the time (waiting for CI to pass right now) [20:02:44] (03PS1) 10Bartosz Dziewoński: Omit 'target' in the body of review REST API requests [extensions/FlaggedRevs] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/953648 [20:03:08] MatmaRex: let's try it [20:03:19] I'm going to deploy my config change for now [20:03:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hmonroy@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952940 (https://phabricator.wikimedia.org/T341754) (owner: 10HMonroy) [20:04:32] (03Merged) 10jenkins-bot: wikidiff2: set maxSplitSize = 10 by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952940 (https://phabricator.wikimedia.org/T341754) (owner: 10HMonroy) [20:05:00] !log hmonroy@deploy1002 Started scap: Backport for [[gerrit:952940|wikidiff2: set maxSplitSize = 10 by default (T341754)]] [20:05:12] T341754: Deploy wikidiff2 paragraph split detection - https://phabricator.wikimedia.org/T341754 [20:06:38] !log hmonroy@deploy1002 hmonroy: Backport for [[gerrit:952940|wikidiff2: set maxSplitSize = 10 by default (T341754)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:07:23] hmonroy: I would suggest +2'ing the FlaggedRevs backport now, to speed up the 'wait for CI' phase [20:08:38] !log hmonroy@deploy1002 hmonroy: Continuing with sync [20:09:31] taavi: first time deploying alone, what is FlaggedRevs? [20:10:03] that's the extension MatmaRex's patch is modifying [20:10:22] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10Jclark-ctr) Replaced Failed Drive sdc [20:10:32] PROBLEM - cassandra-c SSL 10.64.48.236:7000 on restbase1030 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [20:10:32] it lets you "approve" revisions of articles. a few wikipedias have it enabled [20:10:38] PROBLEM - cassandra-b CQL 10.64.48.235:9042 on restbase1030 is CRITICAL: connect to address 10.64.48.235 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [20:11:10] PROBLEM - cassandra-c service on restbase1030 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:11:20] PROBLEM - cassandra-b service on restbase1030 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:11:31] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FlaggedRevs/+/953648 is this the patch that we want to backport? [20:11:36] PROBLEM - cassandra-b SSL 10.64.48.235:7000 on restbase1030 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [20:11:42] PROBLEM - cassandra-c CQL 10.64.48.236:9042 on restbase1030 is CRITICAL: connect to address 10.64.48.236 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [20:11:52] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-b.service,cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:12:00] PROBLEM - MD RAID on restbase1030 is CRITICAL: CRITICAL: State: degraded, Active: 6, Working: 6, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:12:31] RECOVERY - Host db1201 #page is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [20:12:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:13:39] 10SRE, 10ops-eqiad, 10DBA: db1201 network down - https://phabricator.wikimedia.org/T345271 (10Marostegui) Host back up! Thanks John [20:13:47] 10SRE, 10ops-eqiad, 10DBA: db1201 network down - https://phabricator.wikimedia.org/T345271 (10Jclark-ctr) Replaced SFP-t Link returned [20:13:59] 10SRE, 10ops-eqiad, 10DBA: db1201 network down - https://phabricator.wikimedia.org/T345271 (10Jclark-ctr) a:03Jclark-ctr [20:14:13] !log hmonroy@deploy1002 Finished scap: Backport for [[gerrit:952940|wikidiff2: set maxSplitSize = 10 by default (T341754)]] (duration: 09m 13s) [20:14:16] 10SRE, 10ops-eqiad, 10DBA: db1201 network down - https://phabricator.wikimedia.org/T345271 (10Jclark-ctr) 05Open→03Resolved [20:14:20] T341754: Deploy wikidiff2 paragraph split detection - https://phabricator.wikimedia.org/T341754 [20:15:55] MatmaRex: is this the patch we want to backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FlaggedRevs/+/953648? [20:16:16] yes [20:17:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:17:55] kk, i'm going to proceed with your patch. [20:18:06] and fyi, the bug is: on a page like https://test2.wikipedia.org/wiki/Another_Page_Title , the "Accept revision" button at the bottom does not work [20:18:39] i don't think it has been reported yet, i was just lucky to notice it today when reviewing another patch before the buggy change reached any non-test wikis [20:20:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hmonroy@deploy1002 using scap backport" [extensions/FlaggedRevs] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/953648 (owner: 10Bartosz Dziewoński) [20:21:33] Hi. Sorry if this is the wrong place for this. I cannot download exported files (e.g. epub, mobi, pdf) from Wikisource at the moment. I receive a 504 Bad Gateway error. [20:22:25] MatmaRex: since this is going on wmf.24, it'll be available on group2 till tomorow. Is that okay? [20:22:43] yes [20:23:11] kk [20:24:20] (03Merged) 10jenkins-bot: Omit 'target' in the body of review REST API requests [extensions/FlaggedRevs] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/953648 (owner: 10Bartosz Dziewoński) [20:24:49] !log hmonroy@deploy1002 Started scap: Backport for [[gerrit:953648|Omit 'target' in the body of review REST API requests]] [20:25:26] RECOVERY - cassandra-c service on restbase1030 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:25:29] Solemn: does your issue sound the same as https://phabricator.wikimedia.org/T345025 or https://phabricator.wikimedia.org/T335553 ? i'm not familiar with that feature [20:26:10] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:26:25] !log hmonroy@deploy1002 matmarex and hmonroy: Backport for [[gerrit:953648|Omit 'target' in the body of review REST API requests]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:26:42] MatmaRex: Definitely same as the first one. Not sure how to view uptime for this feature (trying to find that is what led me here). [20:26:50] MatmaRex: Please take a look and let me know if it's okay to proceed :) [20:27:13] hmonroy: yup, looks good [20:27:18] !log hmonroy@deploy1002 matmarex and hmonroy: Continuing with sync [20:27:26] I thought it might be an issue with the size of the file, but I tried some smaller works and they are also broken. [20:27:42] Solemn: i don't know anything about it, sorry. i guess you could ask on the task? [20:29:59] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10RobH) [20:32:31] Thanks MatmaRex. [20:33:08] !log hmonroy@deploy1002 Finished scap: Backport for [[gerrit:953648|Omit 'target' in the body of review REST API requests]] (duration: 08m 18s) [20:33:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:34:21] MatmaRex: your change has been backported [20:34:29] thanks! [20:34:51] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hmonroy@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952925 (https://phabricator.wikimedia.org/T344185) (owner: 10Neil Shah-Quinn (WMF)) [20:35:08] (03PS4) 10HMonroy: Add comment about mirroring of wgMobileUrlTemplate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952925 (https://phabricator.wikimedia.org/T344185) (owner: 10Neil Shah-Quinn (WMF)) [20:35:18] you're very welcome! [20:35:42] hmonroy: thank you for running the window! [20:35:57] (03CR) 10TrainBranchBot: "Approved by hmonroy@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952925 (https://phabricator.wikimedia.org/T344185) (owner: 10Neil Shah-Quinn (WMF)) [20:36:19] urbanecm: thank you for your support and guidance!! [20:36:38] (03Merged) 10jenkins-bot: Add comment about mirroring of wgMobileUrlTemplate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952925 (https://phabricator.wikimedia.org/T344185) (owner: 10Neil Shah-Quinn (WMF)) [20:36:40] np! [20:37:07] !log hmonroy@deploy1002 Started scap: Backport for [[gerrit:952925|Add comment about mirroring of wgMobileUrlTemplate (T344185)]] [20:37:13] T344185: Add a comment to wgMobileUrlTemplate stating that downstream users should be notified of updates - https://phabricator.wikimedia.org/T344185 [20:38:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:44:18] !log hmonroy@deploy1002 Finished scap: Backport for [[gerrit:952925|Add comment about mirroring of wgMobileUrlTemplate (T344185)]] (duration: 07m 11s) [20:44:25] T344185: Add a comment to wgMobileUrlTemplate stating that downstream users should be notified of updates - https://phabricator.wikimedia.org/T344185 [20:45:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:50:20] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:50:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:00:06] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230830T2100) [21:09:38] (03PS1) 10Krinkle: mediawiki.util: Investigate when mw.util is compromised by third-party script [core] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/953649 (https://phabricator.wikimedia.org/T343944) [21:09:50] (03PS3) 10Krinkle: mediawiki.util: Investigate when mw.util is compromised by third-party script [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947913 [21:09:54] (03Abandoned) 10Krinkle: mediawiki.util: Investigate when mw.util is compromised by third-party script [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947913 (owner: 10Krinkle) [21:11:12] (03CR) 10Krinkle: [C: 03+2] mediawiki.util: Investigate when mw.util is compromised by third-party script [core] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/953649 (https://phabricator.wikimedia.org/T343944) (owner: 10Krinkle) [21:11:25] (03CR) 10Krinkle: [C: 03+2] clientError: Investigate when mw.util is compromised by third-party script [extensions/WikimediaEvents] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/953647 (https://phabricator.wikimedia.org/T343944) (owner: 10Krinkle) [21:15:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: hw troubleshooting: wdqs1010 unreachable from SSH or DRAC - https://phabricator.wikimedia.org/T344518 (10Papaul) 05Open→03Resolved @bking IDRAC and BIOS updated. All yours. As for 10/03/2023 the latest IDRAC version for R430 is iDRAC 2.84.84.84 [21:19:22] (03PS5) 10Ebernhardson: Draft: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 [21:24:57] (03Merged) 10jenkins-bot: mediawiki.util: Investigate when mw.util is compromised by third-party script [core] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/953649 (https://phabricator.wikimedia.org/T343944) (owner: 10Krinkle) [21:31:13] !log krinkle@deploy1002: running `sudo /usr/local/sbin/fix-staging-perms` two fix permissions under /srv/patches/1.41.0-wmf.24 where 2 of the 3 patch files are read-only by jnuche:deployment [21:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:08] !log krinkle@deploy1002 Started scap: Backport for [[gerrit:953649|mediawiki.util: Investigate when mw.util is compromised by third-party script (T343944)]] [21:34:14] T343944: JavaScript Error on Wikipedia Mobile Sites and Safari: TypeError: $('[accesskey]').updateTooltipAccessKeys is not a function - https://phabricator.wikimedia.org/T343944 [21:37:46] (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [21:55:26] !log krinkle@deploy1002 krinkle: Backport for [[gerrit:953649|mediawiki.util: Investigate when mw.util is compromised by third-party script (T343944)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [21:55:33] T343944: JavaScript Error on Wikipedia Mobile Sites and Safari: TypeError: $('[accesskey]').updateTooltipAccessKeys is not a function - https://phabricator.wikimedia.org/T343944 [21:57:55] !log krinkle@deploy1002 krinkle: Continuing with sync [21:58:40] 10SRE, 10Data-Engineering-Icebox, 10Traffic, 10WMF-General-or-Unknown, 10Developer Productivity: Requests for /static get an invalid WMF-Last-Access cookie for wikipedia.org on non-Wikipedia requests - https://phabricator.wikimedia.org/T261803 (10Krinkle) [21:59:03] 10SRE, 10Data-Engineering-Icebox, 10Traffic, 10WMF-General-or-Unknown, and 2 others: Requests for /static get an invalid WMF-Last-Access cookie for wikipedia.org on non-Wikipedia requests - https://phabricator.wikimedia.org/T261803 (10Krinkle) [22:03:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:08:34] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:09:17] !log krinkle@deploy1002 Finished scap: Backport for [[gerrit:953649|mediawiki.util: Investigate when mw.util is compromised by third-party script (T343944)]] (duration: 35m 08s) [22:09:23] T343944: JavaScript Error on Wikipedia Mobile Sites and Safari: TypeError: $('[accesskey]').updateTooltipAccessKeys is not a function - https://phabricator.wikimedia.org/T343944 [22:15:01] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:27:09] (03PS6) 10Ebernhardson: Draft: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 [22:28:19] !log krinkle@deploy1002 Synchronized php-1.41.0-wmf.24/extensions/WikimediaEvents/: 697ab03ae9a5d5ddb6 (duration: 06m 26s) [23:01:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [23:01:42] (SystemdUnitFailed) firing: elasticsearch-disable-readahead.service Failed on elastic2038:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:31:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads