[00:40:29] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1228259 [00:40:29] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1228259 (owner: 10TrainBranchBot) [00:55:39] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1228259 (owner: 10TrainBranchBot) [01:00:59] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:10:34] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1228261 [01:10:34] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1228261 (owner: 10TrainBranchBot) [01:14:31] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 31s) [01:24:12] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:34:09] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1228261 (owner: 10TrainBranchBot) [02:16:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87701 and previous config saved to /var/cache/conftool/dbconfig/20260118-021615-marostegui.json [02:16:23] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [02:16:23] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [02:26:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P87702 and previous config saved to /var/cache/conftool/dbconfig/20260118-022624-marostegui.json [02:29:33] 10ops-eqsin, 06SRE: Unresponsive management for cp5022.mgmt:22 - https://phabricator.wikimedia.org/T414879 (10phaultfinder) 03NEW [02:36:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P87703 and previous config saved to /var/cache/conftool/dbconfig/20260118-023632-marostegui.json [02:46:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87704 and previous config saved to /var/cache/conftool/dbconfig/20260118-024640-marostegui.json [02:46:49] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [02:46:49] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [02:46:57] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [02:47:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2173 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87705 and previous config saved to /var/cache/conftool/dbconfig/20260118-024705-marostegui.json [03:11:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261 (T413525)', diff saved to https://phabricator.wikimedia.org/P87706 and previous config saved to /var/cache/conftool/dbconfig/20260118-031149-marostegui.json [03:11:54] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [03:21:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261', diff saved to https://phabricator.wikimedia.org/P87707 and previous config saved to /var/cache/conftool/dbconfig/20260118-032157-marostegui.json [03:32:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261', diff saved to https://phabricator.wikimedia.org/P87708 and previous config saved to /var/cache/conftool/dbconfig/20260118-033205-marostegui.json [03:34:12] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:42:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261 (T413525)', diff saved to https://phabricator.wikimedia.org/P87709 and previous config saved to /var/cache/conftool/dbconfig/20260118-034214-marostegui.json [03:42:19] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [03:42:20] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1263.eqiad.wmnet with reason: Maintenance [03:42:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1263 (T413525)', diff saved to https://phabricator.wikimedia.org/P87710 and previous config saved to /var/cache/conftool/dbconfig/20260118-034228-marostegui.json [05:09:12] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:24:12] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:34:12] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:34:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263 (T413525)', diff saved to https://phabricator.wikimedia.org/P87711 and previous config saved to /var/cache/conftool/dbconfig/20260118-073445-marostegui.json [07:34:50] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [07:44:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263', diff saved to https://phabricator.wikimedia.org/P87712 and previous config saved to /var/cache/conftool/dbconfig/20260118-074453-marostegui.json [07:55:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263', diff saved to https://phabricator.wikimedia.org/P87713 and previous config saved to /var/cache/conftool/dbconfig/20260118-075502-marostegui.json [07:59:12] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wdqs-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260118T0800) [08:05:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263 (T413525)', diff saved to https://phabricator.wikimedia.org/P87714 and previous config saved to /var/cache/conftool/dbconfig/20260118-080510-marostegui.json [08:05:16] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [08:05:16] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [09:24:12] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:34:12] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:59:12] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wdqs-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:28:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87715 and previous config saved to /var/cache/conftool/dbconfig/20260118-122804-marostegui.json [12:28:10] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [12:28:11] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [12:38:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P87716 and previous config saved to /var/cache/conftool/dbconfig/20260118-123812-marostegui.json [12:48:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P87717 and previous config saved to /var/cache/conftool/dbconfig/20260118-124820-marostegui.json [12:58:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87718 and previous config saved to /var/cache/conftool/dbconfig/20260118-125829-marostegui.json [12:58:35] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [12:58:36] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [12:58:46] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance [12:59:05] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:59:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1196 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87719 and previous config saved to /var/cache/conftool/dbconfig/20260118-125913-marostegui.json [13:24:12] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:34:12] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:12] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:19:12] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:34:12] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:51:32] !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on cp5022.eqsin.wmnet with reason: host down [15:51:38] !log downtime cp5022 as host is down: T414411 [15:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:42] T414411: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411 [16:35:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87720 and previous config saved to /var/cache/conftool/dbconfig/20260118-163514-marostegui.json [16:35:21] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [16:35:21] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [16:45:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P87721 and previous config saved to /var/cache/conftool/dbconfig/20260118-164523-marostegui.json [16:55:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P87722 and previous config saved to /var/cache/conftool/dbconfig/20260118-165531-marostegui.json [17:05:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87723 and previous config saved to /var/cache/conftool/dbconfig/20260118-170540-marostegui.json [17:05:46] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [17:05:46] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [17:05:56] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [17:06:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2174 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87724 and previous config saved to /var/cache/conftool/dbconfig/20260118-170604-marostegui.json [17:24:12] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:48:55] (03CR) 10Daniel Kinzler: rest gateway: add tests for chart rendering (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225085 (owner: 10Daniel Kinzler) [19:19:12] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:34:12] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:09:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:19:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:24:12] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:19:12] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:34:12] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable