[00:38:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/956013 [00:38:33] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/956013 (owner: 10TrainBranchBot) [00:42:47] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:57] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:52:30] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/956013 (owner: 10TrainBranchBot) [01:15:54] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [01:18:46] !log root@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye [01:35:07] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2001-dev.codfw.wmnet with reason: host reimage [01:35:51] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:38:10] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2001-dev.codfw.wmnet with reason: host reimage [02:03:16] (03CR) 10TTO: "Thanks Martin. Scheduled for deployment on Tuesday: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T0700" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668156 (https://phabricator.wikimedia.org/T61245) (owner: 10TTO) [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:19:49] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye [02:29:05] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:29:23] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:36:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:36:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:09] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:39:27] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:24:07] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T343198)', diff saved to https://phabricator.wikimedia.org/P52346 and previous config saved to /var/cache/conftool/dbconfig/20230909-032407-arnaudb.json [03:24:11] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [03:29:09] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:32:01] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:36:19] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:39:14] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P52347 and previous config saved to /var/cache/conftool/dbconfig/20230909-033913-arnaudb.json [03:42:03] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:54:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P52348 and previous config saved to /var/cache/conftool/dbconfig/20230909-035419-arnaudb.json [04:09:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T343198)', diff saved to https://phabricator.wikimedia.org/P52349 and previous config saved to /var/cache/conftool/dbconfig/20230909-040925-arnaudb.json [04:09:28] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [04:09:29] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [04:09:41] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [04:09:47] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T343198)', diff saved to https://phabricator.wikimedia.org/P52350 and previous config saved to /var/cache/conftool/dbconfig/20230909-040947-arnaudb.json [04:32:15] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:33:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:35:01] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:38:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:40:45] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:45:03] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:59:13] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (cloudcontrol2001-dev), Fresh: 129 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:15:54] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [05:29:49] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:34:49] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:16:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [06:21:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [06:25:39] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:26:49] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:39:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:49:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:15:54] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [10:29:29] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T343198)', diff saved to https://phabricator.wikimedia.org/P52351 and previous config saved to /var/cache/conftool/dbconfig/20230909-102928-arnaudb.json [10:29:32] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [10:35:04] (03PS1) 10Majavah: icinga: drop toolforge.org cert monitor [puppet] - 10https://gerrit.wikimedia.org/r/956029 [10:44:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P52352 and previous config saved to /var/cache/conftool/dbconfig/20230909-104434-arnaudb.json [10:59:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P52353 and previous config saved to /var/cache/conftool/dbconfig/20230909-105941-arnaudb.json [11:14:47] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T343198)', diff saved to https://phabricator.wikimedia.org/P52354 and previous config saved to /var/cache/conftool/dbconfig/20230909-111447-arnaudb.json [11:14:49] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [11:14:51] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [11:15:02] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [11:15:08] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T343198)', diff saved to https://phabricator.wikimedia.org/P52355 and previous config saved to /var/cache/conftool/dbconfig/20230909-111508-arnaudb.json [11:32:07] 10SRE-swift-storage, 10Commons: Uploading large files to Commons almost always fails - https://phabricator.wikimedia.org/T340901 (10Hoi) As suggested by @Midleading, I switched to pywikibot from mwclient. The stability when uploading files of GiB magnitude has improved substantially, under async mode. [13:15:54] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [14:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:10:11] (03PS1) 10Majavah: prometheus: allow external management of blackbox modules [puppet] - 10https://gerrit.wikimedia.org/r/956037 (https://phabricator.wikimedia.org/T288067) [14:10:13] (03PS1) 10Majavah: P:wmcs::metricsinfra: add config for custom blackbox scraping [puppet] - 10https://gerrit.wikimedia.org/r/956038 (https://phabricator.wikimedia.org/T288067) [14:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:21:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:22:15] !log root@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol2001-dev.codfw.wmnet with OS bookworm [15:41:17] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2001-dev.codfw.wmnet with reason: host reimage [15:44:18] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2001-dev.codfw.wmnet with reason: host reimage [15:47:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:52:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:27:44] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2001-dev.codfw.wmnet with OS bookworm [16:33:06] !log root@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol2004-dev.codfw.wmnet with OS bookworm [16:51:38] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2004-dev.codfw.wmnet with reason: host reimage [16:54:14] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2004-dev.codfw.wmnet with reason: host reimage [17:06:18] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT certificates) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:07:54] (03PS2) 10Majavah: prometheus: allow external management of blackbox modules [puppet] - 10https://gerrit.wikimedia.org/r/956037 (https://phabricator.wikimedia.org/T288067) [17:07:56] (03PS2) 10Majavah: P:wmcs::metricsinfra: add config for custom blackbox scraping [puppet] - 10https://gerrit.wikimedia.org/r/956038 (https://phabricator.wikimedia.org/T288067) [17:11:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT certificates) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:13:44] (03CR) 10Winston Sung: [C: 03+1] Add Akan language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955007 (https://phabricator.wikimedia.org/T333765) (owner: 10Srishakatux) [17:15:54] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [17:35:00] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2004-dev.codfw.wmnet with OS bookworm [17:42:23] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T343198)', diff saved to https://phabricator.wikimedia.org/P52356 and previous config saved to /var/cache/conftool/dbconfig/20230909-174222-arnaudb.json [17:42:26] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [17:57:29] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P52357 and previous config saved to /var/cache/conftool/dbconfig/20230909-175728-arnaudb.json [18:12:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P52358 and previous config saved to /var/cache/conftool/dbconfig/20230909-181234-arnaudb.json [18:20:00] (03PS1) 10Majavah: nginx::status_site: allow multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/956068 [18:20:02] (03PS1) 10Majavah: prometheus::nginx_exporter: manage nginx status site [puppet] - 10https://gerrit.wikimedia.org/r/956069 [18:24:49] (03PS2) 10Majavah: prometheus::nginx_exporter: manage nginx status site [puppet] - 10https://gerrit.wikimedia.org/r/956069 [18:27:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T343198)', diff saved to https://phabricator.wikimedia.org/P52359 and previous config saved to /var/cache/conftool/dbconfig/20230909-182741-arnaudb.json [18:27:43] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [18:27:45] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [18:27:56] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [18:28:02] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T343198)', diff saved to https://phabricator.wikimedia.org/P52360 and previous config saved to /var/cache/conftool/dbconfig/20230909-182802-arnaudb.json [19:14:53] !log root@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol2005-dev.codfw.wmnet with OS bookworm [19:35:14] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2005-dev.codfw.wmnet with reason: host reimage [19:38:17] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2005-dev.codfw.wmnet with reason: host reimage [19:52:09] (03PS1) 10Majavah: P:toolforge::checker: remove ToolsDB R/W check [puppet] - 10https://gerrit.wikimedia.org/r/956071 (https://phabricator.wikimedia.org/T313030) [19:54:42] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43191/console" [puppet] - 10https://gerrit.wikimedia.org/r/956071 (https://phabricator.wikimedia.org/T313030) (owner: 10Majavah) [20:00:00] (03PS2) 10Majavah: icinga: drop toolforge.org cert monitor [puppet] - 10https://gerrit.wikimedia.org/r/956029 [20:00:02] (03PS1) 10Majavah: icinga: drop tools.wmflabs.org monitoring [puppet] - 10https://gerrit.wikimedia.org/r/956072 [20:16:16] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2005-dev.codfw.wmnet with OS bookworm [21:15:54] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [22:54:15] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:55:31] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:15:35] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status