[00:38:27] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/956013
[00:38:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/956013 (owner: 10TrainBranchBot)
[00:42:47] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:46:57] <icinga-wm>	 PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:52:30] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/956013 (owner: 10TrainBranchBot)
[01:15:54] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[01:18:46] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye
[01:35:07] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2001-dev.codfw.wmnet with reason: host reimage
[01:35:51] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:38:10] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2001-dev.codfw.wmnet with reason: host reimage
[02:03:16] <wikibugs>	 (03CR) 10TTO: "Thanks Martin. Scheduled for deployment on Tuesday: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T0700" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668156 (https://phabricator.wikimedia.org/T61245) (owner: 10TTO)
[02:06:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:11:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:19:49] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye
[02:29:05] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:29:23] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:31:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[02:36:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[02:36:32] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:39:09] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:39:27] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:24:07] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T343198)', diff saved to https://phabricator.wikimedia.org/P52346 and previous config saved to /var/cache/conftool/dbconfig/20230909-032407-arnaudb.json
[03:24:11] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[03:29:09] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:32:01] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:36:19] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:39:14] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P52347 and previous config saved to /var/cache/conftool/dbconfig/20230909-033913-arnaudb.json
[03:42:03] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:54:20] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P52348 and previous config saved to /var/cache/conftool/dbconfig/20230909-035419-arnaudb.json
[04:09:26] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T343198)', diff saved to https://phabricator.wikimedia.org/P52349 and previous config saved to /var/cache/conftool/dbconfig/20230909-040925-arnaudb.json
[04:09:28] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[04:09:29] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[04:09:41] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[04:09:47] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T343198)', diff saved to https://phabricator.wikimedia.org/P52350 and previous config saved to /var/cache/conftool/dbconfig/20230909-040947-arnaudb.json
[04:32:15] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:33:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[04:35:01] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:38:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[04:40:45] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:45:03] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:59:13] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (cloudcontrol2001-dev), Fresh: 129 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[05:15:54] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[05:29:49] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:34:49] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[06:16:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[06:21:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[06:25:39] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:26:49] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:39:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[08:49:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:15:54] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[10:29:29] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T343198)', diff saved to https://phabricator.wikimedia.org/P52351 and previous config saved to /var/cache/conftool/dbconfig/20230909-102928-arnaudb.json
[10:29:32] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[10:35:04] <wikibugs>	 (03PS1) 10Majavah: icinga: drop toolforge.org cert monitor [puppet] - 10https://gerrit.wikimedia.org/r/956029
[10:44:35] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P52352 and previous config saved to /var/cache/conftool/dbconfig/20230909-104434-arnaudb.json
[10:59:41] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P52353 and previous config saved to /var/cache/conftool/dbconfig/20230909-105941-arnaudb.json
[11:14:47] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T343198)', diff saved to https://phabricator.wikimedia.org/P52354 and previous config saved to /var/cache/conftool/dbconfig/20230909-111447-arnaudb.json
[11:14:49] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance
[11:14:51] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[11:15:02] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance
[11:15:08] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T343198)', diff saved to https://phabricator.wikimedia.org/P52355 and previous config saved to /var/cache/conftool/dbconfig/20230909-111508-arnaudb.json
[11:32:07] <wikibugs>	 10SRE-swift-storage, 10Commons: Uploading large files to Commons almost always fails - https://phabricator.wikimedia.org/T340901 (10Hoi) As suggested by @Midleading, I switched to pywikibot from mwclient. The stability when uploading files of GiB magnitude has improved substantially, under async mode.
[13:15:54] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[14:06:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:10:11] <wikibugs>	 (03PS1) 10Majavah: prometheus: allow external management of blackbox modules [puppet] - 10https://gerrit.wikimedia.org/r/956037 (https://phabricator.wikimedia.org/T288067)
[14:10:13] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::metricsinfra: add config for custom blackbox scraping [puppet] - 10https://gerrit.wikimedia.org/r/956038 (https://phabricator.wikimedia.org/T288067)
[14:11:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:16:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:21:32] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:22:15] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol2001-dev.codfw.wmnet with OS bookworm
[15:41:17] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2001-dev.codfw.wmnet with reason: host reimage
[15:44:18] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2001-dev.codfw.wmnet with reason: host reimage
[15:47:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[15:52:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[16:27:44] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2001-dev.codfw.wmnet with OS bookworm
[16:33:06] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol2004-dev.codfw.wmnet with OS bookworm
[16:51:38] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2004-dev.codfw.wmnet with reason: host reimage
[16:54:14] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2004-dev.codfw.wmnet with reason: host reimage
[17:06:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT certificates) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:07:54] <wikibugs>	 (03PS2) 10Majavah: prometheus: allow external management of blackbox modules [puppet] - 10https://gerrit.wikimedia.org/r/956037 (https://phabricator.wikimedia.org/T288067)
[17:07:56] <wikibugs>	 (03PS2) 10Majavah: P:wmcs::metricsinfra: add config for custom blackbox scraping [puppet] - 10https://gerrit.wikimedia.org/r/956038 (https://phabricator.wikimedia.org/T288067)
[17:11:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT certificates) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:13:44] <wikibugs>	 (03CR) 10Winston Sung: [C: 03+1] Add Akan language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955007 (https://phabricator.wikimedia.org/T333765) (owner: 10Srishakatux)
[17:15:54] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[17:35:00] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2004-dev.codfw.wmnet with OS bookworm
[17:42:23] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T343198)', diff saved to https://phabricator.wikimedia.org/P52356 and previous config saved to /var/cache/conftool/dbconfig/20230909-174222-arnaudb.json
[17:42:26] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[17:57:29] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P52357 and previous config saved to /var/cache/conftool/dbconfig/20230909-175728-arnaudb.json
[18:12:35] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P52358 and previous config saved to /var/cache/conftool/dbconfig/20230909-181234-arnaudb.json
[18:20:00] <wikibugs>	 (03PS1) 10Majavah: nginx::status_site: allow multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/956068
[18:20:02] <wikibugs>	 (03PS1) 10Majavah: prometheus::nginx_exporter: manage nginx status site [puppet] - 10https://gerrit.wikimedia.org/r/956069
[18:24:49] <wikibugs>	 (03PS2) 10Majavah: prometheus::nginx_exporter: manage nginx status site [puppet] - 10https://gerrit.wikimedia.org/r/956069
[18:27:41] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T343198)', diff saved to https://phabricator.wikimedia.org/P52359 and previous config saved to /var/cache/conftool/dbconfig/20230909-182741-arnaudb.json
[18:27:43] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[18:27:45] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[18:27:56] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[18:28:02] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T343198)', diff saved to https://phabricator.wikimedia.org/P52360 and previous config saved to /var/cache/conftool/dbconfig/20230909-182802-arnaudb.json
[19:14:53] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol2005-dev.codfw.wmnet with OS bookworm
[19:35:14] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2005-dev.codfw.wmnet with reason: host reimage
[19:38:17] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2005-dev.codfw.wmnet with reason: host reimage
[19:52:09] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::checker: remove ToolsDB R/W check [puppet] - 10https://gerrit.wikimedia.org/r/956071 (https://phabricator.wikimedia.org/T313030)
[19:54:42] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43191/console" [puppet] - 10https://gerrit.wikimedia.org/r/956071 (https://phabricator.wikimedia.org/T313030) (owner: 10Majavah)
[20:00:00] <wikibugs>	 (03PS2) 10Majavah: icinga: drop toolforge.org cert monitor [puppet] - 10https://gerrit.wikimedia.org/r/956029
[20:00:02] <wikibugs>	 (03PS1) 10Majavah: icinga: drop tools.wmflabs.org monitoring [puppet] - 10https://gerrit.wikimedia.org/r/956072
[20:16:16] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2005-dev.codfw.wmnet with OS bookworm
[21:15:54] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[22:54:15] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:55:31] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:15:35] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status