[00:04:48] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[00:38:48] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/974644
[00:38:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/974644 (owner: 10TrainBranchBot)
[00:45:44] <icinga-wm>	 PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:58:08] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/974644 (owner: 10TrainBranchBot)
[01:37:28] <icinga-wm>	 PROBLEM - snapshot of s6 in eqiad on backupmon1001 is CRITICAL: snapshot for s6 at eqiad (db1225) taken more than 3 days ago: Most recent backup 2023-11-16 01:10:44 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[01:41:32] <icinga-wm>	 PROBLEM - snapshot of s6 in codfw on backupmon1001 is CRITICAL: snapshot for s6 at codfw (db2097) taken more than 3 days ago: Most recent backup 2023-11-16 01:14:47 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[02:22:54] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[02:23:07] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[02:38:21] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:52:06] <icinga-wm>	 PROBLEM - snapshot of s1 in eqiad on backupmon1001 is CRITICAL: snapshot for s1 at eqiad (db1140) taken more than 3 days ago: Most recent backup 2023-11-16 02:32:57 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[03:02:52] <icinga-wm>	 PROBLEM - snapshot of s4 in eqiad on backupmon1001 is CRITICAL: snapshot for s4 at eqiad (db1150) taken more than 3 days ago: Most recent backup 2023-11-16 02:39:56 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[03:08:21] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:08:21] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[03:11:14] <icinga-wm>	 PROBLEM - snapshot of s4 in codfw on backupmon1001 is CRITICAL: snapshot for s4 at codfw (db2099) taken more than 3 days ago: Most recent backup 2023-11-16 02:45:56 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[03:14:10] <icinga-wm>	 PROBLEM - snapshot of s8 in codfw on backupmon1001 is CRITICAL: snapshot for s8 at codfw (db2098) taken more than 3 days ago: Most recent backup 2023-11-16 03:07:57 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[03:19:41] <logmsgbot>	 !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[03:23:21] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:36:18] <icinga-wm>	 PROBLEM - snapshot of s8 in eqiad on backupmon1001 is CRITICAL: snapshot for s8 at eqiad (db1171) taken more than 3 days ago: Most recent backup 2023-11-16 03:11:09 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[03:39:00] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on ml-serve1008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[04:04:48] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[04:05:32] <icinga-wm>	 PROBLEM - snapshot of s2 in codfw on backupmon1001 is CRITICAL: snapshot for s2 at codfw (db2097) taken more than 3 days ago: Most recent backup 2023-11-16 03:47:45 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[04:24:14] <icinga-wm>	 PROBLEM - snapshot of s2 in eqiad on backupmon1001 is CRITICAL: snapshot for s2 at eqiad (db1225) taken more than 3 days ago: Most recent backup 2023-11-16 04:05:48 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[05:34:10] <icinga-wm>	 PROBLEM - snapshot of s5 in eqiad on backupmon1001 is CRITICAL: snapshot for s5 at eqiad (db1216) taken more than 3 days ago: Most recent backup 2023-11-16 05:15:57 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[05:46:52] <icinga-wm>	 PROBLEM - snapshot of s5 in codfw on backupmon1001 is CRITICAL: snapshot for s5 at codfw (db2101) taken more than 3 days ago: Most recent backup 2023-11-16 05:41:36 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[05:51:06] <icinga-wm>	 PROBLEM - snapshot of x1 in codfw on backupmon1001 is CRITICAL: snapshot for x1 at codfw (db2097) taken more than 3 days ago: Most recent backup 2023-11-16 05:34:33 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[06:16:16] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:16:16] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:19:02] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:19:04] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:23:14] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:23:14] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:26:02] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:30:08] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:39:46] <icinga-wm>	 PROBLEM - snapshot of s3 in eqiad on backupmon1001 is CRITICAL: snapshot for s3 at eqiad (db1150) taken more than 3 days ago: Most recent backup 2023-11-16 06:31:49 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[07:08:21] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[07:11:32] <icinga-wm>	 PROBLEM - snapshot of s3 in codfw on backupmon1001 is CRITICAL: snapshot for s3 at codfw (db2139) taken more than 3 days ago: Most recent backup 2023-11-16 06:43:49 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[07:18:14] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:18:16] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:22:20] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance
[07:22:33] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance
[07:23:21] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:25:06] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:25:08] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:29:16] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:29:44] <icinga-wm>	 PROBLEM - snapshot of s7 in codfw on backupmon1001 is CRITICAL: snapshot for s7 at codfw (db2098) taken more than 3 days ago: Most recent backup 2023-11-16 07:18:22 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[07:30:40] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:39:00] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on ml-serve1008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[07:40:32] <icinga-wm>	 PROBLEM - snapshot of s7 in eqiad on backupmon1001 is CRITICAL: snapshot for s7 at eqiad (db1171) taken more than 3 days ago: Most recent backup 2023-11-16 07:15:41 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[07:58:12] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231119T0800)
[08:02:22] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-jobrunner_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:04:52] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[08:20:08] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:20:10] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:27:02] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:32:32] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:08:21] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[11:23:21] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:39:00] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on ml-serve1008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[12:04:48] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[12:34:03] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance
[12:34:27] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance
[12:34:33] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2106 (T348183)', diff saved to https://phabricator.wikimedia.org/P53575 and previous config saved to /var/cache/conftool/dbconfig/20231119-123433-arnaudb.json
[12:34:38] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[12:52:54] <wikibugs>	 10SRE, 10Traffic-Icebox, 10MediaWiki-Platform-Team (Radar), 10Patch-For-Review: Better handling for one-hit-wonder objects - https://phabricator.wikimedia.org/T144187 (10Aklapper) There are two open patches left: https://gerrit.wikimedia.org/r/c/operations/puppet/+/594969 and https://gerrit.wikimedia.org/r...
[13:12:44] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:04:34] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:38:21] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:40:46] <icinga-wm>	 PROBLEM - snapshot of x1 in eqiad on backupmon1001 is CRITICAL: snapshot for x1 at eqiad (db1225) taken more than 3 days ago: Most recent backup 2023-11-16 14:37:06 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[14:53:21] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:08:21] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[15:17:16] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 176 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:22:38] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 60 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:23:21] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:39:00] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on ml-serve1008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[16:08:22] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[17:52:18] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T348183)', diff saved to https://phabricator.wikimedia.org/P53576 and previous config saved to /var/cache/conftool/dbconfig/20231119-175217-arnaudb.json
[17:52:23] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[18:07:24] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P53577 and previous config saved to /var/cache/conftool/dbconfig/20231119-180723-arnaudb.json
[18:22:30] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P53578 and previous config saved to /var/cache/conftool/dbconfig/20231119-182230-arnaudb.json
[18:37:37] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T348183)', diff saved to https://phabricator.wikimedia.org/P53579 and previous config saved to /var/cache/conftool/dbconfig/20231119-183736-arnaudb.json
[18:37:39] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance
[18:37:42] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[18:37:52] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance
[18:37:58] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2110 (T348183)', diff saved to https://phabricator.wikimedia.org/P53580 and previous config saved to /var/cache/conftool/dbconfig/20231119-183758-arnaudb.json
[19:08:21] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[19:23:23] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:36:22] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:39:15] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on ml-serve1008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[20:09:49] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[20:32:34] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:58:52] <wikibugs>	 (03CR) 10Pppery: [C: 04-1] "Found a few bugs in this, will upload another patchset fixing them soon." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975392 (https://phabricator.wikimedia.org/T351363) (owner: 10Pppery)
[23:08:18] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[23:08:21] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[23:24:49] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:39:15] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on ml-serve1008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[23:41:24] <wikibugs>	 (03PS3) 10Pppery: Merge in changes to qqq.json rather than overwriting them [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975392 (https://phabricator.wikimedia.org/T351363)
[23:41:26] <wikibugs>	 (03PS1) 10Pppery: Undo qqq.json overwrites [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975438 (https://phabricator.wikimedia.org/T351363)
[23:53:05] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T348183)', diff saved to https://phabricator.wikimedia.org/P53581 and previous config saved to /var/cache/conftool/dbconfig/20231119-235305-arnaudb.json
[23:53:10] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183