[00:00:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_ferm_mss.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:03:05] <wikibugs>	 (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112113
[00:03:05] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112113 (owner: 10Zabe)
[00:03:50] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112113 (owner: 10Zabe)
[00:04:20] <wikibugs>	 (03PS1) 10Zabe: Activate arbcom_zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112114 (https://phabricator.wikimedia.org/T380119)
[00:04:30] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Update composer.lock [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112111 (owner: 10Zabe)
[00:05:07] <wikibugs>	 (03Merged) 10jenkins-bot: Update composer.lock [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112111 (owner: 10Zabe)
[00:05:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: prometheus_ferm_mss.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:06:01] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Activate arbcom_zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112114 (https://phabricator.wikimedia.org/T380119) (owner: 10Zabe)
[00:06:43] <wikibugs>	 (03Merged) 10jenkins-bot: Activate arbcom_zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112114 (https://phabricator.wikimedia.org/T380119) (owner: 10Zabe)
[00:07:07] <logmsgbot>	 !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1112113|Update interwiki cache]], [[gerrit:1112111|Update composer.lock]], [[gerrit:1112114|Activate arbcom_zhwiki (T380119)]]
[00:07:11] <stashbot>	 T380119: Create arbcom-zh wiki - https://phabricator.wikimedia.org/T380119
[00:08:40] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1025:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:11:41] <logmsgbot>	 !log zabe@deploy2002 zabe: Backport for [[gerrit:1112113|Update interwiki cache]], [[gerrit:1112111|Update composer.lock]], [[gerrit:1112114|Activate arbcom_zhwiki (T380119)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[00:11:48] <logmsgbot>	 !log zabe@deploy2002 zabe: Continuing with sync
[00:13:40] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1025:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:13:41] <wikibugs>	 (03PS2) 10Clare Ming: Enable the text experiment on testwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112105 (https://phabricator.wikimedia.org/T373715)
[00:16:31] <wikibugs>	 (03CR) 10Clare Ming: "@phuedx@wikimedia.org @sfaci@wikimedia.org if this lgtu, after we get 1 or both patches merged to finalize the config var, i can deploy th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112105 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming)
[00:18:40] <jinxer-wm>	 RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1025:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:18:53] <logmsgbot>	 !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1112113|Update interwiki cache]], [[gerrit:1112111|Update composer.lock]], [[gerrit:1112114|Activate arbcom_zhwiki (T380119)]] (duration: 11m 46s)
[00:18:57] <stashbot>	 T380119: Create arbcom-zh wiki - https://phabricator.wikimedia.org/T380119
[00:20:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10469040 (10phaultfinder)
[00:21:40] <zabe>	 !log zabe@deploy2002:/srv/mediawiki-staging$ mwscript-k8s -f -- createAndPromote.php --wiki=arbcom_zhwiki ZhaoFJx REDACTED
[00:21:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:23:08] <zabe>	 !log zabe@deploy2002:/srv/mediawiki-staging$ mwscript-k8s -f -- createAndPromote.php --wiki=arbcom_zhwiki --sysop --bureaucrat --force ZhaoFJx
[00:23:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:25:29] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:27:29] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:38:14] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1112121
[00:38:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1112121 (owner: 10TrainBranchBot)
[00:39:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1032:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1032 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:40:25] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 10SRE-swift-storage, and 3 others: Supermicro's Config J hot swap behavior - https://phabricator.wikimedia.org/T383903#10469079 (10Papaul) @MatthewVernon @elukey i do agree with you all that "we do need to be able to hot-swap these drivers" and yes by design, all the drives...
[00:49:40] <jinxer-wm>	 FIRING: [3x] KubernetesRsyslogDown: rsyslog on wikikube-worker1032:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:54:40] <jinxer-wm>	 RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1032:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[01:00:53] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1112121 (owner: 10TrainBranchBot)
[01:09:20] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1112122
[01:09:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1112122 (owner: 10TrainBranchBot)
[01:29:05] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1112122 (owner: 10TrainBranchBot)
[02:03:09] <icinga-wm>	 PROBLEM - Disk space on prometheus1006 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/k8s-dse 3176MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=prometheus1006&var-datasource=eqiad+prometheus/ops
[02:09:33] <jinxer-wm>	 FIRING: [12x] KubernetesCalicoDown: mw1448.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[02:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:48:55] <wikibugs>	 (03PS4) 10Pppery: Fix some wrong descriptions of old dumps [puppet] - 10https://gerrit.wikimedia.org/r/1112123
[02:48:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Fix some wrong descriptions of old dumps [puppet] - 10https://gerrit.wikimedia.org/r/1112123 (owner: 10Pppery)
[02:49:56] <wikibugs>	 (03PS5) 10Pppery: Fix some wrong descriptions of old dumps [puppet] - 10https://gerrit.wikimedia.org/r/1112123
[02:59:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10469171 (10phaultfinder)
[03:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:31:29] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1173 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:31:43] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1174 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:32:37] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1167 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:32:43] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1102 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:34:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10469191 (10phaultfinder)
[03:36:19] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1065 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:42:37] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1167 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:44:29] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1173 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:46:43] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1102 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:46:57] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1105 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:55:43] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1174 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:56:19] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1065 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:03:09] <icinga-wm>	 PROBLEM - Disk space on prometheus1006 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/k8s-dse 3166MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=prometheus1006&var-datasource=eqiad+prometheus/ops
[04:03:33] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10procurement: codfw:expansion: Network devices/patch panel wiring - https://phabricator.wikimedia.org/T382219#10469211 (10Papaul) I am adding also here the spines/leaves connection diagram for reference. {F58217796}
[04:14:57] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1105 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:25:30] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:28:05] <wikibugs>	 10SRE-swift-storage, 10CX-deployments, 10LPL Essential, 10MinT: Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10469212 (10KartikMistry) a:03KartikMistry Assigning this to myself, I'll need help here @elukey :)
[05:06:57] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2025-01-17-043010-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112125 (https://phabricator.wikimedia.org/T377813)
[05:13:11] <kart_>	 Quick cxserver deployment, minor change.
[05:23:09] <icinga-wm>	 PROBLEM - Disk space on prometheus1006 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/k8s-dse 3184MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=prometheus1006&var-datasource=eqiad+prometheus/ops
[05:24:56] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-01-17-043010-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112125 (https://phabricator.wikimedia.org/T377813) (owner: 10KartikMistry)
[05:25:59] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2025-01-17-043010-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112125 (https://phabricator.wikimedia.org/T377813) (owner: 10KartikMistry)
[05:26:41] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
[05:27:04] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[05:31:59] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[05:32:29] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[05:32:49] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[05:33:50] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[05:47:54] <wikibugs>	 (03PS1) 10Kevin Bazira: changeprop: add liftwing article-country stream to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112126 (https://phabricator.wikimedia.org/T382295)
[06:09:33] <jinxer-wm>	 FIRING: [12x] KubernetesCalicoDown: mw1448.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[06:22:17] <jinxer-wm>	 FIRING: ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:24:22] <jinxer-wm>	 RESOLVED: ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:27:35] <moritzm>	 !log installing rsync security regression updates
[06:27:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:43:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1112:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1112 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[06:47:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72125 and previous config saved to /var/cache/conftool/dbconfig/20250117-064745-root.json
[06:48:01] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[06:48:35] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[06:48:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1112:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1112 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[07:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250117T0700)
[07:02:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72127 and previous config saved to /var/cache/conftool/dbconfig/20250117-070250-root.json
[07:17:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72128 and previous config saved to /var/cache/conftool/dbconfig/20250117-071755-root.json
[07:33:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72129 and previous config saved to /var/cache/conftool/dbconfig/20250117-073301-root.json
[07:37:30] <wikibugs>	 (03PS3) 10JMeybohm: Update calico to v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112058 (https://phabricator.wikimedia.org/T341984)
[07:37:30] <wikibugs>	 (03PS4) 10JMeybohm: Update staging-codfw to k8s 1.31, calico 3.29 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984)
[07:38:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Update calico to v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112058 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[07:38:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Update staging-codfw to k8s 1.31, calico 3.29 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[07:38:49] <wikibugs>	 (03Abandoned) 10JMeybohm: sq [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112071 (owner: 10JMeybohm)
[07:48:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72130 and previous config saved to /var/cache/conftool/dbconfig/20250117-074806-root.json
[07:55:45] <wikibugs>	 (03PS4) 10Muehlenhoff: Add separate maps master/replica roles for the new Bookworm setup [puppet] - 10https://gerrit.wikimedia.org/r/1111659 (https://phabricator.wikimedia.org/T381565)
[08:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250117T0800)
[08:00:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_ferm_mss.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:05:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: prometheus_ferm_mss.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:12:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Don't setup database config for tilerator on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1112167 (https://phabricator.wikimedia.org/T381565)
[08:15:29] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:16:07] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:18:57] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:19:19] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:25:30] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:30:06] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1112167 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[08:33:39] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw[2282,2310-2311].codfw.wmnet
[08:35:26] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw[2282,2310-2311].codfw.wmnet
[08:36:05] <wikibugs>	 (03CR) 10Jelto: [C:03+2] Rename the remaining mw nodes to wikikube-worker224[0-2] 🥳 [puppet] - 10https://gerrit.wikimedia.org/r/1112055 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto)
[08:40:05] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:42:41] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:48:01] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[08:48:35] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[08:49:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on mw2311:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2311 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:50:41] <wikibugs>	 (03PS4) 10JMeybohm: Update calico to v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112058 (https://phabricator.wikimedia.org/T341984)
[08:50:41] <wikibugs>	 (03PS5) 10JMeybohm: Update staging-codfw to k8s 1.31, calico 3.29 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984)
[08:54:40] <jinxer-wm>	 FIRING: [3x] KubernetesRsyslogDown: rsyslog on mw2282:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:54:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Update staging-codfw to k8s 1.31, calico 3.29 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[08:55:44] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2310 to wikikube-worker2240
[08:56:05] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[08:59:32] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2310 to wikikube-worker2240 - jelto@cumin1002"
[09:00:39] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2310 to wikikube-worker2240 - jelto@cumin1002"
[09:00:39] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:00:39] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2240
[09:01:05] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2240
[09:01:44] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2310 to wikikube-worker2240
[09:03:09] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2311 to wikikube-worker2241
[09:03:30] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[09:06:54] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2311 to wikikube-worker2241 - jelto@cumin1002"
[09:07:11] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2311 to wikikube-worker2241 - jelto@cumin1002"
[09:07:11] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:07:11] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2241
[09:07:22] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2241
[09:08:01] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2311 to wikikube-worker2241
[09:08:18] <Emperor>	 !log depool / restart / repool ms-fe2010 T360913
[09:08:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:21] <stashbot>	 T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913
[09:12:03] <wikibugs>	 (03PS1) 10Muehlenhoff: sre.hosts.reimage: Add link to the help text for move-vlan [cookbooks] - 10https://gerrit.wikimedia.org/r/1112171
[09:12:11] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: recording rules for mw edit count [puppet] - 10https://gerrit.wikimedia.org/r/1112172 (https://phabricator.wikimedia.org/T383963)
[09:13:05] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2240 wikikube-worker2241 on all recursors
[09:13:08] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2240 wikikube-worker2241 on all recursors
[09:14:33] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2240.codfw.wmnet wikikube-worker2241.codfw.wmnet on all recursors
[09:14:36] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2240.codfw.wmnet wikikube-worker2241.codfw.wmnet on all recursors
[09:20:15] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 10SRE-swift-storage, and 3 others: Supermicro's Config J hot swap behavior - https://phabricator.wikimedia.org/T383903#10469386 (10MatthewVernon) Thanks for the update @Papaul , and of course you can have some time to look at the cable management issues. Do keep us posted,...
[09:22:52] <wikibugs>	 (03PS1) 10Jelto: remove reserved name wikikube-worker2242 because of mw2282 decom [puppet] - 10https://gerrit.wikimedia.org/r/1112173 (https://phabricator.wikimedia.org/T383965)
[09:24:08] <wikibugs>	 (03PS2) 10Jelto: remove reserved name wikikube-worker2242 because of mw2282 decom [puppet] - 10https://gerrit.wikimedia.org/r/1112173 (https://phabricator.wikimedia.org/T383965)
[09:26:46] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thanos: sample 10% traces of thanos-query [puppet] - 10https://gerrit.wikimedia.org/r/1112174 (https://phabricator.wikimedia.org/T376179)
[09:27:04] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] remove reserved name wikikube-worker2242 because of mw2282 decom [puppet] - 10https://gerrit.wikimedia.org/r/1112173 (https://phabricator.wikimedia.org/T383965) (owner: 10Jelto)
[09:28:22] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Spicerack fails to find host physical interface for ganeti nodes - https://phabricator.wikimedia.org/T383207#10469393 (10cmooney) 05Resolved→03Open Re-opening as we hit the same issue happening within the reimage cookbook itself.  https://gerrit.wikimedia.org/r/plugins/...
[09:29:07] <wikibugs>	 (03CR) 10Jelto: [C:03+2] remove reserved name wikikube-worker2242 because of mw2282 decom [puppet] - 10https://gerrit.wikimedia.org/r/1112173 (https://phabricator.wikimedia.org/T383965) (owner: 10Jelto)
[09:29:08] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Spicerack fails to find host physical interface for ganeti nodes - https://phabricator.wikimedia.org/T383207#10469396 (10cmooney)
[09:29:08] <wikibugs>	 (03PS1) 10Cathal Mooney: reimage: check if primary IP interface is bridge when getting int [cookbooks] - 10https://gerrit.wikimedia.org/r/1112175 (https://phabricator.wikimedia.org/T383207)
[09:29:09] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netbox, 10netops: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832#10469397 (10cmooney)
[09:33:18] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2240.codfw.wmnet with OS bookworm
[09:33:28] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2240
[09:33:46] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[09:34:58] <wikibugs>	 (03Abandoned) 10Filippo Giunchedi: thanos: sample 10% traces of thanos-query [puppet] - 10https://gerrit.wikimedia.org/r/1112174 (https://phabricator.wikimedia.org/T376179) (owner: 10Filippo Giunchedi)
[09:37:12] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2240 - jelto@cumin1002"
[09:37:16] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2240 - jelto@cumin1002"
[09:37:17] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:37:17] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2240.codfw.wmnet 157.16.192.10.in-addr.arpa 7.5.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[09:37:19] <wikibugs>	 (03PS2) 10Cathal Mooney: reimage: change check to find physical interface if IP is on bridge [cookbooks] - 10https://gerrit.wikimedia.org/r/1112175 (https://phabricator.wikimedia.org/T383207)
[09:37:20] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2240.codfw.wmnet 157.16.192.10.in-addr.arpa 7.5.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[09:37:20] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2240
[09:37:37] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2240
[09:37:37] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2240
[09:38:54] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2241.codfw.wmnet with OS bookworm
[09:39:04] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2241
[09:39:09] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[09:42:33] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2241 - jelto@cumin1002"
[09:42:38] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2241 - jelto@cumin1002"
[09:42:38] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:42:38] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2241.codfw.wmnet 158.16.192.10.in-addr.arpa 8.5.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[09:42:41] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2241.codfw.wmnet 158.16.192.10.in-addr.arpa 8.5.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[09:42:41] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2241
[09:42:53] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2241
[09:42:53] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2241
[09:43:31] <wikibugs>	 (03CR) 10CI reject: [V:04-1] reimage: change check to find physical interface if IP is on bridge [cookbooks] - 10https://gerrit.wikimedia.org/r/1112175 (https://phabricator.wikimedia.org/T383207) (owner: 10Cathal Mooney)
[09:51:25] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: scrape otelcol metrics [puppet] - 10https://gerrit.wikimedia.org/r/1112177 (https://phabricator.wikimedia.org/T376179)
[09:55:08] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2240.codfw.wmnet with reason: host reimage
[09:55:51] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T383862#10469458 (10Jelto)
[09:59:14] <wikibugs>	 (03PS1) 10Muehlenhoff: Pass the Squid port by parameter [puppet] - 10https://gerrit.wikimedia.org/r/1112180
[09:59:16] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2240.codfw.wmnet with reason: host reimage
[09:59:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Pass the Squid port by parameter [puppet] - 10https://gerrit.wikimedia.org/r/1112180 (owner: 10Muehlenhoff)
[10:03:36] <wikibugs>	 (03PS2) 10Muehlenhoff: Pass the Squid port by parameter [puppet] - 10https://gerrit.wikimedia.org/r/1112180
[10:05:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Pass the Squid port by parameter [puppet] - 10https://gerrit.wikimedia.org/r/1112180 (owner: 10Muehlenhoff)
[10:08:03] <logmsgbot>	 !log jelto@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker2241.codfw.wmnet with OS bookworm
[10:08:31] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2241.codfw.wmnet with OS bookworm
[10:08:35] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2241
[10:08:35] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2241
[10:09:33] <jinxer-wm>	 FIRING: [12x] KubernetesCalicoDown: mw1448.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[10:15:52] <wikibugs>	 (03PS3) 10Muehlenhoff: Pass the Squid port by parameter [puppet] - 10https://gerrit.wikimedia.org/r/1112180
[10:16:35] <wikibugs>	 (03PS2) 10JMeybohm: Pin calico version on all clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111935 (https://phabricator.wikimedia.org/T341984)
[10:16:35] <wikibugs>	 (03PS4) 10JMeybohm: Update calico-crds to calico v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111943 (https://phabricator.wikimedia.org/T341984)
[10:16:35] <wikibugs>	 (03PS5) 10JMeybohm: Update calico to v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112058 (https://phabricator.wikimedia.org/T341984)
[10:16:35] <wikibugs>	 (03PS6) 10JMeybohm: Update staging-codfw to k8s 1.31, calico 3.29 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984)
[10:16:36] <wikibugs>	 (03PS1) 10JMeybohm: admin_ng: Install VAPs instead of PSPs on k8s >= 1.24 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112183 (https://phabricator.wikimedia.org/T341984)
[10:18:34] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review: Define an event stream and schema for haproxy_requestctl analytics pipeline ingestion - https://phabricator.wikimedia.org/T383392#10469509 (10Fabfur) >>! In T383392#10467504, @Ottomata wrote: > Hi! >  > It looks like [[ https://gitlab.wikimedia.org/repos/data-engineering...
[10:19:31] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2240.codfw.wmnet with OS bookworm
[10:24:08] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] admin_ng: Install VAPs instead of PSPs on k8s >= 1.24 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112183 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[10:24:59] <wikibugs>	 (03CR) 10Brouberol: airflow: refactor/DRY the volume/volumeMounts accross containers (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112053 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol)
[10:25:54] <wikibugs>	 (03PS4) 10Jcrespo: dbbackups: Migrate db2139 backup generation to db2239 [puppet] - 10https://gerrit.wikimedia.org/r/1111656 (https://phabricator.wikimedia.org/T373579)
[10:25:54] <wikibugs>	 (03PS1) 10Jcrespo: installserver: Review backup and db hosts [puppet] - 10https://gerrit.wikimedia.org/r/1112184 (https://phabricator.wikimedia.org/T383902)
[10:26:04] <wikibugs>	 (03CR) 10Brouberol: airflow: refactor/DRY the volume/volumeMounts accross containers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112053 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol)
[10:26:10] <wikibugs>	 (03PS4) 10Muehlenhoff: Pass the Squid port by parameter [puppet] - 10https://gerrit.wikimedia.org/r/1112180
[10:26:22] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2241.codfw.wmnet with reason: host reimage
[10:29:01] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1112180 (owner: 10Muehlenhoff)
[10:29:30] <wikibugs>	 (03PS2) 10Gkyziridis: ml-services: update articletopic outlink image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111580 (https://phabricator.wikimedia.org/T383312)
[10:29:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] prometheus: reverse proxy for instances belonging to the host' site too [puppet] - 10https://gerrit.wikimedia.org/r/1112014 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi)
[10:30:34] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] dbbackups: Migrate db2139 backup generation to db2239 [puppet] - 10https://gerrit.wikimedia.org/r/1111656 (https://phabricator.wikimedia.org/T373579) (owner: 10Jcrespo)
[10:30:49] <wikibugs>	 (03PS2) 10Jcrespo: installserver: Review backup and db hosts [puppet] - 10https://gerrit.wikimedia.org/r/1112184 (https://phabricator.wikimedia.org/T383902)
[10:31:15] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2241.codfw.wmnet with reason: host reimage
[10:37:20] <wikibugs>	 (03CR) 10Muehlenhoff: "I made a patch for it: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1112180" [puppet] - 10https://gerrit.wikimedia.org/r/1111681 (owner: 10CDanis)
[10:49:02] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Image issue on ओम राऊत MrWp - https://phabricator.wikimedia.org/T383859#10469626 (10Goresm) Done, now the unknown image is gone.
[10:51:43] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2241.codfw.wmnet with OS bookworm
[10:54:24] <jelto>	 !log homer 'lsw1-b3-codfw*' commit 'T377877'
[10:54:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:28] <stashbot>	 T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877
[10:55:08] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: serve apache vhost on localhost too [puppet] - 10https://gerrit.wikimedia.org/r/1112186 (https://phabricator.wikimedia.org/T371087)
[10:55:31] <jelto>	 !log homer 'cr*codfw*' commit 'T377877'
[10:55:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:58:06] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2240-2241].codfw.wmnet
[10:58:08] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2240-2241].codfw.wmnet
[11:02:41] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+1] ml-services: update articletopic outlink image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111580 (https://phabricator.wikimedia.org/T383312) (owner: 10Gkyziridis)
[11:02:52] <wikibugs>	 06SRE, 06Data-Engineering, 10Dumps 2.0, 10Dumps-Generation: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10469674 (10MPGuy2824)
[11:07:41] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Map rest_v1/page/(html|title)/ to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1112188 (https://phabricator.wikimedia.org/T374683)
[11:09:49] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Map rest_v1/page/(html|title)/ to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1112188 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris)
[11:16:49] <wikibugs>	 (03PS1) 10Aklapper: Phabricator data for WMF QLS: Add MCollins, remove ABittaker [puppet] - 10https://gerrit.wikimedia.org/r/1112189 (https://phabricator.wikimedia.org/T383884)
[11:19:04] <wikibugs>	 (03PS1) 10Marostegui: es1046: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1112190 (https://phabricator.wikimedia.org/T382569)
[11:20:57] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es1046: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1112190 (https://phabricator.wikimedia.org/T382569) (owner: 10Marostegui)
[11:25:41] <wikibugs>	 (03PS3) 10Cathal Mooney: reimage: change check to find physical interface if IP is on bridge [cookbooks] - 10https://gerrit.wikimedia.org/r/1112175 (https://phabricator.wikimedia.org/T383207)
[11:27:55] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add es1046 [puppet] - 10https://gerrit.wikimedia.org/r/1112192 (https://phabricator.wikimedia.org/T382569)
[11:29:49] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s3 on db2239 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1857.67 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:30:05] <marostegui>	 jynus: ^ you aware?
[11:30:06] <wikibugs>	 (03PS1) 10Fabfur: systemd: added option to remain after exit [puppet] - 10https://gerrit.wikimedia.org/r/1112193 (https://phabricator.wikimedia.org/T383976)
[11:30:18] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] installserver: Review backup and db hosts [puppet] - 10https://gerrit.wikimedia.org/r/1112184 (https://phabricator.wikimedia.org/T383902) (owner: 10Jcrespo)
[11:30:49] <jynus>	 yeah, I removed notifications, but icinga had a race condition with my manual change
[11:30:57] <jynus>	 i will do it again, should stay
[11:31:39] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1111-1116].eqiad.wmnet
[11:31:39] <logmsgbot>	 !log kamila@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) pool for host wikikube-worker[1111-1116].eqiad.wmnet
[11:33:45] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow: refactor/DRY the volume/volumeMounts accross containers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112053 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol)
[11:33:47] <wikibugs>	 06SRE, 06Data-Engineering, 10EventStreams: eventstreams is hitting memory limits, causing restarts and paging - https://phabricator.wikimedia.org/T383977#10469769 (10hnowlan) p:05Triage→03High
[11:35:44] <wikibugs>	 (03PS4) 10Cathal Mooney: reimage: change check to find physical interface if IP is on bridge [cookbooks] - 10https://gerrit.wikimedia.org/r/1112175 (https://phabricator.wikimedia.org/T383207)
[11:36:15] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review: Refine add_is_wmf_domain TransformFunction fails if no source field exists - https://phabricator.wikimedia.org/T383914#10469771 (10Antoine_Quhen) a:03Antoine_Quhen
[11:37:07] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1112193 (https://phabricator.wikimedia.org/T383976) (owner: 10Fabfur)
[11:37:44] <logmsgbot>	 !log jynus@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: reimage
[11:39:36] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Image issue on ओम राऊत MrWp - https://phabricator.wikimedia.org/T383859#10469777 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Thanks for confirming, I'll close this task now.
[11:39:47] <wikibugs>	 (03PS1) 10Hnowlan: eventstreams: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112197 (https://phabricator.wikimedia.org/T383977)
[11:40:13] <wikibugs>	 (03PS5) 10Cathal Mooney: reimage: change check to find physical interface if IP is on bridge [cookbooks] - 10https://gerrit.wikimedia.org/r/1112175 (https://phabricator.wikimedia.org/T383207)
[11:41:05] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host db2141.codfw.wmnet with OS bookworm
[11:42:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2027.codfw.wmnet with OS bookworm
[11:42:12] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10469786 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2027.codfw.wmnet with OS bookworm
[11:42:43] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] eventstreams: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112197 (https://phabricator.wikimedia.org/T383977) (owner: 10Hnowlan)
[11:53:01] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[11:53:35] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[11:55:13] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] eventstreams: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112197 (https://phabricator.wikimedia.org/T383977) (owner: 10Hnowlan)
[11:56:19] <wikibugs>	 (03Merged) 10jenkins-bot: eventstreams: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112197 (https://phabricator.wikimedia.org/T383977) (owner: 10Hnowlan)
[11:57:37] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2141.codfw.wmnet with reason: host reimage
[11:58:51] <moritzm>	 !log installing Linux 6.1.124 on Bookworm hosts
[11:58:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:06] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250117T0800)
[12:00:06] <jouncebot>	 eoghan, jelto, arnoldokoth, and mutante: #bothumor My software never has bugs. It just develops random features. Rise for GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250117T1200).
[12:00:21] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: apply
[12:00:45] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply
[12:01:52] <logmsgbot>	 !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2141.codfw.wmnet with reason: host reimage
[12:02:13] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams: apply
[12:02:43] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply
[12:03:10] <icinga-wm>	 PROBLEM - Disk space on prometheus1006 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/k8s-dse 3087MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=prometheus1006&var-datasource=eqiad+prometheus/ops
[12:04:33] <jinxer-wm>	 FIRING: [12x] KubernetesCalicoDown: mw1448.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[12:07:27] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1111-1116].eqiad.wmnet
[12:07:29] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1111-1116].eqiad.wmnet
[12:09:06] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Spicerack fails to find host physical interface for ganeti nodes - https://phabricator.wikimedia.org/T383207#10469892 (10cmooney) Today's blip has made me realise how we didn't hit this more often in the past:  # In June 2023 (after T296832), the reima...
[12:09:22] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] instances.yaml: Add es1046 [puppet] - 10https://gerrit.wikimedia.org/r/1112192 (https://phabricator.wikimedia.org/T382569) (owner: 10Marostegui)
[12:09:33] <jinxer-wm>	 RESOLVED: [12x] KubernetesCalicoDown: mw1448.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[12:21:06] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:04-1] "One comment, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1112180 (owner: 10Muehlenhoff)
[12:23:02] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[12:23:36] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[12:25:30] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:25:42] <logmsgbot>	 !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2141.codfw.wmnet with OS bookworm
[12:25:54] <wikibugs>	 (03PS1) 10Marostegui: sections.yaml: Add pc6 to codfw [puppet] - 10https://gerrit.wikimedia.org/r/1112199 (https://phabricator.wikimedia.org/T383234)
[12:27:24] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] sections.yaml: Add pc6 to codfw [puppet] - 10https://gerrit.wikimedia.org/r/1112199 (https://phabricator.wikimedia.org/T383234) (owner: 10Marostegui)
[12:27:28] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] sections.yaml: Add pc6 to codfw [puppet] - 10https://gerrit.wikimedia.org/r/1112199 (https://phabricator.wikimedia.org/T383234) (owner: 10Marostegui)
[12:31:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Add es1046 to dbctl depooled T382569', diff saved to https://phabricator.wikimedia.org/P72136 and previous config saved to /var/cache/conftool/dbconfig/20250117-123153-marostegui.json
[12:32:00] <stashbot>	 T382569: Productionize es104[1-6] - https://phabricator.wikimedia.org/T382569
[12:32:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 1%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P72137 and previous config saved to /var/cache/conftool/dbconfig/20250117-123235-root.json
[12:33:02] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[12:33:36] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[12:43:53] <wikibugs>	 (03PS2) 10Fabfur: systemd: added option to remain after exit [puppet] - 10https://gerrit.wikimedia.org/r/1112193 (https://phabricator.wikimedia.org/T383976)
[12:43:54] <wikibugs>	 (03PS1) 10Kamila Součková: wikikube: rename mw146[4-9] -> wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1112200 (https://phabricator.wikimedia.org/T365571)
[12:47:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 2%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P72138 and previous config saved to /var/cache/conftool/dbconfig/20250117-124740-root.json
[12:49:31] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1112193 (https://phabricator.wikimedia.org/T383976) (owner: 10Fabfur)
[12:49:52] <koi>	 Hi urbanecm, can you have a look at patch 
[12:49:54] <koi>	 https://gerrit.wikimedia.org/r/c/1100228 ? thanks
[12:49:59] <urbanecm>	 hey, sure!
[12:50:06] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[12:50:42] <urbanecm>	 koi: i'd still like to know what the result of the investigations were. do you know anything about that?
[12:52:04] <koi>	 urbanecm, sorry but am i missing some context? what kind of investigation
[12:52:36] <urbanecm>	 koi: the task is stalled per https://phabricator.wikimedia.org/T378287#10341850. before going ahead with the patch, the investigation should conclude
[12:52:42] <urbanecm>	 so that we don't accidentally cause any problems
[12:52:47] <wikibugs>	 (03CR) 10Elukey: "LGTM, I have a doubt about the need of `profile::rsyslog::udp_localhost_compat`. If it is needed feel free to go and merge!" [puppet] - 10https://gerrit.wikimedia.org/r/1111659 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[12:54:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on mw2282:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2282 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:55:06] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[12:56:06] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Don't setup database config for tilerator on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1112167 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[12:56:35] <koi>	 urbanecm, um, actually this kinda confuse me, at first i thought they mean T381197, but i'm not sure if they mean prod as well
[12:56:36] <stashbot>	 T381197: Create views for SecurePoll db tables in Toolforge replicas - https://phabricator.wikimedia.org/T381197
[12:57:17] <urbanecm>	 koi: exactly. unfortunately, that comment doesn't provide context, which makes it hard to follow up on that. but we definitely need to clarify that before going ahead. i hope that makes sense to you
[12:57:57] <koi>	 fair enough, i'll ask about the progress of such investigate
[12:58:06] <koi>	 thanks for the reply!
[12:59:54] <wikibugs>	 (03PS5) 10Muehlenhoff: Add separate maps master/replica roles for the new Bookworm setup [puppet] - 10https://gerrit.wikimedia.org/r/1111659 (https://phabricator.wikimedia.org/T381565)
[12:59:59] <wikibugs>	 (03CR) 10Muehlenhoff: Add separate maps master/replica roles for the new Bookworm setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1111659 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[13:02:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 3%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P72139 and previous config saved to /var/cache/conftool/dbconfig/20250117-130245-root.json
[13:03:16] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+1] ml-services: update articletopic outlink image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111580 (https://phabricator.wikimedia.org/T383312) (owner: 10Gkyziridis)
[13:04:17] <wikibugs>	 (03CR) 10Brouberol: airflow: define a K8sPodOperator pod template for pods needing access to hadoop (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112057 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol)
[13:05:31] <wikibugs>	 (03CR) 10Brouberol: airflow: define a K8sPodOperator pod template for pods needing access to hadoop (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112057 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol)
[13:06:15] <wikibugs>	 (03CR) 10Muehlenhoff: Pass the Squid port by parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1112180 (owner: 10Muehlenhoff)
[13:09:21] <wikibugs>	 (03PS3) 10Brouberol: airflow: define a K8sPodOperator pod template for pods needing access to hadoop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112057 (https://phabricator.wikimedia.org/T383430)
[13:10:01] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Add separate maps master/replica roles for the new Bookworm setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1111659 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[13:13:24] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] systemd: added option to remain after exit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1112193 (https://phabricator.wikimedia.org/T383976) (owner: 10Fabfur)
[13:13:34] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 10SRE-swift-storage, and 3 others: Supermicro's Config J hot swap behavior - https://phabricator.wikimedia.org/T383903#10470352 (10elukey) Supermicro came back with some nice suggestions to clear the state of a new/replaced disk (if it gets into something like Foreign state...
[13:16:29] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003 (10elukey) 03NEW
[13:17:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 4%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P72140 and previous config saved to /var/cache/conftool/dbconfig/20250117-131751-root.json
[13:18:10] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] ml-services: update articletopic outlink image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111580 (https://phabricator.wikimedia.org/T383312) (owner: 10Gkyziridis)
[13:24:52] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow: define a K8sPodOperator pod template for pods needing access to hadoop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112057 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol)
[13:27:13] <wikibugs>	 (03PS3) 10Fabfur: systemd: added option to remain after exit [puppet] - 10https://gerrit.wikimedia.org/r/1112193 (https://phabricator.wikimedia.org/T383976)
[13:27:51] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: deploy hive config under both hadoop and spark config dirs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112052 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol)
[13:27:54] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: refactor/DRY the volume/volumeMounts accross containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112053 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol)
[13:27:57] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: define a K8sPodOperator pod template for pods needing access to hadoop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112057 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol)
[13:28:52] <wikibugs>	 (03PS10) 10JMeybohm: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984)
[13:28:52] <wikibugs>	 (03PS1) 10JMeybohm: calico: Create certificates for Typha/Felix mTLS [puppet] - 10https://gerrit.wikimedia.org/r/1112204 (https://phabricator.wikimedia.org/T365687)
[13:29:33] <wikibugs>	 (03Merged) 10jenkins-bot: airflow: deploy hive config under both hadoop and spark config dirs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112052 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol)
[13:29:35] <wikibugs>	 (03CR) 10Fabfur: systemd: added option to remain after exit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1112193 (https://phabricator.wikimedia.org/T383976) (owner: 10Fabfur)
[13:29:36] <wikibugs>	 (03Merged) 10jenkins-bot: airflow: refactor/DRY the volume/volumeMounts accross containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112053 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol)
[13:29:37] <wikibugs>	 (03Merged) 10jenkins-bot: airflow: define a K8sPodOperator pod template for pods needing access to hadoop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112057 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol)
[13:30:37] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10470456 (10MatthewVernon)
[13:31:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] calico: Create certificates for Typha/Felix mTLS [puppet] - 10https://gerrit.wikimedia.org/r/1112204 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm)
[13:32:47] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review: Refine add_is_wmf_domain TransformFunction fails if no source field exists - https://phabricator.wikimedia.org/T383914#10470469 (10Ottomata) @Fabfur thanks, merged.   Since this is a bug on our side, we can work with you to make a backwards incompatible change to remove...
[13:32:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 5%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P72142 and previous config saved to /var/cache/conftool/dbconfig/20250117-133256-root.json
[13:33:30] <wikibugs>	 (03CR) 10Vgutierrez: systemd: added option to remain after exit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1112193 (https://phabricator.wikimedia.org/T383976) (owner: 10Fabfur)
[13:34:26] <wikibugs>	 (03PS1) 10Arnaudb: peopleweb: disable envoy request timeout, enable log [puppet] - 10https://gerrit.wikimedia.org/r/1112205 (https://phabricator.wikimedia.org/T383750)
[13:34:27] <wikibugs>	 (03CR) 10Arnaudb: "I agree with your statement in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1112056/comments/41ce4051_55c7f928 this is not the pro" [puppet] - 10https://gerrit.wikimedia.org/r/1112205 (https://phabricator.wikimedia.org/T383750) (owner: 10Arnaudb)
[13:34:31] <wikibugs>	 (03PS1) 10Brouberol: airflow: bypass DNS resolution for the PG URI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112206 (https://phabricator.wikimedia.org/T383651)
[13:34:45] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review: Refine add_is_wmf_domain TransformFunction fails if no source field exists - https://phabricator.wikimedia.org/T383914#10470473 (10Fabfur) >>! In T383914#10470469, @Ottomata wrote: > @Fabfur thanks, merged.  >  > Since this is a bug on our side, we can work with you to m...
[13:35:00] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[13:35:18] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[13:35:44] <wikibugs>	 (03PS2) 10JMeybohm: calico: Create certificates for Typha/Felix mTLS [puppet] - 10https://gerrit.wikimedia.org/r/1112204 (https://phabricator.wikimedia.org/T365687)
[13:35:44] <wikibugs>	 (03PS11) 10JMeybohm: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984)
[13:35:59] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[13:36:38] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[13:37:29] <wikibugs>	 (03PS4) 10Fabfur: systemd: added option to remain after exit [puppet] - 10https://gerrit.wikimedia.org/r/1112193 (https://phabricator.wikimedia.org/T383976)
[13:37:43] <wikibugs>	 (03CR) 10Fabfur: systemd: added option to remain after exit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1112193 (https://phabricator.wikimedia.org/T383976) (owner: 10Fabfur)
[13:41:16] <wikibugs>	 (03PS1) 10Urbanecm: [Growth] Remove tybanner campaigns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112207 (https://phabricator.wikimedia.org/T380405)
[13:41:18] <wikibugs>	 (03PS1) 10Urbanecm: [Growth] Add fundraising- as a prefix for fundraising campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112208 (https://phabricator.wikimedia.org/T380405)
[13:42:03] <wikibugs>	 (03PS1) 10Dreamy Jazz: Pin wgCheckUserEnableTempAccountsOnboardingDialog as false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112209 (https://phabricator.wikimedia.org/T384005)
[13:42:03] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1112193 (https://phabricator.wikimedia.org/T383976) (owner: 10Fabfur)
[13:44:45] <wikibugs>	 (03PS1) 10Brouberol: airflow: hotfix: do not render empty volumes/volumeMounts blocks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112210 (https://phabricator.wikimedia.org/T383430)
[13:45:23] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow: hotfix: do not render empty volumes/volumeMounts blocks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112210 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol)
[13:46:43] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow: bypass DNS resolution for the PG URI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112206 (https://phabricator.wikimedia.org/T383651) (owner: 10Brouberol)
[13:47:03] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Switch an-test-presto1001 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1111180 (owner: 10Muehlenhoff)
[13:47:58] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: hotfix: do not render empty volumes/volumeMounts blocks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112210 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol)
[13:48:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 10%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P72143 and previous config saved to /var/cache/conftool/dbconfig/20250117-134801-root.json
[13:50:09] <wikibugs>	 (03PS2) 10Brouberol: airflow: bypass DNS resolution for the PG URI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112206 (https://phabricator.wikimedia.org/T383651)
[13:50:56] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[13:51:12] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[13:54:08] <wikibugs>	 (03PS1) 10Brouberol: airflow: hotfix: fix broken indentation in the pod template configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112213 (https://phabricator.wikimedia.org/T383430)
[13:54:58] <wikibugs>	 (03CR) 10Máté Szabó: [C:03+2] Pin wgCheckUserEnableTempAccountsOnboardingDialog as false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112209 (https://phabricator.wikimedia.org/T384005) (owner: 10Dreamy Jazz)
[13:55:42] <wikibugs>	 (03Merged) 10jenkins-bot: Pin wgCheckUserEnableTempAccountsOnboardingDialog as false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112209 (https://phabricator.wikimedia.org/T384005) (owner: 10Dreamy Jazz)
[13:57:16] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: hotfix: fix broken indentation in the pod template configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112213 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol)
[13:59:50] <wikibugs>	 (03PS3) 10Brouberol: airflow: bypass DNS resolution for the PG URI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112206 (https://phabricator.wikimedia.org/T383651)
[14:00:16] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[14:00:32] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[14:01:31] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[14:01:47] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[14:03:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 25%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P72144 and previous config saved to /var/cache/conftool/dbconfig/20250117-140308-root.json
[14:07:19] <wikibugs>	 (03CR) 10Muehlenhoff: "Can I reboot an-test-presto1001 any time or should I sync up beforehand?" [puppet] - 10https://gerrit.wikimedia.org/r/1111180 (owner: 10Muehlenhoff)
[14:07:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add separate maps master/replica roles for the new Bookworm setup [puppet] - 10https://gerrit.wikimedia.org/r/1111659 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[14:10:10] <wikibugs>	 (03PS4) 10Muehlenhoff: Make maps-test2001 a bookworm maps master node (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565)
[14:12:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2027.codfw.wmnet with reason: host reimage
[14:12:30] <wikibugs>	 (03PS5) 10Muehlenhoff: Make maps-test2001 a bookworm maps master node [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565)
[14:14:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good, this made the reimage of ganeti2027 (with test-cookbook) work fine" [cookbooks] - 10https://gerrit.wikimedia.org/r/1112175 (https://phabricator.wikimedia.org/T383207) (owner: 10Cathal Mooney)
[14:14:28] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[14:15:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2027.codfw.wmnet with reason: host reimage
[14:18:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 50%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P72145 and previous config saved to /var/cache/conftool/dbconfig/20250117-141813-root.json
[14:19:25] <mszabo>	 jouncebot: nowandnext
[14:19:25] <jouncebot>	 For the next 17 hour(s) and 40 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250117T0800)
[14:19:25] <jouncebot>	 In 17 hour(s) and 40 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250118T0800)
[14:21:18] <wikibugs>	 (03PS1) 10Máté Szabó: Revert "Pin wgCheckUserEnableTempAccountsOnboardingDialog as false" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112217
[14:22:10] <wikibugs>	 (03PS1) 10Muehlenhoff: Add missing Hiera settings for new bookworm master roles [puppet] - 10https://gerrit.wikimedia.org/r/1112218 (https://phabricator.wikimedia.org/T381565)
[14:24:01] <wikibugs>	 (03CR) 10Santiago Faci: "Just a question to confirm if we are using the right config variable for the experiment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112105 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming)
[14:24:03] <wikibugs>	 (03CR) 10STran: [C:03+2] Revert "Pin wgCheckUserEnableTempAccountsOnboardingDialog as false" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112217 (owner: 10Máté Szabó)
[14:25:07] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Disable sidebar cache on the auth domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112219 (https://phabricator.wikimedia.org/T383916)
[14:25:09] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Pin wgCheckUserEnableTempAccountsOnboardingDialog as false" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112217 (owner: 10Máté Szabó)
[14:26:09] <wikibugs>	 (03PS1) 10Dreamy Jazz: Revert^2 "Pin wgCheckUserEnableTempAccountsOnboardingDialog as false" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112220
[14:26:20] <wikibugs>	 (03PS1) 10Slyngshede: Add OIDC support to development environment [software/bitu] - 10https://gerrit.wikimedia.org/r/1112221
[14:32:56] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:33:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 75%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P72146 and previous config saved to /var/cache/conftool/dbconfig/20250117-143318-root.json
[14:33:40] <wikibugs>	 (03CR) 10Máté Szabó: [C:03+1] Revert^2 "Pin wgCheckUserEnableTempAccountsOnboardingDialog as false" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112220 (owner: 10Dreamy Jazz)
[14:34:40] <wikibugs>	 (03PS1) 10Elukey: role::kafka::monitoring: re-add Jumbo config [puppet] - 10https://gerrit.wikimedia.org/r/1112223
[14:34:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2027.codfw.wmnet with OS bookworm
[14:34:46] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10470903 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2027.codfw.wmnet with OS bookworm completed: - ganeti202...
[14:36:35] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4814/console" [puppet] - 10https://gerrit.wikimedia.org/r/1112223 (owner: 10Elukey)
[14:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:37:56] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:38:02] <wikibugs>	 (03PS1) 10Slyngshede: C:idm remove associate_by_email pipeline [puppet] - 10https://gerrit.wikimedia.org/r/1112224 (https://phabricator.wikimedia.org/T383707)
[14:41:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] role::kafka::monitoring: re-add Jumbo config [puppet] - 10https://gerrit.wikimedia.org/r/1112223 (owner: 10Elukey)
[14:42:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1112223 (owner: 10Elukey)
[14:42:43] <wikibugs>	 (03CR) 10Elukey: [V:03+1 C:03+2] role::kafka::monitoring: re-add Jumbo config [puppet] - 10https://gerrit.wikimedia.org/r/1112223 (owner: 10Elukey)
[14:44:16] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove firewall rule for rsync on archiva [puppet] - 10https://gerrit.wikimedia.org/r/1112226 (https://phabricator.wikimedia.org/T367315)
[14:45:56] <MichaelG_WMF>	 Is it just me or is Grafana not having a good time right now?
[14:46:27] <MichaelG_WMF>	 I'm seeing lots of `Firefox can’t establish a connection to the server at wss://grafana-rw.wikimedia.org/api/live/ws.` and equivalent in Chromium
[14:48:10] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove rsync from archiva [puppet] - 10https://gerrit.wikimedia.org/r/1112228 (https://phabricator.wikimedia.org/T367315)
[14:48:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 100%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P72147 and previous config saved to /var/cache/conftool/dbconfig/20250117-144824-root.json
[14:55:45] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Add missing Hiera settings for new bookworm master roles [puppet] - 10https://gerrit.wikimedia.org/r/1112218 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[14:56:59] <elukey>	 MichaelG_WMF: works fine on my side (but on Chrome) - any specific dashboard that you are looking at?
[14:57:15] <wikibugs>	 (03PS3) 10JMeybohm: calico: Create certificates for Typha/Felix mTLS [puppet] - 10https://gerrit.wikimedia.org/r/1112204 (https://phabricator.wikimedia.org/T365687)
[14:57:16] <wikibugs>	 (03PS12) 10JMeybohm: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984)
[14:58:39] <MichaelG_WMF>	 elukey: I'm working right now on https://grafana-rw.wikimedia.org/d/ff15559c-b4a2-4363-94c8-190a086b3315/michael-s-playground?forceLogin&forceLogin=true&from=now-7d&orgId=1&to=now
[14:59:15] <MichaelG_WMF>	 it is mainly the editing/live features that are giving me trouble
[14:59:43] <MichaelG_WMF>	 but hearing in #wikimedia-observability right now that this might be expected
[14:59:48] <elukey>	 I see that observability in answering in their chan, I'll leave it to them that are more knowledgeable
[14:59:51] <elukey>	 exactly :)
[15:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:09:13] <icinga-wm>	 PROBLEM - SSH on prometheus1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:09:41] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Run validator when creating switch-interface in provision script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1112069 (https://phabricator.wikimedia.org/T383915) (owner: 10Cathal Mooney)
[15:11:32] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Pin calico version on all clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111935 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[15:12:18] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service prometheus1006:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#prometheus1006:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:12:51] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017 (10Neslihan_Turan_WMDE) 03NEW
[15:13:12] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:13:30] <moritzm>	 !log powercycle prometheus1006
[15:13:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:46] <wikibugs>	 (03Merged) 10jenkins-bot: Run validator when creating switch-interface in provision script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1112069 (https://phabricator.wikimedia.org/T383915) (owner: 10Cathal Mooney)
[15:14:49] <icinga-wm>	 PROBLEM - Host prometheus1006 is DOWN: PING CRITICAL - Packet loss = 100%
[15:15:21] <wikibugs>	 (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1112204 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm)
[15:15:24] <wikibugs>	 (03Merged) 10jenkins-bot: Pin calico version on all clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111935 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[15:15:27] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018 (10SuzanneWood-WMDE) 03NEW
[15:15:45] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017#10471120 (10Neslihan_Turan_WMDE) @WMDECyn kindly pinging you to approve this :)
[15:15:51] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10471121 (10SuzanneWood-WMDE)
[15:17:01] <icinga-wm>	 RECOVERY - Host prometheus1006 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[15:17:03] <icinga-wm>	 RECOVERY - SSH on prometheus1006 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:17:24] <sukhe>	 thanks moritzm!
[15:18:12] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:19:22] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service prometheus1006:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#prometheus1006:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:20:42] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary
[15:20:57] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary
[15:21:14] <wikibugs>	 (03PS1) 10JMeybohm: calico: Add support for Typha/Felix mTLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112235 (https://phabricator.wikimedia.org/T365687)
[15:21:16] <wikibugs>	 (03PS1) 10JMeybohm: Update calico to 0.2.11 in staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112236 (https://phabricator.wikimedia.org/T365687)
[15:25:42] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Update calico to 0.2.11 in staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112236 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm)
[15:25:54] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Netbox: execute interface validator in provision script for switch interfaces - https://phabricator.wikimedia.org/T383915#10471187 (10cmooney) 05Open→03Resolved a:03cmooney Merged now and working as expected in tests. ` Script abo...
[15:26:15] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox
[15:26:52] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1112204 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm)
[15:28:33] <wikibugs>	 (03PS2) 10JMeybohm: Update calico to 0.2.11 in staging-codfw and enable mTLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112236 (https://phabricator.wikimedia.org/T365687)
[15:29:26] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] calico: Create certificates for Typha/Felix mTLS [puppet] - 10https://gerrit.wikimedia.org/r/1112204 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm)
[15:32:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Update calico to 0.2.11 in staging-codfw and enable mTLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112236 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm)
[15:39:13] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:39:29] <logmsgbot>	 !log cmooney@cumin1002 END (FAIL) - Cookbook sre.netbox.update-extras (exit_code=1) rolling restart_daemons on A:netbox
[15:40:03] <topranks>	 !log manually restarting netbox service on netbox1003 
[15:40:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:40:43] <wikibugs>	 (03Abandoned) 10Bernard Wang: Enable web search AB test stream in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112082 (owner: 10Bernard Wang)
[15:48:13] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Phabricator data for WMF QLS: Add MCollins, remove ABittaker [puppet] - 10https://gerrit.wikimedia.org/r/1112189 (https://phabricator.wikimedia.org/T383884) (owner: 10Aklapper)
[15:54:11] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+1] Disable sidebar cache on the auth domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112219 (https://phabricator.wikimedia.org/T383916) (owner: 10Bartosz Dziewoński)
[15:55:06] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "Overall LGTM. See my comment on how to improve conditionals a bit." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert)
[15:58:57] <jinxer-wm>	 FIRING: KubernetesCalicoDown: kubestage2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestage2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[15:59:07] <wikibugs>	 (03PS1) 10Filippo Giunchedi: benthos: add nocookies and tls session metadata [puppet] - 10https://gerrit.wikimedia.org/r/1112248 (https://phabricator.wikimedia.org/T383900)
[15:59:09] <jayme>	 that's me, back in a second
[16:03:57] <jinxer-wm>	 FIRING: [7x] KubernetesCalicoDown: kubestage2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[16:06:22] <wikibugs>	 (03PS1) 10Scott French: sre.switchdc.mediawiki: add -next services [cookbooks] - 10https://gerrit.wikimedia.org/r/1112246 (https://phabricator.wikimedia.org/T377040)
[16:08:57] <jinxer-wm>	 RESOLVED: [7x] KubernetesCalicoDown: kubestage2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[16:10:49] <wikibugs>	 (03CR) 10Hnowlan: "Overall lgtm, one nit. There's lots of scope for cleanup/harmonisation once we look at a refactor but I think for now this works." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert)
[16:12:26] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] sre.switchdc.mediawiki: add -next services [cookbooks] - 10https://gerrit.wikimedia.org/r/1112246 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French)
[16:16:17] <wikibugs>	 (03CR) 10Scott French: [C:03+2] sre.switchdc.mediawiki: add -next services [cookbooks] - 10https://gerrit.wikimedia.org/r/1112246 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French)
[16:20:06] <wikibugs>	 (03CR) 10Herron: [C:03+1] prometheus: serve apache vhost on localhost too [puppet] - 10https://gerrit.wikimedia.org/r/1112186 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi)
[16:20:32] <wikibugs>	 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10471506 (10MatthewVernon) I've spent some more time with these logs, and I think I may have reached the point of diminishing returns. I extracted logs for...
[16:21:11] <wikibugs>	 (03PS3) 10Scott French: service::catalog: enable monitoring for mw-(web|api-ext)-next [puppet] - 10https://gerrit.wikimedia.org/r/1101124 (https://phabricator.wikimedia.org/T377040)
[16:21:12] <wikibugs>	 (03PS1) 10Scott French: mw-(web|api-ext)-next: bump replicas and update TODO [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112078 (https://phabricator.wikimedia.org/T377040)
[16:24:51] <wikibugs>	 (03PS13) 10JMeybohm: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984)
[16:24:51] <wikibugs>	 (03PS1) 10JMeybohm: calico: mTLS certificate symlinks have to be relative [puppet] - 10https://gerrit.wikimedia.org/r/1112250 (https://phabricator.wikimedia.org/T365687)
[16:25:28] <wikibugs>	 (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1112250 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm)
[16:25:30] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:26:28] <wikibugs>	 (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: add -next services [cookbooks] - 10https://gerrit.wikimedia.org/r/1112246 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French)
[16:27:35] <wikibugs>	 (03PS4) 10Scott French: service::catalog: enable monitoring for mw-(web|api-ext)-next [puppet] - 10https://gerrit.wikimedia.org/r/1101124 (https://phabricator.wikimedia.org/T377040)
[16:29:11] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] calico: mTLS certificate symlinks have to be relative [puppet] - 10https://gerrit.wikimedia.org/r/1112250 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm)
[16:31:40] <wikibugs>	 (03PS1) 10Hnowlan: wikikube: reimage 5 former jobrunner/videoscaler hosts to workers [puppet] - 10https://gerrit.wikimedia.org/r/1112252 (https://phabricator.wikimedia.org/T354791)
[16:31:59] <wikibugs>	 (03Abandoned) 10Hnowlan: wikikube: reimage 5 former jobrunner/videoscaler hosts to workers [puppet] - 10https://gerrit.wikimedia.org/r/1111670 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[16:32:39] <wikibugs>	 (03CR) 10DCausse: "is there anything blocking this patch?" [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916) (owner: 10Bking)
[16:33:09] <wikibugs>	 (03CR) 10Clare Ming: Enable the text experiment on testwiki only (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112105 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming)
[16:36:54] <wikibugs>	 (03CR) 10Tacsipacsi: "Thanks, but if it cannot be a real extremal value, I’m not sure if it’s worth it (especially considering the human resources needed to do " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112054 (owner: 10Lucas Werkmeister (WMDE))
[16:37:07] <wikibugs>	 (03PS2) 10Hnowlan: wikikube: reimage 5 former jobrunner/videoscaler hosts to workers [puppet] - 10https://gerrit.wikimedia.org/r/1112252 (https://phabricator.wikimedia.org/T354791)
[16:39:24] <wikibugs>	 (03CR) 10Scott French: [C:03+1] wikikube: reimage 5 former jobrunner/videoscaler hosts to workers [puppet] - 10https://gerrit.wikimedia.org/r/1112252 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[16:40:19] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] wikikube: reimage 5 former jobrunner/videoscaler hosts to workers [puppet] - 10https://gerrit.wikimedia.org/r/1112252 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[16:40:38] <wikibugs>	 (03CR) 10DCausse: "you mean a test in puppet? or perhaps allow only a set of specific keys and fail the puppet run?" [puppet] - 10https://gerrit.wikimedia.org/r/1091325 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson)
[16:47:23] <wikibugs>	 (03CR) 10Santiago Faci: Enable the text experiment on testwiki only (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112105 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming)
[16:54:24] <wikibugs>	 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10471670 (10MatthewVernon) 05Open→03Stalled Reported as [[ https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1093304 | Debian #1093304 ]]; more so we've...
[16:54:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on mw2282:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2282 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[17:03:35] <wikibugs>	 06SRE, 10Ceph, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Configure DSCP marking for cloudceph* hosts - https://phabricator.wikimedia.org/T371501#10471741 (10cmooney) 05Open→03Resolved a:03cmooney Gonna close this one, everything looks good great to see.  The real test will be...
[17:05:16] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10471771 (10hnowlan)
[17:05:26] <wikibugs>	 (03PS5) 10JMeybohm: Update calico-crds to calico v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111943 (https://phabricator.wikimedia.org/T341984)
[17:05:26] <wikibugs>	 (03PS6) 10JMeybohm: Update calico to v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112058 (https://phabricator.wikimedia.org/T341984)
[17:05:26] <wikibugs>	 (03PS2) 10JMeybohm: admin_ng: Install VAPs instead of PSPs on k8s >= 1.24 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112183 (https://phabricator.wikimedia.org/T341984)
[17:05:26] <wikibugs>	 (03PS7) 10JMeybohm: Update staging-codfw to k8s 1.31, calico 3.29 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984)
[17:09:10] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Update calico to v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112058 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[17:09:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Update calico-crds to calico v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111943 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[17:09:19] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "Apart from the commit message, this LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1112200 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková)
[17:09:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin_ng: Install VAPs instead of PSPs on k8s >= 1.24 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112183 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[17:09:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Update staging-codfw to k8s 1.31, calico 3.29 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[17:09:56] <wikibugs>	 (03PS1) 10Audrey Penven: Add known-good regexes for WikibaseQualityConstraints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112261 (https://phabricator.wikimedia.org/T380751)
[17:11:51] <logmsgbot>	 !log xcollazo@deploy2002 Started deploy [airflow-dags/analytics@b0cd4df]: Deploy latest DAGs for 'analytics' Airflow instance. T366542.
[17:11:56] <stashbot>	 T366542: Consider renaming columns and/or table to abide by the data modeling guidelines - https://phabricator.wikimedia.org/T366542
[17:12:21] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:12:24] <logmsgbot>	 !log xcollazo@deploy2002 Finished deploy [airflow-dags/analytics@b0cd4df]: Deploy latest DAGs for 'analytics' Airflow instance. T366542. (duration: 00m 32s)
[17:13:53] <wikibugs>	 (03PS1) 10Hnowlan: kubernetes: remove out-of-warranty jobrunner hosts awaiting reimage [puppet] - 10https://gerrit.wikimedia.org/r/1112262 (https://phabricator.wikimedia.org/T384043)
[17:18:12] <wikibugs>	 (03CR) 10Scott French: [C:03+1] kubernetes: remove out-of-warranty jobrunner hosts awaiting reimage [puppet] - 10https://gerrit.wikimedia.org/r/1112262 (https://phabricator.wikimedia.org/T384043) (owner: 10Hnowlan)
[17:20:03] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] kubernetes: remove out-of-warranty jobrunner hosts awaiting reimage [puppet] - 10https://gerrit.wikimedia.org/r/1112262 (https://phabricator.wikimedia.org/T384043) (owner: 10Hnowlan)
[17:25:34] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.decommission for hosts mw[2259,2263-2266].codfw.wmnet
[17:27:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10471923 (10phaultfinder)
[17:27:57] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Migrate port utilisation alert from LibreNMS to alertmanage - https://phabricator.wikimedia.org/T384052 (10cmooney) 03NEW p:05Triage→03Medium
[17:28:10] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10471938 (10cmooney)
[17:28:29] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Migrate port utilisation alert from LibreNMS to alertmanage - https://phabricator.wikimedia.org/T384052#10471937 (10cmooney)
[17:29:36] <wikibugs>	 06SRE, 06SRE-OnFire, 06serviceops, 10Release-Engineering-Team (Radar), 07Sustainability: Remove old scap repositories from deploy1002 - https://phabricator.wikimedia.org/T309162#10471941 (10LSobanski) #serviceops own the deployment servers, reassigning.
[17:31:15] <jinxer-wm>	 FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[17:31:48] <hnowlan>	 ^ me, will be resolved after a puppet run
[17:34:56] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:36:14] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.dns.netbox
[17:36:21] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:39:56] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:40:04] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Migrate port utilisation alert from LibreNMS to alertmanager - https://phabricator.wikimedia.org/T384052#10471964 (10cmooney)
[17:40:35] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[2259,2263-2266].codfw.wmnet decommissioned, removing all IPs except the asset tag one - hnowlan@cumin2002"
[17:40:49] <denisse>	 !log Upgrading LibreNMS in production - T384036
[17:40:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:41:15] <jinxer-wm>	 RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[17:41:45] <jinxer-wm>	 FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[17:41:46] <logmsgbot>	 !log denisse@deploy2002 Started deploy [librenms/librenms@f049593]: Upgrade LibreNMS to 24.12.0 - T384036
[17:42:00] <logmsgbot>	 !log denisse@deploy2002 Finished deploy [librenms/librenms@f049593]: Upgrade LibreNMS to 24.12.0 - T384036 (duration: 00m 14s)
[17:44:04] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[2259,2263-2266].codfw.wmnet decommissioned, removing all IPs except the asset tag one - hnowlan@cumin2002"
[17:44:04] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:44:05] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw[2259,2263-2266].codfw.wmnet
[17:46:45] <jinxer-wm>	 RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[17:46:58] <wikibugs>	 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission mw2259,mw225[3-6] - https://phabricator.wikimedia.org/T384043#10471975 (10hnowlan)
[17:47:45] <jinxer-wm>	 FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[17:51:30] <jinxer-wm>	 RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[17:59:18] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10472018 (10hnowlan)
[17:59:41] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[18:04:53] <wikibugs>	 (03CR) 10Subramanya Sastry: [C:03+1] Remove KartographerParsoidSupport flag from configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111932 (https://phabricator.wikimedia.org/T340134) (owner: 10Isabelle Hurbain-Palatin)
[18:05:22] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns names for newly assigned wmcs private ipv6 entries - cmooney@cumin1002"
[18:05:27] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns names for newly assigned wmcs private ipv6 entries - cmooney@cumin1002"
[18:05:27] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:06:06] <wikibugs>	 (03PS3) 10Clare Ming: Enable the text experiment on testwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112105 (https://phabricator.wikimedia.org/T373715)
[18:08:44] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[18:12:54] <logmsgbot>	 !log cmooney@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[18:13:31] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[18:16:25] <wikibugs>	 (03CR) 10CDanis: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1112177 (https://phabricator.wikimedia.org/T376179) (owner: 10Filippo Giunchedi)
[18:17:40] <wikibugs>	 06SRE, 10SRE-swift-storage, 07Upstream: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10472072 (10A_smart_kitten)
[18:18:41] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns names for newly assigned wmcs private ipv6 entries - cmooney@cumin1002"
[18:18:45] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns names for newly assigned wmcs private ipv6 entries - cmooney@cumin1002"
[18:18:45] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:23:27] <wikibugs>	 (03PS1) 10Cathal Mooney: Modifications to CR BGP policy for eqiad cloud-private IPv6 aggregate [homer/public] - 10https://gerrit.wikimedia.org/r/1112268 (https://phabricator.wikimedia.org/T37947)
[18:24:14] <wikibugs>	 (03CR) 10CDanis: [C:03+1] "lgtm but what's your use case?" [puppet] - 10https://gerrit.wikimedia.org/r/1112193 (https://phabricator.wikimedia.org/T383976) (owner: 10Fabfur)
[18:26:05] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:27:39] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Modifications to CR BGP policy for eqiad cloud-private IPv6 aggregate [homer/public] - 10https://gerrit.wikimedia.org/r/1112268 (https://phabricator.wikimedia.org/T37947) (owner: 10Cathal Mooney)
[18:29:52] <wikibugs>	 (03Merged) 10jenkins-bot: Modifications to CR BGP policy for eqiad cloud-private IPv6 aggregate [homer/public] - 10https://gerrit.wikimedia.org/r/1112268 (https://phabricator.wikimedia.org/T37947) (owner: 10Cathal Mooney)
[18:32:59] <wikibugs>	 (03PS1) 10Mstyles: security-landing-page: deploying update [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112270 (https://phabricator.wikimedia.org/T383098)
[18:35:56] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10472104 (10colewhite)
[18:41:00] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10472126 (10colewhite) @KFrancis, would you mind helping Suzanne with the NDA?  @WMDECyn, do you approve of this request?  @SuzanneWood-WMDE, would you mind emailing me the SSH key y...
[18:41:10] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10472127 (10colewhite) p:05Triage→03Medium
[18:43:10] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017#10472137 (10colewhite) p:05Triage→03Medium @KFrancis, would you mind helping Neslihan with an NDA?  Thanks!
[18:43:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10472142 (10phaultfinder)
[18:46:31] <jinxer-wm>	 FIRING: Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr3-ulsfo.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[18:48:51] <icinga-wm>	 RECOVERY - BGP status on cr2-eqord is OK: BGP OK - up: 199, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:50:43] <wikibugs>	 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: BGP status (instance cr2-eqord) - https://phabricator.wikimedia.org/T383302#10472149 (10cmooney) 05Open→03Resolved I mailed these guys at the start of the week but haven't heard back so I deleted the peering.  If they respond we...
[18:51:31] <jinxer-wm>	 FIRING: [6x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[18:53:43] <wikibugs>	 (03CR) 10Herron: [C:03+1] "Good call!" [puppet] - 10https://gerrit.wikimedia.org/r/1112177 (https://phabricator.wikimedia.org/T376179) (owner: 10Filippo Giunchedi)
[19:11:14] <wikibugs>	 (03CR) 10CDanis: "lgtm, do you think we might want to also precompute 2m or 5m rates as well?" [puppet] - 10https://gerrit.wikimedia.org/r/1112172 (https://phabricator.wikimedia.org/T383963) (owner: 10Filippo Giunchedi)
[19:31:58] <wikibugs>	 (03PS1) 10Cathal Mooney: Add WMCS cloud-private eqiad ranges to private6 prefix list [homer/public] - 10https://gerrit.wikimedia.org/r/1112273 (https://phabricator.wikimedia.org/T37947)
[19:33:59] <wikibugs>	 (03CR) 10CDanis: [C:03+2] prometheus: scrape otelcol metrics [puppet] - 10https://gerrit.wikimedia.org/r/1112177 (https://phabricator.wikimedia.org/T376179) (owner: 10Filippo Giunchedi)
[19:42:22] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+1] "Looks good!!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112105 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming)
[19:45:10] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 20 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112105 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming)
[19:45:11] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 20 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112105 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming)
[19:47:51] <wikibugs>	 (03PS4) 10Clare Ming: Enable the text experiment on testwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112105 (https://phabricator.wikimedia.org/T373715)
[20:21:05] <wikibugs>	 (03CR) 10Fabfur: "I'd like to split the puppet agent timer for first run (on boot) and periodic (30m) into different units. The "on boot" one will have `Rem" [puppet] - 10https://gerrit.wikimedia.org/r/1112193 (https://phabricator.wikimedia.org/T383976) (owner: 10Fabfur)
[20:25:30] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:42:44] <wikibugs>	 (03CR) 10Andrea Denisse: prometheus: serve apache vhost on localhost too (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1112186 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi)
[20:43:39] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1112172 (https://phabricator.wikimedia.org/T383963) (owner: 10Filippo Giunchedi)
[20:54:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on mw2282:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2282 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[21:00:25] <wikibugs>	 (03CR) 10SBassett: [C:03+2] security-landing-page: deploying update [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112270 (https://phabricator.wikimedia.org/T383098) (owner: 10Mstyles)
[21:01:56] <wikibugs>	 (03Merged) 10jenkins-bot: security-landing-page: deploying update [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112270 (https://phabricator.wikimedia.org/T383098) (owner: 10Mstyles)
[21:34:05] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10472705 (10colewhite)
[21:34:56] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:39:56] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:35:14] <wikibugs>	 (03PS1) 10CDanis: draft: allow k8s NodeJS apps to opt-in to auto-ECS [puppet] - 10https://gerrit.wikimedia.org/r/1112295
[22:51:31] <jinxer-wm>	 FIRING: [6x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[23:58:47] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:59:37] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring