[00:00:25] FIRING: SystemdUnitFailed: prometheus_ferm_mss.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:03:05] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112113 [00:03:05] (03CR) 10Zabe: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112113 (owner: 10Zabe) [00:03:50] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112113 (owner: 10Zabe) [00:04:20] (03PS1) 10Zabe: Activate arbcom_zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112114 (https://phabricator.wikimedia.org/T380119) [00:04:30] (03CR) 10Zabe: [C:03+2] Update composer.lock [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112111 (owner: 10Zabe) [00:05:07] (03Merged) 10jenkins-bot: Update composer.lock [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112111 (owner: 10Zabe) [00:05:25] RESOLVED: SystemdUnitFailed: prometheus_ferm_mss.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:06:01] (03CR) 10Zabe: [C:03+2] Activate arbcom_zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112114 (https://phabricator.wikimedia.org/T380119) (owner: 10Zabe) [00:06:43] (03Merged) 10jenkins-bot: Activate arbcom_zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112114 (https://phabricator.wikimedia.org/T380119) (owner: 10Zabe) [00:07:07] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1112113|Update interwiki cache]], [[gerrit:1112111|Update composer.lock]], [[gerrit:1112114|Activate arbcom_zhwiki (T380119)]] [00:07:11] T380119: Create arbcom-zh wiki - https://phabricator.wikimedia.org/T380119 [00:08:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1025:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:11:41] !log zabe@deploy2002 zabe: Backport for [[gerrit:1112113|Update interwiki cache]], [[gerrit:1112111|Update composer.lock]], [[gerrit:1112114|Activate arbcom_zhwiki (T380119)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [00:11:48] !log zabe@deploy2002 zabe: Continuing with sync [00:13:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1025:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:13:41] (03PS2) 10Clare Ming: Enable the text experiment on testwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112105 (https://phabricator.wikimedia.org/T373715) [00:16:31] (03CR) 10Clare Ming: "@phuedx@wikimedia.org @sfaci@wikimedia.org if this lgtu, after we get 1 or both patches merged to finalize the config var, i can deploy th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112105 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming) [00:18:40] RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1025:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:18:53] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1112113|Update interwiki cache]], [[gerrit:1112111|Update composer.lock]], [[gerrit:1112114|Activate arbcom_zhwiki (T380119)]] (duration: 11m 46s) [00:18:57] T380119: Create arbcom-zh wiki - https://phabricator.wikimedia.org/T380119 [00:20:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10469040 (10phaultfinder) [00:21:40] !log zabe@deploy2002:/srv/mediawiki-staging$ mwscript-k8s -f -- createAndPromote.php --wiki=arbcom_zhwiki ZhaoFJx REDACTED [00:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:08] !log zabe@deploy2002:/srv/mediawiki-staging$ mwscript-k8s -f -- createAndPromote.php --wiki=arbcom_zhwiki --sysop --bureaucrat --force ZhaoFJx [00:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:29] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:27:29] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:38:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1112121 [00:38:15] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1112121 (owner: 10TrainBranchBot) [00:39:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1032:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1032 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:40:25] 10ops-codfw, 10ops-eqiad, 06SRE, 10SRE-swift-storage, and 3 others: Supermicro's Config J hot swap behavior - https://phabricator.wikimedia.org/T383903#10469079 (10Papaul) @MatthewVernon @elukey i do agree with you all that "we do need to be able to hot-swap these drivers" and yes by design, all the drives... [00:49:40] FIRING: [3x] KubernetesRsyslogDown: rsyslog on wikikube-worker1032:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:54:40] RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1032:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:00:53] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1112121 (owner: 10TrainBranchBot) [01:09:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1112122 [01:09:20] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1112122 (owner: 10TrainBranchBot) [01:29:05] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1112122 (owner: 10TrainBranchBot) [02:03:09] PROBLEM - Disk space on prometheus1006 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/k8s-dse 3176MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=prometheus1006&var-datasource=eqiad+prometheus/ops [02:09:33] FIRING: [12x] KubernetesCalicoDown: mw1448.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:48:55] (03PS4) 10Pppery: Fix some wrong descriptions of old dumps [puppet] - 10https://gerrit.wikimedia.org/r/1112123 [02:48:56] (03CR) 10CI reject: [V:04-1] Fix some wrong descriptions of old dumps [puppet] - 10https://gerrit.wikimedia.org/r/1112123 (owner: 10Pppery) [02:49:56] (03PS5) 10Pppery: Fix some wrong descriptions of old dumps [puppet] - 10https://gerrit.wikimedia.org/r/1112123 [02:59:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10469171 (10phaultfinder) [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:31:29] PROBLEM - Hadoop NodeManager on an-worker1173 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:31:43] PROBLEM - Hadoop NodeManager on an-worker1174 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:32:37] PROBLEM - Hadoop NodeManager on an-worker1167 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:32:43] PROBLEM - Hadoop NodeManager on an-worker1102 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:34:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10469191 (10phaultfinder) [03:36:19] PROBLEM - Hadoop NodeManager on an-worker1065 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:42:37] RECOVERY - Hadoop NodeManager on an-worker1167 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:44:29] RECOVERY - Hadoop NodeManager on an-worker1173 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:46:43] RECOVERY - Hadoop NodeManager on an-worker1102 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:46:57] PROBLEM - Hadoop NodeManager on an-worker1105 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:55:43] RECOVERY - Hadoop NodeManager on an-worker1174 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:56:19] RECOVERY - Hadoop NodeManager on an-worker1065 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:03:09] PROBLEM - Disk space on prometheus1006 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/k8s-dse 3166MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=prometheus1006&var-datasource=eqiad+prometheus/ops [04:03:33] 10ops-codfw, 06SRE, 06DC-Ops, 10procurement: codfw:expansion: Network devices/patch panel wiring - https://phabricator.wikimedia.org/T382219#10469211 (10Papaul) I am adding also here the spines/leaves connection diagram for reference. {F58217796} [04:14:57] RECOVERY - Hadoop NodeManager on an-worker1105 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:25:30] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:28:05] 10SRE-swift-storage, 10CX-deployments, 10LPL Essential, 10MinT: Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10469212 (10KartikMistry) a:03KartikMistry Assigning this to myself, I'll need help here @elukey :) [05:06:57] (03PS1) 10KartikMistry: Update cxserver to 2025-01-17-043010-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112125 (https://phabricator.wikimedia.org/T377813) [05:13:11] Quick cxserver deployment, minor change. [05:23:09] PROBLEM - Disk space on prometheus1006 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/k8s-dse 3184MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=prometheus1006&var-datasource=eqiad+prometheus/ops [05:24:56] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-01-17-043010-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112125 (https://phabricator.wikimedia.org/T377813) (owner: 10KartikMistry) [05:25:59] (03Merged) 10jenkins-bot: Update cxserver to 2025-01-17-043010-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112125 (https://phabricator.wikimedia.org/T377813) (owner: 10KartikMistry) [05:26:41] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:27:04] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:31:59] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [05:32:29] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [05:32:49] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [05:33:50] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [05:47:54] (03PS1) 10Kevin Bazira: changeprop: add liftwing article-country stream to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112126 (https://phabricator.wikimedia.org/T382295) [06:09:33] FIRING: [12x] KubernetesCalicoDown: mw1448.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:22:17] FIRING: ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:24:22] RESOLVED: ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:27:35] !log installing rsync security regression updates [06:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1112:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1112 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:47:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72125 and previous config saved to /var/cache/conftool/dbconfig/20250117-064745-root.json [06:48:01] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [06:48:35] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [06:48:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1112:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1112 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250117T0700) [07:02:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72127 and previous config saved to /var/cache/conftool/dbconfig/20250117-070250-root.json [07:17:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72128 and previous config saved to /var/cache/conftool/dbconfig/20250117-071755-root.json [07:33:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72129 and previous config saved to /var/cache/conftool/dbconfig/20250117-073301-root.json [07:37:30] (03PS3) 10JMeybohm: Update calico to v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112058 (https://phabricator.wikimedia.org/T341984) [07:37:30] (03PS4) 10JMeybohm: Update staging-codfw to k8s 1.31, calico 3.29 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984) [07:38:26] (03CR) 10CI reject: [V:04-1] Update calico to v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112058 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [07:38:33] (03CR) 10CI reject: [V:04-1] Update staging-codfw to k8s 1.31, calico 3.29 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [07:38:49] (03Abandoned) 10JMeybohm: sq [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112071 (owner: 10JMeybohm) [07:48:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72130 and previous config saved to /var/cache/conftool/dbconfig/20250117-074806-root.json [07:55:45] (03PS4) 10Muehlenhoff: Add separate maps master/replica roles for the new Bookworm setup [puppet] - 10https://gerrit.wikimedia.org/r/1111659 (https://phabricator.wikimedia.org/T381565) [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250117T0800) [08:00:25] FIRING: SystemdUnitFailed: prometheus_ferm_mss.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:05:25] RESOLVED: SystemdUnitFailed: prometheus_ferm_mss.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:12:06] (03PS1) 10Muehlenhoff: Don't setup database config for tilerator on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1112167 (https://phabricator.wikimedia.org/T381565) [08:15:29] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:16:07] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:18:57] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:19:19] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:25:30] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:30:06] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1112167 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:33:39] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw[2282,2310-2311].codfw.wmnet [08:35:26] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw[2282,2310-2311].codfw.wmnet [08:36:05] (03CR) 10Jelto: [C:03+2] Rename the remaining mw nodes to wikikube-worker224[0-2] 🥳 [puppet] - 10https://gerrit.wikimedia.org/r/1112055 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto) [08:40:05] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:42:41] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:48:01] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [08:48:35] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [08:49:40] FIRING: KubernetesRsyslogDown: rsyslog on mw2311:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2311 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:50:41] (03PS4) 10JMeybohm: Update calico to v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112058 (https://phabricator.wikimedia.org/T341984) [08:50:41] (03PS5) 10JMeybohm: Update staging-codfw to k8s 1.31, calico 3.29 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984) [08:54:40] FIRING: [3x] KubernetesRsyslogDown: rsyslog on mw2282:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:54:41] (03CR) 10CI reject: [V:04-1] Update staging-codfw to k8s 1.31, calico 3.29 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [08:55:44] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2310 to wikikube-worker2240 [08:56:05] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [08:59:32] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2310 to wikikube-worker2240 - jelto@cumin1002" [09:00:39] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2310 to wikikube-worker2240 - jelto@cumin1002" [09:00:39] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:00:39] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2240 [09:01:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2240 [09:01:44] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2310 to wikikube-worker2240 [09:03:09] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2311 to wikikube-worker2241 [09:03:30] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [09:06:54] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2311 to wikikube-worker2241 - jelto@cumin1002" [09:07:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2311 to wikikube-worker2241 - jelto@cumin1002" [09:07:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:07:11] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2241 [09:07:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2241 [09:08:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2311 to wikikube-worker2241 [09:08:18] !log depool / restart / repool ms-fe2010 T360913 [09:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:21] T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913 [09:12:03] (03PS1) 10Muehlenhoff: sre.hosts.reimage: Add link to the help text for move-vlan [cookbooks] - 10https://gerrit.wikimedia.org/r/1112171 [09:12:11] (03PS1) 10Filippo Giunchedi: prometheus: recording rules for mw edit count [puppet] - 10https://gerrit.wikimedia.org/r/1112172 (https://phabricator.wikimedia.org/T383963) [09:13:05] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2240 wikikube-worker2241 on all recursors [09:13:08] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2240 wikikube-worker2241 on all recursors [09:14:33] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2240.codfw.wmnet wikikube-worker2241.codfw.wmnet on all recursors [09:14:36] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2240.codfw.wmnet wikikube-worker2241.codfw.wmnet on all recursors [09:20:15] 10ops-codfw, 10ops-eqiad, 06SRE, 10SRE-swift-storage, and 3 others: Supermicro's Config J hot swap behavior - https://phabricator.wikimedia.org/T383903#10469386 (10MatthewVernon) Thanks for the update @Papaul , and of course you can have some time to look at the cable management issues. Do keep us posted,... [09:22:52] (03PS1) 10Jelto: remove reserved name wikikube-worker2242 because of mw2282 decom [puppet] - 10https://gerrit.wikimedia.org/r/1112173 (https://phabricator.wikimedia.org/T383965) [09:24:08] (03PS2) 10Jelto: remove reserved name wikikube-worker2242 because of mw2282 decom [puppet] - 10https://gerrit.wikimedia.org/r/1112173 (https://phabricator.wikimedia.org/T383965) [09:26:46] (03PS1) 10Filippo Giunchedi: thanos: sample 10% traces of thanos-query [puppet] - 10https://gerrit.wikimedia.org/r/1112174 (https://phabricator.wikimedia.org/T376179) [09:27:04] (03CR) 10JMeybohm: [C:03+1] remove reserved name wikikube-worker2242 because of mw2282 decom [puppet] - 10https://gerrit.wikimedia.org/r/1112173 (https://phabricator.wikimedia.org/T383965) (owner: 10Jelto) [09:28:22] 06SRE, 06Infrastructure-Foundations: Spicerack fails to find host physical interface for ganeti nodes - https://phabricator.wikimedia.org/T383207#10469393 (10cmooney) 05Resolved→03Open Re-opening as we hit the same issue happening within the reimage cookbook itself. https://gerrit.wikimedia.org/r/plugins/... [09:29:07] (03CR) 10Jelto: [C:03+2] remove reserved name wikikube-worker2242 because of mw2282 decom [puppet] - 10https://gerrit.wikimedia.org/r/1112173 (https://phabricator.wikimedia.org/T383965) (owner: 10Jelto) [09:29:08] 06SRE, 06Infrastructure-Foundations: Spicerack fails to find host physical interface for ganeti nodes - https://phabricator.wikimedia.org/T383207#10469396 (10cmooney) [09:29:08] (03PS1) 10Cathal Mooney: reimage: check if primary IP interface is bridge when getting int [cookbooks] - 10https://gerrit.wikimedia.org/r/1112175 (https://phabricator.wikimedia.org/T383207) [09:29:09] 06SRE, 06Infrastructure-Foundations, 10netbox, 10netops: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832#10469397 (10cmooney) [09:33:18] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2240.codfw.wmnet with OS bookworm [09:33:28] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2240 [09:33:46] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [09:34:58] (03Abandoned) 10Filippo Giunchedi: thanos: sample 10% traces of thanos-query [puppet] - 10https://gerrit.wikimedia.org/r/1112174 (https://phabricator.wikimedia.org/T376179) (owner: 10Filippo Giunchedi) [09:37:12] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2240 - jelto@cumin1002" [09:37:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2240 - jelto@cumin1002" [09:37:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:37:17] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2240.codfw.wmnet 157.16.192.10.in-addr.arpa 7.5.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:37:19] (03PS2) 10Cathal Mooney: reimage: change check to find physical interface if IP is on bridge [cookbooks] - 10https://gerrit.wikimedia.org/r/1112175 (https://phabricator.wikimedia.org/T383207) [09:37:20] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2240.codfw.wmnet 157.16.192.10.in-addr.arpa 7.5.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:37:20] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2240 [09:37:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2240 [09:37:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2240 [09:38:54] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2241.codfw.wmnet with OS bookworm [09:39:04] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2241 [09:39:09] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [09:42:33] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2241 - jelto@cumin1002" [09:42:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2241 - jelto@cumin1002" [09:42:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:42:38] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2241.codfw.wmnet 158.16.192.10.in-addr.arpa 8.5.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:42:41] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2241.codfw.wmnet 158.16.192.10.in-addr.arpa 8.5.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:42:41] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2241 [09:42:53] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2241 [09:42:53] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2241 [09:43:31] (03CR) 10CI reject: [V:04-1] reimage: change check to find physical interface if IP is on bridge [cookbooks] - 10https://gerrit.wikimedia.org/r/1112175 (https://phabricator.wikimedia.org/T383207) (owner: 10Cathal Mooney) [09:51:25] (03PS1) 10Filippo Giunchedi: prometheus: scrape otelcol metrics [puppet] - 10https://gerrit.wikimedia.org/r/1112177 (https://phabricator.wikimedia.org/T376179) [09:55:08] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2240.codfw.wmnet with reason: host reimage [09:55:51] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T383862#10469458 (10Jelto) [09:59:14] (03PS1) 10Muehlenhoff: Pass the Squid port by parameter [puppet] - 10https://gerrit.wikimedia.org/r/1112180 [09:59:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2240.codfw.wmnet with reason: host reimage [09:59:36] (03CR) 10CI reject: [V:04-1] Pass the Squid port by parameter [puppet] - 10https://gerrit.wikimedia.org/r/1112180 (owner: 10Muehlenhoff) [10:03:36] (03PS2) 10Muehlenhoff: Pass the Squid port by parameter [puppet] - 10https://gerrit.wikimedia.org/r/1112180 [10:05:46] (03CR) 10CI reject: [V:04-1] Pass the Squid port by parameter [puppet] - 10https://gerrit.wikimedia.org/r/1112180 (owner: 10Muehlenhoff) [10:08:03] !log jelto@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker2241.codfw.wmnet with OS bookworm [10:08:31] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2241.codfw.wmnet with OS bookworm [10:08:35] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2241 [10:08:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2241 [10:09:33] FIRING: [12x] KubernetesCalicoDown: mw1448.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:15:52] (03PS3) 10Muehlenhoff: Pass the Squid port by parameter [puppet] - 10https://gerrit.wikimedia.org/r/1112180 [10:16:35] (03PS2) 10JMeybohm: Pin calico version on all clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111935 (https://phabricator.wikimedia.org/T341984) [10:16:35] (03PS4) 10JMeybohm: Update calico-crds to calico v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111943 (https://phabricator.wikimedia.org/T341984) [10:16:35] (03PS5) 10JMeybohm: Update calico to v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112058 (https://phabricator.wikimedia.org/T341984) [10:16:35] (03PS6) 10JMeybohm: Update staging-codfw to k8s 1.31, calico 3.29 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984) [10:16:36] (03PS1) 10JMeybohm: admin_ng: Install VAPs instead of PSPs on k8s >= 1.24 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112183 (https://phabricator.wikimedia.org/T341984) [10:18:34] 06SRE, 06Traffic, 13Patch-For-Review: Define an event stream and schema for haproxy_requestctl analytics pipeline ingestion - https://phabricator.wikimedia.org/T383392#10469509 (10Fabfur) >>! In T383392#10467504, @Ottomata wrote: > Hi! > > It looks like [[ https://gitlab.wikimedia.org/repos/data-engineering... [10:19:31] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2240.codfw.wmnet with OS bookworm [10:24:08] (03CR) 10Alexandros Kosiaris: [C:03+1] admin_ng: Install VAPs instead of PSPs on k8s >= 1.24 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112183 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [10:24:59] (03CR) 10Brouberol: airflow: refactor/DRY the volume/volumeMounts accross containers (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112053 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol) [10:25:54] (03PS4) 10Jcrespo: dbbackups: Migrate db2139 backup generation to db2239 [puppet] - 10https://gerrit.wikimedia.org/r/1111656 (https://phabricator.wikimedia.org/T373579) [10:25:54] (03PS1) 10Jcrespo: installserver: Review backup and db hosts [puppet] - 10https://gerrit.wikimedia.org/r/1112184 (https://phabricator.wikimedia.org/T383902) [10:26:04] (03CR) 10Brouberol: airflow: refactor/DRY the volume/volumeMounts accross containers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112053 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol) [10:26:10] (03PS4) 10Muehlenhoff: Pass the Squid port by parameter [puppet] - 10https://gerrit.wikimedia.org/r/1112180 [10:26:22] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2241.codfw.wmnet with reason: host reimage [10:29:01] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1112180 (owner: 10Muehlenhoff) [10:29:30] (03PS2) 10Gkyziridis: ml-services: update articletopic outlink image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111580 (https://phabricator.wikimedia.org/T383312) [10:29:47] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] prometheus: reverse proxy for instances belonging to the host' site too [puppet] - 10https://gerrit.wikimedia.org/r/1112014 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [10:30:34] (03CR) 10Jcrespo: [C:03+2] dbbackups: Migrate db2139 backup generation to db2239 [puppet] - 10https://gerrit.wikimedia.org/r/1111656 (https://phabricator.wikimedia.org/T373579) (owner: 10Jcrespo) [10:30:49] (03PS2) 10Jcrespo: installserver: Review backup and db hosts [puppet] - 10https://gerrit.wikimedia.org/r/1112184 (https://phabricator.wikimedia.org/T383902) [10:31:15] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2241.codfw.wmnet with reason: host reimage [10:37:20] (03CR) 10Muehlenhoff: "I made a patch for it: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1112180" [puppet] - 10https://gerrit.wikimedia.org/r/1111681 (owner: 10CDanis) [10:49:02] 10SRE-swift-storage, 10Thumbor: Image issue on ओम राऊत MrWp - https://phabricator.wikimedia.org/T383859#10469626 (10Goresm) Done, now the unknown image is gone. [10:51:43] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2241.codfw.wmnet with OS bookworm [10:54:24] !log homer 'lsw1-b3-codfw*' commit 'T377877' [10:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:28] T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877 [10:55:08] (03PS1) 10Filippo Giunchedi: prometheus: serve apache vhost on localhost too [puppet] - 10https://gerrit.wikimedia.org/r/1112186 (https://phabricator.wikimedia.org/T371087) [10:55:31] !log homer 'cr*codfw*' commit 'T377877' [10:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:06] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2240-2241].codfw.wmnet [10:58:08] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2240-2241].codfw.wmnet [11:02:41] (03CR) 10Kevin Bazira: [C:03+1] ml-services: update articletopic outlink image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111580 (https://phabricator.wikimedia.org/T383312) (owner: 10Gkyziridis) [11:02:52] 06SRE, 06Data-Engineering, 10Dumps 2.0, 10Dumps-Generation: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10469674 (10MPGuy2824) [11:07:41] (03PS1) 10Alexandros Kosiaris: Map rest_v1/page/(html|title)/ to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1112188 (https://phabricator.wikimedia.org/T374683) [11:09:49] (03CR) 10CI reject: [V:04-1] Map rest_v1/page/(html|title)/ to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1112188 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [11:16:49] (03PS1) 10Aklapper: Phabricator data for WMF QLS: Add MCollins, remove ABittaker [puppet] - 10https://gerrit.wikimedia.org/r/1112189 (https://phabricator.wikimedia.org/T383884) [11:19:04] (03PS1) 10Marostegui: es1046: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1112190 (https://phabricator.wikimedia.org/T382569) [11:20:57] (03CR) 10Marostegui: [C:03+2] es1046: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1112190 (https://phabricator.wikimedia.org/T382569) (owner: 10Marostegui) [11:25:41] (03PS3) 10Cathal Mooney: reimage: change check to find physical interface if IP is on bridge [cookbooks] - 10https://gerrit.wikimedia.org/r/1112175 (https://phabricator.wikimedia.org/T383207) [11:27:55] (03PS1) 10Marostegui: instances.yaml: Add es1046 [puppet] - 10https://gerrit.wikimedia.org/r/1112192 (https://phabricator.wikimedia.org/T382569) [11:29:49] PROBLEM - MariaDB Replica Lag: s3 on db2239 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1857.67 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:30:05] jynus: ^ you aware? [11:30:06] (03PS1) 10Fabfur: systemd: added option to remain after exit [puppet] - 10https://gerrit.wikimedia.org/r/1112193 (https://phabricator.wikimedia.org/T383976) [11:30:18] (03CR) 10Jcrespo: [C:03+2] installserver: Review backup and db hosts [puppet] - 10https://gerrit.wikimedia.org/r/1112184 (https://phabricator.wikimedia.org/T383902) (owner: 10Jcrespo) [11:30:49] yeah, I removed notifications, but icinga had a race condition with my manual change [11:30:57] i will do it again, should stay [11:31:39] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1111-1116].eqiad.wmnet [11:31:39] !log kamila@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) pool for host wikikube-worker[1111-1116].eqiad.wmnet [11:33:45] (03CR) 10Btullis: [C:03+1] airflow: refactor/DRY the volume/volumeMounts accross containers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112053 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol) [11:33:47] 06SRE, 06Data-Engineering, 10EventStreams: eventstreams is hitting memory limits, causing restarts and paging - https://phabricator.wikimedia.org/T383977#10469769 (10hnowlan) p:05Triage→03High [11:35:44] (03PS4) 10Cathal Mooney: reimage: change check to find physical interface if IP is on bridge [cookbooks] - 10https://gerrit.wikimedia.org/r/1112175 (https://phabricator.wikimedia.org/T383207) [11:36:15] 06SRE, 06Traffic, 13Patch-For-Review: Refine add_is_wmf_domain TransformFunction fails if no source field exists - https://phabricator.wikimedia.org/T383914#10469771 (10Antoine_Quhen) a:03Antoine_Quhen [11:37:07] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1112193 (https://phabricator.wikimedia.org/T383976) (owner: 10Fabfur) [11:37:44] !log jynus@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: reimage [11:39:36] 10SRE-swift-storage, 10Thumbor: Image issue on ओम राऊत MrWp - https://phabricator.wikimedia.org/T383859#10469777 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Thanks for confirming, I'll close this task now. [11:39:47] (03PS1) 10Hnowlan: eventstreams: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112197 (https://phabricator.wikimedia.org/T383977) [11:40:13] (03PS5) 10Cathal Mooney: reimage: change check to find physical interface if IP is on bridge [cookbooks] - 10https://gerrit.wikimedia.org/r/1112175 (https://phabricator.wikimedia.org/T383207) [11:41:05] !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host db2141.codfw.wmnet with OS bookworm [11:42:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2027.codfw.wmnet with OS bookworm [11:42:12] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10469786 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2027.codfw.wmnet with OS bookworm [11:42:43] (03CR) 10Clément Goubert: [C:03+1] eventstreams: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112197 (https://phabricator.wikimedia.org/T383977) (owner: 10Hnowlan) [11:53:01] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [11:53:35] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [11:55:13] (03CR) 10Hnowlan: [C:03+2] eventstreams: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112197 (https://phabricator.wikimedia.org/T383977) (owner: 10Hnowlan) [11:56:19] (03Merged) 10jenkins-bot: eventstreams: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112197 (https://phabricator.wikimedia.org/T383977) (owner: 10Hnowlan) [11:57:37] !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2141.codfw.wmnet with reason: host reimage [11:58:51] !log installing Linux 6.1.124 on Bookworm hosts [11:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:06] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250117T0800) [12:00:06] eoghan, jelto, arnoldokoth, and mutante: #bothumor My software never has bugs. It just develops random features. Rise for GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250117T1200). [12:00:21] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [12:00:45] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [12:01:52] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2141.codfw.wmnet with reason: host reimage [12:02:13] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams: apply [12:02:43] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [12:03:10] PROBLEM - Disk space on prometheus1006 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/k8s-dse 3087MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=prometheus1006&var-datasource=eqiad+prometheus/ops [12:04:33] FIRING: [12x] KubernetesCalicoDown: mw1448.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:07:27] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1111-1116].eqiad.wmnet [12:07:29] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1111-1116].eqiad.wmnet [12:09:06] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Spicerack fails to find host physical interface for ganeti nodes - https://phabricator.wikimedia.org/T383207#10469892 (10cmooney) Today's blip has made me realise how we didn't hit this more often in the past: # In June 2023 (after T296832), the reima... [12:09:22] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add es1046 [puppet] - 10https://gerrit.wikimedia.org/r/1112192 (https://phabricator.wikimedia.org/T382569) (owner: 10Marostegui) [12:09:33] RESOLVED: [12x] KubernetesCalicoDown: mw1448.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:21:06] (03CR) 10Alexandros Kosiaris: [C:04-1] "One comment, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1112180 (owner: 10Muehlenhoff) [12:23:02] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [12:23:36] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [12:25:30] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:25:42] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2141.codfw.wmnet with OS bookworm [12:25:54] (03PS1) 10Marostegui: sections.yaml: Add pc6 to codfw [puppet] - 10https://gerrit.wikimedia.org/r/1112199 (https://phabricator.wikimedia.org/T383234) [12:27:24] (03CR) 10Giuseppe Lavagetto: [C:03+1] sections.yaml: Add pc6 to codfw [puppet] - 10https://gerrit.wikimedia.org/r/1112199 (https://phabricator.wikimedia.org/T383234) (owner: 10Marostegui) [12:27:28] (03CR) 10Marostegui: [C:03+2] sections.yaml: Add pc6 to codfw [puppet] - 10https://gerrit.wikimedia.org/r/1112199 (https://phabricator.wikimedia.org/T383234) (owner: 10Marostegui) [12:31:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add es1046 to dbctl depooled T382569', diff saved to https://phabricator.wikimedia.org/P72136 and previous config saved to /var/cache/conftool/dbconfig/20250117-123153-marostegui.json [12:32:00] T382569: Productionize es104[1-6] - https://phabricator.wikimedia.org/T382569 [12:32:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 1%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P72137 and previous config saved to /var/cache/conftool/dbconfig/20250117-123235-root.json [12:33:02] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [12:33:36] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [12:43:53] (03PS2) 10Fabfur: systemd: added option to remain after exit [puppet] - 10https://gerrit.wikimedia.org/r/1112193 (https://phabricator.wikimedia.org/T383976) [12:43:54] (03PS1) 10Kamila Součková: wikikube: rename mw146[4-9] -> wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1112200 (https://phabricator.wikimedia.org/T365571) [12:47:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 2%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P72138 and previous config saved to /var/cache/conftool/dbconfig/20250117-124740-root.json [12:49:31] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1112193 (https://phabricator.wikimedia.org/T383976) (owner: 10Fabfur) [12:49:52] Hi urbanecm, can you have a look at patch [12:49:54] https://gerrit.wikimedia.org/r/c/1100228 ? thanks [12:49:59] hey, sure! [12:50:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [12:50:42] koi: i'd still like to know what the result of the investigations were. do you know anything about that? [12:52:04] urbanecm, sorry but am i missing some context? what kind of investigation [12:52:36] koi: the task is stalled per https://phabricator.wikimedia.org/T378287#10341850. before going ahead with the patch, the investigation should conclude [12:52:42] so that we don't accidentally cause any problems [12:52:47] (03CR) 10Elukey: "LGTM, I have a doubt about the need of `profile::rsyslog::udp_localhost_compat`. If it is needed feel free to go and merge!" [puppet] - 10https://gerrit.wikimedia.org/r/1111659 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [12:54:40] FIRING: KubernetesRsyslogDown: rsyslog on mw2282:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2282 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:55:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [12:56:06] (03CR) 10Elukey: [C:03+1] Don't setup database config for tilerator on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1112167 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [12:56:35] urbanecm, um, actually this kinda confuse me, at first i thought they mean T381197, but i'm not sure if they mean prod as well [12:56:36] T381197: Create views for SecurePoll db tables in Toolforge replicas - https://phabricator.wikimedia.org/T381197 [12:57:17] koi: exactly. unfortunately, that comment doesn't provide context, which makes it hard to follow up on that. but we definitely need to clarify that before going ahead. i hope that makes sense to you [12:57:57] fair enough, i'll ask about the progress of such investigate [12:58:06] thanks for the reply! [12:59:54] (03PS5) 10Muehlenhoff: Add separate maps master/replica roles for the new Bookworm setup [puppet] - 10https://gerrit.wikimedia.org/r/1111659 (https://phabricator.wikimedia.org/T381565) [12:59:59] (03CR) 10Muehlenhoff: Add separate maps master/replica roles for the new Bookworm setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1111659 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:02:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 3%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P72139 and previous config saved to /var/cache/conftool/dbconfig/20250117-130245-root.json [13:03:16] (03CR) 10Gkyziridis: [C:03+1] ml-services: update articletopic outlink image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111580 (https://phabricator.wikimedia.org/T383312) (owner: 10Gkyziridis) [13:04:17] (03CR) 10Brouberol: airflow: define a K8sPodOperator pod template for pods needing access to hadoop (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112057 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol) [13:05:31] (03CR) 10Brouberol: airflow: define a K8sPodOperator pod template for pods needing access to hadoop (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112057 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol) [13:06:15] (03CR) 10Muehlenhoff: Pass the Squid port by parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1112180 (owner: 10Muehlenhoff) [13:09:21] (03PS3) 10Brouberol: airflow: define a K8sPodOperator pod template for pods needing access to hadoop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112057 (https://phabricator.wikimedia.org/T383430) [13:10:01] (03CR) 10Elukey: [C:03+1] Add separate maps master/replica roles for the new Bookworm setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1111659 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:13:24] (03CR) 10Vgutierrez: [C:04-1] systemd: added option to remain after exit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1112193 (https://phabricator.wikimedia.org/T383976) (owner: 10Fabfur) [13:13:34] 10ops-codfw, 10ops-eqiad, 06SRE, 10SRE-swift-storage, and 3 others: Supermicro's Config J hot swap behavior - https://phabricator.wikimedia.org/T383903#10470352 (10elukey) Supermicro came back with some nice suggestions to clear the state of a new/replaced disk (if it gets into something like Foreign state... [13:16:29] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003 (10elukey) 03NEW [13:17:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 4%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P72140 and previous config saved to /var/cache/conftool/dbconfig/20250117-131751-root.json [13:18:10] (03CR) 10AikoChou: [C:03+1] ml-services: update articletopic outlink image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111580 (https://phabricator.wikimedia.org/T383312) (owner: 10Gkyziridis) [13:24:52] (03CR) 10Btullis: [C:03+1] airflow: define a K8sPodOperator pod template for pods needing access to hadoop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112057 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol) [13:27:13] (03PS3) 10Fabfur: systemd: added option to remain after exit [puppet] - 10https://gerrit.wikimedia.org/r/1112193 (https://phabricator.wikimedia.org/T383976) [13:27:51] (03CR) 10Brouberol: [C:03+2] airflow: deploy hive config under both hadoop and spark config dirs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112052 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol) [13:27:54] (03CR) 10Brouberol: [C:03+2] airflow: refactor/DRY the volume/volumeMounts accross containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112053 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol) [13:27:57] (03CR) 10Brouberol: [C:03+2] airflow: define a K8sPodOperator pod template for pods needing access to hadoop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112057 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol) [13:28:52] (03PS10) 10JMeybohm: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984) [13:28:52] (03PS1) 10JMeybohm: calico: Create certificates for Typha/Felix mTLS [puppet] - 10https://gerrit.wikimedia.org/r/1112204 (https://phabricator.wikimedia.org/T365687) [13:29:33] (03Merged) 10jenkins-bot: airflow: deploy hive config under both hadoop and spark config dirs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112052 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol) [13:29:35] (03CR) 10Fabfur: systemd: added option to remain after exit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1112193 (https://phabricator.wikimedia.org/T383976) (owner: 10Fabfur) [13:29:36] (03Merged) 10jenkins-bot: airflow: refactor/DRY the volume/volumeMounts accross containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112053 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol) [13:29:37] (03Merged) 10jenkins-bot: airflow: define a K8sPodOperator pod template for pods needing access to hadoop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112057 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol) [13:30:37] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10470456 (10MatthewVernon) [13:31:38] (03CR) 10CI reject: [V:04-1] calico: Create certificates for Typha/Felix mTLS [puppet] - 10https://gerrit.wikimedia.org/r/1112204 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm) [13:32:47] 06SRE, 06Traffic, 13Patch-For-Review: Refine add_is_wmf_domain TransformFunction fails if no source field exists - https://phabricator.wikimedia.org/T383914#10470469 (10Ottomata) @Fabfur thanks, merged. Since this is a bug on our side, we can work with you to make a backwards incompatible change to remove... [13:32:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 5%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P72142 and previous config saved to /var/cache/conftool/dbconfig/20250117-133256-root.json [13:33:30] (03CR) 10Vgutierrez: systemd: added option to remain after exit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1112193 (https://phabricator.wikimedia.org/T383976) (owner: 10Fabfur) [13:34:26] (03PS1) 10Arnaudb: peopleweb: disable envoy request timeout, enable log [puppet] - 10https://gerrit.wikimedia.org/r/1112205 (https://phabricator.wikimedia.org/T383750) [13:34:27] (03CR) 10Arnaudb: "I agree with your statement in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1112056/comments/41ce4051_55c7f928 this is not the pro" [puppet] - 10https://gerrit.wikimedia.org/r/1112205 (https://phabricator.wikimedia.org/T383750) (owner: 10Arnaudb) [13:34:31] (03PS1) 10Brouberol: airflow: bypass DNS resolution for the PG URI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112206 (https://phabricator.wikimedia.org/T383651) [13:34:45] 06SRE, 06Traffic, 13Patch-For-Review: Refine add_is_wmf_domain TransformFunction fails if no source field exists - https://phabricator.wikimedia.org/T383914#10470473 (10Fabfur) >>! In T383914#10470469, @Ottomata wrote: > @Fabfur thanks, merged. > > Since this is a bug on our side, we can work with you to m... [13:35:00] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:35:18] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:35:44] (03PS2) 10JMeybohm: calico: Create certificates for Typha/Felix mTLS [puppet] - 10https://gerrit.wikimedia.org/r/1112204 (https://phabricator.wikimedia.org/T365687) [13:35:44] (03PS11) 10JMeybohm: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984) [13:35:59] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [13:36:38] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [13:37:29] (03PS4) 10Fabfur: systemd: added option to remain after exit [puppet] - 10https://gerrit.wikimedia.org/r/1112193 (https://phabricator.wikimedia.org/T383976) [13:37:43] (03CR) 10Fabfur: systemd: added option to remain after exit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1112193 (https://phabricator.wikimedia.org/T383976) (owner: 10Fabfur) [13:41:16] (03PS1) 10Urbanecm: [Growth] Remove tybanner campaigns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112207 (https://phabricator.wikimedia.org/T380405) [13:41:18] (03PS1) 10Urbanecm: [Growth] Add fundraising- as a prefix for fundraising campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112208 (https://phabricator.wikimedia.org/T380405) [13:42:03] (03PS1) 10Dreamy Jazz: Pin wgCheckUserEnableTempAccountsOnboardingDialog as false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112209 (https://phabricator.wikimedia.org/T384005) [13:42:03] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1112193 (https://phabricator.wikimedia.org/T383976) (owner: 10Fabfur) [13:44:45] (03PS1) 10Brouberol: airflow: hotfix: do not render empty volumes/volumeMounts blocks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112210 (https://phabricator.wikimedia.org/T383430) [13:45:23] (03CR) 10Btullis: [C:03+1] airflow: hotfix: do not render empty volumes/volumeMounts blocks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112210 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol) [13:46:43] (03CR) 10Btullis: [C:03+1] airflow: bypass DNS resolution for the PG URI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112206 (https://phabricator.wikimedia.org/T383651) (owner: 10Brouberol) [13:47:03] (03CR) 10Btullis: [C:03+1] Switch an-test-presto1001 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1111180 (owner: 10Muehlenhoff) [13:47:58] (03CR) 10Brouberol: [C:03+2] airflow: hotfix: do not render empty volumes/volumeMounts blocks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112210 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol) [13:48:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 10%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P72143 and previous config saved to /var/cache/conftool/dbconfig/20250117-134801-root.json [13:50:09] (03PS2) 10Brouberol: airflow: bypass DNS resolution for the PG URI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112206 (https://phabricator.wikimedia.org/T383651) [13:50:56] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [13:51:12] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [13:54:08] (03PS1) 10Brouberol: airflow: hotfix: fix broken indentation in the pod template configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112213 (https://phabricator.wikimedia.org/T383430) [13:54:58] (03CR) 10Máté Szabó: [C:03+2] Pin wgCheckUserEnableTempAccountsOnboardingDialog as false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112209 (https://phabricator.wikimedia.org/T384005) (owner: 10Dreamy Jazz) [13:55:42] (03Merged) 10jenkins-bot: Pin wgCheckUserEnableTempAccountsOnboardingDialog as false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112209 (https://phabricator.wikimedia.org/T384005) (owner: 10Dreamy Jazz) [13:57:16] (03CR) 10Brouberol: [C:03+2] airflow: hotfix: fix broken indentation in the pod template configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112213 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol) [13:59:50] (03PS3) 10Brouberol: airflow: bypass DNS resolution for the PG URI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112206 (https://phabricator.wikimedia.org/T383651) [14:00:16] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:00:32] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:01:31] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:01:47] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:03:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 25%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P72144 and previous config saved to /var/cache/conftool/dbconfig/20250117-140308-root.json [14:07:19] (03CR) 10Muehlenhoff: "Can I reboot an-test-presto1001 any time or should I sync up beforehand?" [puppet] - 10https://gerrit.wikimedia.org/r/1111180 (owner: 10Muehlenhoff) [14:07:46] (03CR) 10Muehlenhoff: [C:03+2] Add separate maps master/replica roles for the new Bookworm setup [puppet] - 10https://gerrit.wikimedia.org/r/1111659 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:10:10] (03PS4) 10Muehlenhoff: Make maps-test2001 a bookworm maps master node (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565) [14:12:11] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2027.codfw.wmnet with reason: host reimage [14:12:30] (03PS5) 10Muehlenhoff: Make maps-test2001 a bookworm maps master node [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565) [14:14:17] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, this made the reimage of ganeti2027 (with test-cookbook) work fine" [cookbooks] - 10https://gerrit.wikimedia.org/r/1112175 (https://phabricator.wikimedia.org/T383207) (owner: 10Cathal Mooney) [14:14:28] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:15:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2027.codfw.wmnet with reason: host reimage [14:18:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 50%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P72145 and previous config saved to /var/cache/conftool/dbconfig/20250117-141813-root.json [14:19:25] jouncebot: nowandnext [14:19:25] For the next 17 hour(s) and 40 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250117T0800) [14:19:25] In 17 hour(s) and 40 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250118T0800) [14:21:18] (03PS1) 10Máté Szabó: Revert "Pin wgCheckUserEnableTempAccountsOnboardingDialog as false" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112217 [14:22:10] (03PS1) 10Muehlenhoff: Add missing Hiera settings for new bookworm master roles [puppet] - 10https://gerrit.wikimedia.org/r/1112218 (https://phabricator.wikimedia.org/T381565) [14:24:01] (03CR) 10Santiago Faci: "Just a question to confirm if we are using the right config variable for the experiment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112105 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming) [14:24:03] (03CR) 10STran: [C:03+2] Revert "Pin wgCheckUserEnableTempAccountsOnboardingDialog as false" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112217 (owner: 10Máté Szabó) [14:25:07] (03PS1) 10Bartosz Dziewoński: Disable sidebar cache on the auth domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112219 (https://phabricator.wikimedia.org/T383916) [14:25:09] (03Merged) 10jenkins-bot: Revert "Pin wgCheckUserEnableTempAccountsOnboardingDialog as false" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112217 (owner: 10Máté Szabó) [14:26:09] (03PS1) 10Dreamy Jazz: Revert^2 "Pin wgCheckUserEnableTempAccountsOnboardingDialog as false" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112220 [14:26:20] (03PS1) 10Slyngshede: Add OIDC support to development environment [software/bitu] - 10https://gerrit.wikimedia.org/r/1112221 [14:32:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:33:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 75%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P72146 and previous config saved to /var/cache/conftool/dbconfig/20250117-143318-root.json [14:33:40] (03CR) 10Máté Szabó: [C:03+1] Revert^2 "Pin wgCheckUserEnableTempAccountsOnboardingDialog as false" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112220 (owner: 10Dreamy Jazz) [14:34:40] (03PS1) 10Elukey: role::kafka::monitoring: re-add Jumbo config [puppet] - 10https://gerrit.wikimedia.org/r/1112223 [14:34:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2027.codfw.wmnet with OS bookworm [14:34:46] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10470903 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2027.codfw.wmnet with OS bookworm completed: - ganeti202... [14:36:35] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4814/console" [puppet] - 10https://gerrit.wikimedia.org/r/1112223 (owner: 10Elukey) [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:38:02] (03PS1) 10Slyngshede: C:idm remove associate_by_email pipeline [puppet] - 10https://gerrit.wikimedia.org/r/1112224 (https://phabricator.wikimedia.org/T383707) [14:41:42] (03CR) 10Filippo Giunchedi: [C:03+1] role::kafka::monitoring: re-add Jumbo config [puppet] - 10https://gerrit.wikimedia.org/r/1112223 (owner: 10Elukey) [14:42:06] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1112223 (owner: 10Elukey) [14:42:43] (03CR) 10Elukey: [V:03+1 C:03+2] role::kafka::monitoring: re-add Jumbo config [puppet] - 10https://gerrit.wikimedia.org/r/1112223 (owner: 10Elukey) [14:44:16] (03PS1) 10Muehlenhoff: Remove firewall rule for rsync on archiva [puppet] - 10https://gerrit.wikimedia.org/r/1112226 (https://phabricator.wikimedia.org/T367315) [14:45:56] Is it just me or is Grafana not having a good time right now? [14:46:27] I'm seeing lots of `Firefox can’t establish a connection to the server at wss://grafana-rw.wikimedia.org/api/live/ws.` and equivalent in Chromium [14:48:10] (03PS1) 10Muehlenhoff: Remove rsync from archiva [puppet] - 10https://gerrit.wikimedia.org/r/1112228 (https://phabricator.wikimedia.org/T367315) [14:48:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 100%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P72147 and previous config saved to /var/cache/conftool/dbconfig/20250117-144824-root.json [14:55:45] (03CR) 10Elukey: [C:03+1] Add missing Hiera settings for new bookworm master roles [puppet] - 10https://gerrit.wikimedia.org/r/1112218 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:56:59] MichaelG_WMF: works fine on my side (but on Chrome) - any specific dashboard that you are looking at? [14:57:15] (03PS3) 10JMeybohm: calico: Create certificates for Typha/Felix mTLS [puppet] - 10https://gerrit.wikimedia.org/r/1112204 (https://phabricator.wikimedia.org/T365687) [14:57:16] (03PS12) 10JMeybohm: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984) [14:58:39] elukey: I'm working right now on https://grafana-rw.wikimedia.org/d/ff15559c-b4a2-4363-94c8-190a086b3315/michael-s-playground?forceLogin&forceLogin=true&from=now-7d&orgId=1&to=now [14:59:15] it is mainly the editing/live features that are giving me trouble [14:59:43] but hearing in #wikimedia-observability right now that this might be expected [14:59:48] I see that observability in answering in their chan, I'll leave it to them that are more knowledgeable [14:59:51] exactly :) [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:13] PROBLEM - SSH on prometheus1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:09:41] (03CR) 10Cathal Mooney: [C:03+2] Run validator when creating switch-interface in provision script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1112069 (https://phabricator.wikimedia.org/T383915) (owner: 10Cathal Mooney) [15:11:32] (03CR) 10JMeybohm: [C:03+2] Pin calico version on all clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111935 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [15:12:18] FIRING: [2x] ProbeDown: Service prometheus1006:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#prometheus1006:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:12:51] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017 (10Neslihan_Turan_WMDE) 03NEW [15:13:12] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:13:30] !log powercycle prometheus1006 [15:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:46] (03Merged) 10jenkins-bot: Run validator when creating switch-interface in provision script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1112069 (https://phabricator.wikimedia.org/T383915) (owner: 10Cathal Mooney) [15:14:49] PROBLEM - Host prometheus1006 is DOWN: PING CRITICAL - Packet loss = 100% [15:15:21] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1112204 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm) [15:15:24] (03Merged) 10jenkins-bot: Pin calico version on all clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111935 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [15:15:27] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018 (10SuzanneWood-WMDE) 03NEW [15:15:45] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017#10471120 (10Neslihan_Turan_WMDE) @WMDECyn kindly pinging you to approve this :) [15:15:51] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10471121 (10SuzanneWood-WMDE) [15:17:01] RECOVERY - Host prometheus1006 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [15:17:03] RECOVERY - SSH on prometheus1006 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:17:24] thanks moritzm! [15:18:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:19:22] RESOLVED: [2x] ProbeDown: Service prometheus1006:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#prometheus1006:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:20:42] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [15:20:57] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [15:21:14] (03PS1) 10JMeybohm: calico: Add support for Typha/Felix mTLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112235 (https://phabricator.wikimedia.org/T365687) [15:21:16] (03PS1) 10JMeybohm: Update calico to 0.2.11 in staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112236 (https://phabricator.wikimedia.org/T365687) [15:25:42] (03CR) 10CI reject: [V:04-1] Update calico to 0.2.11 in staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112236 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm) [15:25:54] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Netbox: execute interface validator in provision script for switch interfaces - https://phabricator.wikimedia.org/T383915#10471187 (10cmooney) 05Open→03Resolved a:03cmooney Merged now and working as expected in tests. ` Script abo... [15:26:15] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [15:26:52] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1112204 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm) [15:28:33] (03PS2) 10JMeybohm: Update calico to 0.2.11 in staging-codfw and enable mTLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112236 (https://phabricator.wikimedia.org/T365687) [15:29:26] (03CR) 10JMeybohm: [C:03+2] calico: Create certificates for Typha/Felix mTLS [puppet] - 10https://gerrit.wikimedia.org/r/1112204 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm) [15:32:54] (03CR) 10CI reject: [V:04-1] Update calico to 0.2.11 in staging-codfw and enable mTLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112236 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm) [15:39:13] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:39:29] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.netbox.update-extras (exit_code=1) rolling restart_daemons on A:netbox [15:40:03] !log manually restarting netbox service on netbox1003 [15:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:43] (03Abandoned) 10Bernard Wang: Enable web search AB test stream in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112082 (owner: 10Bernard Wang) [15:48:13] (03CR) 10Dzahn: [C:03+2] Phabricator data for WMF QLS: Add MCollins, remove ABittaker [puppet] - 10https://gerrit.wikimedia.org/r/1112189 (https://phabricator.wikimedia.org/T383884) (owner: 10Aklapper) [15:54:11] (03CR) 10Gergő Tisza: [C:03+1] Disable sidebar cache on the auth domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112219 (https://phabricator.wikimedia.org/T383916) (owner: 10Bartosz Dziewoński) [15:55:06] (03CR) 10Giuseppe Lavagetto: "Overall LGTM. See my comment on how to improve conditionals a bit." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [15:58:57] FIRING: KubernetesCalicoDown: kubestage2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestage2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:59:07] (03PS1) 10Filippo Giunchedi: benthos: add nocookies and tls session metadata [puppet] - 10https://gerrit.wikimedia.org/r/1112248 (https://phabricator.wikimedia.org/T383900) [15:59:09] that's me, back in a second [16:03:57] FIRING: [7x] KubernetesCalicoDown: kubestage2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:06:22] (03PS1) 10Scott French: sre.switchdc.mediawiki: add -next services [cookbooks] - 10https://gerrit.wikimedia.org/r/1112246 (https://phabricator.wikimedia.org/T377040) [16:08:57] RESOLVED: [7x] KubernetesCalicoDown: kubestage2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:10:49] (03CR) 10Hnowlan: "Overall lgtm, one nit. There's lots of scope for cleanup/harmonisation once we look at a refactor but I think for now this works." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [16:12:26] (03CR) 10Giuseppe Lavagetto: [C:03+1] sre.switchdc.mediawiki: add -next services [cookbooks] - 10https://gerrit.wikimedia.org/r/1112246 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [16:16:17] (03CR) 10Scott French: [C:03+2] sre.switchdc.mediawiki: add -next services [cookbooks] - 10https://gerrit.wikimedia.org/r/1112246 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [16:20:06] (03CR) 10Herron: [C:03+1] prometheus: serve apache vhost on localhost too [puppet] - 10https://gerrit.wikimedia.org/r/1112186 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [16:20:32] 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10471506 (10MatthewVernon) I've spent some more time with these logs, and I think I may have reached the point of diminishing returns. I extracted logs for... [16:21:11] (03PS3) 10Scott French: service::catalog: enable monitoring for mw-(web|api-ext)-next [puppet] - 10https://gerrit.wikimedia.org/r/1101124 (https://phabricator.wikimedia.org/T377040) [16:21:12] (03PS1) 10Scott French: mw-(web|api-ext)-next: bump replicas and update TODO [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112078 (https://phabricator.wikimedia.org/T377040) [16:24:51] (03PS13) 10JMeybohm: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984) [16:24:51] (03PS1) 10JMeybohm: calico: mTLS certificate symlinks have to be relative [puppet] - 10https://gerrit.wikimedia.org/r/1112250 (https://phabricator.wikimedia.org/T365687) [16:25:28] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1112250 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm) [16:25:30] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:26:28] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: add -next services [cookbooks] - 10https://gerrit.wikimedia.org/r/1112246 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [16:27:35] (03PS4) 10Scott French: service::catalog: enable monitoring for mw-(web|api-ext)-next [puppet] - 10https://gerrit.wikimedia.org/r/1101124 (https://phabricator.wikimedia.org/T377040) [16:29:11] (03CR) 10JMeybohm: [C:03+2] calico: mTLS certificate symlinks have to be relative [puppet] - 10https://gerrit.wikimedia.org/r/1112250 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm) [16:31:40] (03PS1) 10Hnowlan: wikikube: reimage 5 former jobrunner/videoscaler hosts to workers [puppet] - 10https://gerrit.wikimedia.org/r/1112252 (https://phabricator.wikimedia.org/T354791) [16:31:59] (03Abandoned) 10Hnowlan: wikikube: reimage 5 former jobrunner/videoscaler hosts to workers [puppet] - 10https://gerrit.wikimedia.org/r/1111670 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [16:32:39] (03CR) 10DCausse: "is there anything blocking this patch?" [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916) (owner: 10Bking) [16:33:09] (03CR) 10Clare Ming: Enable the text experiment on testwiki only (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112105 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming) [16:36:54] (03CR) 10Tacsipacsi: "Thanks, but if it cannot be a real extremal value, I’m not sure if it’s worth it (especially considering the human resources needed to do " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112054 (owner: 10Lucas Werkmeister (WMDE)) [16:37:07] (03PS2) 10Hnowlan: wikikube: reimage 5 former jobrunner/videoscaler hosts to workers [puppet] - 10https://gerrit.wikimedia.org/r/1112252 (https://phabricator.wikimedia.org/T354791) [16:39:24] (03CR) 10Scott French: [C:03+1] wikikube: reimage 5 former jobrunner/videoscaler hosts to workers [puppet] - 10https://gerrit.wikimedia.org/r/1112252 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [16:40:19] (03CR) 10Hnowlan: [C:03+2] wikikube: reimage 5 former jobrunner/videoscaler hosts to workers [puppet] - 10https://gerrit.wikimedia.org/r/1112252 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [16:40:38] (03CR) 10DCausse: "you mean a test in puppet? or perhaps allow only a set of specific keys and fail the puppet run?" [puppet] - 10https://gerrit.wikimedia.org/r/1091325 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [16:47:23] (03CR) 10Santiago Faci: Enable the text experiment on testwiki only (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112105 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming) [16:54:24] 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10471670 (10MatthewVernon) 05Open→03Stalled Reported as [[ https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1093304 | Debian #1093304 ]]; more so we've... [16:54:40] FIRING: KubernetesRsyslogDown: rsyslog on mw2282:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2282 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:03:35] 06SRE, 10Ceph, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Configure DSCP marking for cloudceph* hosts - https://phabricator.wikimedia.org/T371501#10471741 (10cmooney) 05Open→03Resolved a:03cmooney Gonna close this one, everything looks good great to see. The real test will be... [17:05:16] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10471771 (10hnowlan) [17:05:26] (03PS5) 10JMeybohm: Update calico-crds to calico v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111943 (https://phabricator.wikimedia.org/T341984) [17:05:26] (03PS6) 10JMeybohm: Update calico to v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112058 (https://phabricator.wikimedia.org/T341984) [17:05:26] (03PS2) 10JMeybohm: admin_ng: Install VAPs instead of PSPs on k8s >= 1.24 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112183 (https://phabricator.wikimedia.org/T341984) [17:05:26] (03PS7) 10JMeybohm: Update staging-codfw to k8s 1.31, calico 3.29 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984) [17:09:10] (03CR) 10CI reject: [V:04-1] Update calico to v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112058 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [17:09:12] (03CR) 10CI reject: [V:04-1] Update calico-crds to calico v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111943 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [17:09:19] (03CR) 10JMeybohm: [C:03+1] "Apart from the commit message, this LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1112200 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [17:09:19] (03CR) 10CI reject: [V:04-1] admin_ng: Install VAPs instead of PSPs on k8s >= 1.24 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112183 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [17:09:41] (03CR) 10CI reject: [V:04-1] Update staging-codfw to k8s 1.31, calico 3.29 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [17:09:56] (03PS1) 10Audrey Penven: Add known-good regexes for WikibaseQualityConstraints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112261 (https://phabricator.wikimedia.org/T380751) [17:11:51] !log xcollazo@deploy2002 Started deploy [airflow-dags/analytics@b0cd4df]: Deploy latest DAGs for 'analytics' Airflow instance. T366542. [17:11:56] T366542: Consider renaming columns and/or table to abide by the data modeling guidelines - https://phabricator.wikimedia.org/T366542 [17:12:21] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:12:24] !log xcollazo@deploy2002 Finished deploy [airflow-dags/analytics@b0cd4df]: Deploy latest DAGs for 'analytics' Airflow instance. T366542. (duration: 00m 32s) [17:13:53] (03PS1) 10Hnowlan: kubernetes: remove out-of-warranty jobrunner hosts awaiting reimage [puppet] - 10https://gerrit.wikimedia.org/r/1112262 (https://phabricator.wikimedia.org/T384043) [17:18:12] (03CR) 10Scott French: [C:03+1] kubernetes: remove out-of-warranty jobrunner hosts awaiting reimage [puppet] - 10https://gerrit.wikimedia.org/r/1112262 (https://phabricator.wikimedia.org/T384043) (owner: 10Hnowlan) [17:20:03] (03CR) 10Hnowlan: [C:03+2] kubernetes: remove out-of-warranty jobrunner hosts awaiting reimage [puppet] - 10https://gerrit.wikimedia.org/r/1112262 (https://phabricator.wikimedia.org/T384043) (owner: 10Hnowlan) [17:25:34] !log hnowlan@cumin2002 START - Cookbook sre.hosts.decommission for hosts mw[2259,2263-2266].codfw.wmnet [17:27:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10471923 (10phaultfinder) [17:27:57] 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Migrate port utilisation alert from LibreNMS to alertmanage - https://phabricator.wikimedia.org/T384052 (10cmooney) 03NEW p:05Triage→03Medium [17:28:10] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10471938 (10cmooney) [17:28:29] 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Migrate port utilisation alert from LibreNMS to alertmanage - https://phabricator.wikimedia.org/T384052#10471937 (10cmooney) [17:29:36] 06SRE, 06SRE-OnFire, 06serviceops, 10Release-Engineering-Team (Radar), 07Sustainability: Remove old scap repositories from deploy1002 - https://phabricator.wikimedia.org/T309162#10471941 (10LSobanski) #serviceops own the deployment servers, reassigning. [17:31:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [17:31:48] ^ me, will be resolved after a puppet run [17:34:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:36:14] !log hnowlan@cumin2002 START - Cookbook sre.dns.netbox [17:36:21] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:39:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:40:04] 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Migrate port utilisation alert from LibreNMS to alertmanager - https://phabricator.wikimedia.org/T384052#10471964 (10cmooney) [17:40:35] !log hnowlan@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[2259,2263-2266].codfw.wmnet decommissioned, removing all IPs except the asset tag one - hnowlan@cumin2002" [17:40:49] !log Upgrading LibreNMS in production - T384036 [17:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:15] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [17:41:45] FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [17:41:46] !log denisse@deploy2002 Started deploy [librenms/librenms@f049593]: Upgrade LibreNMS to 24.12.0 - T384036 [17:42:00] !log denisse@deploy2002 Finished deploy [librenms/librenms@f049593]: Upgrade LibreNMS to 24.12.0 - T384036 (duration: 00m 14s) [17:44:04] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[2259,2263-2266].codfw.wmnet decommissioned, removing all IPs except the asset tag one - hnowlan@cumin2002" [17:44:04] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:44:05] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw[2259,2263-2266].codfw.wmnet [17:46:45] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [17:46:58] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission mw2259,mw225[3-6] - https://phabricator.wikimedia.org/T384043#10471975 (10hnowlan) [17:47:45] FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [17:51:30] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [17:59:18] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10472018 (10hnowlan) [17:59:41] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [18:04:53] (03CR) 10Subramanya Sastry: [C:03+1] Remove KartographerParsoidSupport flag from configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111932 (https://phabricator.wikimedia.org/T340134) (owner: 10Isabelle Hurbain-Palatin) [18:05:22] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns names for newly assigned wmcs private ipv6 entries - cmooney@cumin1002" [18:05:27] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns names for newly assigned wmcs private ipv6 entries - cmooney@cumin1002" [18:05:27] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:06:06] (03PS3) 10Clare Ming: Enable the text experiment on testwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112105 (https://phabricator.wikimedia.org/T373715) [18:08:44] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [18:12:54] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:13:31] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [18:16:25] (03CR) 10CDanis: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1112177 (https://phabricator.wikimedia.org/T376179) (owner: 10Filippo Giunchedi) [18:17:40] 06SRE, 10SRE-swift-storage, 07Upstream: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10472072 (10A_smart_kitten) [18:18:41] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns names for newly assigned wmcs private ipv6 entries - cmooney@cumin1002" [18:18:45] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns names for newly assigned wmcs private ipv6 entries - cmooney@cumin1002" [18:18:45] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:23:27] (03PS1) 10Cathal Mooney: Modifications to CR BGP policy for eqiad cloud-private IPv6 aggregate [homer/public] - 10https://gerrit.wikimedia.org/r/1112268 (https://phabricator.wikimedia.org/T37947) [18:24:14] (03CR) 10CDanis: [C:03+1] "lgtm but what's your use case?" [puppet] - 10https://gerrit.wikimedia.org/r/1112193 (https://phabricator.wikimedia.org/T383976) (owner: 10Fabfur) [18:26:05] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:27:39] (03CR) 10Cathal Mooney: [C:03+2] Modifications to CR BGP policy for eqiad cloud-private IPv6 aggregate [homer/public] - 10https://gerrit.wikimedia.org/r/1112268 (https://phabricator.wikimedia.org/T37947) (owner: 10Cathal Mooney) [18:29:52] (03Merged) 10jenkins-bot: Modifications to CR BGP policy for eqiad cloud-private IPv6 aggregate [homer/public] - 10https://gerrit.wikimedia.org/r/1112268 (https://phabricator.wikimedia.org/T37947) (owner: 10Cathal Mooney) [18:32:59] (03PS1) 10Mstyles: security-landing-page: deploying update [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112270 (https://phabricator.wikimedia.org/T383098) [18:35:56] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10472104 (10colewhite) [18:41:00] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10472126 (10colewhite) @KFrancis, would you mind helping Suzanne with the NDA? @WMDECyn, do you approve of this request? @SuzanneWood-WMDE, would you mind emailing me the SSH key y... [18:41:10] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10472127 (10colewhite) p:05Triage→03Medium [18:43:10] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017#10472137 (10colewhite) p:05Triage→03Medium @KFrancis, would you mind helping Neslihan with an NDA? Thanks! [18:43:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10472142 (10phaultfinder) [18:46:31] FIRING: Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr3-ulsfo.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [18:48:51] RECOVERY - BGP status on cr2-eqord is OK: BGP OK - up: 199, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:50:43] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: BGP status (instance cr2-eqord) - https://phabricator.wikimedia.org/T383302#10472149 (10cmooney) 05Open→03Resolved I mailed these guys at the start of the week but haven't heard back so I deleted the peering. If they respond we... [18:51:31] FIRING: [6x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [18:53:43] (03CR) 10Herron: [C:03+1] "Good call!" [puppet] - 10https://gerrit.wikimedia.org/r/1112177 (https://phabricator.wikimedia.org/T376179) (owner: 10Filippo Giunchedi) [19:11:14] (03CR) 10CDanis: "lgtm, do you think we might want to also precompute 2m or 5m rates as well?" [puppet] - 10https://gerrit.wikimedia.org/r/1112172 (https://phabricator.wikimedia.org/T383963) (owner: 10Filippo Giunchedi) [19:31:58] (03PS1) 10Cathal Mooney: Add WMCS cloud-private eqiad ranges to private6 prefix list [homer/public] - 10https://gerrit.wikimedia.org/r/1112273 (https://phabricator.wikimedia.org/T37947) [19:33:59] (03CR) 10CDanis: [C:03+2] prometheus: scrape otelcol metrics [puppet] - 10https://gerrit.wikimedia.org/r/1112177 (https://phabricator.wikimedia.org/T376179) (owner: 10Filippo Giunchedi) [19:42:22] (03CR) 10Santiago Faci: [C:03+1] "Looks good!!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112105 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming) [19:45:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 20 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112105 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming) [19:45:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 20 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112105 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming) [19:47:51] (03PS4) 10Clare Ming: Enable the text experiment on testwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112105 (https://phabricator.wikimedia.org/T373715) [20:21:05] (03CR) 10Fabfur: "I'd like to split the puppet agent timer for first run (on boot) and periodic (30m) into different units. The "on boot" one will have `Rem" [puppet] - 10https://gerrit.wikimedia.org/r/1112193 (https://phabricator.wikimedia.org/T383976) (owner: 10Fabfur) [20:25:30] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:42:44] (03CR) 10Andrea Denisse: prometheus: serve apache vhost on localhost too (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1112186 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [20:43:39] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1112172 (https://phabricator.wikimedia.org/T383963) (owner: 10Filippo Giunchedi) [20:54:40] FIRING: KubernetesRsyslogDown: rsyslog on mw2282:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2282 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:00:25] (03CR) 10SBassett: [C:03+2] security-landing-page: deploying update [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112270 (https://phabricator.wikimedia.org/T383098) (owner: 10Mstyles) [21:01:56] (03Merged) 10jenkins-bot: security-landing-page: deploying update [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112270 (https://phabricator.wikimedia.org/T383098) (owner: 10Mstyles) [21:34:05] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10472705 (10colewhite) [21:34:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:39:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:35:14] (03PS1) 10CDanis: draft: allow k8s NodeJS apps to opt-in to auto-ECS [puppet] - 10https://gerrit.wikimedia.org/r/1112295 [22:51:31] FIRING: [6x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [23:58:47] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:59:37] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring