[00:26:42] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3081.esams.wmnet with OS bullseye
[00:36:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[00:38:22] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/948687
[00:38:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/948687 (owner: 10TrainBranchBot)
[00:41:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[00:46:08] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:53:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/948687 (owner: 10TrainBranchBot)
[00:56:24] <icinga-wm>	 RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:01:16] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp3081.esams.wmnet with OS bullseye
[01:03:48] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[01:08:34] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T344213 (10phaultfinder)
[01:13:57] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T344213 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact
[01:18:24] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3081.esams.wmnet with OS bullseye
[01:23:18] <wikibugs>	 (03PS1) 10Ssingh: cp3073: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/948685 (https://phabricator.wikimedia.org/T327438)
[01:23:20] <wikibugs>	 (03PS1) 10Ssingh: cp3071: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/948706 (https://phabricator.wikimedia.org/T327438)
[01:30:04] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:30:20] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:31:30] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:39:35] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp3081.esams.wmnet with OS bullseye
[01:50:04] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:57:04] <icinga-wm>	 RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:57:34] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T0200)
[02:07:52] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.22 [core] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/948688 (https://phabricator.wikimedia.org/T343724)
[02:07:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.22 [core] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/948688 (https://phabricator.wikimedia.org/T343724) (owner: 10TrainBranchBot)
[02:11:38] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:23:07] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.22 [core] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/948688 (https://phabricator.wikimedia.org/T343724) (owner: 10TrainBranchBot)
[02:31:36] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:31:38] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:43:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:48:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:55:44] <icinga-wm>	 RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:00:04] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T0300)
[03:01:30] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948707 (https://phabricator.wikimedia.org/T343724)
[03:01:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948707 (https://phabricator.wikimedia.org/T343724) (owner: 10TrainBranchBot)
[03:02:12] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948707 (https://phabricator.wikimedia.org/T343724) (owner: 10TrainBranchBot)
[03:02:43] <logmsgbot>	 !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.41.0-wmf.22  refs T343724
[03:02:47] <stashbot>	 T343724: 1.41.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T343724
[03:24:43] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:28:46] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:55:48] <icinga-wm>	 RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:56:25] <logmsgbot>	 !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.41.0-wmf.22  refs T343724 (duration: 53m 42s)
[03:56:29] <stashbot>	 T343724: 1.41.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T343724
[03:58:41] <logmsgbot>	 !log mwpresync@deploy1002 Pruned MediaWiki: 1.41.0-wmf.19 (duration: 02m 13s)
[04:24:23] <wikibugs>	 (03PS1) 10Majavah: Add a comment why PdfHandler does not use Shellbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948710
[04:24:59] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] Add a comment why PdfHandler does not use Shellbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948710 (owner: 10Majavah)
[04:25:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948710 (owner: 10Majavah)
[04:25:58] <wikibugs>	 (03Merged) 10jenkins-bot: Add a comment why PdfHandler does not use Shellbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948710 (owner: 10Majavah)
[04:26:32] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:948710|Add a comment why PdfHandler does not use Shellbox]]
[04:28:14] <logmsgbot>	 !log taavi@deploy1002 taavi: Backport for [[gerrit:948710|Add a comment why PdfHandler does not use Shellbox]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[04:28:31] <logmsgbot>	 !log taavi@deploy1002 taavi: Continuing with sync
[04:33:10] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:34:57] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:948710|Add a comment why PdfHandler does not use Shellbox]] (duration: 08m 24s)
[04:42:24] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] "Aaaah the inconsistent spacing between `proxy_hide_header` and the header. I think this patch is fine, though." [puppet] - 10https://gerrit.wikimedia.org/r/940506 (owner: 10Lucas Werkmeister)
[04:57:00] <icinga-wm>	 RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:18:27] <wikibugs>	 (03PS1) 10Ayounsi: realm.pp fix new esams range [puppet] - 10https://gerrit.wikimedia.org/r/948713 (https://phabricator.wikimedia.org/T329219)
[05:31:22] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:38:10] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:39:40] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:56:56] <icinga-wm>	 RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T0600)
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T0600).
[06:28:08] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:29:34] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] realm.pp fix new esams range [puppet] - 10https://gerrit.wikimedia.org/r/948713 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[06:30:18] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:32:44] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:38:46] <wikibugs>	 (03PS1) 10Ayounsi: Revert "pybal: Make check conform to the Nagios plugin API" [puppet] - 10https://gerrit.wikimedia.org/r/948594
[06:41:00] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Revert "pybal: Make check conform to the Nagios plugin API" [puppet] - 10https://gerrit.wikimedia.org/r/948594 (owner: 10Ayounsi)
[06:42:30] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[06:46:31] <wikibugs>	 (03PS1) 10Dreamy Jazz: clienthints: Collect Client Hints data on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948985 (https://phabricator.wikimedia.org/T341110)
[06:47:55] <taavi>	 jouncebot: nowandnext
[06:47:55] <jouncebot>	 For the next 0 hour(s) and 12 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T0600)
[06:47:55] <jouncebot>	 In 0 hour(s) and 12 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T0700)
[06:48:54] <taavi>	 looks like the mw infra window is unused. so I'm starting the backport window a bit early
[06:49:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948985 (https://phabricator.wikimedia.org/T341110) (owner: 10Dreamy Jazz)
[06:49:54] <wikibugs>	 (03Merged) 10jenkins-bot: clienthints: Collect Client Hints data on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948985 (https://phabricator.wikimedia.org/T341110) (owner: 10Dreamy Jazz)
[06:50:33] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:948985|clienthints: Collect Client Hints data on group0 wikis (T341110)]]
[06:50:37] <stashbot>	 T341110: Deploy client hints functionality - https://phabricator.wikimedia.org/T341110
[06:52:16] <logmsgbot>	 !log taavi@deploy1002 taavi and dreamyjazz: Backport for [[gerrit:948985|clienthints: Collect Client Hints data on group0 wikis (T341110)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[06:55:40] <icinga-wm>	 RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:55:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org
[06:57:26] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox
[06:59:25] <logmsgbot>	 !log taavi@deploy1002 taavi and dreamyjazz: Continuing with sync
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and taavi: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T0700).
[07:00:05] <jouncebot>	 aanzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:21] <wikibugs>	 (03CR) 10JMeybohm: "I would say that, before doing this again, there should be at least a notification to ops@ informing everybody of the change and the requi" [puppet] - 10https://gerrit.wikimedia.org/r/948125 (https://phabricator.wikimedia.org/T343987) (owner: 10Herron)
[07:00:27] <taavi>	 o/
[07:00:27] <aanzx>	 o/
[07:00:52] <taavi>	 I'll deploy
[07:00:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host lists2001.codfw.wmnet
[07:01:06] <taavi>	 aanzx: your patch is marked as a draft
[07:01:14] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] miscweb: add wikiworkshop and reasearch-landing-page to staging wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/948539 (https://phabricator.wikimedia.org/T334511) (owner: 10Jelto)
[07:02:23] <wikibugs>	 (03PS5) 10Anzx: jawiki: reassign the changetags user right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948590 (https://phabricator.wikimedia.org/T344150)
[07:02:38] <aanzx>	 taavi: now set as active
[07:03:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] trafficserver: Use svc urls for eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/948624 (https://phabricator.wikimedia.org/T326657) (owner: 10BCornwall)
[07:03:58] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[07:04:16] <taavi>	 thx. will deploy that once Dreamy_Jazz's patch is synced
[07:04:54] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox
[07:05:36] <icinga-wm>	 PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga
[07:05:57] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:948985|clienthints: Collect Client Hints data on group0 wikis (T341110)]] (duration: 15m 23s)
[07:05:58] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] jawiki: reassign the changetags user right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948590 (https://phabricator.wikimedia.org/T344150) (owner: 10Anzx)
[07:06:00] <stashbot>	 T341110: Deploy client hints functionality - https://phabricator.wikimedia.org/T341110
[07:06:10] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:06:37] <wikibugs>	 (03Merged) 10jenkins-bot: jawiki: reassign the changetags user right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948590 (https://phabricator.wikimedia.org/T344150) (owner: 10Anzx)
[07:06:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lists2001.codfw.wmnet
[07:07:12] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:948590|jawiki: reassign the changetags user right (T344150)]]
[07:07:17] <stashbot>	 T344150: Reassign the changetags user right on jawiki - https://phabricator.wikimedia.org/T344150
[07:08:48] <logmsgbot>	 !log taavi@deploy1002 anzx and taavi: Backport for [[gerrit:948590|jawiki: reassign the changetags user right (T344150)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[07:08:57] <taavi>	 aanzx: please test
[07:09:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host titan2001.codfw.wmnet
[07:09:27] <aanzx>	 testing
[07:10:26] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] hieradata: complete cadvisor rollout on k8s [puppet] - 10https://gerrit.wikimedia.org/r/942426 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi)
[07:11:45] <aanzx>	 taavi: looks good
[07:12:02] <logmsgbot>	 !log taavi@deploy1002 anzx and taavi: Continuing with sync
[07:12:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [dns] - 10https://gerrit.wikimedia.org/r/948562 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond)
[07:15:05] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "cp3081 - ayounsi@cumin1001"
[07:15:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan2001.codfw.wmnet
[07:16:03] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "cp3081 - ayounsi@cumin1001"
[07:16:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add insetup variant for undefined ownership [puppet] - 10https://gerrit.wikimedia.org/r/869777 (owner: 10Muehlenhoff)
[07:16:48] <icinga-wm>	 PROBLEM - Check systemd state on bast2003 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:17:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host titan2002.codfw.wmnet
[07:18:12] <wikibugs>	 (03PS7) 10Sohom Datta: Enable EditInSequence on all wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947883 (https://phabricator.wikimedia.org/T308098)
[07:18:18] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:948590|jawiki: reassign the changetags user right (T344150)]] (duration: 11m 05s)
[07:18:21] <stashbot>	 T344150: Reassign the changetags user right on jawiki - https://phabricator.wikimedia.org/T344150
[07:18:50] <taavi>	 aanzx: done
[07:18:55] <aanzx>	 thanks Taavi
[07:19:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947883 (https://phabricator.wikimedia.org/T308098) (owner: 10Sohom Datta)
[07:19:45] <wikibugs>	 (03Merged) 10jenkins-bot: Enable EditInSequence on all wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947883 (https://phabricator.wikimedia.org/T308098) (owner: 10Sohom Datta)
[07:20:14] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:947883|Enable EditInSequence on all wikisources (T308098)]]
[07:20:17] <stashbot>	 T308098: Integrate edit-in-sequence inside ProofreadPage - https://phabricator.wikimedia.org/T308098
[07:20:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan2002.codfw.wmnet
[07:21:53] <logmsgbot>	 !log taavi@deploy1002 soda and taavi: Backport for [[gerrit:947883|Enable EditInSequence on all wikisources (T308098)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[07:24:43] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:25:31] <wikibugs>	 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027 (10JMeybohm)
[07:27:12] <logmsgbot>	 !log taavi@deploy1002 soda and taavi: Continuing with sync
[07:29:13] <gehel>	 !log restarting wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph on wdqs2012
[07:29:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:29:40] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:30:04] <icinga-wm>	 RECOVERY - Check systemd state on bast2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:33:43] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:947883|Enable EditInSequence on all wikisources (T308098)]] (duration: 13m 29s)
[07:33:47] <stashbot>	 T308098: Integrate edit-in-sequence inside ProofreadPage - https://phabricator.wikimedia.org/T308098
[07:34:00] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:34:43] <jinxer-wm>	 (SystemdUnitFailed) resolved: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:37:11] <wikibugs>	 (03CR) 10Umherirrender: "Failure is known as T344191 and fix is Ia2a8a24a14f7af7e18928da9c7cc412829be8e20" [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/948687 (owner: 10TrainBranchBot)
[07:47:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] aux: add grpc/http ports for jaeger collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/946551 (https://phabricator.wikimedia.org/T343302) (owner: 10Filippo Giunchedi)
[07:47:53] <wikibugs>	 (03PS2) 10Filippo Giunchedi: aux: add grpc/http ports for jaeger collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/946551 (https://phabricator.wikimedia.org/T343302)
[07:49:02] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox
[07:49:14] <logmsgbot>	 !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97)
[07:49:27] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox
[07:52:03] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] Class gitlab: Use gitlab-settings v1.3.0 [puppet] - 10https://gerrit.wikimedia.org/r/948665 (owner: 10Ahmon Dancy)
[07:53:32] <wikibugs>	 (03PS1) 10Zabe: Add messages for Pa'O Wiktionary (blkwiktionary) [extensions/WikimediaMessages] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948595 (https://phabricator.wikimedia.org/T343540)
[07:53:39] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Add messages for Pa'O Wiktionary (blkwiktionary) [extensions/WikimediaMessages] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948595 (https://phabricator.wikimedia.org/T343540) (owner: 10Zabe)
[07:53:59] <wikibugs>	 (03PS1) 10Zabe: Add messages for Sundanese Wikisource (suwikisource) [extensions/WikimediaMessages] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948596 (https://phabricator.wikimedia.org/T343539)
[07:54:16] <wikibugs>	 (03PS2) 10Zabe: Add messages for Sundanese Wikisource (suwikisource) [extensions/WikimediaMessages] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948596 (https://phabricator.wikimedia.org/T343539)
[07:54:21] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Add messages for Sundanese Wikisource (suwikisource) [extensions/WikimediaMessages] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948596 (https://phabricator.wikimedia.org/T343539) (owner: 10Zabe)
[07:55:48] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cr2-esams mgmt - ayounsi@cumin1001"
[07:56:24] <icinga-wm>	 RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:57:52] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cr2-esams mgmt - ayounsi@cumin1001"
[07:57:52] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:07:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [extensions/WikimediaMessages] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948595 (https://phabricator.wikimedia.org/T343540) (owner: 10Zabe)
[08:07:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [extensions/WikimediaMessages] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948596 (https://phabricator.wikimedia.org/T343539) (owner: 10Zabe)
[08:08:36] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:08:39] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Rename mr devices at Amsterdam POP sites [homer/public] - 10https://gerrit.wikimedia.org/r/948635 (https://phabricator.wikimedia.org/T329219) (owner: 10Cathal Mooney)
[08:08:55] <wikibugs>	 (03Merged) 10jenkins-bot: Add messages for Pa'O Wiktionary (blkwiktionary) [extensions/WikimediaMessages] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948595 (https://phabricator.wikimedia.org/T343540) (owner: 10Zabe)
[08:08:57] <wikibugs>	 (03Merged) 10jenkins-bot: Add messages for Sundanese Wikisource (suwikisource) [extensions/WikimediaMessages] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948596 (https://phabricator.wikimedia.org/T343539) (owner: 10Zabe)
[08:09:12] <wikibugs>	 (03Merged) 10jenkins-bot: Rename mr devices at Amsterdam POP sites [homer/public] - 10https://gerrit.wikimedia.org/r/948635 (https://phabricator.wikimedia.org/T329219) (owner: 10Cathal Mooney)
[08:09:28] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:948595|Add messages for Pa'O Wiktionary (blkwiktionary) (T343540)]], [[gerrit:948596|Add messages for Sundanese Wikisource (suwikisource) (T343539)]]
[08:09:34] <stashbot>	 T343539: Create Wikisource Sundanese - https://phabricator.wikimedia.org/T343539
[08:09:34] <stashbot>	 T343540: Create Wiktionary Pa'O - https://phabricator.wikimedia.org/T343540
[08:16:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:16:40] <logmsgbot>	 !log filippo@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[08:16:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add ganeti config for knams [puppet] - 10https://gerrit.wikimedia.org/r/948129 (owner: 10Muehlenhoff)
[08:20:52] <wikibugs>	 (03PS1) 10Ayounsi: More cr3-knams -> cr2-esams and mr1 -> old [homer/public] - 10https://gerrit.wikimedia.org/r/948991 (https://phabricator.wikimedia.org/T329219)
[08:21:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] More cr3-knams -> cr2-esams and mr1 -> old [homer/public] - 10https://gerrit.wikimedia.org/r/948991 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[08:21:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:21:50] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:23:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] nftables::file: Expand prefix to three digits [puppet] - 10https://gerrit.wikimedia.org/r/945586 (owner: 10Muehlenhoff)
[08:28:02] <logmsgbot>	 !log filippo@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[08:29:12] <wikibugs>	 (03PS2) 10Ayounsi: More esams router renaming [homer/public] - 10https://gerrit.wikimedia.org/r/948991 (https://phabricator.wikimedia.org/T329219)
[08:30:45] <logmsgbot>	 !log zabe@deploy1002 zabe: Backport for [[gerrit:948595|Add messages for Pa'O Wiktionary (blkwiktionary) (T343540)]], [[gerrit:948596|Add messages for Sundanese Wikisource (suwikisource) (T343539)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[08:30:56] <stashbot>	 T343539: Create Wikisource Sundanese - https://phabricator.wikimedia.org/T343539
[08:30:57] <stashbot>	 T343540: Create Wiktionary Pa'O - https://phabricator.wikimedia.org/T343540
[08:31:14] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] miscweb: add wikiworkshop and reasearch-landing-page to staging wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/948539 (https://phabricator.wikimedia.org/T334511) (owner: 10Jelto)
[08:31:19] <logmsgbot>	 !log zabe@deploy1002 zabe: Continuing with sync
[08:32:00] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: add wikiworkshop and reasearch-landing-page to staging wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/948539 (https://phabricator.wikimedia.org/T334511) (owner: 10Jelto)
[08:32:08] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/948991 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[08:32:46] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] More esams router renaming [homer/public] - 10https://gerrit.wikimedia.org/r/948991 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[08:33:06] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:33:19] <wikibugs>	 (03Merged) 10jenkins-bot: More esams router renaming [homer/public] - 10https://gerrit.wikimedia.org/r/948991 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[08:35:26] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+1] "LGTM, just note that a lot of serviceops people are on holiday today." [deployment-charts] - 10https://gerrit.wikimedia.org/r/948136 (owner: 10Elukey)
[08:36:34] <logmsgbot>	 !log jelto@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply
[08:37:34] <logmsgbot>	 !log jelto@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[08:42:55] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:948595|Add messages for Pa'O Wiktionary (blkwiktionary) (T343540)]], [[gerrit:948596|Add messages for Sundanese Wikisource (suwikisource) (T343539)]] (duration: 33m 26s)
[08:42:59] <stashbot>	 T343539: Create Wikisource Sundanese - https://phabricator.wikimedia.org/T343539
[08:43:00] <stashbot>	 T343540: Create Wiktionary Pa'O - https://phabricator.wikimedia.org/T343540
[08:46:22] <klausman>	 !log Draining ml2002 for kubelet partition resize
[08:46:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:36] <wikibugs>	 (03PS1) 10Filippo Giunchedi: aux: set calico typha to two replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/948994 (https://phabricator.wikimedia.org/T333302)
[08:54:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Add site.pp entries for new Ganeti nodes in knams [puppet] - 10https://gerrit.wikimedia.org/r/948995
[08:54:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add site.pp entries for new Ganeti nodes in knams [puppet] - 10https://gerrit.wikimedia.org/r/948995 (owner: 10Muehlenhoff)
[08:55:17] <klausman>	 !log Draining ml2003 for kubelet partition resize
[08:55:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:34] <icinga-wm>	 RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:55:42] <wikibugs>	 (03PS2) 10Filippo Giunchedi: aux: set calico typha to one replica [deployment-charts] - 10https://gerrit.wikimedia.org/r/948994 (https://phabricator.wikimedia.org/T333302)
[08:56:53] <wikibugs>	 (03PS2) 10Muehlenhoff: Add site.pp entries for new Ganeti nodes in knams [puppet] - 10https://gerrit.wikimedia.org/r/948995
[08:57:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add site.pp entries for new Ganeti nodes in knams [puppet] - 10https://gerrit.wikimedia.org/r/948995 (owner: 10Muehlenhoff)
[09:01:21] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "We need to make sure this gets reverted as soon as there is a third node in aux!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/948994 (https://phabricator.wikimedia.org/T333302) (owner: 10Filippo Giunchedi)
[09:01:28] <wikibugs>	 (03PS1) 10David Caro: openstack: use the right proxy names [alerts] - 10https://gerrit.wikimedia.org/r/948996
[09:03:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] aux: set calico typha to one replica [deployment-charts] - 10https://gerrit.wikimedia.org/r/948994 (https://phabricator.wikimedia.org/T333302) (owner: 10Filippo Giunchedi)
[09:03:17] <wikibugs>	 (03PS3) 10Muehlenhoff: Add site.pp entries for new Ganeti nodes in knams [puppet] - 10https://gerrit.wikimedia.org/r/948995
[09:04:37] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] openstack: use the right proxy names [alerts] - 10https://gerrit.wikimedia.org/r/948996 (owner: 10David Caro)
[09:04:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add site.pp entries for new Ganeti nodes in knams [puppet] - 10https://gerrit.wikimedia.org/r/948995 (owner: 10Muehlenhoff)
[09:05:18] <logmsgbot>	 !log filippo@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[09:05:50] <wikibugs>	 (03Merged) 10jenkins-bot: openstack: use the right proxy names [alerts] - 10https://gerrit.wikimedia.org/r/948996 (owner: 10David Caro)
[09:08:14] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] Grant Kara Payne shell access [puppet] - 10https://gerrit.wikimedia.org/r/948568 (https://phabricator.wikimedia.org/T342546) (owner: 10Stevemunene)
[09:08:41] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp3079.esams.wmnet with OS bullseye
[09:10:24] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] airflow-wmde: Add Kara Payne to analytics-wmde [puppet] - 10https://gerrit.wikimedia.org/r/940863 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene)
[09:11:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti3005.esams.wmnet with OS bullseye
[09:11:12] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Shut down two en-arbcom mailing lists (audit, appeals-en) - https://phabricator.wikimedia.org/T344112 (10Legoktm) The audit list is  arbcom-audit-en@.
[09:11:44] <logmsgbot>	 !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ganeti3005.esams.wmnet with OS bullseye
[09:15:31] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Shut down two en-arbcom mailing lists (audit, appeals-en) - https://phabricator.wikimedia.org/T344112 (10Legoktm) 05Open→03Resolved >>! In T344112#9092517, @Legoktm wrote: > The audit list is  arbcom-audit-en@.  Which I've now archived. So I think we're all set here!
[09:15:32] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp3079.esams.wmnet with OS bullseye
[09:15:54] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp3079.esams.wmnet with OS bullseye
[09:16:37] <wikibugs>	 (03PS2) 10Muehlenhoff: Add a Firewall::Portrange define [puppet] - 10https://gerrit.wikimedia.org/r/947316
[09:17:03] <logmsgbot>	 !log filippo@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[09:18:00] <wikibugs>	 10sre-alert-triage, 10Machine-Learning-Team, 10Patch-For-Review: Alert triage: overdue alert [warning] Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99 - https://phabricator.wikimedia.org/T343900 (10klausman) I have done ml2002 and ml2003 today (two machines to force some pods back...
[09:20:00] <wikibugs>	 (03PS1) 10Jelto: miscweb: add wikiworkshop and reasearch-landing-page to eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/948998 (https://phabricator.wikimedia.org/T334511)
[09:25:44] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "ok" [puppet] - 10https://gerrit.wikimedia.org/r/948685 (https://phabricator.wikimedia.org/T327438) (owner: 10Ssingh)
[09:28:08] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] cp3073: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/948685 (https://phabricator.wikimedia.org/T327438) (owner: 10Ssingh)
[09:28:38] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:30:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti3005.esams.wmnet with OS bullseye
[09:30:15] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3073.esams.wmnet with OS bullseye
[09:30:37] <wikibugs>	 (03PS1) 10Ayounsi: Add dns3003 to asw1-by27 anycast peers [homer/public] - 10https://gerrit.wikimedia.org/r/948999 (https://phabricator.wikimedia.org/T329219)
[09:31:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add dns3003 to asw1-by27 anycast peers [homer/public] - 10https://gerrit.wikimedia.org/r/948999 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[09:31:53] <wikibugs>	 (03PS2) 10Ayounsi: Add dns3003 to asw1-by27 anycast peers [homer/public] - 10https://gerrit.wikimedia.org/r/948999 (https://phabricator.wikimedia.org/T329219)
[09:34:11] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1096.eqiad.wmnet with OS bullseye
[09:34:35] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "Looks good, thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/948999 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[09:34:45] <wikibugs>	 (03PS1) 10Ayounsi: esams: remove profile::bird::neighbors_list [puppet] - 10https://gerrit.wikimedia.org/r/949000 (https://phabricator.wikimedia.org/T329219)
[09:35:17] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add dns3003 to asw1-by27 anycast peers [homer/public] - 10https://gerrit.wikimedia.org/r/948999 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[09:35:49] <wikibugs>	 (03Merged) 10jenkins-bot: Add dns3003 to asw1-by27 anycast peers [homer/public] - 10https://gerrit.wikimedia.org/r/948999 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[09:37:45] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3079.esams.wmnet with reason: host reimage
[09:38:56] <wikibugs>	 (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/949000/42893/dns3001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/949000 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[09:40:03] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] cp3071: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/948706 (https://phabricator.wikimedia.org/T327438) (owner: 10Ssingh)
[09:40:32] <wikibugs>	 (03PS2) 10Ssingh: cp3071: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/948706 (https://phabricator.wikimedia.org/T327438)
[09:41:11] <wikibugs>	 (03CR) 10Ssingh: [V: 03+2] cp3071: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/948706 (https://phabricator.wikimedia.org/T327438) (owner: 10Ssingh)
[09:41:49] <logmsgbot>	 !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp3079.esams.wmnet with reason: host reimage
[09:43:19] <wikibugs>	 (03CR) 10Ayounsi: esams: remove profile::bird::neighbors_list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/949000 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[09:43:26] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3071.esams.wmnet with OS bullseye
[09:44:00] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] esams: remove profile::bird::neighbors_list [puppet] - 10https://gerrit.wikimedia.org/r/949000 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[09:44:18] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] esams: remove profile::bird::neighbors_list [puppet] - 10https://gerrit.wikimedia.org/r/949000 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[09:49:05] <wikibugs>	 (03PS3) 10Ssingh: cp3079: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/948652 (https://phabricator.wikimedia.org/T327438)
[09:49:51] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] cp3079: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/948652 (https://phabricator.wikimedia.org/T327438) (owner: 10Ssingh)
[09:49:55] <wikibugs>	 (03CR) 10Fabfur: [C: 03+2] cp3079: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/948652 (https://phabricator.wikimedia.org/T327438) (owner: 10Ssingh)
[09:50:56] <logmsgbot>	 !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp3079.esams.wmnet with OS bullseye
[09:51:56] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti3005.esams.wmnet with OS bullseye
[09:52:19] <wikibugs>	 (03PS1) 10Stevemunene: airflow-wmde: Create analytics-wmde airflow admin group [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648)
[09:52:29] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp3079.esams.wmnet with OS bullseye
[09:52:50] <logmsgbot>	 !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp3079.esams.wmnet with OS bullseye
[09:53:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] airflow-wmde: Create analytics-wmde airflow admin group [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene)
[09:54:01] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp3079.esams.wmnet with OS bullseye
[09:55:34] <icinga-wm>	 RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:58:03] <wikibugs>	 (03PS1) 10JMeybohm: Remove podAntiAffinity for calico-typha on aux [deployment-charts] - 10https://gerrit.wikimedia.org/r/949002 (https://phabricator.wikimedia.org/T292077)
[09:58:32] <wikibugs>	 (03PS2) 10JMeybohm: Remove podAntiAffinity for calico-typha on aux [deployment-charts] - 10https://gerrit.wikimedia.org/r/949002 (https://phabricator.wikimedia.org/T344230)
[09:59:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Remove podAntiAffinity for calico-typha on aux [deployment-charts] - 10https://gerrit.wikimedia.org/r/949002 (https://phabricator.wikimedia.org/T344230) (owner: 10JMeybohm)
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T1000)
[10:00:32] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Requesting access to  analytics-wmde-users (no kerberos, with ssh) for karapayneWMDE - https://phabricator.wikimedia.org/T342546 (10Stevemunene) Te changes have been merged and @karapayneWMDE now has shell access and is a member of `analytics-wmde-users`
[10:00:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[10:01:09] <wikibugs>	 (03PS1) 10JMeybohm: Revert "aux: set calico typha to one replica" [deployment-charts] - 10https://gerrit.wikimedia.org/r/948597 (https://phabricator.wikimedia.org/T333302)
[10:01:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti3005.esams.wmnet with OS bullseye
[10:02:08] <taavi>	 jouncebot: nowandnext
[10:02:08] <jouncebot>	 For the next 0 hour(s) and 57 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T1000)
[10:02:08] <jouncebot>	 In 1 hour(s) and 57 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T1200)
[10:02:50] <wikibugs>	 (03PS3) 10JMeybohm: Remove podAntiAffinity for calico-typha on aux [deployment-charts] - 10https://gerrit.wikimedia.org/r/949002 (https://phabricator.wikimedia.org/T344230)
[10:04:55] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3073.esams.wmnet with reason: host reimage
[10:05:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Revert "aux: set calico typha to one replica" [deployment-charts] - 10https://gerrit.wikimedia.org/r/948597 (https://phabricator.wikimedia.org/T333302) (owner: 10JMeybohm)
[10:05:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[10:06:10] <icinga-wm>	 PROBLEM - Check systemd state on kubestagemaster1002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:06:16] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3071.esams.wmnet with reason: host reimage
[10:09:08] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp3073.esams.wmnet with reason: host reimage
[10:11:37] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp3071.esams.wmnet with reason: host reimage
[10:12:26] <wikibugs>	 (03PS2) 10Stevemunene: airflow-wmde: Create analytics-wmde airflow admin group [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648)
[10:12:36] <wikibugs>	 10sre-alert-triage, 10Machine-Learning-Team, 10Patch-For-Review: Alert triage: overdue alert [warning] Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99 - https://phabricator.wikimedia.org/T343900 (10JMeybohm) I might be missing something here, but what issues did you have with the...
[10:13:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] airflow-wmde: Create analytics-wmde airflow admin group [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene)
[10:14:36] <wikibugs>	 (03PS1) 10Muehlenhoff: Add dummy certs for ganeti02.svc.esams.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/949004
[10:16:12] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3079.esams.wmnet with reason: host reimage
[10:17:55] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add dummy certs for ganeti02.svc.esams.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/949004 (owner: 10Muehlenhoff)
[10:18:57] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/947316 (owner: 10Muehlenhoff)
[10:19:43] <wikibugs>	 (03PS1) 10Ssingh: cp306[79], cp307[57]:  update site.pp and related configs for cp roles [puppet] - 10https://gerrit.wikimedia.org/r/949006 (https://phabricator.wikimedia.org/T344174)
[10:19:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cp306[79], cp307[57]:  update site.pp and related configs for cp roles [puppet] - 10https://gerrit.wikimedia.org/r/949006 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[10:20:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:20:15] <logmsgbot>	 !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp3079.esams.wmnet with reason: host reimage
[10:21:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti3005.esams.wmnet with reason: host reimage
[10:21:56] <wikibugs>	 (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949006 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[10:24:20] <wikibugs>	 (03PS1) 10Ssingh: cp306[79], cp307[57]: update site.pp and related configs for cp roles [puppet] - 10https://gerrit.wikimedia.org/r/949009 (https://phabricator.wikimedia.org/T344174)
[10:24:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cp306[79], cp307[57]: update site.pp and related configs for cp roles [puppet] - 10https://gerrit.wikimedia.org/r/949009 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[10:24:40] <sukhe>	 sigh
[10:24:48] <wikibugs>	 (03Abandoned) 10Ssingh: cp306[79], cp307[57]:  update site.pp and related configs for cp roles [puppet] - 10https://gerrit.wikimedia.org/r/949006 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[10:24:52] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubestagemaster1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[10:25:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:25:15] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ganeti3005.esams.wmnet with reason: host reimage
[10:27:11] <wikibugs>	 (03PS1) 10Ssingh: hiera: add new DNS host in esams, dns3003 [puppet] - 10https://gerrit.wikimedia.org/r/949010 (https://phabricator.wikimedia.org/T344174)
[10:27:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] hiera: add new DNS host in esams, dns3003 [puppet] - 10https://gerrit.wikimedia.org/r/949010 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[10:27:23] <sukhe>	 ok
[10:27:27] <sukhe>	 so something is up with CI
[10:29:21] <sukhe>	 GitCommandError: Cmd('git') failed due to: exit code(128) cmdline: git fetch --force --tags -v origin stderr: 'fatal: Could not read from remote repository.
[10:29:35] <sukhe>	 contint1002
[10:30:25] <wikibugs>	 (03PS2) 10Fabfur: cp306[79], cp307[57]: update site.pp and related configs for cp roles [puppet] - 10https://gerrit.wikimedia.org/r/949009 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[10:30:34] <wikibugs>	 (03CR) 10jenkins-bot: cp306[79], cp307[57]: update site.pp and related configs for cp roles [puppet] - 10https://gerrit.wikimedia.org/r/949009 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[10:31:56] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[10:32:54] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[10:32:54] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3073.esams.wmnet with OS bullseye
[10:33:22] <wikibugs>	 (03PS1) 10Muehlenhoff: Add cert for ganeti02.svc.esams.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/949011
[10:33:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add cert for ganeti02.svc.esams.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/949011 (owner: 10Muehlenhoff)
[10:33:50] <sukhe>	 moritzm: ^ broken CI
[10:34:41] <wikibugs>	 (03PS2) 10Muehlenhoff: Add cert for ganeti02.svc.esams.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/949011
[10:34:49] <moritzm>	 lovely :-)
[10:34:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add cert for ganeti02.svc.esams.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/949011 (owner: 10Muehlenhoff)
[10:35:44] <sukhe>	 yeah I asked in releng
[10:36:38] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[10:36:55] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10ssingh)
[10:37:33] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[10:37:33] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3071.esams.wmnet with OS bullseye
[10:38:41] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10ssingh)
[10:40:23] <wikibugs>	 (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949009 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[10:41:08] <sukhe>	 no, still broken :)
[10:41:56] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001"
[10:42:53] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001"
[10:42:53] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3079.esams.wmnet with OS bullseye
[10:43:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002"
[10:43:46] <wikibugs>	 10sre-alert-triage, 10Machine-Learning-Team, 10Patch-For-Review: Alert triage: overdue alert [warning] Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99 - https://phabricator.wikimedia.org/T343900 (10klausman) The problem is only really relevant for LLMs (Large Language Models), sin...
[10:44:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002"
[10:44:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti3005.esams.wmnet with OS bullseye
[10:44:52] <wikibugs>	 (03PS1) 10EoghanGaffney: gitlab: Update config to fix compatibility with swift [puppet] - 10https://gerrit.wikimedia.org/r/949014
[10:45:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gitlab: Update config to fix compatibility with swift [puppet] - 10https://gerrit.wikimedia.org/r/949014 (owner: 10EoghanGaffney)
[10:45:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti3007.esams.wmnet with OS bullseye
[10:46:09] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42894/console" [puppet] - 10https://gerrit.wikimedia.org/r/949014 (owner: 10EoghanGaffney)
[10:48:20] <wikibugs>	 (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949009 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[10:52:04] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:54:10] <wikibugs>	 (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949009 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[10:54:24] <sukhe>	 !log zuul@contint1002:/srv/zuul/git/operations/puppet$ git fetch --force --tags -v origin
[10:54:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:52] <sukhe>	 ok that worked
[10:54:55] <sukhe>	 for how long, we will see
[10:55:09] <sukhe>	 moritzm: ^
[10:55:19] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949014 (owner: 10EoghanGaffney)
[10:56:02] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1096.eqiad.wmnet with OS bullseye
[10:56:03] <moritzm>	 sukhe: thanks, I'll give it a shot
[10:56:04] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-db1001.eqiad.wmnet
[10:56:29] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/949009 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[10:57:51] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] cp306[79], cp307[57]: update site.pp and related configs for cp roles [puppet] - 10https://gerrit.wikimedia.org/r/949009 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[10:58:04] <XioNoX>	 sukhe: what reimage script step is it failing at?
[10:58:28] <sukhe>	 XioNoX: which one?
[10:58:36] <sukhe>	 nothing failed for us, CI was broken :)
[10:58:52] <sukhe>	 reimaging fine so far
[10:58:56] <XioNoX>	 ohh ok
[10:59:14] <XioNoX>	 sukhe: I thought the CI issue blocked the reimage cookbook
[10:59:34] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:59:58] <sukhe>	 ah yeah, well it did in a way, I didn't want to proceed till I got a CI check
[11:00:12] <sukhe>	 we should be done with the cp's soon, doing multiple at once
[11:00:36] <XioNoX>	 cool
[11:00:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:00:42] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp3067.esams.wmnet with OS bullseye
[11:01:08] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10ssingh)
[11:01:52] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3075.esams.wmnet with OS bullseye
[11:02:44] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:02:48] <icinga-wm>	 RECOVERY - Check systemd state on kubestagemaster1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:03:18] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3069.esams.wmnet with OS bullseye
[11:04:50] <wikibugs>	 (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949010 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[11:04:54] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp3077.esams.wmnet with OS bullseye
[11:05:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:05:39] <wikibugs>	 (03PS2) 10Ssingh: hiera: add new DNS host in esams, dns3003 [puppet] - 10https://gerrit.wikimedia.org/r/949010 (https://phabricator.wikimedia.org/T344174)
[11:05:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] hiera: add new DNS host in esams, dns3003 [puppet] - 10https://gerrit.wikimedia.org/r/949010 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[11:06:12] <wikibugs>	 (03PS3) 10Muehlenhoff: Add cert for ganeti02.svc.esams.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/949011
[11:06:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add cert for ganeti02.svc.esams.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/949011 (owner: 10Muehlenhoff)
[11:06:25] <sukhe>	 yeah it's back
[11:07:03] <sukhe>	 manually fetched again
[11:07:09] <sukhe>	 not a solution, but no time right now to debug :)
[11:07:09] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host an-db1001.eqiad.wmnet
[11:07:17] <wikibugs>	 (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949010 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[11:08:06] <icinga-wm>	 PROBLEM - Check systemd state on an-db1001 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:08:13] <wikibugs>	 (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949010 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[11:10:28] <wikibugs>	 (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949010 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[11:16:11] <wikibugs>	 (03PS1) 10Stevemunene: airflow-wmde: Add wmde service user to the Yarn production queue [puppet] - 10https://gerrit.wikimedia.org/r/949019 (https://phabricator.wikimedia.org/T340648)
[11:16:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] airflow-wmde: Add wmde service user to the Yarn production queue [puppet] - 10https://gerrit.wikimedia.org/r/949019 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene)
[11:20:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:22:15] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3067.esams.wmnet with reason: host reimage
[11:22:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3005.esams.wmnet
[11:22:37] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ganeti3005.esams.wmnet
[11:22:41] <sukhe>	 sukhe@contint2002:~$ sudo systemctl restart zuul
[11:22:47] <sukhe>	 !log sukhe@contint2002:~$ sudo systemctl restart zuul
[11:22:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:23:14] <wikibugs>	 (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949010 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[11:23:46] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3075.esams.wmnet with reason: host reimage
[11:24:16] <sukhe>	 CI should be back
[11:24:24] <sukhe>	 thanks to RhinosF1 for finding the right task
[11:24:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3005.esams.wmnet
[11:24:26] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ganeti3005.esams.wmnet
[11:24:44] <RhinosF1>	 sukhe: it's fine
[11:24:55] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3069.esams.wmnet with reason: host reimage
[11:25:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:25:54] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubestagemaster1002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[11:26:36] <icinga-wm>	 RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:26:39] <logmsgbot>	 !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp3067.esams.wmnet with reason: host reimage
[11:26:47] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3077.esams.wmnet with reason: host reimage
[11:26:51] <wikibugs>	 (03CR) 10Stevemunene: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949019 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene)
[11:27:39] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp3075.esams.wmnet with reason: host reimage
[11:28:34] <wikibugs>	 (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949011 (owner: 10Muehlenhoff)
[11:29:15] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp3069.esams.wmnet with reason: host reimage
[11:30:12] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs
[11:30:33] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:30:42] <icinga-wm>	 RECOVERY - Check systemd state on an-db1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:31:38] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog1002 is OK: SSL OK - Certificate centrallog1002.eqiad.wmnet valid until 2028-01-24 19:33:10 +0000 (expires in 1623 days) https://wikitech.wikimedia.org/wiki/Logs
[11:31:47] <logmsgbot>	 !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp3077.esams.wmnet with reason: host reimage
[11:32:24] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve1007 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:35:33] <jinxer-wm>	 (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:36:58] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:39:56] <wikibugs>	 (03PS2) 10EoghanGaffney: gitlab: Update config to fix compatibility with swift [puppet] - 10https://gerrit.wikimedia.org/r/949014
[11:40:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gitlab: Update config to fix compatibility with swift [puppet] - 10https://gerrit.wikimedia.org/r/949014 (owner: 10EoghanGaffney)
[11:49:15] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[11:50:10] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[11:50:10] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3075.esams.wmnet with OS bullseye
[11:50:30] <icinga-wm>	 PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: user-runtime-dir@23938.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:51:03] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001"
[11:52:58] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[11:53:43] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001"
[11:53:43] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3067.esams.wmnet with OS bullseye
[11:54:13] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[11:54:13] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3069.esams.wmnet with OS bullseye
[11:54:39] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001"
[11:55:43] <wikibugs>	 (03CR) 10Stevemunene: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949019 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene)
[11:58:32] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:58:37] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10ssingh)
[11:58:50] <wikibugs>	 (03PS1) 10Ayounsi: Add new esams switches to icinga hostgroups [puppet] - 10https://gerrit.wikimedia.org/r/949023 (https://phabricator.wikimedia.org/T329219)
[11:58:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add new esams switches to icinga hostgroups [puppet] - 10https://gerrit.wikimedia.org/r/949023 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[11:59:48] <wikibugs>	 (03PS2) 10Ayounsi: Add new esams switches to icinga hostgroups [puppet] - 10https://gerrit.wikimedia.org/r/949023 (https://phabricator.wikimedia.org/T329219)
[11:59:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add new esams switches to icinga hostgroups [puppet] - 10https://gerrit.wikimedia.org/r/949023 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[12:00:00] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/949023 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T1200)
[12:01:00] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10Fabfur)
[12:02:27] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001"
[12:02:27] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3077.esams.wmnet with OS bullseye
[12:02:39] <sukhe>	 !log sukhe@contint2002:~$ sudo systemctl restart zuul
[12:02:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:49] <sukhe>	 !log sukhe@contint2002:~$ sudo systemctl restart zuul: T344238
[12:02:49] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Add new esams switches to icinga hostgroups [puppet] - 10https://gerrit.wikimedia.org/r/949023 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[12:02:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:52] <stashbot>	 T344238: CI "Merge Failed. because cross-repo dependencies" on CI jobs, even up-to-date ones - https://phabricator.wikimedia.org/T344238
[12:02:56] <wikibugs>	 (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949023 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[12:03:15] <sukhe>	 XioNoX: oh ok
[12:03:26] <sukhe>	 I just did a recheck but yeah, pcc wins
[12:04:18] <logmsgbot>	 !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ganeti3007.esams.wmnet with OS bullseye
[12:08:51] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] idp_test: add datahub_staging as a OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/944231 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene)
[12:10:05] <wikibugs>	 (03CR) 10EoghanGaffney: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949014 (owner: 10EoghanGaffney)
[12:10:34] <wikibugs>	 (03PS1) 10Jelto: gerrit: raise maxConnectionsPerUser to 8 [puppet] - 10https://gerrit.wikimedia.org/r/949026 (https://phabricator.wikimedia.org/T344238)
[12:12:07] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sw - ayounsi@cumin1001"
[12:12:07] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "LGTM, good idea" [puppet] - 10https://gerrit.wikimedia.org/r/949026 (https://phabricator.wikimedia.org/T344238) (owner: 10Jelto)
[12:12:52] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "sw - ayounsi@cumin1001"
[12:14:09] <wikibugs>	 (03CR) 10Jelto: "note: the value was reduced from 32 to 4 in 2019: I30afd4ff3d8527aa3eb3280b81a840367f64918c" [puppet] - 10https://gerrit.wikimedia.org/r/949026 (https://phabricator.wikimedia.org/T344238) (owner: 10Jelto)
[12:16:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] admin_ng: increase resources for calico on wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/948091 (https://phabricator.wikimedia.org/T343900) (owner: 10Elukey)
[12:18:29] <wikibugs>	 (03PS1) 10Anzx: Remove knwiktionary tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949029 (https://phabricator.wikimedia.org/T343662)
[12:18:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Remove knwiktionary tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949029 (https://phabricator.wikimedia.org/T343662) (owner: 10Anzx)
[12:24:54] <icinga-wm>	 PROBLEM - carbon-cache@b service on cloudmetrics1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:24:56] <icinga-wm>	 PROBLEM - carbon-cache@d service on cloudmetrics1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@d is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:25:00] <icinga-wm>	 PROBLEM - carbon-cache@g service on cloudmetrics1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@g is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:25:08] <icinga-wm>	 PROBLEM - carbon-local-relay service on cloudmetrics1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-local-relay is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:25:26] <icinga-wm>	 PROBLEM - carbon-frontend-relay service on cloudmetrics1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-frontend-relay is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:25:28] <icinga-wm>	 PROBLEM - carbon-cache@f service on cloudmetrics1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@f is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:25:36] <icinga-wm>	 PROBLEM - carbon-cache@e service on cloudmetrics1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@e is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:25:46] <icinga-wm>	 PROBLEM - carbon-cache@a service on cloudmetrics1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:25:48] <icinga-wm>	 PROBLEM - carbon-cache@h service on cloudmetrics1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@h is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:26:26] <icinga-wm>	 RECOVERY - carbon-cache@b service on cloudmetrics1003 is OK: OK - carbon-cache@b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:26:26] <icinga-wm>	 RECOVERY - carbon-cache@d service on cloudmetrics1003 is OK: OK - carbon-cache@d is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:26:32] <icinga-wm>	 RECOVERY - carbon-cache@g service on cloudmetrics1003 is OK: OK - carbon-cache@g is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:26:40] <icinga-wm>	 RECOVERY - carbon-local-relay service on cloudmetrics1003 is OK: OK - carbon-local-relay is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:26:54] <icinga-wm>	 RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:26:58] <icinga-wm>	 RECOVERY - carbon-frontend-relay service on cloudmetrics1003 is OK: OK - carbon-frontend-relay is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:27:00] <icinga-wm>	 RECOVERY - carbon-cache@f service on cloudmetrics1003 is OK: OK - carbon-cache@f is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:27:10] <icinga-wm>	 RECOVERY - carbon-cache@e service on cloudmetrics1003 is OK: OK - carbon-cache@e is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:27:20] <icinga-wm>	 RECOVERY - carbon-cache@a service on cloudmetrics1003 is OK: OK - carbon-cache@a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:27:20] <icinga-wm>	 RECOVERY - carbon-cache@h service on cloudmetrics1003 is OK: OK - carbon-cache@h is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:32:42] <wikibugs>	 (03CR) 10Michael Große: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949030 (https://phabricator.wikimedia.org/T343409) (owner: 10Michael Große)
[12:34:39] <wikibugs>	 (03PS1) 10Anzx: Remove knwiktionary tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949032 (https://phabricator.wikimedia.org/T343662)
[12:35:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Remove podAntiAffinity for calico-typha on aux [deployment-charts] - 10https://gerrit.wikimedia.org/r/949002 (https://phabricator.wikimedia.org/T344230) (owner: 10JMeybohm)
[12:35:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "aux: set calico typha to one replica" [deployment-charts] - 10https://gerrit.wikimedia.org/r/948597 (https://phabricator.wikimedia.org/T333302) (owner: 10JMeybohm)
[12:36:02] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:36:30] <wikibugs>	 (03PS2) 10Anzx: Remove knwiktionary tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949029 (https://phabricator.wikimedia.org/T343662)
[12:36:38] <logmsgbot>	 !log filippo@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[12:37:06] <wikibugs>	 (03PS1) 10Urbanecm: Growth: Enable new Impact backend on large Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949033 (https://phabricator.wikimedia.org/T344143)
[12:37:08] <wikibugs>	 (03PS1) 10Urbanecm: Growth: Enable new Impact backend everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949034 (https://phabricator.wikimedia.org/T344143)
[12:37:17] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-2] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949034 (https://phabricator.wikimedia.org/T344143) (owner: 10Urbanecm)
[12:37:23] <wikibugs>	 (03Abandoned) 10Anzx: Remove knwiktionary tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949032 (https://phabricator.wikimedia.org/T343662) (owner: 10Anzx)
[12:40:34] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:42:02] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:42:30] <wikibugs>	 (03PS1) 10Ayounsi: Add new-esams network infra to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/949035 (https://phabricator.wikimedia.org/T329219)
[12:43:03] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] admin_ng: increase resources for calico on wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/948091 (https://phabricator.wikimedia.org/T343900) (owner: 10Elukey)
[12:43:30] <wikibugs>	 (03PS1) 10Kevin Bazira: [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890)
[12:44:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira)
[12:44:12] <wikibugs>	 (03PS2) 10Ayounsi: Add new-esams network infra to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/949035 (https://phabricator.wikimedia.org/T329219)
[12:44:35] <wikibugs>	 (03PS3) 10Ayounsi: Add new-esams network infra to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/949035 (https://phabricator.wikimedia.org/T329219)
[12:44:41] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/949035 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[12:44:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Add new-esams network infra to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/949035 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[12:45:44] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: increase resources for calico on wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/948091 (https://phabricator.wikimedia.org/T343900) (owner: 10Elukey)
[12:45:59] <wikibugs>	 (03PS4) 10Ayounsi: Add new-esams network infra to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/949035 (https://phabricator.wikimedia.org/T329219)
[12:46:26] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:47:47] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/949035 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[12:47:54] <logmsgbot>	 !log filippo@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[12:48:30] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/949014 (owner: 10EoghanGaffney)
[12:49:51] <wikibugs>	 (03Abandoned) 10Jelto: gerrit: add blackbox check for json endpoint [puppet] - 10https://gerrit.wikimedia.org/r/948555 (owner: 10Jelto)
[12:50:21] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add new-esams network infra to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/949035 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[12:51:10] <wikibugs>	 (03PS1) 10David Caro: role::wmcs::monitoring: remove unused envoy options [puppet] - 10https://gerrit.wikimedia.org/r/949037 (https://phabricator.wikimedia.org/T344242)
[12:51:16] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] icinga: remove obsolete gerrit checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948552 (owner: 10Filippo Giunchedi)
[12:52:37] <wikibugs>	 (03PS1) 10Anzx: Remove knwiktionary tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949038 (https://phabricator.wikimedia.org/T343662)
[12:56:24] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti3005']
[12:59:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:59:48] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: Time to snap out of that daydream and deploy UTC afternoon backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T1300).
[13:00:04] <jouncebot>	 sergi0 and aanzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:18] <sergi0>	 hello
[13:00:23] <urbanecm>	 I can deploy today
[13:00:29] <aanzx>	 o/
[13:00:34] <urbanecm>	 or actually, sergi0, since you're also a deployer, do you want to try to deploy your patch? :)
[13:00:55] <taavi>	 o/ also around, but would prefer not to deploy
[13:01:03] <sergi0>	 urbanecm: I didn't do my training :(
[13:02:41] <urbanecm>	 sergi0: ah. i can share my screen instead if you're interested in watching the deployment. i'm also happy to supervise your deployment monitoring your screen. with `scap backport`, it's not as difficult as it used to be :))
[13:03:09] <sergi0>	 urbanecm: yeah let's do it
[13:03:17] <urbanecm>	 which one? 
[13:03:32] <sergi0>	 let me watch first :)
[13:03:35] <urbanecm>	 ok
[13:04:09] <urbanecm>	 sergi0: see slack for meeting link
[13:04:17] <wikibugs>	 (03PS2) 10David Caro: role::wmcs::monitoring: pass through the ensure option [puppet] - 10https://gerrit.wikimedia.org/r/949037 (https://phabricator.wikimedia.org/T344242)
[13:04:36] <wikibugs>	 (03PS1) 10Ayounsi: Remove esams from ripeatlas_measurements [puppet] - 10https://gerrit.wikimedia.org/r/949039
[13:04:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:05:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Remove esams from ripeatlas_measurements [puppet] - 10https://gerrit.wikimedia.org/r/949039 (owner: 10Ayounsi)
[13:05:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948631 (https://phabricator.wikimedia.org/T308138) (owner: 10Sergio Gimeno)
[13:05:15] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Remove esams from ripeatlas_measurements [puppet] - 10https://gerrit.wikimedia.org/r/949039 (owner: 10Ayounsi)
[13:05:46] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: enable AddLink backend 13th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948631 (https://phabricator.wikimedia.org/T308138) (owner: 10Sergio Gimeno)
[13:05:58] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti3005']
[13:06:14] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:948631|GrowthExperiments: enable AddLink backend 13th round of wikis (T308138)]]
[13:06:18] <stashbot>	 T308138: Deploy "add a link" to 13th round of wikis - https://phabricator.wikimedia.org/T308138
[13:06:36] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti3007']
[13:06:52] <wikibugs>	 (03PS3) 10David Caro: role::wmcs::monitoring: pass through the ensure option [puppet] - 10https://gerrit.wikimedia.org/r/949037 (https://phabricator.wikimedia.org/T344242)
[13:07:53] <logmsgbot>	 !log urbanecm@deploy1002 sgimeno and urbanecm: Backport for [[gerrit:948631|GrowthExperiments: enable AddLink backend 13th round of wikis (T308138)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:08:13] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42897/console" [puppet] - 10https://gerrit.wikimedia.org/r/949037 (https://phabricator.wikimedia.org/T344242) (owner: 10David Caro)
[13:08:36] <urbanecm>	 sergi0: please test on mwdebug1001 
[13:09:11] <sergi0>	 urbanecm: this is a noop change, will trigger a periodic a job but I can run one manually
[13:09:57] <wikibugs>	 (03PS4) 10Muehlenhoff: Add cert for ganeti02.svc.esams.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/949011
[13:10:26] <logmsgbot>	 !log urbanecm@deploy1002 sgimeno and urbanecm: Continuing with sync
[13:15:57] <wikibugs>	 (03PS6) 10AOkoth: vrts: add test VM to site [puppet] - 10https://gerrit.wikimedia.org/r/939349 (https://phabricator.wikimedia.org/T340027)
[13:15:59] <wikibugs>	 (03PS1) 10AOkoth: contint2001: puppet cleanup post decom [puppet] - 10https://gerrit.wikimedia.org/r/949040 (https://phabricator.wikimedia.org/T342017)
[13:16:19] <wikibugs>	 (03CR) 10Stevemunene: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949019 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene)
[13:16:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] vrts: add test VM to site [puppet] - 10https://gerrit.wikimedia.org/r/939349 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth)
[13:17:02] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:948631|GrowthExperiments: enable AddLink backend 13th round of wikis (T308138)]] (duration: 10m 47s)
[13:17:06] <stashbot>	 T308138: Deploy "add a link" to 13th round of wikis - https://phabricator.wikimedia.org/T308138
[13:17:08] <urbanecm>	 sergi0: deployed :)
[13:17:09] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti3007']
[13:17:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add cert for ganeti02.svc.esams.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/949011 (owner: 10Muehlenhoff)
[13:17:29] <wikibugs>	 (03PS2) 10Urbanecm: Remove knwiktionary tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949038 (https://phabricator.wikimedia.org/T343662) (owner: 10Anzx)
[13:17:51] <sergi0>	 urbanecm: thank you so much! Almost graduated :)
[13:17:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949038 (https://phabricator.wikimedia.org/T343662) (owner: 10Anzx)
[13:18:09] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti3007']
[13:18:37] <wikibugs>	 (03Merged) 10jenkins-bot: Remove knwiktionary tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949038 (https://phabricator.wikimedia.org/T343662) (owner: 10Anzx)
[13:19:03] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:949038|Remove knwiktionary tagline (T343662)]]
[13:19:05] <wikibugs>	 (03PS2) 10AOkoth: contint2001: puppet cleanup post decom [puppet] - 10https://gerrit.wikimedia.org/r/949040 (https://phabricator.wikimedia.org/T342017)
[13:19:06] <stashbot>	 T343662: update knwiktionary logos - https://phabricator.wikimedia.org/T343662
[13:19:20] <urbanecm>	 aanzx: your patch is up next. will ping you once testable on mwdebug.
[13:19:30] <aanzx>	 urbanecm: ok 
[13:20:19] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "IP looks good, checked on netbox" [puppet] - 10https://gerrit.wikimedia.org/r/949010 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[13:20:41] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and anzx: Backport for [[gerrit:949038|Remove knwiktionary tagline (T343662)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:20:48] <aanzx>	 Testing 
[13:20:49] <wikibugs>	 (03CR) 10Fabfur: [C: 03+2] hiera: add new DNS host in esams, dns3003 [puppet] - 10https://gerrit.wikimedia.org/r/949010 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[13:20:53] <urbanecm>	 thanks :)
[13:21:40] <wikibugs>	 (03PS3) 10AOkoth: contint2001: puppet cleanup post decom [puppet] - 10https://gerrit.wikimedia.org/r/949040 (https://phabricator.wikimedia.org/T342017)
[13:22:42] <aanzx>	 Urbanecm tested looks good 
[13:22:57] <urbanecm>	 aanzx: thanks, proceeding
[13:22:58] <icinga-wm>	 RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga
[13:22:59] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and anzx: Continuing with sync
[13:23:04] <logmsgbot>	 !log filippo@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[13:23:17] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host dns3003.wikimedia.org with OS bullseye
[13:23:55] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti3007']
[13:23:57] <icinga-wm>	 PROBLEM - Host cr1-esams is DOWN: PING CRITICAL - Packet loss = 100%
[13:23:57] <icinga-wm>	 PROBLEM - Host cr1-esams IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[13:24:10] <logmsgbot>	 !log filippo@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[13:25:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti3007.esams.wmnet with OS bullseye
[13:28:42] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] contint2001: puppet cleanup post decom [puppet] - 10https://gerrit.wikimedia.org/r/949040 (https://phabricator.wikimedia.org/T342017) (owner: 10AOkoth)
[13:29:23] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:949038|Remove knwiktionary tagline (T343662)]] (duration: 10m 20s)
[13:29:24] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:29:27] <stashbot>	 T343662: update knwiktionary logos - https://phabricator.wikimedia.org/T343662
[13:29:31] <urbanecm>	 aanzx: deployed :)
[13:29:40] <aanzx>	 urbanecm: thanks 
[13:29:42] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:29:52] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3081 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[13:29:52] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3081 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[13:29:52] <icinga-wm>	 PROBLEM - Check systemd state on cp3081 is CRITICAL: CRITICAL - degraded: The following units failed: haproxy_stek_job.service,varnishncsa.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:29:52] <icinga-wm>	 PROBLEM - traffic-pool service on cp3081 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:29:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3005.esams.wmnet
[13:30:14] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:31:16] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:31:50] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 5.346 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:32:02] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3081 is OK: SSL OK - OCSP staple validity for wikipedia.org has 575701 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 87 days) https://wikitech.wikimedia.org/wiki/HTTPS
[13:32:02] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3081 is OK: SSL OK - OCSP staple validity for wikipedia.org has 590461 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 94 days) https://wikitech.wikimedia.org/wiki/HTTPS
[13:32:24] <sukhe>	 fabfur: ^ recovered :)
[13:32:34] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.268 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:32:40] <fabfur>	 sukhe: yep tnx
[13:35:08] <icinga-wm>	 RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:36:14] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/947401 (owner: 10PipelineBot)
[13:36:59] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] gitlab: Update config to fix compatibility with swift [puppet] - 10https://gerrit.wikimedia.org/r/949014 (owner: 10EoghanGaffney)
[13:37:05] <wikibugs>	 (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/947401 (owner: 10PipelineBot)
[13:37:30] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:38:31] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp3081.esams.wmnet with OS bullseye
[13:38:50] <icinga-wm>	 PROBLEM - puppet last run on gitlab1003 is CRITICAL: CRITICAL: Puppet last ran 4 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:42:56] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:44:12] <icinga-wm>	 RECOVERY - puppet last run on gitlab1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:44:34] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply
[13:44:45] <logmsgbot>	 !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dns3003.wikimedia.org with OS bullseye
[13:45:01] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply
[13:45:17] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply
[13:45:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti3007.esams.wmnet with reason: host reimage
[13:45:48] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply
[13:46:58] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply
[13:47:32] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply
[13:48:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti3007.esams.wmnet with reason: host reimage
[13:49:33] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10Jhancock.wm) 05Open→03Resolved stayed steady for 24 hours. closing.
[13:51:31] <wikibugs>	 10ops-codfw, 10serviceops-radar, 10Maps (Maps-data): ManagementSSHDown - https://phabricator.wikimedia.org/T344110 (10Jhancock.wm) Dell sent me a list of checks to determine if it's the motherboard or the backplane. followed directions and replied. my guess is the MB will need to be replaced. will update whe...
[13:51:37] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Add IP pre-assignments for new lvs servers in Amsterdam [puppet] - 10https://gerrit.wikimedia.org/r/944875 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney)
[13:51:44] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host dns3003.wikimedia.org with OS bullseye
[13:53:08] <wikibugs>	 (03PS1) 10Jgiannelos: wikifeeds: Use GET instead of POST for mwapi requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/949046
[13:54:04] <wikibugs>	 (03PS2) 10Jgiannelos: wikifeeds: Use GET instead of POST for mwapi requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/949046 (https://phabricator.wikimedia.org/T343950)
[13:59:55] <wikibugs>	 10sre-alert-triage, 10Machine-Learning-Team: Alert triage: overdue alert [warning] Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99 - https://phabricator.wikimedia.org/T343900 (10JMeybohm) >>! In T343900#9092723, @klausman wrote: > The problem is only really relevant for LLMs (Large...
[14:00:10] <wikibugs>	 (03CR) 10Jgiannelos: "We already made the same change in the codebase repo but the config is overridden here." [deployment-charts] - 10https://gerrit.wikimedia.org/r/949046 (https://phabricator.wikimedia.org/T343950) (owner: 10Jgiannelos)
[14:00:52] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3081.esams.wmnet with reason: host reimage
[14:03:58] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3081.esams.wmnet with reason: host reimage
[14:07:48] <logmsgbot>	 !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ganeti3005.esams.wmnet
[14:09:05] <wikibugs>	 (03PS3) 10Stevemunene: airflow-wmde: Create analytics-wmde airflow admin group [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648)
[14:09:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3005.esams.wmnet
[14:11:18] <wikibugs>	 (03PS1) 10Ssingh: esams: add new LVS high-traffic2 host, lvs3009 [puppet] - 10https://gerrit.wikimedia.org/r/949052 (https://phabricator.wikimedia.org/T344174)
[14:11:38] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:11:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] esams: add new LVS high-traffic2 host, lvs3009 [puppet] - 10https://gerrit.wikimedia.org/r/949052 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[14:12:57] <wikibugs>	 (03CR) 10Ssingh: "CI? KeyError: key not found: "PARALLEL_PID_FILE"" [puppet] - 10https://gerrit.wikimedia.org/r/949052 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[14:13:00] <wikibugs>	 (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949052 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[14:13:52] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:14:15] <logmsgbot>	 !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dns3003.wikimedia.org with OS bullseye
[14:14:50] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] role::wmcs::monitoring: pass through the ensure option [puppet] - 10https://gerrit.wikimedia.org/r/949037 (https://phabricator.wikimedia.org/T344242) (owner: 10David Caro)
[14:16:38] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:17:51] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1097.eqiad.wmnet with OS bullseye
[14:21:34] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/949052 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[14:21:43] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] esams: add new LVS high-traffic2 host, lvs3009 [puppet] - 10https://gerrit.wikimedia.org/r/949052 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[14:22:57] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs3009.esams.wmnet with OS bullseye
[14:23:13] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti3005.esams.wmnet
[14:25:48] <icinga-wm>	 RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:26:12] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3081.esams.wmnet with OS bullseye
[14:26:24] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1015.eqiad.wmnet with OS bullseye
[14:26:47] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1014.eqiad.wmnet with OS bullseye
[14:27:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3005.esams.wmnet
[14:27:54] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:28:58] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:33:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002"
[14:33:09] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:34:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:34:51] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1097.eqiad.wmnet with reason: host reimage
[14:34:52] <wikibugs>	 (03Abandoned) 10Jdrewniak: Enable Vector "Zebra" AB test on Hebrew wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922564 (https://phabricator.wikimedia.org/T335972) (owner: 10Jdrewniak)
[14:37:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002"
[14:37:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti3007.esams.wmnet with OS bullseye
[14:38:06] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1097.eqiad.wmnet with reason: host reimage
[14:38:27] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:38:35] <icinga-wm>	 PROBLEM - Recursive DNS on 185.15.59.34 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[14:38:49] <sukhe>	 yeah this is fine
[14:38:51] <sukhe>	 ^ 
[14:39:44] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.provision for host ganeti3005.mgmt.esams.wmnet with reboot policy GRACEFUL
[14:42:59] <wikibugs>	 10ops-eqiad, 10Cassandra: restbase1030: Cassandra crashing (signal 11) - https://phabricator.wikimedia.org/T344210 (10Eevans) p:05Triage→03High I think the upgrade to 4.1.1 is a red herring.  This seems to be limited to a single instance (one of three), and each sig 11 corresponds with the device errors ab...
[14:43:05] <wikibugs>	 (03CR) 10Herron: thanos-fe: switch to cfssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948125 (https://phabricator.wikimedia.org/T343987) (owner: 10Herron)
[14:43:36] <logmsgbot>	 !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ganeti3005.esams.wmnet
[14:43:38] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for fkaelin - https://phabricator.wikimedia.org/T343957 (10fkaelin) Thanks @colewhite. - As part of the data eng onboarding (T267817), I signed the L3 and a LDAP user should have been created.  - This is the wikitech [[ https://wikitech.wikimedia.or...
[14:45:08] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk2001.codfw.wmnet
[14:45:09] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[14:45:55] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:48:05] <wikibugs>	 (03PS1) 10Ssingh: hiera: update ifaces names for lvs3009 [puppet] - 10https://gerrit.wikimedia.org/r/949057 (https://phabricator.wikimedia.org/T344174)
[14:49:04] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: update ifaces names for lvs3009 [puppet] - 10https://gerrit.wikimedia.org/r/949057 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[14:49:09] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1015.eqiad.wmnet with reason: host reimage
[14:49:17] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1014.eqiad.wmnet with reason: host reimage
[14:50:27] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti3005.mgmt.esams.wmnet with reboot policy GRACEFUL
[14:50:41] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for fkaelin - https://phabricator.wikimedia.org/T343957 (10RhinosF1)
[14:51:24] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk2001.codfw.wmnet - bking@cumin1001"
[14:51:31] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs3009.esams.wmnet with OS bullseye
[14:52:10] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk2001.codfw.wmnet - bking@cumin1001"
[14:52:10] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:52:10] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk2001.codfw.wmnet on all recursors
[14:52:14] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.dns.wipe-cache (exit_code=99) flink-zk2001.codfw.wmnet on all recursors
[14:52:26] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, sql_labm SSH key entry, Kerberos Principal, Team Shell (posix) membership for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10OSefu-WMF)
[14:52:36] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1015.eqiad.wmnet with reason: host reimage
[14:53:08] <icinga-wm>	 PROBLEM - NTP peers on dns3003 is CRITICAL: NTP CRITICAL: No response from NTP server https://wikitech.wikimedia.org/wiki/NTP
[14:54:06] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[14:54:50] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1014.eqiad.wmnet with reason: host reimage
[14:54:52] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.provision for host lvs3009.mgmt.esams.wmnet with reboot policy FORCED
[14:55:27] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lvs3009.mgmt.esams.wmnet with reboot policy FORCED
[14:56:02] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.provision for host lvs3009.mgmt.esams.wmnet with reboot policy FORCED
[14:57:49] <wikibugs>	 (03PS13) 10David Caro: WIP: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691)
[14:59:15] <wikibugs>	 (03PS1) 10Ssingh: Revert "hiera: add new DNS host in esams, dns3003" [puppet] - 10https://gerrit.wikimedia.org/r/948600
[14:59:38] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:59:55] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Revert "hiera: add new DNS host in esams, dns3003" [puppet] - 10https://gerrit.wikimedia.org/r/948600 (owner: 10Ssingh)
[15:00:02] <icinga-wm>	 PROBLEM - Auth DNS on dns3003 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[15:00:12] <sukhe>	 ^ not fine but fine
[15:00:19] <sukhe>	 as in, not serving prod traffic
[15:00:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro)
[15:04:03] <wikibugs>	 (03CR) 10David Caro: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro)
[15:05:26] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs3009.mgmt.esams.wmnet with reboot policy FORCED
[15:06:17] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs3009.esams.wmnet with OS bullseye
[15:07:48] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1097.eqiad.wmnet with OS bullseye
[15:08:06] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:11:40] <wikibugs>	 (03PS1) 10BCornwall: Revert "Revert "pybal: Make check conform to the Nagios plugin API"" [puppet] - 10https://gerrit.wikimedia.org/r/948601
[15:12:12] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:12:52] <icinga-wm>	 PROBLEM - Recursive DNS on 2a02:ec80:300:2:185:15:59:34 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[15:13:13] <sukhe>	 expected
[15:14:05] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] Revert "Revert "pybal: Make check conform to the Nagios plugin API"" [puppet] - 10https://gerrit.wikimedia.org/r/948601 (owner: 10BCornwall)
[15:15:02] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:18:22] <wikibugs>	 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10Eevans)
[15:19:28] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:21:49] <wikibugs>	 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10Eevans)
[15:26:00] <icinga-wm>	 RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:27:05] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs3009.esams.wmnet with reason: host reimage
[15:27:42] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1014.eqiad.wmnet with OS bullseye
[15:29:05] <wikibugs>	 (03PS2) 10Aklapper: Add botadmin group on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940486 (https://phabricator.wikimedia.org/T342484) (owner: 10Hamish)
[15:29:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3007.esams.wmnet
[15:29:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add botadmin group on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940486 (https://phabricator.wikimedia.org/T342484) (owner: 10Hamish)
[15:29:47] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.dns.netbox
[15:29:49] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM flink-zk2001.codfw.wmnet - bking@cumin1001"
[15:30:42] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM flink-zk2001.codfw.wmnet - bking@cumin1001"
[15:30:42] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:30:42] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk2001.codfw.wmnet on all recursors
[15:30:45] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.dns.wipe-cache (exit_code=99) flink-zk2001.codfw.wmnet on all recursors
[15:32:19] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs3009.esams.wmnet with reason: host reimage
[15:32:52] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: rack bw27 hosts - robh@cumin1001"
[15:33:05] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:33:36] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: rack bw27 hosts - robh@cumin1001"
[15:33:36] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:34:05] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:34:59] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk2001.codfw.wmnet
[15:35:07] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host lvs3010
[15:35:19] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs3010
[15:35:24] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host lvs3008
[15:35:45] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs3008
[15:36:15] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk2001.codfw.wmnet
[15:36:17] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[15:36:42] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Requesting access to  analytics-wmde-users (no kerberos, with ssh) for karapayneWMDE - https://phabricator.wikimedia.org/T342546 (10Gehel) 05Open→03Resolved
[15:37:05] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:37:15] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp3080
[15:37:27] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp3080
[15:37:31] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host dns3004
[15:37:43] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dns3004
[15:37:50] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp3078
[15:38:02] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp3078
[15:38:03] <wikibugs>	 (03PS14) 10David Caro: WIP: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691)
[15:38:06] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp3076
[15:38:13] <sukhe>	 robh: 
[15:38:16] <sukhe>	 I see DNS changes for dns3004
[15:38:19] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp3076
[15:38:20] <sukhe>	 going to add them
[15:38:24] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp3074
[15:38:27] <sukhe>	 is that fine?
[15:38:36] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp3074
[15:38:41] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp3072
[15:38:57] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp3072
[15:39:01] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp3070
[15:39:12] <wikibugs>	 (03PS1) 10Muehlenhoff: Assign ganeti role to BY27 cluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/949063
[15:39:15] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp3070
[15:39:21] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk2001.codfw.wmnet - bking@cumin1001"
[15:39:25] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp3068
[15:39:37] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp3068
[15:39:41] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp3066
[15:39:49] <wikibugs>	 (03PS3) 10Hamish: Add botadmin group on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940486 (https://phabricator.wikimedia.org/T342484)
[15:39:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3007.esams.wmnet
[15:40:07] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp3066
[15:40:08] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk2001.codfw.wmnet - bking@cumin1001"
[15:40:08] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:40:08] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk2001.codfw.wmnet on all recursors
[15:40:11] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.dns.wipe-cache (exit_code=99) flink-zk2001.codfw.wmnet on all recursors
[15:40:23] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:40:37] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk2001.codfw.wmnet - bking@cumin1001"
[15:41:22] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk2001.codfw.wmnet - bking@cumin1001"
[15:41:34] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk2001.codfw.wmnet with OS bookworm
[15:42:04] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.dns.netbox
[15:42:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Assign ganeti role to BY27 cluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/949063 (owner: 10Muehlenhoff)
[15:44:09] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns3004 - robh@cumin1001"
[15:44:19] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1015.eqiad.wmnet with OS bullseye
[15:44:54] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns3004 - robh@cumin1001"
[15:44:55] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:47:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10Eevans)
[15:47:49] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:47:50] <wikibugs>	 (03PS5) 10Ryan Kemper: elastic: allow only 1 enwiki_content per host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833861 (https://phabricator.wikimedia.org/T343820)
[15:49:03] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:49:20] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.provision for host lvs3010.mgmt.esams.wmnet with reboot policy FORCED
[15:49:23] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[15:50:03] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.provision for host lvs3008.mgmt.esams.wmnet with reboot policy FORCED
[15:50:18] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[15:50:18] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs3009.esams.wmnet with OS bullseye
[15:50:19] <wikibugs>	 (03PS15) 10David Caro: WIP: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691)
[15:50:21] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.provision for host ganeti3008.mgmt.esams.wmnet with reboot policy FORCED
[15:50:50] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.provision for host ganeti3006.mgmt.esams.wmnet with reboot policy FORCED
[15:51:11] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.provision for host dns3004.mgmt.esams.wmnet with reboot policy FORCED
[15:51:47] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp3080.mgmt.esams.wmnet with reboot policy FORCED
[15:53:29] <wikibugs>	 (03PS16) 10David Caro: WIP: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691)
[15:56:42] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124
[15:56:46] <stashbot>	 T343124: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124
[15:56:56] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 14s)
[15:57:44] <wikibugs>	 (03PS1) 10Ayounsi: Rancid: esams migration [puppet] - 10https://gerrit.wikimedia.org/r/949072 (https://phabricator.wikimedia.org/T329219)
[15:58:03] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124
[15:58:18] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 15s)
[15:58:39] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Rancid: esams migration [puppet] - 10https://gerrit.wikimedia.org/r/949072 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[15:59:43] <wikibugs>	 (03PS1) 10Ayounsi: Rancid: typo [puppet] - 10https://gerrit.wikimedia.org/r/949073
[15:59:53] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[16:00:05] <jouncebot>	 jbond and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:20] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Rancid: typo [puppet] - 10https://gerrit.wikimedia.org/r/949073 (owner: 10Ayounsi)
[16:00:30] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[16:02:44] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs3008.mgmt.esams.wmnet with reboot policy FORCED
[16:02:52] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti3006.mgmt.esams.wmnet with reboot policy FORCED
[16:02:53] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:03:49] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+2] trafficserver: Use svc urls for eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/948624 (https://phabricator.wikimedia.org/T326657) (owner: 10BCornwall)
[16:05:10] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1098.eqiad.wmnet with OS bullseye
[16:08:33] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti3008.mgmt.esams.wmnet with reboot policy FORCED
[16:09:06] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dns3004.mgmt.esams.wmnet with reboot policy FORCED
[16:09:13] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs3010.mgmt.esams.wmnet with reboot policy FORCED
[16:09:45] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp3078.mgmt.esams.wmnet with reboot policy FORCED
[16:09:48] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp3080.mgmt.esams.wmnet with reboot policy FORCED
[16:10:27] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp3076.mgmt.esams.wmnet with reboot policy FORCED
[16:10:49] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp3074.mgmt.esams.wmnet with reboot policy FORCED
[16:11:17] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp3072.mgmt.esams.wmnet with reboot policy FORCED
[16:11:26] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp3070.mgmt.esams.wmnet with reboot policy FORCED
[16:11:38] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp3068.mgmt.esams.wmnet with reboot policy FORCED
[16:12:40] <wikibugs>	 10ops-codfw: InterfaceSpeedError - https://phabricator.wikimedia.org/T344269 (10phaultfinder)
[16:14:45] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] Release 1.9-4 to target Bookworm [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/946604 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[16:18:14] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[16:19:39] <icinga-wm>	 PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:20:30] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[16:20:51] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[16:21:35] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1098.eqiad.wmnet with reason: host reimage
[16:24:10] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[16:24:27] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1098.eqiad.wmnet with reason: host reimage
[16:25:19] <wikibugs>	 (03PS1) 10Ssingh: Revert "sre.hosts.reimage: connect to the micro service port" [cookbooks] - 10https://gerrit.wikimedia.org/r/948603
[16:26:31] <icinga-wm>	 RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:26:42] <wikibugs>	 (03CR) 10Ssingh: "For posterity and based on discussions with Jesse and Arzhel, this line causes multiple entries: https://gerrit.wikimedia.org/r/plugins/gi" [cookbooks] - 10https://gerrit.wikimedia.org/r/948603 (owner: 10Ssingh)
[16:27:30] <wikibugs>	 (03PS1) 10Ayounsi: Revert "sre.hosts.reimage: connect to the micro service port" [cookbooks] - 10https://gerrit.wikimedia.org/r/948604
[16:27:56] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp3076.mgmt.esams.wmnet with reboot policy FORCED
[16:28:10] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp3078.mgmt.esams.wmnet with reboot policy FORCED
[16:28:40] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] Revert "sre.hosts.reimage: connect to the micro service port" [cookbooks] - 10https://gerrit.wikimedia.org/r/948603 (owner: 10Ssingh)
[16:28:56] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp3074.mgmt.esams.wmnet with reboot policy FORCED
[16:29:18] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] Revert "sre.hosts.reimage: connect to the micro service port" [cookbooks] - 10https://gerrit.wikimedia.org/r/948604 (owner: 10Ayounsi)
[16:29:34] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp3072.mgmt.esams.wmnet with reboot policy FORCED
[16:29:36] <wikibugs>	 (03Abandoned) 10Ssingh: Revert "sre.hosts.reimage: connect to the micro service port" [cookbooks] - 10https://gerrit.wikimedia.org/r/948603 (owner: 10Ssingh)
[16:29:42] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp3068.mgmt.esams.wmnet with reboot policy FORCED
[16:30:00] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp3070.mgmt.esams.wmnet with reboot policy FORCED
[16:30:08] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Revert "sre.hosts.reimage: connect to the micro service port" [cookbooks] - 10https://gerrit.wikimedia.org/r/948604 (owner: 10Ayounsi)
[16:32:04] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp3066.mgmt.esams.wmnet with reboot policy FORCED
[16:32:46] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "sre.hosts.reimage: connect to the micro service port" [cookbooks] - 10https://gerrit.wikimedia.org/r/948604 (owner: 10Ayounsi)
[16:33:05] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:33:33] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] "Longer fix might be to ease the "len(json_response) != 1" check." [cookbooks] - 10https://gerrit.wikimedia.org/r/948604 (owner: 10Ayounsi)
[16:33:38] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs3010']
[16:36:28] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs3008']
[16:37:01] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti3008']
[16:37:35] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti3006']
[16:37:41] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:37:59] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns3004']
[16:42:17] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lvs3008']
[16:42:30] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti3008']
[16:43:04] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lvs3010']
[16:43:55] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dns3004']
[16:44:05] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti3006']
[16:44:45] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:44:59] <wikibugs>	 (03PS1) 10Ssingh: esams: provision all cp hosts in rack B27 [puppet] - 10https://gerrit.wikimedia.org/r/949078 (https://phabricator.wikimedia.org/T344174)
[16:45:02] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs3010']
[16:45:12] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk2002.codfw.wmnet
[16:45:14] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[16:45:37] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3080']
[16:46:01] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3078']
[16:46:59] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3076']
[16:47:24] <wikibugs>	 (03PS1) 10Ssingh: Revert "Revert "hiera: add new DNS host in esams, dns3003"" [puppet] - 10https://gerrit.wikimedia.org/r/948605
[16:47:47] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:47:54] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[16:47:58] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk2002.codfw.wmnet
[16:48:33] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3074']
[16:48:46] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp3074']
[16:49:31] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1098.eqiad.wmnet with OS bullseye
[16:49:37] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "seems simple enough!" [puppet] - 10https://gerrit.wikimedia.org/r/948566 (https://phabricator.wikimedia.org/T334585) (owner: 10David Caro)
[16:49:39] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp3066.mgmt.esams.wmnet with reboot policy FORCED
[16:50:45] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3074']
[16:51:46] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp3080']
[16:52:27] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp3078']
[16:53:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:53:56] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lvs3010']
[16:54:01] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp3076']
[16:54:12] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3072']
[16:55:04] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3070']
[16:55:26] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3068']
[16:55:47] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3066']
[16:56:55] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp3074']
[16:57:37] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host flink-zk2001.codfw.wmnet with OS bookworm
[16:57:37] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk2001.codfw.wmnet
[16:57:39] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T1700)
[17:00:13] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:00:32] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp3072']
[17:01:29] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp3068']
[17:01:38] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp3066']
[17:01:53] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp3070']
[17:06:32] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T344135 (10RobH) 05Open→03Declined
[17:06:51] <wikibugs>	 (03PS1) 10Ssingh: esams: add new LVS high-traffic1 host, lvs3008 [puppet] - 10https://gerrit.wikimedia.org/r/949082 (https://phabricator.wikimedia.org/T344174)
[17:06:53] <wikibugs>	 (03PS1) 10Ssingh: esams: add new LVS secondary host, lvs3010 [puppet] - 10https://gerrit.wikimedia.org/r/949083 (https://phabricator.wikimedia.org/T344174)
[17:08:24] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10RobH)
[17:09:41] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5004 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[17:12:19] <sukhe>	 ^ yeah
[17:14:14] <wikibugs>	 (03CR) 10Fabfur: [C: 03+2] Revert "Revert "hiera: add new DNS host in esams, dns3003"" [puppet] - 10https://gerrit.wikimedia.org/r/948605 (owner: 10Ssingh)
[17:15:25] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] esams: provision all cp hosts in rack B27 [puppet] - 10https://gerrit.wikimedia.org/r/949078 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[17:15:48] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host dns3003.wikimedia.org with OS bullseye
[17:16:48] <fabfur>	 ^^ due to this cookbook running dns changes *may* fail
[17:16:56] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] esams: provision all cp hosts in rack B27 [puppet] - 10https://gerrit.wikimedia.org/r/949078 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[17:20:58] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3066.esams.wmnet with OS bullseye
[17:21:11] <brett>	 !log Upload libvmod-netmapper 1.9-4 (bookworm) to archive - T342154
[17:21:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:21:14] <stashbot>	 T342154: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154
[17:21:56] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall)
[17:22:11] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3074.esams.wmnet with OS bullseye
[17:24:05] <icinga-wm>	 PROBLEM - Host 2a02:ec80:300:2:185:15:59:34 is DOWN: CRITICAL - Destination Unreachable (2a02:ec80:300:2:185:15:59:34)
[17:24:25] <sukhe>	 ^ expected
[17:24:27] <sukhe>	 fabfur reimaging
[17:25:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:26:01] <icinga-wm>	 RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:26:37] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve1006 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:29:16] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubernetes1011:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes1011 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[17:29:19] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:29:55] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:31:19] <wikibugs>	 (03PS1) 10Fabfur: hiera: add new DNS host in esams, dns3004 [puppet] - 10https://gerrit.wikimedia.org/r/949088 (https://phabricator.wikimedia.org/T344174)
[17:32:01] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] hiera: add new DNS host in esams, dns3004 [puppet] - 10https://gerrit.wikimedia.org/r/949088 (https://phabricator.wikimedia.org/T344174) (owner: 10Fabfur)
[17:33:02] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/949083 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[17:33:57] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve1006 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[17:35:25] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:37:18] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] contint2001: puppet cleanup post decom [puppet] - 10https://gerrit.wikimedia.org/r/949040 (https://phabricator.wikimedia.org/T342017) (owner: 10AOkoth)
[17:39:21] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dns3003.wikimedia.org with reason: host reimage
[17:40:14] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3068.esams.wmnet with OS bullseye
[17:42:39] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3076.esams.wmnet with OS bullseye
[17:42:47] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns3003.wikimedia.org with reason: host reimage
[17:42:51] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3066.esams.wmnet with reason: host reimage
[17:44:06] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3074.esams.wmnet with reason: host reimage
[17:45:44] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[17:45:59] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3066.esams.wmnet with reason: host reimage
[17:48:35] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3074.esams.wmnet with reason: host reimage
[17:50:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:50:15] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs
[17:50:23] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:51:15] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog1002 is OK: SSL OK - Certificate centrallog1002.eqiad.wmnet valid until 2028-01-24 19:33:10 +0000 (expires in 1623 days) https://wikitech.wikimedia.org/wiki/Logs
[17:51:23] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:51:25] <icinga-wm>	 PROBLEM - Recursive DNS on 185.15.59.34 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[17:53:23] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[17:54:40] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[17:55:04] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:59:01] <icinga-wm>	 RECOVERY - Recursive DNS on 185.15.59.34 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[18:00:04] <jouncebot>	 brennen and dancy: That opportune time is upon us again. Time for a MediaWiki train - Utc-7 Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T1800).
[18:00:12] <dancy>	 o/
[18:01:18] <dancy>	 ooh, a lot of risky patches this train. Good times.  I do very much appreciate the forewarnings!
[18:01:27] * dancy tries to read them carefully
[18:01:45] <rzl>	 break a leg!
[18:01:55] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3068.esams.wmnet with reason: host reimage
[18:02:50] <dancy>	 Looks like train is blocked on T344223 so I'm not pressing any buttons at this time.
[18:02:51] <stashbot>	 T344223: User logging in on mw-on-k8s triggers "RuntimeException: firejail is enabled, but cannot be found" - https://phabricator.wikimedia.org/T344223
[18:05:04] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3076.esams.wmnet with reason: host reimage
[18:05:09] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3068.esams.wmnet with reason: host reimage
[18:08:19] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3076.esams.wmnet with reason: host reimage
[18:09:31] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:09:44] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[18:10:29] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001"
[18:11:07] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:11:11] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[18:11:53] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:12:07] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.144 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:12:35] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 4.691 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:12:47] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:14:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:16:14] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[18:16:14] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3066.esams.wmnet with OS bullseye
[18:16:14] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[18:16:14] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3074.esams.wmnet with OS bullseye
[18:17:41] <logmsgbot>	 !log fabfur@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001"
[18:17:41] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns3003.wikimedia.org with OS bullseye
[18:19:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:21:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:23:52] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10Fabfur)
[18:24:01] <wikibugs>	 (03CR) 10Fabfur: [C: 03+2] hiera: add new DNS host in esams, dns3004 [puppet] - 10https://gerrit.wikimedia.org/r/949088 (https://phabricator.wikimedia.org/T344174) (owner: 10Fabfur)
[18:24:57] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:25:13] <fabfur>	 start reimaging dns3004 (dns changes may fail during this time, in case of error skip) 
[18:26:13] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host dns3004.wikimedia.org with OS bullseye
[18:27:59] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:28:24] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002"
[18:29:47] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:30:19] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002"
[18:30:43] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002"
[18:30:43] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3068.esams.wmnet with OS bullseye
[18:32:31] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002"
[18:32:32] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3076.esams.wmnet with OS bullseye
[18:33:46] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10BCornwall)
[18:36:11] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3070.esams.wmnet with OS bullseye
[18:36:34] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3078.esams.wmnet with OS bullseye
[18:36:41] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[18:36:58] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns3004.wikimedia.org with OS bullseye
[18:37:12] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host dns3004.wikimedia.org with OS bullseye
[18:57:11] <icinga-wm>	 RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:57:21] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3070.esams.wmnet with reason: host reimage
[18:57:52] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3078.esams.wmnet with reason: host reimage
[18:58:01] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:58:16] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dns3004.wikimedia.org with reason: host reimage
[19:01:30] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3070.esams.wmnet with reason: host reimage
[19:03:48] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns3004.wikimedia.org with reason: host reimage
[19:06:24] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3078.esams.wmnet with reason: host reimage
[19:06:39] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:06:55] <icinga-wm>	 PROBLEM - Recursive DNS on 185.15.59.2 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[19:07:42] <wikibugs>	 (03PS2) 10Ahmon Dancy: Update kask container image path [deployment-charts] - 10https://gerrit.wikimedia.org/r/913949 (https://phabricator.wikimedia.org/T335691)
[19:10:18] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] Update kask container image path [deployment-charts] - 10https://gerrit.wikimedia.org/r/913949 (https://phabricator.wikimedia.org/T335691) (owner: 10Ahmon Dancy)
[19:16:15] <icinga-wm>	 PROBLEM - Recursive DNS on 2a02:ec80:300:1:185:15:59:2 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[19:18:41] <icinga-wm>	 RECOVERY - Recursive DNS on 2a02:ec80:300:1:185:15:59:2 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[19:19:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:19:25] <icinga-wm>	 RECOVERY - Recursive DNS on 185.15.59.2 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[19:20:23] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[19:23:01] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:23:23] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10ssingh)
[19:25:39] <wikibugs>	 (03PS1) 10Ssingh: devices: add anycast_ and lvs_neigbhors for esams (bw27/by27) [homer/public] - 10https://gerrit.wikimedia.org/r/949100 (https://phabricator.wikimedia.org/T329219)
[19:26:00] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[19:26:43] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002"
[19:27:51] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] devices: add anycast_ and lvs_neigbhors for esams (bw27/by27) [homer/public] - 10https://gerrit.wikimedia.org/r/949100 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[19:28:21] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002"
[19:28:28] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3072.esams.wmnet with OS bullseye
[19:29:36] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3080.esams.wmnet with OS bullseye
[19:30:19] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: merge flink-zk2002 DNS changes - sukhe@cumin2002"
[19:30:57] <logmsgbot>	 !log brett@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002"
[19:30:58] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3070.esams.wmnet with OS bullseye
[19:30:58] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002"
[19:30:58] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3078.esams.wmnet with OS bullseye
[19:31:06] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: merge flink-zk2002 DNS changes - sukhe@cumin2002"
[19:31:06] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:32:00] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "manual trigger - cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002 - brett@cumin2002"
[19:32:43] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "manual trigger - cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002 - brett@cumin2002"
[19:33:33] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:33:58] <wikibugs>	 (03PS1) 10Bking: query_service: let puppet manage whitelist [puppet] - 10https://gerrit.wikimedia.org/r/949101 (https://phabricator.wikimedia.org/T343856)
[19:34:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] query_service: let puppet manage whitelist [puppet] - 10https://gerrit.wikimedia.org/r/949101 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking)
[19:36:01] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] esams: add new LVS high-traffic1 host, lvs3008 [puppet] - 10https://gerrit.wikimedia.org/r/949082 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[19:36:43] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[19:38:02] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs3008.esams.wmnet with OS bullseye
[19:40:50] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for fkaelin - https://phabricator.wikimedia.org/T343957 (10colewhite)
[19:42:12] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10BCornwall)
[19:44:05] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001"
[19:44:29] <wikibugs>	 (03PS2) 10Ryan Kemper: query_service: let puppet manage whitelist [puppet] - 10https://gerrit.wikimedia.org/r/949101 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking)
[19:45:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] query_service: let puppet manage whitelist [puppet] - 10https://gerrit.wikimedia.org/r/949101 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking)
[19:45:37] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001"
[19:45:38] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns3004.wikimedia.org with OS bullseye
[19:45:45] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for fkaelin - https://phabricator.wikimedia.org/T343957 (10colewhite) Thanks @fkaelin!  Found the L3 signature. Good to go! Found based on the shell name and existing data entry.  The email is subaddressed making ldap search return false negative....
[19:46:02] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for fkaelin - https://phabricator.wikimedia.org/T343957 (10colewhite)
[19:47:40] <wikibugs>	 (03PS3) 10Bking: query_service: let puppet manage whitelist [puppet] - 10https://gerrit.wikimedia.org/r/949101 (https://phabricator.wikimedia.org/T343856)
[19:47:54] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10Fabfur)
[19:48:00] <wikibugs>	 (03PS1) 10Cwhite: admin: add fab to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/948693 (https://phabricator.wikimedia.org/T343957)
[19:48:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] query_service: let puppet manage whitelist [puppet] - 10https://gerrit.wikimedia.org/r/949101 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking)
[19:49:18] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10fundraising-tech-ops: decommission frdev1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T341869 (10Jgreen)
[19:49:44] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3072.esams.wmnet with reason: host reimage
[19:49:57] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10colewhite) a:03Mabualruz
[19:50:09] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10fundraising-tech-ops: decommission civi1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T341868 (10Jgreen)
[19:51:01] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3080.esams.wmnet with reason: host reimage
[19:53:22] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3072.esams.wmnet with reason: host reimage
[19:53:56] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Ricki Jay (WMDE) - https://phabricator.wikimedia.org/T343700 (10colewhite) >>! In T343700#9091877, @KFrancis wrote: > Please provide Ricki Jay's email address and I will start processing this request.  You may send it to kfrancis@wikimedia.org if you...
[19:55:47] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3080.esams.wmnet with reason: host reimage
[19:57:04] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs3008.esams.wmnet with reason: host reimage
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T2000)
[20:00:05] <jouncebot>	 hmonroy and ryankemper: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:18] <icinga-wm>	 RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:01:20] <sukhe>	 !log running dummy authdns-update
[20:01:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:01:35] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, sql_labm SSH key entry, Kerberos Principal, Team Shell (posix) membership for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10colewhite)
[20:01:46] <urbanecm>	 o/ I'm around if needed; hmonroy ryankemper do you plan on self-deploying, or should I go ahead?
[20:01:52] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs3008.esams.wmnet with reason: host reimage
[20:02:22] <wikibugs>	 (03PS2) 10BCornwall: Release 0.36-2 for Bookworm [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/948672 (https://phabricator.wikimedia.org/T342154)
[20:02:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Release 0.36-2 for Bookworm [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/948672 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[20:03:47] <ryankemper>	 urbanecm: can go ahead with mine!
[20:04:17] <urbanecm>	 Feel free to proceed :)
[20:05:54] <ryankemper>	 ack, rolling in a couple mins
[20:07:33] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, sql_labm SSH key entry, Kerberos Principal, Team Shell (posix) membership for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10colewhite) Hi and welcome!  Please help me confirm the ssh key out-of-band (off phabricator) by...
[20:09:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ryankemper@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833861 (https://phabricator.wikimedia.org/T343820) (owner: 10Ryan Kemper)
[20:10:42] <wikibugs>	 (03Merged) 10jenkins-bot: elastic: allow only 1 enwiki_content per host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833861 (https://phabricator.wikimedia.org/T343820) (owner: 10Ryan Kemper)
[20:11:10] <logmsgbot>	 !log ryankemper@deploy1002 Started scap: Backport for [[gerrit:833861|elastic: allow only 1 enwiki_content per host (T343820)]]
[20:11:14] <stashbot>	 T343820: Retune enwiki_content shard settings - https://phabricator.wikimedia.org/T343820
[20:12:03] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] esams: add new LVS secondary host, lvs3010 [puppet] - 10https://gerrit.wikimedia.org/r/949083 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[20:12:11] <wikibugs>	 (03PS2) 10Ssingh: esams: add new LVS secondary host, lvs3010 [puppet] - 10https://gerrit.wikimedia.org/r/949083 (https://phabricator.wikimedia.org/T344174)
[20:12:40] <wikibugs>	 (03CR) 10Ssingh: [V: 03+2] esams: add new LVS secondary host, lvs3010 [puppet] - 10https://gerrit.wikimedia.org/r/949083 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh)
[20:12:49] <logmsgbot>	 !log ryankemper@deploy1002 ryankemper: Backport for [[gerrit:833861|elastic: allow only 1 enwiki_content per host (T343820)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[20:13:48] <logmsgbot>	 !log ryankemper@deploy1002 ryankemper: Continuing with sync
[20:14:51] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs3010.esams.wmnet with OS bullseye
[20:16:19] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[20:17:29] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[20:17:29] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3072.esams.wmnet with OS bullseye
[20:18:16] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[20:19:13] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[20:19:13] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3080.esams.wmnet with OS bullseye
[20:19:35] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[20:20:29] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[20:20:29] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs3008.esams.wmnet with OS bullseye
[20:20:36] <logmsgbot>	 !log ryankemper@deploy1002 Finished scap: Backport for [[gerrit:833861|elastic: allow only 1 enwiki_content per host (T343820)]] (duration: 09m 25s)
[20:20:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:20:40] <stashbot>	 T343820: Retune enwiki_content shard settings - https://phabricator.wikimedia.org/T343820
[20:25:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:31:05] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:33:52] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs3010.esams.wmnet with reason: host reimage
[20:36:47] <ebernhardson>	 !log T342444 start cirrussearch reindex of all wikis to enable new text analysis components from mwmaint1002
[20:36:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:36:50] <stashbot>	 T342444: Reindex all wikis to enable apostrophe normalization, camelCase handling, acronym handling, and word_break_helper - https://phabricator.wikimedia.org/T342444
[20:37:19] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs3010.esams.wmnet with reason: host reimage
[20:37:49] <hmonroy>	 urbanecm My apologies! I got carried away in some task. Should I rescheduled?
[20:38:44] <hmonroy>	 urbanecm: ^^
[20:39:36] <urbanecm>	 I'm afk now unfortunately, so yes please hmonroy. Or, feel free to self-deploy if you can :)
[20:40:19] <hmonroy>	 urbanecm: Sounds good. Thank you!!
[20:50:26] <wikibugs>	 (03PS1) 10Bking: spdx.rb: Skip SPDX enforcement of txt files [puppet] - 10https://gerrit.wikimedia.org/r/949112 (https://phabricator.wikimedia.org/T344291)
[20:52:33] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:54:53] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[20:55:05] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:55:51] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:55:53] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[20:55:53] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs3010.esams.wmnet with OS bullseye
[20:57:06] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Traffic: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10ssingh)
[20:57:58] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Traffic: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10ssingh)
[21:00:04] <wikibugs>	 (03PS1) 10Ssingh: esams/ntp: point to dns3003 [dns] - 10https://gerrit.wikimedia.org/r/949113 (https://phabricator.wikimedia.org/T329219)
[21:03:17] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:04:09] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:04:48] <wikibugs>	 (03Abandoned) 10Ssingh: hiera: enable single backend on esams and switch to F4-U hardware config [puppet] - 10https://gerrit.wikimedia.org/r/948581 (https://phabricator.wikimedia.org/T288106) (owner: 10Ssingh)
[21:05:11] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:07:08] <wikibugs>	 (03CR) 10Ssingh: "Merging this Wednesday morning." [homer/public] - 10https://gerrit.wikimedia.org/r/949100 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[21:07:40] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.dns.netbox
[21:07:43] <logmsgbot>	 !log robh@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97)
[21:09:12] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.dns.netbox
[21:13:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[21:13:31] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[21:14:04] <rzl>	 yo
[21:14:41] <jhathaway>	 yo yo
[21:15:35] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pdus - robh@cumin1001"
[21:16:21] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pdus - robh@cumin1001"
[21:16:21] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:16:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:17:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:18:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[21:18:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[21:21:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:21:45] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:25:15] <icinga-wm>	 RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:29:16] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubernetes1011:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes1011 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[21:32:19] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1012.eqiad.wmnet with OS bullseye
[21:32:56] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1013.eqiad.wmnet with OS bullseye
[21:37:37] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops: Q4:knams: PDU installation - https://phabricator.wikimedia.org/T334280 (10RobH)
[21:47:33] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1012.eqiad.wmnet with reason: host reimage
[21:47:46] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1013.eqiad.wmnet with reason: host reimage
[21:50:29] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1012.eqiad.wmnet with reason: host reimage
[21:53:02] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1013.eqiad.wmnet with reason: host reimage
[21:55:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:00:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:06:28] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Ricki Jay (WMDE) - https://phabricator.wikimedia.org/T343700 (10KFrancis) Thank you so much!  I've sent out the agreement for signatures.
[22:08:28] <wikibugs>	 (03PS4) 10Bking: query_service: let puppet manage whitelist [puppet] - 10https://gerrit.wikimedia.org/r/949101 (https://phabricator.wikimedia.org/T343856)
[22:09:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] query_service: let puppet manage whitelist [puppet] - 10https://gerrit.wikimedia.org/r/949101 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking)
[22:10:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:14:45] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:22:20] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1013.eqiad.wmnet with OS bullseye
[22:27:55] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1012.eqiad.wmnet with OS bullseye
[22:32:15] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:35:09] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:38:35] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[22:39:45] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:49:33] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:49:47] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:50:55] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50421 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:51:09] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.403 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:52:49] <jinxer-wm>	 (RdfStreamingUpdaterNotEnoughTaskSlots) firing: The flink session cluster rdf-streaming-updater in eqiad (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots
[22:56:45] <icinga-wm>	 RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:56:51] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[22:57:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[22:57:49] <jinxer-wm>	 (RdfStreamingUpdaterNotEnoughTaskSlots) resolved: The flink session cluster rdf-streaming-updater in eqiad (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots
[22:58:34] <wikibugs>	 (03PS2) 10Tim Starling: Set wikidiff2 maxSplitSize = 10 on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947986 (https://phabricator.wikimedia.org/T341754)
[23:02:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[23:11:29] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns3003 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[23:12:45] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:13:13] <wikibugs>	 (03CR) 10HMonroy: [C: 03+2] Set wikidiff2 maxSplitSize = 10 on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947986 (https://phabricator.wikimedia.org/T341754) (owner: 10Tim Starling)
[23:14:00] <wikibugs>	 (03Merged) 10jenkins-bot: Set wikidiff2 maxSplitSize = 10 on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947986 (https://phabricator.wikimedia.org/T341754) (owner: 10Tim Starling)
[23:14:01] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:14:01] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:25:01] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.021 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:26:07] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:26:09] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:26:30] <logmsgbot>	 !log hmonroy@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Set wikidiff2 maxSplitSize = 10 on group0 wikis T341754 (duration: 07m 39s)
[23:26:34] <stashbot>	 T341754: Deploy wikidiff2 paragraph split detection - https://phabricator.wikimedia.org/T341754
[23:30:25] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:45:50] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, sql_labm SSH key entry, Kerberos Principal, Team Shell (posix) membership for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10kzimmerman) Approved as Omari's manager, thank you!
[23:57:07] <icinga-wm>	 RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state