[00:26:42] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3081.esams.wmnet with OS bullseye [00:36:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:38:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/948687 [00:38:28] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/948687 (owner: 10TrainBranchBot) [00:41:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:46:08] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:09] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/948687 (owner: 10TrainBranchBot) [00:56:24] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:01:16] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp3081.esams.wmnet with OS bullseye [01:03:48] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [01:08:34] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T344213 (10phaultfinder) [01:13:57] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T344213 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact [01:18:24] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3081.esams.wmnet with OS bullseye [01:23:18] (03PS1) 10Ssingh: cp3073: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/948685 (https://phabricator.wikimedia.org/T327438) [01:23:20] (03PS1) 10Ssingh: cp3071: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/948706 (https://phabricator.wikimedia.org/T327438) [01:30:04] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:30:20] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:30] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:39:35] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp3081.esams.wmnet with OS bullseye [01:50:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:57:04] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:57:34] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T0200) [02:07:52] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.22 [core] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/948688 (https://phabricator.wikimedia.org/T343724) [02:07:58] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.22 [core] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/948688 (https://phabricator.wikimedia.org/T343724) (owner: 10TrainBranchBot) [02:11:38] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:23:07] (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.22 [core] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/948688 (https://phabricator.wikimedia.org/T343724) (owner: 10TrainBranchBot) [02:31:36] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:38] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:43:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:48:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:55:44] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T0300) [03:01:30] (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948707 (https://phabricator.wikimedia.org/T343724) [03:01:32] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948707 (https://phabricator.wikimedia.org/T343724) (owner: 10TrainBranchBot) [03:02:12] (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948707 (https://phabricator.wikimedia.org/T343724) (owner: 10TrainBranchBot) [03:02:43] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.41.0-wmf.22 refs T343724 [03:02:47] T343724: 1.41.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T343724 [03:24:43] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:28:46] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:55:48] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:56:25] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.41.0-wmf.22 refs T343724 (duration: 53m 42s) [03:56:29] T343724: 1.41.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T343724 [03:58:41] !log mwpresync@deploy1002 Pruned MediaWiki: 1.41.0-wmf.19 (duration: 02m 13s) [04:24:23] (03PS1) 10Majavah: Add a comment why PdfHandler does not use Shellbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948710 [04:24:59] (03CR) 10Jforrester: [C: 03+1] Add a comment why PdfHandler does not use Shellbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948710 (owner: 10Majavah) [04:25:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948710 (owner: 10Majavah) [04:25:58] (03Merged) 10jenkins-bot: Add a comment why PdfHandler does not use Shellbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948710 (owner: 10Majavah) [04:26:32] !log taavi@deploy1002 Started scap: Backport for [[gerrit:948710|Add a comment why PdfHandler does not use Shellbox]] [04:28:14] !log taavi@deploy1002 taavi: Backport for [[gerrit:948710|Add a comment why PdfHandler does not use Shellbox]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [04:28:31] !log taavi@deploy1002 taavi: Continuing with sync [04:33:10] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:34:57] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:948710|Add a comment why PdfHandler does not use Shellbox]] (duration: 08m 24s) [04:42:24] (03CR) 10Majavah: [C: 03+1] "Aaaah the inconsistent spacing between `proxy_hide_header` and the header. I think this patch is fine, though." [puppet] - 10https://gerrit.wikimedia.org/r/940506 (owner: 10Lucas Werkmeister) [04:57:00] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:18:27] (03PS1) 10Ayounsi: realm.pp fix new esams range [puppet] - 10https://gerrit.wikimedia.org/r/948713 (https://phabricator.wikimedia.org/T329219) [05:31:22] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:38:10] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:39:40] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:56:56] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T0600) [06:00:05] kormat, marostegui, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T0600). [06:28:08] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:34] (03CR) 10Ayounsi: [C: 03+2] realm.pp fix new esams range [puppet] - 10https://gerrit.wikimedia.org/r/948713 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [06:30:18] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:32:44] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:38:46] (03PS1) 10Ayounsi: Revert "pybal: Make check conform to the Nagios plugin API" [puppet] - 10https://gerrit.wikimedia.org/r/948594 [06:41:00] (03CR) 10Ayounsi: [C: 03+2] Revert "pybal: Make check conform to the Nagios plugin API" [puppet] - 10https://gerrit.wikimedia.org/r/948594 (owner: 10Ayounsi) [06:42:30] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [06:46:31] (03PS1) 10Dreamy Jazz: clienthints: Collect Client Hints data on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948985 (https://phabricator.wikimedia.org/T341110) [06:47:55] jouncebot: nowandnext [06:47:55] For the next 0 hour(s) and 12 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T0600) [06:47:55] In 0 hour(s) and 12 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T0700) [06:48:54] looks like the mw infra window is unused. so I'm starting the backport window a bit early [06:49:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948985 (https://phabricator.wikimedia.org/T341110) (owner: 10Dreamy Jazz) [06:49:54] (03Merged) 10jenkins-bot: clienthints: Collect Client Hints data on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948985 (https://phabricator.wikimedia.org/T341110) (owner: 10Dreamy Jazz) [06:50:33] !log taavi@deploy1002 Started scap: Backport for [[gerrit:948985|clienthints: Collect Client Hints data on group0 wikis (T341110)]] [06:50:37] T341110: Deploy client hints functionality - https://phabricator.wikimedia.org/T341110 [06:52:16] !log taavi@deploy1002 taavi and dreamyjazz: Backport for [[gerrit:948985|clienthints: Collect Client Hints data on group0 wikis (T341110)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [06:55:40] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:55:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org [06:57:26] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [06:59:25] !log taavi@deploy1002 taavi and dreamyjazz: Continuing with sync [07:00:05] Amir1, Urbanecm, and taavi: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T0700). [07:00:05] aanzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:21] (03CR) 10JMeybohm: "I would say that, before doing this again, there should be at least a notification to ops@ informing everybody of the change and the requi" [puppet] - 10https://gerrit.wikimedia.org/r/948125 (https://phabricator.wikimedia.org/T343987) (owner: 10Herron) [07:00:27] o/ [07:00:27] o/ [07:00:52] I'll deploy [07:00:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host lists2001.codfw.wmnet [07:01:06] aanzx: your patch is marked as a draft [07:01:14] (03CR) 10JMeybohm: [C: 03+1] miscweb: add wikiworkshop and reasearch-landing-page to staging wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/948539 (https://phabricator.wikimedia.org/T334511) (owner: 10Jelto) [07:02:23] (03PS5) 10Anzx: jawiki: reassign the changetags user right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948590 (https://phabricator.wikimedia.org/T344150) [07:02:38] taavi: now set as active [07:03:23] (03CR) 10Filippo Giunchedi: [C: 03+1] trafficserver: Use svc urls for eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/948624 (https://phabricator.wikimedia.org/T326657) (owner: 10BCornwall) [07:03:58] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [07:04:16] thx. will deploy that once Dreamy_Jazz's patch is synced [07:04:54] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [07:05:36] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [07:05:57] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:948985|clienthints: Collect Client Hints data on group0 wikis (T341110)]] (duration: 15m 23s) [07:05:58] (03CR) 10Majavah: [C: 03+2] jawiki: reassign the changetags user right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948590 (https://phabricator.wikimedia.org/T344150) (owner: 10Anzx) [07:06:00] T341110: Deploy client hints functionality - https://phabricator.wikimedia.org/T341110 [07:06:10] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:06:37] (03Merged) 10jenkins-bot: jawiki: reassign the changetags user right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948590 (https://phabricator.wikimedia.org/T344150) (owner: 10Anzx) [07:06:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lists2001.codfw.wmnet [07:07:12] !log taavi@deploy1002 Started scap: Backport for [[gerrit:948590|jawiki: reassign the changetags user right (T344150)]] [07:07:17] T344150: Reassign the changetags user right on jawiki - https://phabricator.wikimedia.org/T344150 [07:08:48] !log taavi@deploy1002 anzx and taavi: Backport for [[gerrit:948590|jawiki: reassign the changetags user right (T344150)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [07:08:57] aanzx: please test [07:09:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host titan2001.codfw.wmnet [07:09:27] testing [07:10:26] (03CR) 10JMeybohm: [C: 03+2] hieradata: complete cadvisor rollout on k8s [puppet] - 10https://gerrit.wikimedia.org/r/942426 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [07:11:45] taavi: looks good [07:12:02] !log taavi@deploy1002 anzx and taavi: Continuing with sync [07:12:52] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [dns] - 10https://gerrit.wikimedia.org/r/948562 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [07:15:05] !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "cp3081 - ayounsi@cumin1001" [07:15:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan2001.codfw.wmnet [07:16:03] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "cp3081 - ayounsi@cumin1001" [07:16:45] (03CR) 10Muehlenhoff: [C: 03+2] Add insetup variant for undefined ownership [puppet] - 10https://gerrit.wikimedia.org/r/869777 (owner: 10Muehlenhoff) [07:16:48] PROBLEM - Check systemd state on bast2003 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:17:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host titan2002.codfw.wmnet [07:18:12] (03PS7) 10Sohom Datta: Enable EditInSequence on all wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947883 (https://phabricator.wikimedia.org/T308098) [07:18:18] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:948590|jawiki: reassign the changetags user right (T344150)]] (duration: 11m 05s) [07:18:21] T344150: Reassign the changetags user right on jawiki - https://phabricator.wikimedia.org/T344150 [07:18:50] aanzx: done [07:18:55] thanks Taavi [07:19:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947883 (https://phabricator.wikimedia.org/T308098) (owner: 10Sohom Datta) [07:19:45] (03Merged) 10jenkins-bot: Enable EditInSequence on all wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947883 (https://phabricator.wikimedia.org/T308098) (owner: 10Sohom Datta) [07:20:14] !log taavi@deploy1002 Started scap: Backport for [[gerrit:947883|Enable EditInSequence on all wikisources (T308098)]] [07:20:17] T308098: Integrate edit-in-sequence inside ProofreadPage - https://phabricator.wikimedia.org/T308098 [07:20:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan2002.codfw.wmnet [07:21:53] !log taavi@deploy1002 soda and taavi: Backport for [[gerrit:947883|Enable EditInSequence on all wikisources (T308098)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [07:24:43] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:25:31] 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027 (10JMeybohm) [07:27:12] !log taavi@deploy1002 soda and taavi: Continuing with sync [07:29:13] !log restarting wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph on wdqs2012 [07:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:40] RECOVERY - Check systemd state on wdqs2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:30:04] RECOVERY - Check systemd state on bast2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:33:43] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:947883|Enable EditInSequence on all wikisources (T308098)]] (duration: 13m 29s) [07:33:47] T308098: Integrate edit-in-sequence inside ProofreadPage - https://phabricator.wikimedia.org/T308098 [07:34:00] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:34:43] (SystemdUnitFailed) resolved: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:37:11] (03CR) 10Umherirrender: "Failure is known as T344191 and fix is Ia2a8a24a14f7af7e18928da9c7cc412829be8e20" [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/948687 (owner: 10TrainBranchBot) [07:47:48] (03CR) 10Filippo Giunchedi: [C: 03+2] aux: add grpc/http ports for jaeger collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/946551 (https://phabricator.wikimedia.org/T343302) (owner: 10Filippo Giunchedi) [07:47:53] (03PS2) 10Filippo Giunchedi: aux: add grpc/http ports for jaeger collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/946551 (https://phabricator.wikimedia.org/T343302) [07:49:02] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [07:49:14] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [07:49:27] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [07:52:03] (03CR) 10Jelto: [C: 03+2] Class gitlab: Use gitlab-settings v1.3.0 [puppet] - 10https://gerrit.wikimedia.org/r/948665 (owner: 10Ahmon Dancy) [07:53:32] (03PS1) 10Zabe: Add messages for Pa'O Wiktionary (blkwiktionary) [extensions/WikimediaMessages] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948595 (https://phabricator.wikimedia.org/T343540) [07:53:39] (03CR) 10Zabe: [C: 03+2] Add messages for Pa'O Wiktionary (blkwiktionary) [extensions/WikimediaMessages] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948595 (https://phabricator.wikimedia.org/T343540) (owner: 10Zabe) [07:53:59] (03PS1) 10Zabe: Add messages for Sundanese Wikisource (suwikisource) [extensions/WikimediaMessages] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948596 (https://phabricator.wikimedia.org/T343539) [07:54:16] (03PS2) 10Zabe: Add messages for Sundanese Wikisource (suwikisource) [extensions/WikimediaMessages] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948596 (https://phabricator.wikimedia.org/T343539) [07:54:21] (03CR) 10Zabe: [C: 03+2] Add messages for Sundanese Wikisource (suwikisource) [extensions/WikimediaMessages] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948596 (https://phabricator.wikimedia.org/T343539) (owner: 10Zabe) [07:55:48] !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cr2-esams mgmt - ayounsi@cumin1001" [07:56:24] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:57:52] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cr2-esams mgmt - ayounsi@cumin1001" [07:57:52] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:07:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [extensions/WikimediaMessages] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948595 (https://phabricator.wikimedia.org/T343540) (owner: 10Zabe) [08:07:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [extensions/WikimediaMessages] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948596 (https://phabricator.wikimedia.org/T343539) (owner: 10Zabe) [08:08:36] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:08:39] (03CR) 10Ayounsi: [C: 03+2] Rename mr devices at Amsterdam POP sites [homer/public] - 10https://gerrit.wikimedia.org/r/948635 (https://phabricator.wikimedia.org/T329219) (owner: 10Cathal Mooney) [08:08:55] (03Merged) 10jenkins-bot: Add messages for Pa'O Wiktionary (blkwiktionary) [extensions/WikimediaMessages] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948595 (https://phabricator.wikimedia.org/T343540) (owner: 10Zabe) [08:08:57] (03Merged) 10jenkins-bot: Add messages for Sundanese Wikisource (suwikisource) [extensions/WikimediaMessages] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948596 (https://phabricator.wikimedia.org/T343539) (owner: 10Zabe) [08:09:12] (03Merged) 10jenkins-bot: Rename mr devices at Amsterdam POP sites [homer/public] - 10https://gerrit.wikimedia.org/r/948635 (https://phabricator.wikimedia.org/T329219) (owner: 10Cathal Mooney) [08:09:28] !log zabe@deploy1002 Started scap: Backport for [[gerrit:948595|Add messages for Pa'O Wiktionary (blkwiktionary) (T343540)]], [[gerrit:948596|Add messages for Sundanese Wikisource (suwikisource) (T343539)]] [08:09:34] T343539: Create Wikisource Sundanese - https://phabricator.wikimedia.org/T343539 [08:09:34] T343540: Create Wiktionary Pa'O - https://phabricator.wikimedia.org/T343540 [08:16:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:16:40] !log filippo@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [08:16:40] (03CR) 10Muehlenhoff: [C: 03+2] Add ganeti config for knams [puppet] - 10https://gerrit.wikimedia.org/r/948129 (owner: 10Muehlenhoff) [08:20:52] (03PS1) 10Ayounsi: More cr3-knams -> cr2-esams and mr1 -> old [homer/public] - 10https://gerrit.wikimedia.org/r/948991 (https://phabricator.wikimedia.org/T329219) [08:21:21] (03CR) 10CI reject: [V: 04-1] More cr3-knams -> cr2-esams and mr1 -> old [homer/public] - 10https://gerrit.wikimedia.org/r/948991 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [08:21:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:21:50] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:23:56] (03CR) 10Muehlenhoff: [C: 03+2] nftables::file: Expand prefix to three digits [puppet] - 10https://gerrit.wikimedia.org/r/945586 (owner: 10Muehlenhoff) [08:28:02] !log filippo@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:29:12] (03PS2) 10Ayounsi: More esams router renaming [homer/public] - 10https://gerrit.wikimedia.org/r/948991 (https://phabricator.wikimedia.org/T329219) [08:30:45] !log zabe@deploy1002 zabe: Backport for [[gerrit:948595|Add messages for Pa'O Wiktionary (blkwiktionary) (T343540)]], [[gerrit:948596|Add messages for Sundanese Wikisource (suwikisource) (T343539)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [08:30:56] T343539: Create Wikisource Sundanese - https://phabricator.wikimedia.org/T343539 [08:30:57] T343540: Create Wiktionary Pa'O - https://phabricator.wikimedia.org/T343540 [08:31:14] (03CR) 10Jelto: [C: 03+2] miscweb: add wikiworkshop and reasearch-landing-page to staging wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/948539 (https://phabricator.wikimedia.org/T334511) (owner: 10Jelto) [08:31:19] !log zabe@deploy1002 zabe: Continuing with sync [08:32:00] (03Merged) 10jenkins-bot: miscweb: add wikiworkshop and reasearch-landing-page to staging wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/948539 (https://phabricator.wikimedia.org/T334511) (owner: 10Jelto) [08:32:08] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/948991 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [08:32:46] (03CR) 10Ayounsi: [C: 03+2] More esams router renaming [homer/public] - 10https://gerrit.wikimedia.org/r/948991 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [08:33:06] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:33:19] (03Merged) 10jenkins-bot: More esams router renaming [homer/public] - 10https://gerrit.wikimedia.org/r/948991 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [08:35:26] (03CR) 10Kamila Součková: [C: 03+1] "LGTM, just note that a lot of serviceops people are on holiday today." [deployment-charts] - 10https://gerrit.wikimedia.org/r/948136 (owner: 10Elukey) [08:36:34] !log jelto@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [08:37:34] !log jelto@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [08:42:55] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:948595|Add messages for Pa'O Wiktionary (blkwiktionary) (T343540)]], [[gerrit:948596|Add messages for Sundanese Wikisource (suwikisource) (T343539)]] (duration: 33m 26s) [08:42:59] T343539: Create Wikisource Sundanese - https://phabricator.wikimedia.org/T343539 [08:43:00] T343540: Create Wiktionary Pa'O - https://phabricator.wikimedia.org/T343540 [08:46:22] !log Draining ml2002 for kubelet partition resize [08:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:36] (03PS1) 10Filippo Giunchedi: aux: set calico typha to two replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/948994 (https://phabricator.wikimedia.org/T333302) [08:54:20] (03PS1) 10Muehlenhoff: Add site.pp entries for new Ganeti nodes in knams [puppet] - 10https://gerrit.wikimedia.org/r/948995 [08:54:55] (03CR) 10CI reject: [V: 04-1] Add site.pp entries for new Ganeti nodes in knams [puppet] - 10https://gerrit.wikimedia.org/r/948995 (owner: 10Muehlenhoff) [08:55:17] !log Draining ml2003 for kubelet partition resize [08:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:34] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:55:42] (03PS2) 10Filippo Giunchedi: aux: set calico typha to one replica [deployment-charts] - 10https://gerrit.wikimedia.org/r/948994 (https://phabricator.wikimedia.org/T333302) [08:56:53] (03PS2) 10Muehlenhoff: Add site.pp entries for new Ganeti nodes in knams [puppet] - 10https://gerrit.wikimedia.org/r/948995 [08:57:28] (03CR) 10CI reject: [V: 04-1] Add site.pp entries for new Ganeti nodes in knams [puppet] - 10https://gerrit.wikimedia.org/r/948995 (owner: 10Muehlenhoff) [09:01:21] (03CR) 10JMeybohm: [C: 03+1] "We need to make sure this gets reverted as soon as there is a third node in aux!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/948994 (https://phabricator.wikimedia.org/T333302) (owner: 10Filippo Giunchedi) [09:01:28] (03PS1) 10David Caro: openstack: use the right proxy names [alerts] - 10https://gerrit.wikimedia.org/r/948996 [09:03:01] (03CR) 10Filippo Giunchedi: [C: 03+2] aux: set calico typha to one replica [deployment-charts] - 10https://gerrit.wikimedia.org/r/948994 (https://phabricator.wikimedia.org/T333302) (owner: 10Filippo Giunchedi) [09:03:17] (03PS3) 10Muehlenhoff: Add site.pp entries for new Ganeti nodes in knams [puppet] - 10https://gerrit.wikimedia.org/r/948995 [09:04:37] (03CR) 10David Caro: [C: 03+2] openstack: use the right proxy names [alerts] - 10https://gerrit.wikimedia.org/r/948996 (owner: 10David Caro) [09:04:49] (03CR) 10Muehlenhoff: [C: 03+2] Add site.pp entries for new Ganeti nodes in knams [puppet] - 10https://gerrit.wikimedia.org/r/948995 (owner: 10Muehlenhoff) [09:05:18] !log filippo@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [09:05:50] (03Merged) 10jenkins-bot: openstack: use the right proxy names [alerts] - 10https://gerrit.wikimedia.org/r/948996 (owner: 10David Caro) [09:08:14] (03CR) 10Stevemunene: [C: 03+2] Grant Kara Payne shell access [puppet] - 10https://gerrit.wikimedia.org/r/948568 (https://phabricator.wikimedia.org/T342546) (owner: 10Stevemunene) [09:08:41] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp3079.esams.wmnet with OS bullseye [09:10:24] (03CR) 10Stevemunene: [C: 03+2] airflow-wmde: Add Kara Payne to analytics-wmde [puppet] - 10https://gerrit.wikimedia.org/r/940863 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [09:11:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti3005.esams.wmnet with OS bullseye [09:11:12] 10SRE, 10Wikimedia-Mailing-lists: Shut down two en-arbcom mailing lists (audit, appeals-en) - https://phabricator.wikimedia.org/T344112 (10Legoktm) The audit list is arbcom-audit-en@. [09:11:44] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ganeti3005.esams.wmnet with OS bullseye [09:15:31] 10SRE, 10Wikimedia-Mailing-lists: Shut down two en-arbcom mailing lists (audit, appeals-en) - https://phabricator.wikimedia.org/T344112 (10Legoktm) 05Open→03Resolved >>! In T344112#9092517, @Legoktm wrote: > The audit list is arbcom-audit-en@. Which I've now archived. So I think we're all set here! [09:15:32] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp3079.esams.wmnet with OS bullseye [09:15:54] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp3079.esams.wmnet with OS bullseye [09:16:37] (03PS2) 10Muehlenhoff: Add a Firewall::Portrange define [puppet] - 10https://gerrit.wikimedia.org/r/947316 [09:17:03] !log filippo@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:18:00] 10sre-alert-triage, 10Machine-Learning-Team, 10Patch-For-Review: Alert triage: overdue alert [warning] Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99 - https://phabricator.wikimedia.org/T343900 (10klausman) I have done ml2002 and ml2003 today (two machines to force some pods back... [09:20:00] (03PS1) 10Jelto: miscweb: add wikiworkshop and reasearch-landing-page to eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/948998 (https://phabricator.wikimedia.org/T334511) [09:25:44] (03CR) 10Fabfur: [C: 03+1] "ok" [puppet] - 10https://gerrit.wikimedia.org/r/948685 (https://phabricator.wikimedia.org/T327438) (owner: 10Ssingh) [09:28:08] (03CR) 10Ssingh: [C: 03+2] cp3073: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/948685 (https://phabricator.wikimedia.org/T327438) (owner: 10Ssingh) [09:28:38] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:30:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti3005.esams.wmnet with OS bullseye [09:30:15] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3073.esams.wmnet with OS bullseye [09:30:37] (03PS1) 10Ayounsi: Add dns3003 to asw1-by27 anycast peers [homer/public] - 10https://gerrit.wikimedia.org/r/948999 (https://phabricator.wikimedia.org/T329219) [09:31:07] (03CR) 10CI reject: [V: 04-1] Add dns3003 to asw1-by27 anycast peers [homer/public] - 10https://gerrit.wikimedia.org/r/948999 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [09:31:53] (03PS2) 10Ayounsi: Add dns3003 to asw1-by27 anycast peers [homer/public] - 10https://gerrit.wikimedia.org/r/948999 (https://phabricator.wikimedia.org/T329219) [09:34:11] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1096.eqiad.wmnet with OS bullseye [09:34:35] (03CR) 10Ssingh: [C: 03+1] "Looks good, thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/948999 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [09:34:45] (03PS1) 10Ayounsi: esams: remove profile::bird::neighbors_list [puppet] - 10https://gerrit.wikimedia.org/r/949000 (https://phabricator.wikimedia.org/T329219) [09:35:17] (03CR) 10Ayounsi: [C: 03+2] Add dns3003 to asw1-by27 anycast peers [homer/public] - 10https://gerrit.wikimedia.org/r/948999 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [09:35:49] (03Merged) 10jenkins-bot: Add dns3003 to asw1-by27 anycast peers [homer/public] - 10https://gerrit.wikimedia.org/r/948999 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [09:37:45] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3079.esams.wmnet with reason: host reimage [09:38:56] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/949000/42893/dns3001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/949000 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [09:40:03] (03CR) 10Ssingh: [C: 03+2] cp3071: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/948706 (https://phabricator.wikimedia.org/T327438) (owner: 10Ssingh) [09:40:32] (03PS2) 10Ssingh: cp3071: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/948706 (https://phabricator.wikimedia.org/T327438) [09:41:11] (03CR) 10Ssingh: [V: 03+2] cp3071: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/948706 (https://phabricator.wikimedia.org/T327438) (owner: 10Ssingh) [09:41:49] !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp3079.esams.wmnet with reason: host reimage [09:43:19] (03CR) 10Ayounsi: esams: remove profile::bird::neighbors_list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/949000 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [09:43:26] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3071.esams.wmnet with OS bullseye [09:44:00] (03CR) 10Ssingh: [C: 03+1] esams: remove profile::bird::neighbors_list [puppet] - 10https://gerrit.wikimedia.org/r/949000 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [09:44:18] (03CR) 10Ayounsi: [C: 03+2] esams: remove profile::bird::neighbors_list [puppet] - 10https://gerrit.wikimedia.org/r/949000 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [09:49:05] (03PS3) 10Ssingh: cp3079: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/948652 (https://phabricator.wikimedia.org/T327438) [09:49:51] (03CR) 10Fabfur: [C: 03+1] cp3079: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/948652 (https://phabricator.wikimedia.org/T327438) (owner: 10Ssingh) [09:49:55] (03CR) 10Fabfur: [C: 03+2] cp3079: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/948652 (https://phabricator.wikimedia.org/T327438) (owner: 10Ssingh) [09:50:56] !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp3079.esams.wmnet with OS bullseye [09:51:56] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti3005.esams.wmnet with OS bullseye [09:52:19] (03PS1) 10Stevemunene: airflow-wmde: Create analytics-wmde airflow admin group [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) [09:52:29] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp3079.esams.wmnet with OS bullseye [09:52:50] !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp3079.esams.wmnet with OS bullseye [09:53:12] (03CR) 10CI reject: [V: 04-1] airflow-wmde: Create analytics-wmde airflow admin group [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [09:54:01] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp3079.esams.wmnet with OS bullseye [09:55:34] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:58:03] (03PS1) 10JMeybohm: Remove podAntiAffinity for calico-typha on aux [deployment-charts] - 10https://gerrit.wikimedia.org/r/949002 (https://phabricator.wikimedia.org/T292077) [09:58:32] (03PS2) 10JMeybohm: Remove podAntiAffinity for calico-typha on aux [deployment-charts] - 10https://gerrit.wikimedia.org/r/949002 (https://phabricator.wikimedia.org/T344230) [09:59:44] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove podAntiAffinity for calico-typha on aux [deployment-charts] - 10https://gerrit.wikimedia.org/r/949002 (https://phabricator.wikimedia.org/T344230) (owner: 10JMeybohm) [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T1000) [10:00:32] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Requesting access to analytics-wmde-users (no kerberos, with ssh) for karapayneWMDE - https://phabricator.wikimedia.org/T342546 (10Stevemunene) Te changes have been merged and @karapayneWMDE now has shell access and is a member of `analytics-wmde-users` [10:00:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [10:01:09] (03PS1) 10JMeybohm: Revert "aux: set calico typha to one replica" [deployment-charts] - 10https://gerrit.wikimedia.org/r/948597 (https://phabricator.wikimedia.org/T333302) [10:01:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti3005.esams.wmnet with OS bullseye [10:02:08] jouncebot: nowandnext [10:02:08] For the next 0 hour(s) and 57 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T1000) [10:02:08] In 1 hour(s) and 57 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T1200) [10:02:50] (03PS3) 10JMeybohm: Remove podAntiAffinity for calico-typha on aux [deployment-charts] - 10https://gerrit.wikimedia.org/r/949002 (https://phabricator.wikimedia.org/T344230) [10:04:55] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3073.esams.wmnet with reason: host reimage [10:05:24] (03CR) 10Filippo Giunchedi: [C: 03+1] Revert "aux: set calico typha to one replica" [deployment-charts] - 10https://gerrit.wikimedia.org/r/948597 (https://phabricator.wikimedia.org/T333302) (owner: 10JMeybohm) [10:05:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [10:06:10] PROBLEM - Check systemd state on kubestagemaster1002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:16] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3071.esams.wmnet with reason: host reimage [10:09:08] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp3073.esams.wmnet with reason: host reimage [10:11:37] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp3071.esams.wmnet with reason: host reimage [10:12:26] (03PS2) 10Stevemunene: airflow-wmde: Create analytics-wmde airflow admin group [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) [10:12:36] 10sre-alert-triage, 10Machine-Learning-Team, 10Patch-For-Review: Alert triage: overdue alert [warning] Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99 - https://phabricator.wikimedia.org/T343900 (10JMeybohm) I might be missing something here, but what issues did you have with the... [10:13:20] (03CR) 10CI reject: [V: 04-1] airflow-wmde: Create analytics-wmde airflow admin group [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [10:14:36] (03PS1) 10Muehlenhoff: Add dummy certs for ganeti02.svc.esams.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/949004 [10:16:12] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3079.esams.wmnet with reason: host reimage [10:17:55] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add dummy certs for ganeti02.svc.esams.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/949004 (owner: 10Muehlenhoff) [10:18:57] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/947316 (owner: 10Muehlenhoff) [10:19:43] (03PS1) 10Ssingh: cp306[79], cp307[57]: update site.pp and related configs for cp roles [puppet] - 10https://gerrit.wikimedia.org/r/949006 (https://phabricator.wikimedia.org/T344174) [10:19:51] (03CR) 10CI reject: [V: 04-1] cp306[79], cp307[57]: update site.pp and related configs for cp roles [puppet] - 10https://gerrit.wikimedia.org/r/949006 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [10:20:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:20:15] !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp3079.esams.wmnet with reason: host reimage [10:21:09] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti3005.esams.wmnet with reason: host reimage [10:21:56] (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949006 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [10:24:20] (03PS1) 10Ssingh: cp306[79], cp307[57]: update site.pp and related configs for cp roles [puppet] - 10https://gerrit.wikimedia.org/r/949009 (https://phabricator.wikimedia.org/T344174) [10:24:31] (03CR) 10CI reject: [V: 04-1] cp306[79], cp307[57]: update site.pp and related configs for cp roles [puppet] - 10https://gerrit.wikimedia.org/r/949009 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [10:24:40] sigh [10:24:48] (03Abandoned) 10Ssingh: cp306[79], cp307[57]: update site.pp and related configs for cp roles [puppet] - 10https://gerrit.wikimedia.org/r/949006 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [10:24:52] PROBLEM - Check whether ferm is active by checking the default input chain on kubestagemaster1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:25:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:25:15] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ganeti3005.esams.wmnet with reason: host reimage [10:27:11] (03PS1) 10Ssingh: hiera: add new DNS host in esams, dns3003 [puppet] - 10https://gerrit.wikimedia.org/r/949010 (https://phabricator.wikimedia.org/T344174) [10:27:20] (03CR) 10CI reject: [V: 04-1] hiera: add new DNS host in esams, dns3003 [puppet] - 10https://gerrit.wikimedia.org/r/949010 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [10:27:23] ok [10:27:27] so something is up with CI [10:29:21] GitCommandError: Cmd('git') failed due to: exit code(128) cmdline: git fetch --force --tags -v origin stderr: 'fatal: Could not read from remote repository. [10:29:35] contint1002 [10:30:25] (03PS2) 10Fabfur: cp306[79], cp307[57]: update site.pp and related configs for cp roles [puppet] - 10https://gerrit.wikimedia.org/r/949009 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [10:30:34] (03CR) 10jenkins-bot: cp306[79], cp307[57]: update site.pp and related configs for cp roles [puppet] - 10https://gerrit.wikimedia.org/r/949009 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [10:31:56] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [10:32:54] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [10:32:54] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3073.esams.wmnet with OS bullseye [10:33:22] (03PS1) 10Muehlenhoff: Add cert for ganeti02.svc.esams.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/949011 [10:33:31] (03CR) 10CI reject: [V: 04-1] Add cert for ganeti02.svc.esams.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/949011 (owner: 10Muehlenhoff) [10:33:50] moritzm: ^ broken CI [10:34:41] (03PS2) 10Muehlenhoff: Add cert for ganeti02.svc.esams.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/949011 [10:34:49] lovely :-) [10:34:50] (03CR) 10CI reject: [V: 04-1] Add cert for ganeti02.svc.esams.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/949011 (owner: 10Muehlenhoff) [10:35:44] yeah I asked in releng [10:36:38] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [10:36:55] 10SRE, 10ops-knams, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10ssingh) [10:37:33] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [10:37:33] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3071.esams.wmnet with OS bullseye [10:38:41] 10SRE, 10ops-knams, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10ssingh) [10:40:23] (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949009 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [10:41:08] no, still broken :) [10:41:56] !log fabfur@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001" [10:42:53] !log fabfur@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001" [10:42:53] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3079.esams.wmnet with OS bullseye [10:43:39] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002" [10:43:46] 10sre-alert-triage, 10Machine-Learning-Team, 10Patch-For-Review: Alert triage: overdue alert [warning] Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99 - https://phabricator.wikimedia.org/T343900 (10klausman) The problem is only really relevant for LLMs (Large Language Models), sin... [10:44:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002" [10:44:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti3005.esams.wmnet with OS bullseye [10:44:52] (03PS1) 10EoghanGaffney: gitlab: Update config to fix compatibility with swift [puppet] - 10https://gerrit.wikimedia.org/r/949014 [10:45:01] (03CR) 10CI reject: [V: 04-1] gitlab: Update config to fix compatibility with swift [puppet] - 10https://gerrit.wikimedia.org/r/949014 (owner: 10EoghanGaffney) [10:45:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti3007.esams.wmnet with OS bullseye [10:46:09] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42894/console" [puppet] - 10https://gerrit.wikimedia.org/r/949014 (owner: 10EoghanGaffney) [10:48:20] (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949009 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [10:52:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:54:10] (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949009 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [10:54:24] !log zuul@contint1002:/srv/zuul/git/operations/puppet$ git fetch --force --tags -v origin [10:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:52] ok that worked [10:54:55] for how long, we will see [10:55:09] moritzm: ^ [10:55:19] (03CR) 10EoghanGaffney: [V: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949014 (owner: 10EoghanGaffney) [10:56:02] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1096.eqiad.wmnet with OS bullseye [10:56:03] sukhe: thanks, I'll give it a shot [10:56:04] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-db1001.eqiad.wmnet [10:56:29] (03CR) 10Fabfur: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/949009 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [10:57:51] (03CR) 10Ssingh: [C: 03+2] cp306[79], cp307[57]: update site.pp and related configs for cp roles [puppet] - 10https://gerrit.wikimedia.org/r/949009 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [10:58:04] sukhe: what reimage script step is it failing at? [10:58:28] XioNoX: which one? [10:58:36] nothing failed for us, CI was broken :) [10:58:52] reimaging fine so far [10:58:56] ohh ok [10:59:14] sukhe: I thought the CI issue blocked the reimage cookbook [10:59:34] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:59:58] ah yeah, well it did in a way, I didn't want to proceed till I got a CI check [11:00:12] we should be done with the cp's soon, doing multiple at once [11:00:36] cool [11:00:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:00:42] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp3067.esams.wmnet with OS bullseye [11:01:08] 10SRE, 10ops-knams, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10ssingh) [11:01:52] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3075.esams.wmnet with OS bullseye [11:02:44] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:02:48] RECOVERY - Check systemd state on kubestagemaster1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:03:18] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3069.esams.wmnet with OS bullseye [11:04:50] (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949010 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [11:04:54] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp3077.esams.wmnet with OS bullseye [11:05:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:05:39] (03PS2) 10Ssingh: hiera: add new DNS host in esams, dns3003 [puppet] - 10https://gerrit.wikimedia.org/r/949010 (https://phabricator.wikimedia.org/T344174) [11:05:47] (03CR) 10CI reject: [V: 04-1] hiera: add new DNS host in esams, dns3003 [puppet] - 10https://gerrit.wikimedia.org/r/949010 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [11:06:12] (03PS3) 10Muehlenhoff: Add cert for ganeti02.svc.esams.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/949011 [11:06:21] (03CR) 10CI reject: [V: 04-1] Add cert for ganeti02.svc.esams.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/949011 (owner: 10Muehlenhoff) [11:06:25] yeah it's back [11:07:03] manually fetched again [11:07:09] not a solution, but no time right now to debug :) [11:07:09] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host an-db1001.eqiad.wmnet [11:07:17] (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949010 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [11:08:06] PROBLEM - Check systemd state on an-db1001 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:08:13] (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949010 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [11:10:28] (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949010 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [11:16:11] (03PS1) 10Stevemunene: airflow-wmde: Add wmde service user to the Yarn production queue [puppet] - 10https://gerrit.wikimedia.org/r/949019 (https://phabricator.wikimedia.org/T340648) [11:16:20] (03CR) 10CI reject: [V: 04-1] airflow-wmde: Add wmde service user to the Yarn production queue [puppet] - 10https://gerrit.wikimedia.org/r/949019 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [11:20:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:22:15] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3067.esams.wmnet with reason: host reimage [11:22:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3005.esams.wmnet [11:22:37] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ganeti3005.esams.wmnet [11:22:41] sukhe@contint2002:~$ sudo systemctl restart zuul [11:22:47] !log sukhe@contint2002:~$ sudo systemctl restart zuul [11:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:14] (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949010 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [11:23:46] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3075.esams.wmnet with reason: host reimage [11:24:16] CI should be back [11:24:24] thanks to RhinosF1 for finding the right task [11:24:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3005.esams.wmnet [11:24:26] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ganeti3005.esams.wmnet [11:24:44] sukhe: it's fine [11:24:55] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3069.esams.wmnet with reason: host reimage [11:25:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:25:54] RECOVERY - Check whether ferm is active by checking the default input chain on kubestagemaster1002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:26:36] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:26:39] !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp3067.esams.wmnet with reason: host reimage [11:26:47] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3077.esams.wmnet with reason: host reimage [11:26:51] (03CR) 10Stevemunene: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949019 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [11:27:39] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp3075.esams.wmnet with reason: host reimage [11:28:34] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949011 (owner: 10Muehlenhoff) [11:29:15] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp3069.esams.wmnet with reason: host reimage [11:30:12] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [11:30:33] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:30:42] RECOVERY - Check systemd state on an-db1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:38] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1002 is OK: SSL OK - Certificate centrallog1002.eqiad.wmnet valid until 2028-01-24 19:33:10 +0000 (expires in 1623 days) https://wikitech.wikimedia.org/wiki/Logs [11:31:47] !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp3077.esams.wmnet with reason: host reimage [11:32:24] PROBLEM - Check systemd state on ml-serve1007 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:35:33] (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:36:58] RECOVERY - Check systemd state on ml-serve1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:39:56] (03PS2) 10EoghanGaffney: gitlab: Update config to fix compatibility with swift [puppet] - 10https://gerrit.wikimedia.org/r/949014 [11:40:04] (03CR) 10CI reject: [V: 04-1] gitlab: Update config to fix compatibility with swift [puppet] - 10https://gerrit.wikimedia.org/r/949014 (owner: 10EoghanGaffney) [11:49:15] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [11:50:10] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [11:50:10] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3075.esams.wmnet with OS bullseye [11:50:30] PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: user-runtime-dir@23938.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:03] !log fabfur@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001" [11:52:58] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [11:53:43] !log fabfur@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001" [11:53:43] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3067.esams.wmnet with OS bullseye [11:54:13] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [11:54:13] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3069.esams.wmnet with OS bullseye [11:54:39] !log fabfur@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001" [11:55:43] (03CR) 10Stevemunene: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949019 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [11:58:32] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:37] 10SRE, 10ops-knams, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10ssingh) [11:58:50] (03PS1) 10Ayounsi: Add new esams switches to icinga hostgroups [puppet] - 10https://gerrit.wikimedia.org/r/949023 (https://phabricator.wikimedia.org/T329219) [11:58:59] (03CR) 10CI reject: [V: 04-1] Add new esams switches to icinga hostgroups [puppet] - 10https://gerrit.wikimedia.org/r/949023 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [11:59:48] (03PS2) 10Ayounsi: Add new esams switches to icinga hostgroups [puppet] - 10https://gerrit.wikimedia.org/r/949023 (https://phabricator.wikimedia.org/T329219) [11:59:56] (03CR) 10CI reject: [V: 04-1] Add new esams switches to icinga hostgroups [puppet] - 10https://gerrit.wikimedia.org/r/949023 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [12:00:00] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/949023 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T1200) [12:01:00] 10SRE, 10ops-knams, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10Fabfur) [12:02:27] !log fabfur@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001" [12:02:27] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3077.esams.wmnet with OS bullseye [12:02:39] !log sukhe@contint2002:~$ sudo systemctl restart zuul [12:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:49] !log sukhe@contint2002:~$ sudo systemctl restart zuul: T344238 [12:02:49] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Add new esams switches to icinga hostgroups [puppet] - 10https://gerrit.wikimedia.org/r/949023 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [12:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:52] T344238: CI "Merge Failed. because cross-repo dependencies" on CI jobs, even up-to-date ones - https://phabricator.wikimedia.org/T344238 [12:02:56] (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949023 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [12:03:15] XioNoX: oh ok [12:03:26] I just did a recheck but yeah, pcc wins [12:04:18] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ganeti3007.esams.wmnet with OS bullseye [12:08:51] (03CR) 10Stevemunene: [C: 03+2] idp_test: add datahub_staging as a OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/944231 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [12:10:05] (03CR) 10EoghanGaffney: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949014 (owner: 10EoghanGaffney) [12:10:34] (03PS1) 10Jelto: gerrit: raise maxConnectionsPerUser to 8 [puppet] - 10https://gerrit.wikimedia.org/r/949026 (https://phabricator.wikimedia.org/T344238) [12:12:07] !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sw - ayounsi@cumin1001" [12:12:07] (03CR) 10Ssingh: [C: 03+1] "LGTM, good idea" [puppet] - 10https://gerrit.wikimedia.org/r/949026 (https://phabricator.wikimedia.org/T344238) (owner: 10Jelto) [12:12:52] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "sw - ayounsi@cumin1001" [12:14:09] (03CR) 10Jelto: "note: the value was reduced from 32 to 4 in 2019: I30afd4ff3d8527aa3eb3280b81a840367f64918c" [puppet] - 10https://gerrit.wikimedia.org/r/949026 (https://phabricator.wikimedia.org/T344238) (owner: 10Jelto) [12:16:24] (03CR) 10Filippo Giunchedi: [C: 03+1] admin_ng: increase resources for calico on wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/948091 (https://phabricator.wikimedia.org/T343900) (owner: 10Elukey) [12:18:29] (03PS1) 10Anzx: Remove knwiktionary tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949029 (https://phabricator.wikimedia.org/T343662) [12:18:36] (03CR) 10CI reject: [V: 04-1] Remove knwiktionary tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949029 (https://phabricator.wikimedia.org/T343662) (owner: 10Anzx) [12:24:54] PROBLEM - carbon-cache@b service on cloudmetrics1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:24:56] PROBLEM - carbon-cache@d service on cloudmetrics1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@d is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:25:00] PROBLEM - carbon-cache@g service on cloudmetrics1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@g is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:25:08] PROBLEM - carbon-local-relay service on cloudmetrics1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-local-relay is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:25:26] PROBLEM - carbon-frontend-relay service on cloudmetrics1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-frontend-relay is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:25:28] PROBLEM - carbon-cache@f service on cloudmetrics1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@f is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:25:36] PROBLEM - carbon-cache@e service on cloudmetrics1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@e is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:25:46] PROBLEM - carbon-cache@a service on cloudmetrics1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:25:48] PROBLEM - carbon-cache@h service on cloudmetrics1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@h is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:26:26] RECOVERY - carbon-cache@b service on cloudmetrics1003 is OK: OK - carbon-cache@b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:26:26] RECOVERY - carbon-cache@d service on cloudmetrics1003 is OK: OK - carbon-cache@d is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:26:32] RECOVERY - carbon-cache@g service on cloudmetrics1003 is OK: OK - carbon-cache@g is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:26:40] RECOVERY - carbon-local-relay service on cloudmetrics1003 is OK: OK - carbon-local-relay is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:26:54] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:26:58] RECOVERY - carbon-frontend-relay service on cloudmetrics1003 is OK: OK - carbon-frontend-relay is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:27:00] RECOVERY - carbon-cache@f service on cloudmetrics1003 is OK: OK - carbon-cache@f is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:27:10] RECOVERY - carbon-cache@e service on cloudmetrics1003 is OK: OK - carbon-cache@e is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:27:20] RECOVERY - carbon-cache@a service on cloudmetrics1003 is OK: OK - carbon-cache@a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:27:20] RECOVERY - carbon-cache@h service on cloudmetrics1003 is OK: OK - carbon-cache@h is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:32:42] (03CR) 10Michael Große: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949030 (https://phabricator.wikimedia.org/T343409) (owner: 10Michael Große) [12:34:39] (03PS1) 10Anzx: Remove knwiktionary tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949032 (https://phabricator.wikimedia.org/T343662) [12:35:02] (03CR) 10Filippo Giunchedi: [C: 03+2] Remove podAntiAffinity for calico-typha on aux [deployment-charts] - 10https://gerrit.wikimedia.org/r/949002 (https://phabricator.wikimedia.org/T344230) (owner: 10JMeybohm) [12:35:09] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "aux: set calico typha to one replica" [deployment-charts] - 10https://gerrit.wikimedia.org/r/948597 (https://phabricator.wikimedia.org/T333302) (owner: 10JMeybohm) [12:36:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:36:30] (03PS2) 10Anzx: Remove knwiktionary tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949029 (https://phabricator.wikimedia.org/T343662) [12:36:38] !log filippo@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [12:37:06] (03PS1) 10Urbanecm: Growth: Enable new Impact backend on large Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949033 (https://phabricator.wikimedia.org/T344143) [12:37:08] (03PS1) 10Urbanecm: Growth: Enable new Impact backend everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949034 (https://phabricator.wikimedia.org/T344143) [12:37:17] (03CR) 10Urbanecm: [C: 04-2] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949034 (https://phabricator.wikimedia.org/T344143) (owner: 10Urbanecm) [12:37:23] (03Abandoned) 10Anzx: Remove knwiktionary tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949032 (https://phabricator.wikimedia.org/T343662) (owner: 10Anzx) [12:40:34] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:42:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:42:30] (03PS1) 10Ayounsi: Add new-esams network infra to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/949035 (https://phabricator.wikimedia.org/T329219) [12:43:03] (03CR) 10JMeybohm: [C: 03+2] admin_ng: increase resources for calico on wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/948091 (https://phabricator.wikimedia.org/T343900) (owner: 10Elukey) [12:43:30] (03PS1) 10Kevin Bazira: [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) [12:44:04] (03CR) 10CI reject: [V: 04-1] [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [12:44:12] (03PS2) 10Ayounsi: Add new-esams network infra to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/949035 (https://phabricator.wikimedia.org/T329219) [12:44:35] (03PS3) 10Ayounsi: Add new-esams network infra to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/949035 (https://phabricator.wikimedia.org/T329219) [12:44:41] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/949035 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [12:44:48] (03CR) 10Filippo Giunchedi: [C: 03+1] Add new-esams network infra to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/949035 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [12:45:44] (03Merged) 10jenkins-bot: admin_ng: increase resources for calico on wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/948091 (https://phabricator.wikimedia.org/T343900) (owner: 10Elukey) [12:45:59] (03PS4) 10Ayounsi: Add new-esams network infra to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/949035 (https://phabricator.wikimedia.org/T329219) [12:46:26] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:47:47] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/949035 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [12:47:54] !log filippo@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:48:30] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/949014 (owner: 10EoghanGaffney) [12:49:51] (03Abandoned) 10Jelto: gerrit: add blackbox check for json endpoint [puppet] - 10https://gerrit.wikimedia.org/r/948555 (owner: 10Jelto) [12:50:21] (03CR) 10Ayounsi: [C: 03+2] Add new-esams network infra to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/949035 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [12:51:10] (03PS1) 10David Caro: role::wmcs::monitoring: remove unused envoy options [puppet] - 10https://gerrit.wikimedia.org/r/949037 (https://phabricator.wikimedia.org/T344242) [12:51:16] (03CR) 10Jelto: [C: 03+1] icinga: remove obsolete gerrit checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948552 (owner: 10Filippo Giunchedi) [12:52:37] (03PS1) 10Anzx: Remove knwiktionary tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949038 (https://phabricator.wikimedia.org/T343662) [12:56:24] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti3005'] [12:59:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:59:48] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: Time to snap out of that daydream and deploy UTC afternoon backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T1300). [13:00:04] sergi0 and aanzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:18] hello [13:00:23] I can deploy today [13:00:29] o/ [13:00:34] or actually, sergi0, since you're also a deployer, do you want to try to deploy your patch? :) [13:00:55] o/ also around, but would prefer not to deploy [13:01:03] urbanecm: I didn't do my training :( [13:02:41] sergi0: ah. i can share my screen instead if you're interested in watching the deployment. i'm also happy to supervise your deployment monitoring your screen. with `scap backport`, it's not as difficult as it used to be :)) [13:03:09] urbanecm: yeah let's do it [13:03:17] which one? [13:03:32] let me watch first :) [13:03:35] ok [13:04:09] sergi0: see slack for meeting link [13:04:17] (03PS2) 10David Caro: role::wmcs::monitoring: pass through the ensure option [puppet] - 10https://gerrit.wikimedia.org/r/949037 (https://phabricator.wikimedia.org/T344242) [13:04:36] (03PS1) 10Ayounsi: Remove esams from ripeatlas_measurements [puppet] - 10https://gerrit.wikimedia.org/r/949039 [13:04:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:05:02] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove esams from ripeatlas_measurements [puppet] - 10https://gerrit.wikimedia.org/r/949039 (owner: 10Ayounsi) [13:05:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948631 (https://phabricator.wikimedia.org/T308138) (owner: 10Sergio Gimeno) [13:05:15] (03CR) 10Ayounsi: [C: 03+2] Remove esams from ripeatlas_measurements [puppet] - 10https://gerrit.wikimedia.org/r/949039 (owner: 10Ayounsi) [13:05:46] (03Merged) 10jenkins-bot: GrowthExperiments: enable AddLink backend 13th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948631 (https://phabricator.wikimedia.org/T308138) (owner: 10Sergio Gimeno) [13:05:58] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti3005'] [13:06:14] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:948631|GrowthExperiments: enable AddLink backend 13th round of wikis (T308138)]] [13:06:18] T308138: Deploy "add a link" to 13th round of wikis - https://phabricator.wikimedia.org/T308138 [13:06:36] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti3007'] [13:06:52] (03PS3) 10David Caro: role::wmcs::monitoring: pass through the ensure option [puppet] - 10https://gerrit.wikimedia.org/r/949037 (https://phabricator.wikimedia.org/T344242) [13:07:53] !log urbanecm@deploy1002 sgimeno and urbanecm: Backport for [[gerrit:948631|GrowthExperiments: enable AddLink backend 13th round of wikis (T308138)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:08:13] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42897/console" [puppet] - 10https://gerrit.wikimedia.org/r/949037 (https://phabricator.wikimedia.org/T344242) (owner: 10David Caro) [13:08:36] sergi0: please test on mwdebug1001 [13:09:11] urbanecm: this is a noop change, will trigger a periodic a job but I can run one manually [13:09:57] (03PS4) 10Muehlenhoff: Add cert for ganeti02.svc.esams.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/949011 [13:10:26] !log urbanecm@deploy1002 sgimeno and urbanecm: Continuing with sync [13:15:57] (03PS6) 10AOkoth: vrts: add test VM to site [puppet] - 10https://gerrit.wikimedia.org/r/939349 (https://phabricator.wikimedia.org/T340027) [13:15:59] (03PS1) 10AOkoth: contint2001: puppet cleanup post decom [puppet] - 10https://gerrit.wikimedia.org/r/949040 (https://phabricator.wikimedia.org/T342017) [13:16:19] (03CR) 10Stevemunene: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949019 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [13:16:35] (03CR) 10CI reject: [V: 04-1] vrts: add test VM to site [puppet] - 10https://gerrit.wikimedia.org/r/939349 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [13:17:02] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:948631|GrowthExperiments: enable AddLink backend 13th round of wikis (T308138)]] (duration: 10m 47s) [13:17:06] T308138: Deploy "add a link" to 13th round of wikis - https://phabricator.wikimedia.org/T308138 [13:17:08] sergi0: deployed :) [13:17:09] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti3007'] [13:17:22] (03CR) 10Muehlenhoff: [C: 03+2] Add cert for ganeti02.svc.esams.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/949011 (owner: 10Muehlenhoff) [13:17:29] (03PS2) 10Urbanecm: Remove knwiktionary tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949038 (https://phabricator.wikimedia.org/T343662) (owner: 10Anzx) [13:17:51] urbanecm: thank you so much! Almost graduated :) [13:17:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949038 (https://phabricator.wikimedia.org/T343662) (owner: 10Anzx) [13:18:09] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti3007'] [13:18:37] (03Merged) 10jenkins-bot: Remove knwiktionary tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949038 (https://phabricator.wikimedia.org/T343662) (owner: 10Anzx) [13:19:03] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:949038|Remove knwiktionary tagline (T343662)]] [13:19:05] (03PS2) 10AOkoth: contint2001: puppet cleanup post decom [puppet] - 10https://gerrit.wikimedia.org/r/949040 (https://phabricator.wikimedia.org/T342017) [13:19:06] T343662: update knwiktionary logos - https://phabricator.wikimedia.org/T343662 [13:19:20] aanzx: your patch is up next. will ping you once testable on mwdebug. [13:19:30] urbanecm: ok [13:20:19] (03CR) 10Fabfur: [C: 03+1] "IP looks good, checked on netbox" [puppet] - 10https://gerrit.wikimedia.org/r/949010 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [13:20:41] !log urbanecm@deploy1002 urbanecm and anzx: Backport for [[gerrit:949038|Remove knwiktionary tagline (T343662)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:20:48] Testing [13:20:49] (03CR) 10Fabfur: [C: 03+2] hiera: add new DNS host in esams, dns3003 [puppet] - 10https://gerrit.wikimedia.org/r/949010 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [13:20:53] thanks :) [13:21:40] (03PS3) 10AOkoth: contint2001: puppet cleanup post decom [puppet] - 10https://gerrit.wikimedia.org/r/949040 (https://phabricator.wikimedia.org/T342017) [13:22:42] Urbanecm tested looks good [13:22:57] aanzx: thanks, proceeding [13:22:58] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [13:22:59] !log urbanecm@deploy1002 urbanecm and anzx: Continuing with sync [13:23:04] !log filippo@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [13:23:17] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host dns3003.wikimedia.org with OS bullseye [13:23:55] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti3007'] [13:23:57] PROBLEM - Host cr1-esams is DOWN: PING CRITICAL - Packet loss = 100% [13:23:57] PROBLEM - Host cr1-esams IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [13:24:10] !log filippo@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:25:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti3007.esams.wmnet with OS bullseye [13:28:42] (03CR) 10EoghanGaffney: [C: 03+1] contint2001: puppet cleanup post decom [puppet] - 10https://gerrit.wikimedia.org/r/949040 (https://phabricator.wikimedia.org/T342017) (owner: 10AOkoth) [13:29:23] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:949038|Remove knwiktionary tagline (T343662)]] (duration: 10m 20s) [13:29:24] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:29:27] T343662: update knwiktionary logos - https://phabricator.wikimedia.org/T343662 [13:29:31] aanzx: deployed :) [13:29:40] urbanecm: thanks [13:29:42] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:29:52] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3081 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [13:29:52] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3081 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [13:29:52] PROBLEM - Check systemd state on cp3081 is CRITICAL: CRITICAL - degraded: The following units failed: haproxy_stek_job.service,varnishncsa.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:29:52] PROBLEM - traffic-pool service on cp3081 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:29:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3005.esams.wmnet [13:30:14] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:31:16] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:31:50] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 5.346 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:32:02] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3081 is OK: SSL OK - OCSP staple validity for wikipedia.org has 575701 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 87 days) https://wikitech.wikimedia.org/wiki/HTTPS [13:32:02] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3081 is OK: SSL OK - OCSP staple validity for wikipedia.org has 590461 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 94 days) https://wikitech.wikimedia.org/wiki/HTTPS [13:32:24] fabfur: ^ recovered :) [13:32:34] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.268 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:32:40] sukhe: yep tnx [13:35:08] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:14] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/947401 (owner: 10PipelineBot) [13:36:59] (03CR) 10EoghanGaffney: [C: 03+2] gitlab: Update config to fix compatibility with swift [puppet] - 10https://gerrit.wikimedia.org/r/949014 (owner: 10EoghanGaffney) [13:37:05] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/947401 (owner: 10PipelineBot) [13:37:30] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:38:31] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp3081.esams.wmnet with OS bullseye [13:38:50] PROBLEM - puppet last run on gitlab1003 is CRITICAL: CRITICAL: Puppet last ran 4 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:42:56] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:44:12] RECOVERY - puppet last run on gitlab1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:44:34] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [13:44:45] !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dns3003.wikimedia.org with OS bullseye [13:45:01] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [13:45:17] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [13:45:38] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti3007.esams.wmnet with reason: host reimage [13:45:48] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [13:46:58] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [13:47:32] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [13:48:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti3007.esams.wmnet with reason: host reimage [13:49:33] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10Jhancock.wm) 05Open→03Resolved stayed steady for 24 hours. closing. [13:51:31] 10ops-codfw, 10serviceops-radar, 10Maps (Maps-data): ManagementSSHDown - https://phabricator.wikimedia.org/T344110 (10Jhancock.wm) Dell sent me a list of checks to determine if it's the motherboard or the backplane. followed directions and replied. my guess is the MB will need to be replaced. will update whe... [13:51:37] (03CR) 10Ssingh: [C: 03+2] Add IP pre-assignments for new lvs servers in Amsterdam [puppet] - 10https://gerrit.wikimedia.org/r/944875 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney) [13:51:44] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host dns3003.wikimedia.org with OS bullseye [13:53:08] (03PS1) 10Jgiannelos: wikifeeds: Use GET instead of POST for mwapi requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/949046 [13:54:04] (03PS2) 10Jgiannelos: wikifeeds: Use GET instead of POST for mwapi requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/949046 (https://phabricator.wikimedia.org/T343950) [13:59:55] 10sre-alert-triage, 10Machine-Learning-Team: Alert triage: overdue alert [warning] Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99 - https://phabricator.wikimedia.org/T343900 (10JMeybohm) >>! In T343900#9092723, @klausman wrote: > The problem is only really relevant for LLMs (Large... [14:00:10] (03CR) 10Jgiannelos: "We already made the same change in the codebase repo but the config is overridden here." [deployment-charts] - 10https://gerrit.wikimedia.org/r/949046 (https://phabricator.wikimedia.org/T343950) (owner: 10Jgiannelos) [14:00:52] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3081.esams.wmnet with reason: host reimage [14:03:58] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3081.esams.wmnet with reason: host reimage [14:07:48] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ganeti3005.esams.wmnet [14:09:05] (03PS3) 10Stevemunene: airflow-wmde: Create analytics-wmde airflow admin group [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) [14:09:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3005.esams.wmnet [14:11:18] (03PS1) 10Ssingh: esams: add new LVS high-traffic2 host, lvs3009 [puppet] - 10https://gerrit.wikimedia.org/r/949052 (https://phabricator.wikimedia.org/T344174) [14:11:38] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:11:52] (03CR) 10CI reject: [V: 04-1] esams: add new LVS high-traffic2 host, lvs3009 [puppet] - 10https://gerrit.wikimedia.org/r/949052 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [14:12:57] (03CR) 10Ssingh: "CI? KeyError: key not found: "PARALLEL_PID_FILE"" [puppet] - 10https://gerrit.wikimedia.org/r/949052 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [14:13:00] (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949052 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [14:13:52] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:15] !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dns3003.wikimedia.org with OS bullseye [14:14:50] (03CR) 10Andrew Bogott: [C: 03+1] role::wmcs::monitoring: pass through the ensure option [puppet] - 10https://gerrit.wikimedia.org/r/949037 (https://phabricator.wikimedia.org/T344242) (owner: 10David Caro) [14:16:38] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:51] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1097.eqiad.wmnet with OS bullseye [14:21:34] (03CR) 10Fabfur: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/949052 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [14:21:43] (03CR) 10Ssingh: [C: 03+2] esams: add new LVS high-traffic2 host, lvs3009 [puppet] - 10https://gerrit.wikimedia.org/r/949052 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [14:22:57] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs3009.esams.wmnet with OS bullseye [14:23:13] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti3005.esams.wmnet [14:25:48] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:12] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3081.esams.wmnet with OS bullseye [14:26:24] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1015.eqiad.wmnet with OS bullseye [14:26:47] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1014.eqiad.wmnet with OS bullseye [14:27:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3005.esams.wmnet [14:27:54] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:58] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:33:03] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002" [14:33:09] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:51] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1097.eqiad.wmnet with reason: host reimage [14:34:52] (03Abandoned) 10Jdrewniak: Enable Vector "Zebra" AB test on Hebrew wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922564 (https://phabricator.wikimedia.org/T335972) (owner: 10Jdrewniak) [14:37:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002" [14:37:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti3007.esams.wmnet with OS bullseye [14:38:06] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1097.eqiad.wmnet with reason: host reimage [14:38:27] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:38:35] PROBLEM - Recursive DNS on 185.15.59.34 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [14:38:49] yeah this is fine [14:38:51] ^ [14:39:44] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host ganeti3005.mgmt.esams.wmnet with reboot policy GRACEFUL [14:42:59] 10ops-eqiad, 10Cassandra: restbase1030: Cassandra crashing (signal 11) - https://phabricator.wikimedia.org/T344210 (10Eevans) p:05Triage→03High I think the upgrade to 4.1.1 is a red herring. This seems to be limited to a single instance (one of three), and each sig 11 corresponds with the device errors ab... [14:43:05] (03CR) 10Herron: thanos-fe: switch to cfssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948125 (https://phabricator.wikimedia.org/T343987) (owner: 10Herron) [14:43:36] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ganeti3005.esams.wmnet [14:43:38] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for fkaelin - https://phabricator.wikimedia.org/T343957 (10fkaelin) Thanks @colewhite. - As part of the data eng onboarding (T267817), I signed the L3 and a LDAP user should have been created. - This is the wikitech [[ https://wikitech.wikimedia.or... [14:45:08] !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk2001.codfw.wmnet [14:45:09] !log bking@cumin1001 START - Cookbook sre.dns.netbox [14:45:55] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:48:05] (03PS1) 10Ssingh: hiera: update ifaces names for lvs3009 [puppet] - 10https://gerrit.wikimedia.org/r/949057 (https://phabricator.wikimedia.org/T344174) [14:49:04] (03CR) 10Ssingh: [C: 03+2] hiera: update ifaces names for lvs3009 [puppet] - 10https://gerrit.wikimedia.org/r/949057 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [14:49:09] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1015.eqiad.wmnet with reason: host reimage [14:49:17] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1014.eqiad.wmnet with reason: host reimage [14:50:27] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti3005.mgmt.esams.wmnet with reboot policy GRACEFUL [14:50:41] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for fkaelin - https://phabricator.wikimedia.org/T343957 (10RhinosF1) [14:51:24] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk2001.codfw.wmnet - bking@cumin1001" [14:51:31] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs3009.esams.wmnet with OS bullseye [14:52:10] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk2001.codfw.wmnet - bking@cumin1001" [14:52:10] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:52:10] !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk2001.codfw.wmnet on all recursors [14:52:14] !log bking@cumin1001 END (FAIL) - Cookbook sre.dns.wipe-cache (exit_code=99) flink-zk2001.codfw.wmnet on all recursors [14:52:26] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, sql_labm SSH key entry, Kerberos Principal, Team Shell (posix) membership for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10OSefu-WMF) [14:52:36] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1015.eqiad.wmnet with reason: host reimage [14:53:08] PROBLEM - NTP peers on dns3003 is CRITICAL: NTP CRITICAL: No response from NTP server https://wikitech.wikimedia.org/wiki/NTP [14:54:06] !log bking@cumin1001 START - Cookbook sre.dns.netbox [14:54:50] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1014.eqiad.wmnet with reason: host reimage [14:54:52] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host lvs3009.mgmt.esams.wmnet with reboot policy FORCED [14:55:27] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lvs3009.mgmt.esams.wmnet with reboot policy FORCED [14:56:02] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host lvs3009.mgmt.esams.wmnet with reboot policy FORCED [14:57:49] (03PS13) 10David Caro: WIP: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [14:59:15] (03PS1) 10Ssingh: Revert "hiera: add new DNS host in esams, dns3003" [puppet] - 10https://gerrit.wikimedia.org/r/948600 [14:59:38] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:59:55] (03CR) 10Ssingh: [C: 03+2] Revert "hiera: add new DNS host in esams, dns3003" [puppet] - 10https://gerrit.wikimedia.org/r/948600 (owner: 10Ssingh) [15:00:02] PROBLEM - Auth DNS on dns3003 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [15:00:12] ^ not fine but fine [15:00:19] as in, not serving prod traffic [15:00:32] (03CR) 10CI reject: [V: 04-1] WIP: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [15:04:03] (03CR) 10David Caro: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [15:05:26] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs3009.mgmt.esams.wmnet with reboot policy FORCED [15:06:17] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs3009.esams.wmnet with OS bullseye [15:07:48] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1097.eqiad.wmnet with OS bullseye [15:08:06] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:40] (03PS1) 10BCornwall: Revert "Revert "pybal: Make check conform to the Nagios plugin API"" [puppet] - 10https://gerrit.wikimedia.org/r/948601 [15:12:12] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:52] PROBLEM - Recursive DNS on 2a02:ec80:300:2:185:15:59:34 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [15:13:13] expected [15:14:05] (03CR) 10BCornwall: [C: 03+2] Revert "Revert "pybal: Make check conform to the Nagios plugin API"" [puppet] - 10https://gerrit.wikimedia.org/r/948601 (owner: 10BCornwall) [15:15:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:22] 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10Eevans) [15:19:28] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:21:49] 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10Eevans) [15:26:00] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:27:05] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs3009.esams.wmnet with reason: host reimage [15:27:42] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1014.eqiad.wmnet with OS bullseye [15:29:05] (03PS2) 10Aklapper: Add botadmin group on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940486 (https://phabricator.wikimedia.org/T342484) (owner: 10Hamish) [15:29:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3007.esams.wmnet [15:29:43] (03CR) 10CI reject: [V: 04-1] Add botadmin group on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940486 (https://phabricator.wikimedia.org/T342484) (owner: 10Hamish) [15:29:47] !log robh@cumin1001 START - Cookbook sre.dns.netbox [15:29:49] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM flink-zk2001.codfw.wmnet - bking@cumin1001" [15:30:42] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM flink-zk2001.codfw.wmnet - bking@cumin1001" [15:30:42] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:30:42] !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk2001.codfw.wmnet on all recursors [15:30:45] !log bking@cumin1001 END (FAIL) - Cookbook sre.dns.wipe-cache (exit_code=99) flink-zk2001.codfw.wmnet on all recursors [15:32:19] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs3009.esams.wmnet with reason: host reimage [15:32:52] !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: rack bw27 hosts - robh@cumin1001" [15:33:05] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:33:36] !log robh@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: rack bw27 hosts - robh@cumin1001" [15:33:36] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:34:05] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:34:59] !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk2001.codfw.wmnet [15:35:07] !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host lvs3010 [15:35:19] !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs3010 [15:35:24] !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host lvs3008 [15:35:45] !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs3008 [15:36:15] !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk2001.codfw.wmnet [15:36:17] !log bking@cumin1001 START - Cookbook sre.dns.netbox [15:36:42] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Requesting access to analytics-wmde-users (no kerberos, with ssh) for karapayneWMDE - https://phabricator.wikimedia.org/T342546 (10Gehel) 05Open→03Resolved [15:37:05] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:37:15] !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp3080 [15:37:27] !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp3080 [15:37:31] !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host dns3004 [15:37:43] !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dns3004 [15:37:50] !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp3078 [15:38:02] !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp3078 [15:38:03] (03PS14) 10David Caro: WIP: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [15:38:06] !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp3076 [15:38:13] robh: [15:38:16] I see DNS changes for dns3004 [15:38:19] !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp3076 [15:38:20] going to add them [15:38:24] !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp3074 [15:38:27] is that fine? [15:38:36] !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp3074 [15:38:41] !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp3072 [15:38:57] !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp3072 [15:39:01] !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp3070 [15:39:12] (03PS1) 10Muehlenhoff: Assign ganeti role to BY27 cluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/949063 [15:39:15] !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp3070 [15:39:21] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk2001.codfw.wmnet - bking@cumin1001" [15:39:25] !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp3068 [15:39:37] !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp3068 [15:39:41] !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp3066 [15:39:49] (03PS3) 10Hamish: Add botadmin group on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940486 (https://phabricator.wikimedia.org/T342484) [15:39:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3007.esams.wmnet [15:40:07] !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp3066 [15:40:08] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk2001.codfw.wmnet - bking@cumin1001" [15:40:08] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:40:08] !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk2001.codfw.wmnet on all recursors [15:40:11] !log bking@cumin1001 END (FAIL) - Cookbook sre.dns.wipe-cache (exit_code=99) flink-zk2001.codfw.wmnet on all recursors [15:40:23] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:40:37] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk2001.codfw.wmnet - bking@cumin1001" [15:41:22] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk2001.codfw.wmnet - bking@cumin1001" [15:41:34] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk2001.codfw.wmnet with OS bookworm [15:42:04] !log robh@cumin1001 START - Cookbook sre.dns.netbox [15:42:11] (03CR) 10Muehlenhoff: [C: 03+2] Assign ganeti role to BY27 cluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/949063 (owner: 10Muehlenhoff) [15:44:09] !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns3004 - robh@cumin1001" [15:44:19] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1015.eqiad.wmnet with OS bullseye [15:44:54] !log robh@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns3004 - robh@cumin1001" [15:44:55] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:47:39] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10Eevans) [15:47:49] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:47:50] (03PS5) 10Ryan Kemper: elastic: allow only 1 enwiki_content per host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833861 (https://phabricator.wikimedia.org/T343820) [15:49:03] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:49:20] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host lvs3010.mgmt.esams.wmnet with reboot policy FORCED [15:49:23] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [15:50:03] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host lvs3008.mgmt.esams.wmnet with reboot policy FORCED [15:50:18] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [15:50:18] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs3009.esams.wmnet with OS bullseye [15:50:19] (03PS15) 10David Caro: WIP: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [15:50:21] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host ganeti3008.mgmt.esams.wmnet with reboot policy FORCED [15:50:50] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host ganeti3006.mgmt.esams.wmnet with reboot policy FORCED [15:51:11] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host dns3004.mgmt.esams.wmnet with reboot policy FORCED [15:51:47] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp3080.mgmt.esams.wmnet with reboot policy FORCED [15:53:29] (03PS16) 10David Caro: WIP: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [15:56:42] !log bking@deploy1002 Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 [15:56:46] T343124: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 [15:56:56] !log bking@deploy1002 Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 14s) [15:57:44] (03PS1) 10Ayounsi: Rancid: esams migration [puppet] - 10https://gerrit.wikimedia.org/r/949072 (https://phabricator.wikimedia.org/T329219) [15:58:03] !log bking@deploy1002 Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 [15:58:18] !log bking@deploy1002 Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 15s) [15:58:39] (03CR) 10Ayounsi: [C: 03+2] Rancid: esams migration [puppet] - 10https://gerrit.wikimedia.org/r/949072 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [15:59:43] (03PS1) 10Ayounsi: Rancid: typo [puppet] - 10https://gerrit.wikimedia.org/r/949073 [15:59:53] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [16:00:05] jbond and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:20] (03CR) 10Ayounsi: [C: 03+2] Rancid: typo [puppet] - 10https://gerrit.wikimedia.org/r/949073 (owner: 10Ayounsi) [16:00:30] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [16:02:44] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs3008.mgmt.esams.wmnet with reboot policy FORCED [16:02:52] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti3006.mgmt.esams.wmnet with reboot policy FORCED [16:02:53] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:03:49] (03CR) 10BCornwall: [V: 03+1 C: 03+2] trafficserver: Use svc urls for eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/948624 (https://phabricator.wikimedia.org/T326657) (owner: 10BCornwall) [16:05:10] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1098.eqiad.wmnet with OS bullseye [16:08:33] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti3008.mgmt.esams.wmnet with reboot policy FORCED [16:09:06] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dns3004.mgmt.esams.wmnet with reboot policy FORCED [16:09:13] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs3010.mgmt.esams.wmnet with reboot policy FORCED [16:09:45] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp3078.mgmt.esams.wmnet with reboot policy FORCED [16:09:48] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp3080.mgmt.esams.wmnet with reboot policy FORCED [16:10:27] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp3076.mgmt.esams.wmnet with reboot policy FORCED [16:10:49] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp3074.mgmt.esams.wmnet with reboot policy FORCED [16:11:17] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp3072.mgmt.esams.wmnet with reboot policy FORCED [16:11:26] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp3070.mgmt.esams.wmnet with reboot policy FORCED [16:11:38] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp3068.mgmt.esams.wmnet with reboot policy FORCED [16:12:40] 10ops-codfw: InterfaceSpeedError - https://phabricator.wikimedia.org/T344269 (10phaultfinder) [16:14:45] (03CR) 10BCornwall: [C: 03+2] Release 1.9-4 to target Bookworm [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/946604 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [16:18:14] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [16:19:39] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:20:30] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:20:51] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [16:21:35] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1098.eqiad.wmnet with reason: host reimage [16:24:10] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:24:27] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1098.eqiad.wmnet with reason: host reimage [16:25:19] (03PS1) 10Ssingh: Revert "sre.hosts.reimage: connect to the micro service port" [cookbooks] - 10https://gerrit.wikimedia.org/r/948603 [16:26:31] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:26:42] (03CR) 10Ssingh: "For posterity and based on discussions with Jesse and Arzhel, this line causes multiple entries: https://gerrit.wikimedia.org/r/plugins/gi" [cookbooks] - 10https://gerrit.wikimedia.org/r/948603 (owner: 10Ssingh) [16:27:30] (03PS1) 10Ayounsi: Revert "sre.hosts.reimage: connect to the micro service port" [cookbooks] - 10https://gerrit.wikimedia.org/r/948604 [16:27:56] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp3076.mgmt.esams.wmnet with reboot policy FORCED [16:28:10] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp3078.mgmt.esams.wmnet with reboot policy FORCED [16:28:40] (03CR) 10BCornwall: [C: 03+1] Revert "sre.hosts.reimage: connect to the micro service port" [cookbooks] - 10https://gerrit.wikimedia.org/r/948603 (owner: 10Ssingh) [16:28:56] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp3074.mgmt.esams.wmnet with reboot policy FORCED [16:29:18] (03CR) 10Ssingh: [C: 03+1] Revert "sre.hosts.reimage: connect to the micro service port" [cookbooks] - 10https://gerrit.wikimedia.org/r/948604 (owner: 10Ayounsi) [16:29:34] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp3072.mgmt.esams.wmnet with reboot policy FORCED [16:29:36] (03Abandoned) 10Ssingh: Revert "sre.hosts.reimage: connect to the micro service port" [cookbooks] - 10https://gerrit.wikimedia.org/r/948603 (owner: 10Ssingh) [16:29:42] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp3068.mgmt.esams.wmnet with reboot policy FORCED [16:30:00] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp3070.mgmt.esams.wmnet with reboot policy FORCED [16:30:08] (03CR) 10Ayounsi: [C: 03+2] Revert "sre.hosts.reimage: connect to the micro service port" [cookbooks] - 10https://gerrit.wikimedia.org/r/948604 (owner: 10Ayounsi) [16:32:04] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp3066.mgmt.esams.wmnet with reboot policy FORCED [16:32:46] (03Merged) 10jenkins-bot: Revert "sre.hosts.reimage: connect to the micro service port" [cookbooks] - 10https://gerrit.wikimedia.org/r/948604 (owner: 10Ayounsi) [16:33:05] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:33:33] (03CR) 10Ayounsi: [C: 03+2] "Longer fix might be to ease the "len(json_response) != 1" check." [cookbooks] - 10https://gerrit.wikimedia.org/r/948604 (owner: 10Ayounsi) [16:33:38] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs3010'] [16:36:28] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs3008'] [16:37:01] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti3008'] [16:37:35] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti3006'] [16:37:41] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:37:59] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns3004'] [16:42:17] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lvs3008'] [16:42:30] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti3008'] [16:43:04] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lvs3010'] [16:43:55] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dns3004'] [16:44:05] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti3006'] [16:44:45] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:44:59] (03PS1) 10Ssingh: esams: provision all cp hosts in rack B27 [puppet] - 10https://gerrit.wikimedia.org/r/949078 (https://phabricator.wikimedia.org/T344174) [16:45:02] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs3010'] [16:45:12] !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk2002.codfw.wmnet [16:45:14] !log bking@cumin1001 START - Cookbook sre.dns.netbox [16:45:37] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3080'] [16:46:01] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3078'] [16:46:59] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3076'] [16:47:24] (03PS1) 10Ssingh: Revert "Revert "hiera: add new DNS host in esams, dns3003"" [puppet] - 10https://gerrit.wikimedia.org/r/948605 [16:47:47] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:47:54] !log bking@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:47:58] !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk2002.codfw.wmnet [16:48:33] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3074'] [16:48:46] !log robh@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp3074'] [16:49:31] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1098.eqiad.wmnet with OS bullseye [16:49:37] (03CR) 10Andrew Bogott: [C: 03+1] "seems simple enough!" [puppet] - 10https://gerrit.wikimedia.org/r/948566 (https://phabricator.wikimedia.org/T334585) (owner: 10David Caro) [16:49:39] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp3066.mgmt.esams.wmnet with reboot policy FORCED [16:50:45] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3074'] [16:51:46] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp3080'] [16:52:27] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp3078'] [16:53:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:53:56] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lvs3010'] [16:54:01] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp3076'] [16:54:12] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3072'] [16:55:04] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3070'] [16:55:26] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3068'] [16:55:47] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3066'] [16:56:55] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp3074'] [16:57:37] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host flink-zk2001.codfw.wmnet with OS bookworm [16:57:37] !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk2001.codfw.wmnet [16:57:39] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T1700) [17:00:13] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:32] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp3072'] [17:01:29] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp3068'] [17:01:38] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp3066'] [17:01:53] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp3070'] [17:06:32] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T344135 (10RobH) 05Open→03Declined [17:06:51] (03PS1) 10Ssingh: esams: add new LVS high-traffic1 host, lvs3008 [puppet] - 10https://gerrit.wikimedia.org/r/949082 (https://phabricator.wikimedia.org/T344174) [17:06:53] (03PS1) 10Ssingh: esams: add new LVS secondary host, lvs3010 [puppet] - 10https://gerrit.wikimedia.org/r/949083 (https://phabricator.wikimedia.org/T344174) [17:08:24] 10SRE, 10ops-knams, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10RobH) [17:09:41] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5004 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [17:12:19] ^ yeah [17:14:14] (03CR) 10Fabfur: [C: 03+2] Revert "Revert "hiera: add new DNS host in esams, dns3003"" [puppet] - 10https://gerrit.wikimedia.org/r/948605 (owner: 10Ssingh) [17:15:25] (03CR) 10BCornwall: [C: 03+1] esams: provision all cp hosts in rack B27 [puppet] - 10https://gerrit.wikimedia.org/r/949078 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [17:15:48] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host dns3003.wikimedia.org with OS bullseye [17:16:48] ^^ due to this cookbook running dns changes *may* fail [17:16:56] (03CR) 10Ssingh: [C: 03+2] esams: provision all cp hosts in rack B27 [puppet] - 10https://gerrit.wikimedia.org/r/949078 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [17:20:58] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3066.esams.wmnet with OS bullseye [17:21:11] !log Upload libvmod-netmapper 1.9-4 (bookworm) to archive - T342154 [17:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:14] T342154: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 [17:21:56] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [17:22:11] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3074.esams.wmnet with OS bullseye [17:24:05] PROBLEM - Host 2a02:ec80:300:2:185:15:59:34 is DOWN: CRITICAL - Destination Unreachable (2a02:ec80:300:2:185:15:59:34) [17:24:25] ^ expected [17:24:27] fabfur reimaging [17:25:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:26:01] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:26:37] PROBLEM - Check systemd state on ml-serve1006 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:29:16] (KubernetesRsyslogDown) firing: rsyslog on kubernetes1011:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes1011 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:29:19] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:29:55] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:31:19] (03PS1) 10Fabfur: hiera: add new DNS host in esams, dns3004 [puppet] - 10https://gerrit.wikimedia.org/r/949088 (https://phabricator.wikimedia.org/T344174) [17:32:01] (03CR) 10Ssingh: [C: 03+1] hiera: add new DNS host in esams, dns3004 [puppet] - 10https://gerrit.wikimedia.org/r/949088 (https://phabricator.wikimedia.org/T344174) (owner: 10Fabfur) [17:33:02] (03CR) 10Fabfur: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/949083 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [17:33:57] PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve1006 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:35:25] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:37:18] (03CR) 10AOkoth: [C: 03+2] contint2001: puppet cleanup post decom [puppet] - 10https://gerrit.wikimedia.org/r/949040 (https://phabricator.wikimedia.org/T342017) (owner: 10AOkoth) [17:39:21] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dns3003.wikimedia.org with reason: host reimage [17:40:14] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3068.esams.wmnet with OS bullseye [17:42:39] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3076.esams.wmnet with OS bullseye [17:42:47] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns3003.wikimedia.org with reason: host reimage [17:42:51] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3066.esams.wmnet with reason: host reimage [17:44:06] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3074.esams.wmnet with reason: host reimage [17:45:44] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:45:59] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3066.esams.wmnet with reason: host reimage [17:48:35] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3074.esams.wmnet with reason: host reimage [17:50:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:50:15] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [17:50:23] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:51:15] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1002 is OK: SSL OK - Certificate centrallog1002.eqiad.wmnet valid until 2028-01-24 19:33:10 +0000 (expires in 1623 days) https://wikitech.wikimedia.org/wiki/Logs [17:51:23] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:51:25] PROBLEM - Recursive DNS on 185.15.59.34 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [17:53:23] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [17:54:40] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [17:55:04] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:59:01] RECOVERY - Recursive DNS on 185.15.59.34 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [18:00:04] brennen and dancy: That opportune time is upon us again. Time for a MediaWiki train - Utc-7 Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T1800). [18:00:12] o/ [18:01:18] ooh, a lot of risky patches this train. Good times. I do very much appreciate the forewarnings! [18:01:27] * dancy tries to read them carefully [18:01:45] break a leg! [18:01:55] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3068.esams.wmnet with reason: host reimage [18:02:50] Looks like train is blocked on T344223 so I'm not pressing any buttons at this time. [18:02:51] T344223: User logging in on mw-on-k8s triggers "RuntimeException: firejail is enabled, but cannot be found" - https://phabricator.wikimedia.org/T344223 [18:05:04] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3076.esams.wmnet with reason: host reimage [18:05:09] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3068.esams.wmnet with reason: host reimage [18:08:19] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3076.esams.wmnet with reason: host reimage [18:09:31] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:09:44] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [18:10:29] !log fabfur@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001" [18:11:07] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:11:11] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [18:11:53] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:12:07] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.144 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:12:35] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 4.691 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:12:47] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:14:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:16:14] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [18:16:14] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3066.esams.wmnet with OS bullseye [18:16:14] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [18:16:14] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3074.esams.wmnet with OS bullseye [18:17:41] !log fabfur@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001" [18:17:41] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns3003.wikimedia.org with OS bullseye [18:19:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:21:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:23:52] 10SRE, 10ops-knams, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10Fabfur) [18:24:01] (03CR) 10Fabfur: [C: 03+2] hiera: add new DNS host in esams, dns3004 [puppet] - 10https://gerrit.wikimedia.org/r/949088 (https://phabricator.wikimedia.org/T344174) (owner: 10Fabfur) [18:24:57] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:25:13] start reimaging dns3004 (dns changes may fail during this time, in case of error skip) [18:26:13] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host dns3004.wikimedia.org with OS bullseye [18:27:59] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:28:24] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [18:29:47] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:30:19] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [18:30:43] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [18:30:43] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3068.esams.wmnet with OS bullseye [18:32:31] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [18:32:32] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3076.esams.wmnet with OS bullseye [18:33:46] 10SRE, 10ops-knams, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10BCornwall) [18:36:11] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3070.esams.wmnet with OS bullseye [18:36:34] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3078.esams.wmnet with OS bullseye [18:36:41] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:36:58] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns3004.wikimedia.org with OS bullseye [18:37:12] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host dns3004.wikimedia.org with OS bullseye [18:57:11] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:57:21] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3070.esams.wmnet with reason: host reimage [18:57:52] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3078.esams.wmnet with reason: host reimage [18:58:01] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:58:16] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dns3004.wikimedia.org with reason: host reimage [19:01:30] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3070.esams.wmnet with reason: host reimage [19:03:48] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns3004.wikimedia.org with reason: host reimage [19:06:24] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3078.esams.wmnet with reason: host reimage [19:06:39] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:06:55] PROBLEM - Recursive DNS on 185.15.59.2 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [19:07:42] (03PS2) 10Ahmon Dancy: Update kask container image path [deployment-charts] - 10https://gerrit.wikimedia.org/r/913949 (https://phabricator.wikimedia.org/T335691) [19:10:18] (03CR) 10Ahmon Dancy: [C: 03+1] Update kask container image path [deployment-charts] - 10https://gerrit.wikimedia.org/r/913949 (https://phabricator.wikimedia.org/T335691) (owner: 10Ahmon Dancy) [19:16:15] PROBLEM - Recursive DNS on 2a02:ec80:300:1:185:15:59:2 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [19:18:41] RECOVERY - Recursive DNS on 2a02:ec80:300:1:185:15:59:2 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [19:19:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:19:25] RECOVERY - Recursive DNS on 185.15.59.2 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [19:20:23] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:23:01] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:23:23] 10SRE, 10ops-knams, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10ssingh) [19:25:39] (03PS1) 10Ssingh: devices: add anycast_ and lvs_neigbhors for esams (bw27/by27) [homer/public] - 10https://gerrit.wikimedia.org/r/949100 (https://phabricator.wikimedia.org/T329219) [19:26:00] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [19:26:43] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [19:27:51] (03CR) 10Ayounsi: [C: 03+1] devices: add anycast_ and lvs_neigbhors for esams (bw27/by27) [homer/public] - 10https://gerrit.wikimedia.org/r/949100 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [19:28:21] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [19:28:28] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3072.esams.wmnet with OS bullseye [19:29:36] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3080.esams.wmnet with OS bullseye [19:30:19] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: merge flink-zk2002 DNS changes - sukhe@cumin2002" [19:30:57] !log brett@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [19:30:58] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3070.esams.wmnet with OS bullseye [19:30:58] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [19:30:58] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3078.esams.wmnet with OS bullseye [19:31:06] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: merge flink-zk2002 DNS changes - sukhe@cumin2002" [19:31:06] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:32:00] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "manual trigger - cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002 - brett@cumin2002" [19:32:43] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "manual trigger - cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002 - brett@cumin2002" [19:33:33] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:33:58] (03PS1) 10Bking: query_service: let puppet manage whitelist [puppet] - 10https://gerrit.wikimedia.org/r/949101 (https://phabricator.wikimedia.org/T343856) [19:34:31] (03CR) 10CI reject: [V: 04-1] query_service: let puppet manage whitelist [puppet] - 10https://gerrit.wikimedia.org/r/949101 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [19:36:01] (03CR) 10Ssingh: [C: 03+2] esams: add new LVS high-traffic1 host, lvs3008 [puppet] - 10https://gerrit.wikimedia.org/r/949082 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [19:36:43] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:38:02] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs3008.esams.wmnet with OS bullseye [19:40:50] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for fkaelin - https://phabricator.wikimedia.org/T343957 (10colewhite) [19:42:12] 10SRE, 10ops-knams, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10BCornwall) [19:44:05] !log fabfur@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001" [19:44:29] (03PS2) 10Ryan Kemper: query_service: let puppet manage whitelist [puppet] - 10https://gerrit.wikimedia.org/r/949101 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [19:45:01] (03CR) 10CI reject: [V: 04-1] query_service: let puppet manage whitelist [puppet] - 10https://gerrit.wikimedia.org/r/949101 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [19:45:37] !log fabfur@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001" [19:45:38] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns3004.wikimedia.org with OS bullseye [19:45:45] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for fkaelin - https://phabricator.wikimedia.org/T343957 (10colewhite) Thanks @fkaelin! Found the L3 signature. Good to go! Found based on the shell name and existing data entry. The email is subaddressed making ldap search return false negative.... [19:46:02] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for fkaelin - https://phabricator.wikimedia.org/T343957 (10colewhite) [19:47:40] (03PS3) 10Bking: query_service: let puppet manage whitelist [puppet] - 10https://gerrit.wikimedia.org/r/949101 (https://phabricator.wikimedia.org/T343856) [19:47:54] 10SRE, 10ops-knams, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10Fabfur) [19:48:00] (03PS1) 10Cwhite: admin: add fab to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/948693 (https://phabricator.wikimedia.org/T343957) [19:48:13] (03CR) 10CI reject: [V: 04-1] query_service: let puppet manage whitelist [puppet] - 10https://gerrit.wikimedia.org/r/949101 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [19:49:18] 10SRE, 10ops-eqiad, 10decommission-hardware, 10fundraising-tech-ops: decommission frdev1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T341869 (10Jgreen) [19:49:44] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3072.esams.wmnet with reason: host reimage [19:49:57] 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10colewhite) a:03Mabualruz [19:50:09] 10SRE, 10ops-eqiad, 10decommission-hardware, 10fundraising-tech-ops: decommission civi1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T341868 (10Jgreen) [19:51:01] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3080.esams.wmnet with reason: host reimage [19:53:22] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3072.esams.wmnet with reason: host reimage [19:53:56] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Ricki Jay (WMDE) - https://phabricator.wikimedia.org/T343700 (10colewhite) >>! In T343700#9091877, @KFrancis wrote: > Please provide Ricki Jay's email address and I will start processing this request. You may send it to kfrancis@wikimedia.org if you... [19:55:47] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3080.esams.wmnet with reason: host reimage [19:57:04] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs3008.esams.wmnet with reason: host reimage [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230815T2000) [20:00:05] hmonroy and ryankemper: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:18] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:01:20] !log running dummy authdns-update [20:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:35] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, sql_labm SSH key entry, Kerberos Principal, Team Shell (posix) membership for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10colewhite) [20:01:46] o/ I'm around if needed; hmonroy ryankemper do you plan on self-deploying, or should I go ahead? [20:01:52] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs3008.esams.wmnet with reason: host reimage [20:02:22] (03PS2) 10BCornwall: Release 0.36-2 for Bookworm [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/948672 (https://phabricator.wikimedia.org/T342154) [20:02:40] (03CR) 10CI reject: [V: 04-1] Release 0.36-2 for Bookworm [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/948672 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [20:03:47] urbanecm: can go ahead with mine! [20:04:17] Feel free to proceed :) [20:05:54] ack, rolling in a couple mins [20:07:33] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, sql_labm SSH key entry, Kerberos Principal, Team Shell (posix) membership for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10colewhite) Hi and welcome! Please help me confirm the ssh key out-of-band (off phabricator) by... [20:09:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ryankemper@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833861 (https://phabricator.wikimedia.org/T343820) (owner: 10Ryan Kemper) [20:10:42] (03Merged) 10jenkins-bot: elastic: allow only 1 enwiki_content per host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833861 (https://phabricator.wikimedia.org/T343820) (owner: 10Ryan Kemper) [20:11:10] !log ryankemper@deploy1002 Started scap: Backport for [[gerrit:833861|elastic: allow only 1 enwiki_content per host (T343820)]] [20:11:14] T343820: Retune enwiki_content shard settings - https://phabricator.wikimedia.org/T343820 [20:12:03] (03CR) 10Ssingh: [C: 03+2] esams: add new LVS secondary host, lvs3010 [puppet] - 10https://gerrit.wikimedia.org/r/949083 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [20:12:11] (03PS2) 10Ssingh: esams: add new LVS secondary host, lvs3010 [puppet] - 10https://gerrit.wikimedia.org/r/949083 (https://phabricator.wikimedia.org/T344174) [20:12:40] (03CR) 10Ssingh: [V: 03+2] esams: add new LVS secondary host, lvs3010 [puppet] - 10https://gerrit.wikimedia.org/r/949083 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [20:12:49] !log ryankemper@deploy1002 ryankemper: Backport for [[gerrit:833861|elastic: allow only 1 enwiki_content per host (T343820)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:13:48] !log ryankemper@deploy1002 ryankemper: Continuing with sync [20:14:51] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs3010.esams.wmnet with OS bullseye [20:16:19] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [20:17:29] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [20:17:29] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3072.esams.wmnet with OS bullseye [20:18:16] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [20:19:13] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [20:19:13] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3080.esams.wmnet with OS bullseye [20:19:35] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [20:20:29] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [20:20:29] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs3008.esams.wmnet with OS bullseye [20:20:36] !log ryankemper@deploy1002 Finished scap: Backport for [[gerrit:833861|elastic: allow only 1 enwiki_content per host (T343820)]] (duration: 09m 25s) [20:20:38] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:20:40] T343820: Retune enwiki_content shard settings - https://phabricator.wikimedia.org/T343820 [20:25:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:31:05] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:33:52] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs3010.esams.wmnet with reason: host reimage [20:36:47] !log T342444 start cirrussearch reindex of all wikis to enable new text analysis components from mwmaint1002 [20:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:50] T342444: Reindex all wikis to enable apostrophe normalization, camelCase handling, acronym handling, and word_break_helper - https://phabricator.wikimedia.org/T342444 [20:37:19] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs3010.esams.wmnet with reason: host reimage [20:37:49] urbanecm My apologies! I got carried away in some task. Should I rescheduled? [20:38:44] urbanecm: ^^ [20:39:36] I'm afk now unfortunately, so yes please hmonroy. Or, feel free to self-deploy if you can :) [20:40:19] urbanecm: Sounds good. Thank you!! [20:50:26] (03PS1) 10Bking: spdx.rb: Skip SPDX enforcement of txt files [puppet] - 10https://gerrit.wikimedia.org/r/949112 (https://phabricator.wikimedia.org/T344291) [20:52:33] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:54:53] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [20:55:05] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:55:51] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:55:53] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [20:55:53] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs3010.esams.wmnet with OS bullseye [20:57:06] 10SRE, 10ops-knams, 10DC-Ops, 10Traffic: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10ssingh) [20:57:58] 10SRE, 10ops-knams, 10DC-Ops, 10Traffic: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10ssingh) [21:00:04] (03PS1) 10Ssingh: esams/ntp: point to dns3003 [dns] - 10https://gerrit.wikimedia.org/r/949113 (https://phabricator.wikimedia.org/T329219) [21:03:17] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:04:09] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:04:48] (03Abandoned) 10Ssingh: hiera: enable single backend on esams and switch to F4-U hardware config [puppet] - 10https://gerrit.wikimedia.org/r/948581 (https://phabricator.wikimedia.org/T288106) (owner: 10Ssingh) [21:05:11] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:07:08] (03CR) 10Ssingh: "Merging this Wednesday morning." [homer/public] - 10https://gerrit.wikimedia.org/r/949100 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [21:07:40] !log robh@cumin1001 START - Cookbook sre.dns.netbox [21:07:43] !log robh@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [21:09:12] !log robh@cumin1001 START - Cookbook sre.dns.netbox [21:13:30] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [21:13:31] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [21:14:04] yo [21:14:41] yo yo [21:15:35] !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pdus - robh@cumin1001" [21:16:21] !log robh@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pdus - robh@cumin1001" [21:16:21] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:16:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:17:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:18:30] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [21:18:30] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [21:21:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:21:45] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:25:15] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:29:16] (KubernetesRsyslogDown) firing: rsyslog on kubernetes1011:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes1011 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:32:19] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1012.eqiad.wmnet with OS bullseye [21:32:56] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1013.eqiad.wmnet with OS bullseye [21:37:37] 10SRE, 10ops-knams, 10DC-Ops: Q4:knams: PDU installation - https://phabricator.wikimedia.org/T334280 (10RobH) [21:47:33] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1012.eqiad.wmnet with reason: host reimage [21:47:46] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1013.eqiad.wmnet with reason: host reimage [21:50:29] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1012.eqiad.wmnet with reason: host reimage [21:53:02] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1013.eqiad.wmnet with reason: host reimage [21:55:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:00:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:06:28] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Ricki Jay (WMDE) - https://phabricator.wikimedia.org/T343700 (10KFrancis) Thank you so much! I've sent out the agreement for signatures. [22:08:28] (03PS4) 10Bking: query_service: let puppet manage whitelist [puppet] - 10https://gerrit.wikimedia.org/r/949101 (https://phabricator.wikimedia.org/T343856) [22:09:01] (03CR) 10CI reject: [V: 04-1] query_service: let puppet manage whitelist [puppet] - 10https://gerrit.wikimedia.org/r/949101 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [22:10:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:14:45] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:22:20] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1013.eqiad.wmnet with OS bullseye [22:27:55] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1012.eqiad.wmnet with OS bullseye [22:32:15] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:35:09] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:38:35] PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [22:39:45] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:49:33] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:49:47] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:50:55] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50421 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:51:09] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.403 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:52:49] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: The flink session cluster rdf-streaming-updater in eqiad (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [22:56:45] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:56:51] RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [22:57:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [22:57:49] (RdfStreamingUpdaterNotEnoughTaskSlots) resolved: The flink session cluster rdf-streaming-updater in eqiad (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [22:58:34] (03PS2) 10Tim Starling: Set wikidiff2 maxSplitSize = 10 on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947986 (https://phabricator.wikimedia.org/T341754) [23:02:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [23:11:29] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns3003 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [23:12:45] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:13:13] (03CR) 10HMonroy: [C: 03+2] Set wikidiff2 maxSplitSize = 10 on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947986 (https://phabricator.wikimedia.org/T341754) (owner: 10Tim Starling) [23:14:00] (03Merged) 10jenkins-bot: Set wikidiff2 maxSplitSize = 10 on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947986 (https://phabricator.wikimedia.org/T341754) (owner: 10Tim Starling) [23:14:01] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:14:01] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:25:01] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.021 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:26:07] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:26:09] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:26:30] !log hmonroy@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Set wikidiff2 maxSplitSize = 10 on group0 wikis T341754 (duration: 07m 39s) [23:26:34] T341754: Deploy wikidiff2 paragraph split detection - https://phabricator.wikimedia.org/T341754 [23:30:25] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:45:50] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, sql_labm SSH key entry, Kerberos Principal, Team Shell (posix) membership for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10kzimmerman) Approved as Omari's manager, thank you! [23:57:07] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state