[00:06:19] 10ops-codfw, 10DBA: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [00:06:21] 10ops-codfw, 10DBA: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [00:06:25] 10ops-codfw, 10DBA: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [00:06:31] 10ops-codfw, 10DBA: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [00:11:19] 10ops-codfw, 10DBA: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [00:11:22] 10ops-codfw, 10DBA: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [00:11:24] 10ops-codfw, 10DBA: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [00:38:10] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/947400 [00:38:16] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/947400 (owner: 10TrainBranchBot) [00:54:37] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/947400 (owner: 10TrainBranchBot) [01:15:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:19:37] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:21:05] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:25:37] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:01] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:11:37] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:18:31] (03PS20) 10Andrew Bogott: Convert cloudbackup200[12] from cinder-backup nodes to backy2 nodes [puppet] - 10https://gerrit.wikimedia.org/r/946963 (https://phabricator.wikimedia.org/T344065) [02:18:33] (03PS1) 10Andrew Bogott: wmcs-backup: support config-file-based volume selection [puppet] - 10https://gerrit.wikimedia.org/r/948234 (https://phabricator.wikimedia.org/T344065) [02:20:42] (03PS2) 10Andrew Bogott: wmcs-backup: support config-file-based volume selection [puppet] - 10https://gerrit.wikimedia.org/r/948234 (https://phabricator.wikimedia.org/T344065) [02:20:44] (03PS21) 10Andrew Bogott: Convert cloudbackup200[12] from cinder-backup nodes to backy2 nodes [puppet] - 10https://gerrit.wikimedia.org/r/946963 (https://phabricator.wikimedia.org/T344065) [02:21:47] (03CR) 10CI reject: [V: 04-1] wmcs-backup: support config-file-based volume selection [puppet] - 10https://gerrit.wikimedia.org/r/948234 (https://phabricator.wikimedia.org/T344065) (owner: 10Andrew Bogott) [02:23:41] (03CR) 10CI reject: [V: 04-1] wmcs-backup: support config-file-based volume selection [puppet] - 10https://gerrit.wikimedia.org/r/948234 (https://phabricator.wikimedia.org/T344065) (owner: 10Andrew Bogott) [02:27:01] (03PS3) 10Andrew Bogott: wmcs-backup: support config-file-based volume selection [puppet] - 10https://gerrit.wikimedia.org/r/948234 (https://phabricator.wikimedia.org/T344065) [02:27:03] (03PS22) 10Andrew Bogott: Convert cloudbackup200[12] from cinder-backup nodes to backy2 nodes [puppet] - 10https://gerrit.wikimedia.org/r/946963 (https://phabricator.wikimedia.org/T344065) [02:30:10] (03CR) 10CI reject: [V: 04-1] wmcs-backup: support config-file-based volume selection [puppet] - 10https://gerrit.wikimedia.org/r/948234 (https://phabricator.wikimedia.org/T344065) (owner: 10Andrew Bogott) [02:30:12] (03CR) 10CI reject: [V: 04-1] Convert cloudbackup200[12] from cinder-backup nodes to backy2 nodes [puppet] - 10https://gerrit.wikimedia.org/r/946963 (https://phabricator.wikimedia.org/T344065) (owner: 10Andrew Bogott) [02:31:37] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:33:46] (03PS4) 10Andrew Bogott: wmcs-backup: support config-file-based volume selection [puppet] - 10https://gerrit.wikimedia.org/r/948234 (https://phabricator.wikimedia.org/T344065) [02:33:48] (03PS23) 10Andrew Bogott: Convert cloudbackup200[12] from cinder-backup nodes to backy2 nodes [puppet] - 10https://gerrit.wikimedia.org/r/946963 (https://phabricator.wikimedia.org/T344065) [02:38:28] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-backup: support config-file-based volume selection [puppet] - 10https://gerrit.wikimedia.org/r/948234 (https://phabricator.wikimedia.org/T344065) (owner: 10Andrew Bogott) [02:38:38] (03CR) 10Andrew Bogott: [C: 03+2] Convert cloudbackup200[12] from cinder-backup nodes to backy2 nodes [puppet] - 10https://gerrit.wikimedia.org/r/946963 (https://phabricator.wikimedia.org/T344065) (owner: 10Andrew Bogott) [02:45:05] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:47:44] (03PS1) 10Andrew Bogott: Disable cinder backups [puppet] - 10https://gerrit.wikimedia.org/r/948235 (https://phabricator.wikimedia.org/T344065) [02:49:37] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:50:58] (03PS1) 10Andrew Bogott: backup_cinder_volumes: Install OpenStack client packages [puppet] - 10https://gerrit.wikimedia.org/r/948236 (https://phabricator.wikimedia.org/T344065) [02:51:10] (03CR) 10Andrew Bogott: [C: 03+2] Disable cinder backups [puppet] - 10https://gerrit.wikimedia.org/r/948235 (https://phabricator.wikimedia.org/T344065) (owner: 10Andrew Bogott) [02:54:05] (03CR) 10Andrew Bogott: [C: 03+2] backup_cinder_volumes: Install OpenStack client packages [puppet] - 10https://gerrit.wikimedia.org/r/948236 (https://phabricator.wikimedia.org/T344065) (owner: 10Andrew Bogott) [03:01:46] (03PS1) 10Andrew Bogott: backup_cinder_volumes: install the right config file! [puppet] - 10https://gerrit.wikimedia.org/r/948237 (https://phabricator.wikimedia.org/T344065) [03:05:20] (03CR) 10Andrew Bogott: [C: 03+2] backup_cinder_volumes: install the right config file! [puppet] - 10https://gerrit.wikimedia.org/r/948237 (https://phabricator.wikimedia.org/T344065) (owner: 10Andrew Bogott) [03:36:43] (03PS1) 10Andrew Bogott: cinder_backups: point backup hosts in codfw to the eqiad cloudceph mons [puppet] - 10https://gerrit.wikimedia.org/r/948238 (https://phabricator.wikimedia.org/T344065) [03:38:23] (03CR) 10Andrew Bogott: [C: 03+2] cinder_backups: point backup hosts in codfw to the eqiad cloudceph mons [puppet] - 10https://gerrit.wikimedia.org/r/948238 (https://phabricator.wikimedia.org/T344065) (owner: 10Andrew Bogott) [03:48:03] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:51:57] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:01:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:03:31] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:05:01] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:07:16] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T344135 (10phaultfinder) [04:08:33] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:11:04] 10ops-codfw, 10DBA: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [04:11:08] 10ops-codfw, 10DBA: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [04:11:14] 10ops-codfw, 10DBA: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [04:11:20] 10ops-codfw, 10DBA: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [04:12:16] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T344135 (10phaultfinder) [04:13:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:16:03] 10ops-codfw, 10DBA: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [04:16:05] 10ops-codfw, 10DBA: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [04:17:15] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T344135 (10phaultfinder) [04:20:39] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:39:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:43:39] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:08:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:12:39] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:18:29] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10Marostegui) [05:23:20] (03CR) 10KartikMistry: [C: 03+1] wmf-config: Remove wgContentTranslationDefaultParsoidClient cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930798 (owner: 10D3r1ck01) [05:28:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:32:43] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:36:01] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [05:37:01] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:41:33] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [06:09:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:13:41] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:38:30] (Traffic bill over quota) firing: Alert for device cr1-drmrs.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [06:43:39] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/947393 (owner: 10PipelineBot) [06:44:38] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/947393 (owner: 10PipelineBot) [06:54:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:57:37] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [06:58:30] (Traffic bill over quota) resolved: Alert for device cr1-drmrs.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [06:59:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:00:06] Amir1, Urbanecm, and taavi: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230814T0700). [07:00:06] Sohom_Datta: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:18] o/ [07:02:44] (03PS5) 10Stevemunene: airflow-wmde: create analytics-wmde user for airflow [puppet] - 10https://gerrit.wikimedia.org/r/947714 [07:03:19] (03CR) 10CI reject: [V: 04-1] airflow-wmde: create analytics-wmde user for airflow [puppet] - 10https://gerrit.wikimedia.org/r/947714 (owner: 10Stevemunene) [07:04:05] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:06:09] (03PS6) 10Stevemunene: airflow-wmde: create analytics-wmde user for airflow [puppet] - 10https://gerrit.wikimedia.org/r/947714 [07:06:18] I'm about 39,000 feet in the air at the moment, can't deploy today [07:06:57] Ah, cool best of luck with the flight :) [07:08:41] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:08:49] I don't think there's anything I can do to affect it at the moment :P [07:08:51] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [07:15:02] 10ops-codfw, 10serviceops-radar: ManagementSSHDown - https://phabricator.wikimedia.org/T344110 (10Joe) p:05Triage→03High The console is still unreachable; this server is part of a cluster actively servicing traffic [07:16:32] <_joe_> Sohom_Datta: if the patch had a +1 I'd be happy to deploy it [07:21:46] I mean it's the same thing that has been deployed previously on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/934723 and https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/939392 just deployed across every other wikisource as well [07:22:32] But yeah I think Sam is traveling as well, so maybe I'll reschedule it for sometime tmrw ? [07:22:39] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops, 10observability: Prometheus: ingest SONiC metrics - https://phabricator.wikimedia.org/T335027 (10ayounsi) More details: LibreNMS (via SNMP) already collects data. There seems to be some minor bugs which will hopefully be fixed with t... [07:23:07] <_joe_> tomorrow morning I won't be around as it's bank holiday here, but the people in flight should be landed by then :) [07:23:53] <_joe_> Sohom_Datta: I have no doubts about your good faith or the legiitimacy of the change btw, I'd just like to do things by the book given I'm not supposed to be running this backport window [07:27:14] Sohom_Datta: I think we're going to be in the same place tomorrow, we can do it in-person :-) [07:27:32] Oh right yes, sure that works as well :) [07:30:03] (03CR) 10Urbanecm: Enable EditInSequence on all wikisources (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947883 (owner: 10Sohom Datta) [07:30:25] taavi: safe flight! :) [07:30:43] * urbanecm has a morning meeting unfortunately [07:30:48] (03PS2) 10Ayounsi: Revert "mgmt: allow prometheus" [homer/public] - 10https://gerrit.wikimedia.org/r/948113 (https://phabricator.wikimedia.org/T326322) [07:31:23] (03CR) 10Urbanecm: Enable EditInSequence on all wikisources (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947883 (owner: 10Sohom Datta) [07:34:00] (03PS5) 10Sohom Datta: Enable EditInSequence on all wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947883 (https://phabricator.wikimedia.org/T308098) [07:36:03] (03PS7) 10Stevemunene: airflow-wmde: create analytics-wmde user for airflow [puppet] - 10https://gerrit.wikimedia.org/r/947714 (https://phabricator.wikimedia.org/T340648) [07:36:08] (03CR) 10Sohom Datta: Enable EditInSequence on all wikisources (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947883 (https://phabricator.wikimedia.org/T308098) (owner: 10Sohom Datta) [07:36:37] (03CR) 10CI reject: [V: 04-1] airflow-wmde: create analytics-wmde user for airflow [puppet] - 10https://gerrit.wikimedia.org/r/947714 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [07:37:41] (03PS8) 10Stevemunene: airflow-wmde: create analytics-wmde user for airflow [puppet] - 10https://gerrit.wikimedia.org/r/947714 (https://phabricator.wikimedia.org/T340648) [07:39:44] (03CR) 10Stevemunene: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/947714 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [07:41:54] (03PS2) 10Ayounsi: Revert "prometheus::ops: add demo node exporter job for SONiC" [puppet] - 10https://gerrit.wikimedia.org/r/948112 (https://phabricator.wikimedia.org/T335027) [07:42:43] (03PS9) 10Stevemunene: airflow-wmde: create analytics-wmde user for airflow [puppet] - 10https://gerrit.wikimedia.org/r/947714 (https://phabricator.wikimedia.org/T340648) [07:43:11] (03PS3) 10Ayounsi: Revert "prometheus::ops: add demo node exporter job for SONiC" [puppet] - 10https://gerrit.wikimedia.org/r/948112 (https://phabricator.wikimedia.org/T335027) [07:43:38] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/948112 (https://phabricator.wikimedia.org/T335027) (owner: 10Ayounsi) [07:45:31] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42884/console" [puppet] - 10https://gerrit.wikimedia.org/r/947714 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [07:49:45] (03CR) 10Filippo Giunchedi: [C: 03+1] Revert "prometheus::ops: add demo node exporter job for SONiC" [puppet] - 10https://gerrit.wikimedia.org/r/948112 (https://phabricator.wikimedia.org/T335027) (owner: 10Ayounsi) [08:03:18] (03CR) 10Ayounsi: [C: 03+2] Revert "prometheus::ops: add demo node exporter job for SONiC" [puppet] - 10https://gerrit.wikimedia.org/r/948112 (https://phabricator.wikimedia.org/T335027) (owner: 10Ayounsi) [08:11:17] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [08:11:19] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [08:11:21] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [08:16:18] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [08:16:20] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [08:16:22] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [08:18:30] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/948090 (https://phabricator.wikimedia.org/T344042) (owner: 10EoghanGaffney) [08:25:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:27:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:29:41] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:30:21] (03PS3) 10Ayounsi: Revert "mgmt: allow prometheus" [homer/public] - 10https://gerrit.wikimedia.org/r/948113 (https://phabricator.wikimedia.org/T335027) [08:32:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:33:01] 10SRE, 10Data-Platform-SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM request for WMDE Airflow - https://phabricator.wikimedia.org/T342424 (10Stevemunene) a:03Stevemunene [08:35:56] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322 (10fgiunchedi) >>! In T326322#9087109, @ayounsi wrote: > Next steps here: > * Decide which hosts will run gnmic, I can think of 4 option... [08:53:03] (03CR) 10EoghanGaffney: [C: 03+2] gitlab: Add default value for thanos swift key [puppet] - 10https://gerrit.wikimedia.org/r/948090 (https://phabricator.wikimedia.org/T344042) (owner: 10EoghanGaffney) [08:54:57] 10SRE, 10Data-Platform-SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM request for WMDE Airflow - https://phabricator.wikimedia.org/T342424 (10Stevemunene) Verifying the cluster availability and resources via ` stevemunene@cumin1001:~$ sudo cookbook -d sre.ganeti.resource-report eqiad DRY-RU... [09:00:14] (03CR) 10JMeybohm: [C: 03+2] CI: Bail out if admin_ng build fails completely [deployment-charts] - 10https://gerrit.wikimedia.org/r/947865 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm) [09:01:46] (03PS1) 10Stevemunene: airflow-wmde: Add wmde airflow instance to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/948534 (https://phabricator.wikimedia.org/T340648) [09:02:57] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1004.eqiad.wmnet [09:04:36] (ProbeDown) firing: (2) Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:07:40] (03Merged) 10jenkins-bot: CI: Bail out if admin_ng build fails completely [deployment-charts] - 10https://gerrit.wikimedia.org/r/947865 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm) [09:08:15] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [09:09:36] (ProbeDown) resolved: (2) Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:10:37] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Rename mr1-esams dns to mr1-eams-old. - cmooney@cumin1001" [09:11:46] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Rename mr1-esams dns to mr1-eams-old. - cmooney@cumin1001" [09:11:46] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:13:01] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/948158 (owner: 10Jbond) [09:13:44] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1004.eqiad.wmnet [09:16:36] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1005.eqiad.wmnet [09:19:40] (03PS1) 10Ayounsi: Allow gNMI from netflow hosts and to Juniper devices [homer/public] - 10https://gerrit.wikimedia.org/r/948535 (https://phabricator.wikimedia.org/T326322) [09:23:33] PROBLEM - Check systemd state on an-worker1124 is CRITICAL: CRITICAL - degraded: The following units failed: user-runtime-dir@116.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:23:36] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10JMeybohm) [09:23:41] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [09:26:59] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1124.eqiad.wmnet [09:27:37] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1005.eqiad.wmnet [09:28:18] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1006.eqiad.wmnet [09:32:03] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [09:32:04] !log cmooney@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [09:32:14] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [09:34:01] (03PS1) 10Ayounsi: Add password support for users [homer/public] - 10https://gerrit.wikimedia.org/r/948538 (https://phabricator.wikimedia.org/T326322) [09:34:28] (03PS1) 10Jelto: miscweb: add wikiworkshop and reasearch-landing-page to staging wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/948539 (https://phabricator.wikimedia.org/T334511) [09:37:01] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:37:14] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1006.eqiad.wmnet [09:39:57] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1007.eqiad.wmnet [09:41:02] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [09:41:33] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [09:42:36] (03PS2) 10Jelto: miscweb: add wikiworkshop and reasearch-landing-page to staging wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/948539 (https://phabricator.wikimedia.org/T334511) [09:47:01] PROBLEM - Host an-worker1124 is DOWN: PING CRITICAL - Packet loss = 100% [09:47:43] (03CR) 10Btullis: [C: 03+1] "Looks OK. Technically, we don't need to use the regex pattern on these hosts, but some of the others use it, so it's fine not to diverge f" [puppet] - 10https://gerrit.wikimedia.org/r/948534 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [09:48:00] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1007.eqiad.wmnet [09:48:05] (03PS1) 10Ayounsi: Enable gNMI on all devices [homer/public] - 10https://gerrit.wikimedia.org/r/948540 (https://phabricator.wikimedia.org/T326322) [09:49:27] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1008.eqiad.wmnet [09:55:07] RECOVERY - Host an-worker1124 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [09:55:09] PROBLEM - Check systemd state on an-worker1124 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:56:40] (03PS3) 10Effie Mouzeli: Update citoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947801 (https://phabricator.wikimedia.org/T300033) [09:56:41] RECOVERY - Check systemd state on an-worker1124 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:57:15] (03CR) 10CI reject: [V: 04-1] Update citoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947801 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli) [09:59:03] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1124.eqiad.wmnet [10:00:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:00:06] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230814T1000) [10:02:38] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, with the public/fake credentials in place and PCC passing I'll merge" [puppet] - 10https://gerrit.wikimedia.org/r/935523 (owner: 10Krinkle) [10:04:01] (03CR) 10Filippo Giunchedi: [C: 03+2] Disable access to grafana-labs/cloud.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/947933 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah) [10:04:35] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:04:59] (03PS11) 10Effie Mouzeli: kartotherian: add kartotherian chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/531699 (https://phabricator.wikimedia.org/T231006) (owner: 10Mathew.onipe) [10:05:37] (03CR) 10Filippo Giunchedi: dns::dotls: expose and gather haproxy internal metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948087 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [10:05:47] (03CR) 10CI reject: [V: 04-1] kartotherian: add kartotherian chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/531699 (https://phabricator.wikimedia.org/T231006) (owner: 10Mathew.onipe) [10:07:21] (03CR) 10Klausman: [C: 03+1] changeprop: allow retries for liftwing streams with 502 responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/948136 (owner: 10Elukey) [10:08:05] (03CR) 10Filippo Giunchedi: "This isn't needed FYI, prometheus hosts in production can already access all ports. The issue is the firewall between production and cloud" [puppet] - 10https://gerrit.wikimedia.org/r/948084 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [10:08:24] (03CR) 10Btullis: "The code looks good, but could you please update the commit message to describe the new purpose of this change and explain why it should " [puppet] - 10https://gerrit.wikimedia.org/r/947714 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [10:09:45] (03CR) 10Stevemunene: [C: 03+2] airflow-wmde: Add wmde airflow instance to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/948534 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [10:12:25] (03PS1) 10Ilias Sarantopoulos: ores-extension: replace thresholdswith values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948542 (https://phabricator.wikimedia.org/T343308) [10:12:34] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1009.eqiad.wmnet [10:13:05] (03CR) 10CI reject: [V: 04-1] ores-extension: replace thresholdswith values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948542 (https://phabricator.wikimedia.org/T343308) (owner: 10Ilias Sarantopoulos) [10:13:11] !log stevemunene@cumin1001 START - Cookbook sre.ganeti.makevm for new host an-airflow1007.eqiad.wmnet [10:13:13] !log stevemunene@cumin1001 START - Cookbook sre.dns.netbox [10:15:05] It looks like maps2009 is unreachable for 2 days, also this is a master node which means replication is not working. Is this a known issue ? [10:20:01] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:20:06] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1009.eqiad.wmnet [10:21:46] It looks like there is a ticket for that: https://phabricator.wikimedia.org/T344110 [10:24:47] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Rename mr1-esams-new to mr1-esams in dns. - cmooney@cumin1001" [10:24:53] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:24:55] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1008.eqiad.wmnet [10:25:33] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Rename mr1-esams-new to mr1-esams in dns. - cmooney@cumin1001" [10:25:33] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:26:43] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:26:46] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:27:29] 10ops-codfw, 10serviceops-radar: ManagementSSHDown - https://phabricator.wikimedia.org/T344110 (10Jgiannelos) I randomly encountered this while manually checking maps2009. Just a heads up this node is a master node which means that if services are down, OSM data syncing/invalidation and postgres replication is... [10:29:59] 10ops-codfw, 10serviceops-radar, 10Maps (Maps-data): ManagementSSHDown - https://phabricator.wikimedia.org/T344110 (10MSantos) [10:30:20] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [10:30:41] PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1006 is CRITICAL: 12 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [10:31:27] PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1007 is CRITICAL: 14 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [10:32:50] PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1005 is CRITICAL: 16 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [10:33:21] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Rename mr1-esams-new to mr1-esams in dns. - cmooney@cumin1001" [10:34:09] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Rename mr1-esams-new to mr1-esams in dns. - cmooney@cumin1001" [10:34:09] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:34:50] (03PS2) 10Ilias Sarantopoulos: ores-extension: replace thresholdswith values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948542 (https://phabricator.wikimedia.org/T343308) [10:35:23] PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1006 is CRITICAL: 16 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [10:35:57] (03CR) 10CI reject: [V: 04-1] ores-extension: replace thresholdswith values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948542 (https://phabricator.wikimedia.org/T343308) (owner: 10Ilias Sarantopoulos) [10:36:09] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [10:38:22] !log stevemunene@cumin1001 START - Cookbook sre.dns.wipe-cache an-airflow1007.eqiad.wmnet on all recursors [10:38:26] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-airflow1007.eqiad.wmnet on all recursors [10:38:29] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Rename mr1-esams-new to mr1-esams in dns. - cmooney@cumin1001" [10:38:51] !log stevemunene@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM an-airflow1007.eqiad.wmnet - stevemunene@cumin1001" [10:39:12] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Rename mr1-esams-new to mr1-esams in dns. - cmooney@cumin1001" [10:39:12] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:39:35] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM an-airflow1007.eqiad.wmnet - stevemunene@cumin1001" [10:39:59] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-airflow1007.eqiad.wmnet with OS buster [10:40:12] (03PS3) 10Ilias Sarantopoulos: ores-extension: replace thresholdswith values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948542 (https://phabricator.wikimedia.org/T343308) [10:40:33] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [10:40:37] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:42:11] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:51:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:51:36] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-airflow1007.eqiad.wmnet with reason: host reimage [10:53:00] PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1005 is CRITICAL: 16 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [10:54:42] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-airflow1007.eqiad.wmnet with reason: host reimage [10:56:02] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/948159 (owner: 10Jbond) [10:56:36] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [11:02:32] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:09:38] PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1007 is CRITICAL: 16 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [11:09:55] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-airflow1007.eqiad.wmnet with OS buster [11:09:55] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host an-airflow1007.eqiad.wmnet [11:13:54] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [11:16:37] !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mr1-esams oob - ayounsi@cumin1001" [11:17:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:19:34] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mr1-esams oob - ayounsi@cumin1001" [11:19:35] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:21:01] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [11:21:18] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:22:44] PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1006 is CRITICAL: 16 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [11:23:02] (03CR) 10Stevemunene: [V: 03+1] airflow-wmde: create analytics-wmde user for airflow (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/947714 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [11:23:26] !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mr1-esams oob - ayounsi@cumin1001" [11:24:40] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mr1-esams oob - ayounsi@cumin1001" [11:24:40] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:30:26] PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1006 is CRITICAL: 16 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [11:35:47] (03PS1) 10JMeybohm: CI: Fix dependencies to refresh_fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/948550 [11:37:48] PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1005 is CRITICAL: 16 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [11:38:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:38:38] 10SRE, 10serviceops-radar, 10Patch-For-Review, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Aklapper) Looking at the remaining items in this task description, * Is `3d2png` superseded by T267327 (per... [11:39:50] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/948159 (owner: 10Jbond) [11:41:10] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/948159 (owner: 10Jbond) [11:42:32] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:19] (03CR) 10JMeybohm: "LGTM, minor nit about health checking" [deployment-charts] - 10https://gerrit.wikimedia.org/r/948539 (https://phabricator.wikimedia.org/T334511) (owner: 10Jelto) [11:43:26] (03CR) 10JMeybohm: [C: 03+1] miscweb: add wikiworkshop and reasearch-landing-page to staging wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/948539 (https://phabricator.wikimedia.org/T334511) (owner: 10Jelto) [11:44:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:45:10] (03CR) 10JMeybohm: [C: 03+2] CI: Fix dependencies to refresh_fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/948550 (owner: 10JMeybohm) [11:45:20] PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1005 is CRITICAL: 16 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [11:47:22] (03PS4) 10Effie Mouzeli: Update citoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947801 (https://phabricator.wikimedia.org/T300033) [11:48:25] (03PS1) 10Effie Mouzeli: (WIP) kartotherian: add kartotherian chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/948551 (https://phabricator.wikimedia.org/T231006) [11:50:36] (ProbeDown) firing: (2) Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:50:49] (03CR) 10CI reject: [V: 04-1] Update citoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947801 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli) [11:50:56] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [11:51:18] PROBLEM - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [11:51:36] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [11:52:43] hmmm I don't get a response from gerrit either [11:52:49] (03CR) 10CI reject: [V: 04-1] (WIP) kartotherian: add kartotherian chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/948551 (https://phabricator.wikimedia.org/T231006) (owner: 10Effie Mouzeli) [11:54:13] same [11:56:28] (03CR) 10CI reject: [V: 04-1] CI: Fix dependencies to refresh_fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/948550 (owner: 10JMeybohm) [11:56:38] (JobUnavailable) firing: (4) Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:56:41] it's back but took a bit to show up [11:56:48] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 973 bytes in 2.847 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [11:57:08] RECOVERY - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is OK: OK - Certificate gerrit.wikimedia.org will expire on Sun 08 Oct 2023 09:52:13 PM GMT +0000. https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [11:57:26] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 62843 bytes in 0.044 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [11:57:30] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:45] (03CR) 10JMeybohm: [C: 03+2] "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/948550 (owner: 10JMeybohm) [11:58:22] huh, didn't touch gerrit, but was poking around for a second. Gerrit was responsive. Apache seemed hung up. Doesn't look like anyone restarted either of them. [11:59:03] beats me but it was definitely unreachable there for a few. [11:59:10] yep [12:00:36] (ProbeDown) resolved: (2) Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:01:27] objections to removing the old gerrit health checks in favor of the prometheus-based probes ? [12:01:38] (JobUnavailable) resolved: (4) Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:02:15] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Requesting access to analytics-wmde-users (no kerberos, with ssh) for karapayneWMDE - https://phabricator.wikimedia.org/T342546 (10karapayneWMDE) hello, apologies for the delay (was on holiday) public key is: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIN2rcD7HPK... [12:03:37] godog: I'm happy with only the prometheus one. [12:05:21] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Move config-master to dedicated VMs - https://phabricator.wikimedia.org/T341717 (10jbond) [12:05:30] jelto: ok! thank you, I'll send a quick patch [12:05:45] thanks! [12:06:38] (03Abandoned) 10Effie Mouzeli: kartotherian: add kartotherian chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/531699 (https://phabricator.wikimedia.org/T231006) (owner: 10Mathew.onipe) [12:06:51] (03Merged) 10jenkins-bot: CI: Fix dependencies to refresh_fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/948550 (owner: 10JMeybohm) [12:07:46] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/947801 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli) [12:08:12] (03PS1) 10Filippo Giunchedi: icinga: remove obsolete gerrit checks [puppet] - 10https://gerrit.wikimedia.org/r/948552 [12:08:18] ah a whole puppet class is gone, even more satisfying [12:09:49] (03PS2) 10Effie Mouzeli: (WIP) kartotherian: add kartotherian chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/948551 (https://phabricator.wikimedia.org/T231006) [12:10:29] (03PS10) 10Stevemunene: airflow-wmde: create analytics-wmde users class for wmde services [puppet] - 10https://gerrit.wikimedia.org/r/947714 (https://phabricator.wikimedia.org/T340648) [12:10:54] PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1005 is CRITICAL: 16 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [12:11:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:15:05] (03PS1) 10Ayounsi: WIP: add gNMI (+cert) check for network devices [puppet] - 10https://gerrit.wikimedia.org/r/948553 (https://phabricator.wikimedia.org/T326322) [12:15:37] (03CR) 10CI reject: [V: 04-1] WIP: add gNMI (+cert) check for network devices [puppet] - 10https://gerrit.wikimedia.org/r/948553 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [12:16:02] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [12:16:04] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [12:16:08] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [12:16:10] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [12:17:06] (03PS2) 10Ayounsi: WIP: add gNMI (+cert) check for network devices [puppet] - 10https://gerrit.wikimedia.org/r/948553 (https://phabricator.wikimedia.org/T326322) [12:18:34] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:20:41] (03PS3) 10Ayounsi: WIP: add gNMI (+cert) check for network devices [puppet] - 10https://gerrit.wikimedia.org/r/948553 (https://phabricator.wikimedia.org/T326322) [12:21:05] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [12:21:10] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [12:21:12] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [12:23:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:23:34] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10fgiunchedi) >>! In T344101#9088827, @Marostegui wrote: > Did we just lose a pdu? quite possible, all the hosts updated in this task are in B6 cc @Papaul @Jhancock.wm [12:27:34] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:35:20] PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1006 is CRITICAL: 16 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [12:35:27] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Requesting access to analytics-wmde-users (no kerberos, with ssh) for karapayneWMDE - https://phabricator.wikimedia.org/T342546 (10Stevemunene) >>! In T342546#9089720, @karapayneWMDE wrote: > hello, apologies for the delay (was on holiday) > > public key... [12:35:37] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Requesting access to analytics-wmde-users (no kerberos, with ssh) for karapayneWMDE - https://phabricator.wikimedia.org/T342546 (10Stevemunene) [12:41:47] (03CR) 10Cathal Mooney: [C: 03+2] Add / update network definitions to include new esams ranges [puppet] - 10https://gerrit.wikimedia.org/r/948216 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney) [12:42:39] (03PS1) 10Jelto: gerrit: add blackbox check for json endpoint [puppet] - 10https://gerrit.wikimedia.org/r/948555 [12:42:54] PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1006 is CRITICAL: 16 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [12:43:13] (03CR) 10CI reject: [V: 04-1] gerrit: add blackbox check for json endpoint [puppet] - 10https://gerrit.wikimedia.org/r/948555 (owner: 10Jelto) [12:44:05] (03PS2) 10Jelto: gerrit: add blackbox check for json endpoint [puppet] - 10https://gerrit.wikimedia.org/r/948555 [12:45:56] (03CR) 10David Caro: [C: 03+2] labweb: use a valid host for the probes [puppet] - 10https://gerrit.wikimedia.org/r/948102 (owner: 10David Caro) [12:47:48] PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1007 is CRITICAL: 16 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [12:50:13] (03CR) 10Jelto: [C: 03+1] "lgtm, I added one additional blackbox check for the json endpoint in Ibc71e33b4384fef4cbc23bf0a4f13e0399a08afc. I'm not sure about the his" [puppet] - 10https://gerrit.wikimedia.org/r/948552 (owner: 10Filippo Giunchedi) [12:53:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:52] 10SRE, 10Performance-Team, 10Wiki-Loves-Monuments: Provide nocookie domain for thumb.php (generic or for Commons) - https://phabricator.wikimedia.org/T344153 (10Nux) [12:56:05] (03PS1) 10Ayounsi: network/data.yaml add esams prefixes [puppet] - 10https://gerrit.wikimedia.org/r/948559 (https://phabricator.wikimedia.org/T343214) [12:56:33] (03PS1) 10Jbond: service::catalog: Add config-master to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/948560 (https://phabricator.wikimedia.org/T341717) [12:57:16] 10SRE, 10Performance-Team, 10Wiki-Loves-Monuments: Provide nocookie domain for thumb.php (generic or for Commons) - https://phabricator.wikimedia.org/T344153 (10Nux) [12:57:50] (03CR) 10Filippo Giunchedi: icinga: remove obsolete gerrit checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948552 (owner: 10Filippo Giunchedi) [12:57:56] (03PS3) 10Jelto: miscweb: add wikiworkshop and reasearch-landing-page to staging wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/948539 (https://phabricator.wikimedia.org/T334511) [12:58:38] (03CR) 10Jelto: miscweb: add wikiworkshop and reasearch-landing-page to staging wikikube (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/948539 (https://phabricator.wikimedia.org/T334511) (owner: 10Jelto) [12:59:24] 10SRE, 10Performance-Team, 10Wiki-Loves-Monuments: Provide nocookie domain for thumb.php (generic or for Commons) - https://phabricator.wikimedia.org/T344153 (10Nux) [13:00:04] (03CR) 10Cathal Mooney: [C: 03+1] "Not sure how I missed those thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/948559 (https://phabricator.wikimedia.org/T343214) (owner: 10Ayounsi) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230814T1300). [13:00:05] xSavitar: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:14] o/ [13:00:36] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:01:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:01:38] I can deploy if no one is around in this window [13:03:23] (03PS9) 10D3r1ck01: wmf-config: Remove wgContentTranslationDefaultParsoidClient cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930798 [13:04:40] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by derick@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930798 (owner: 10D3r1ck01) [13:05:04] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:21] (03Merged) 10jenkins-bot: wmf-config: Remove wgContentTranslationDefaultParsoidClient cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930798 (owner: 10D3r1ck01) [13:05:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:05:51] (03CR) 10Ayounsi: [C: 03+2] network/data.yaml add esams prefixes [puppet] - 10https://gerrit.wikimedia.org/r/948559 (https://phabricator.wikimedia.org/T343214) (owner: 10Ayounsi) [13:06:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:06:13] (03PS2) 10Jbond: service::catalog: Add config-master to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/948560 (https://phabricator.wikimedia.org/T341717) [13:06:29] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10Jhancock.wm) @fgiunchedi the first breaker on ps2-b6-codfw tripped. I've restarted that breaker and will observe the rack for a few days to see if it happens again. as of right now, it looks like all the servers are back on redundant... [13:06:54] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10Jhancock.wm) a:03Jhancock.wm [13:07:27] (03PS1) 10Jbond: config-master: add new discovery record for config-master [dns] - 10https://gerrit.wikimedia.org/r/948562 (https://phabricator.wikimedia.org/T341717) [13:08:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:08:19] !log derick@deploy1002 Backport cancelled. [13:08:23] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344100 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm part of T344101. monitoring from there [13:08:56] * xSavitar sees unexpected commits [13:09:04] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344099 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm related to T344101. monitoring from there. [13:10:07] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344098 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm related to T344101. monitoring from there. [13:10:46] !log derick@deploy1002 Started scap: Backport for [[gerrit:930798|wmf-config: Remove wgContentTranslationDefaultParsoidClient cleanup]] [13:11:01] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344097 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm related to T344101. monitoring from there. [13:11:11] xSavitar: all okay? [13:11:35] TheresNoTime, yes all okay! [13:12:03] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344096 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm related to T344101. monitoring from there. [13:12:16] Just saw some commit pile which was unexpected but guess I shouldn't worry about those? :) [13:12:27] Though I see the config change I'm deploying :) [13:12:32] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:12:53] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344095 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm primary failure at T344101. [13:13:48] (03CR) 10Herron: [C: 03+2] thanos-fe: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/946559 (https://phabricator.wikimedia.org/T343987) (owner: 10Herron) [13:14:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:59] (03PS1) 10Andrew Bogott: wmcs-backup: be a little better about tracking image_id vs ceph_id [puppet] - 10https://gerrit.wikimedia.org/r/948565 (https://phabricator.wikimedia.org/T344065) [13:15:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:16:58] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:18:28] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:19:25] !log derick@deploy1002 d3r1ck01 and derick: Backport for [[gerrit:930798|wmf-config: Remove wgContentTranslationDefaultParsoidClient cleanup]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:20:26] !log derick@deploy1002 d3r1ck01 and derick: Continuing with sync [13:21:44] (03PS2) 10Andrew Bogott: wmcs-backup: be a little better about tracking image_id vs ceph_id [puppet] - 10https://gerrit.wikimedia.org/r/948565 (https://phabricator.wikimedia.org/T344065) [13:22:52] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/947401 [13:24:36] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-backup: be a little better about tracking image_id vs ceph_id [puppet] - 10https://gerrit.wikimedia.org/r/948565 (https://phabricator.wikimedia.org/T344065) (owner: 10Andrew Bogott) [13:26:08] 10ops-codfw, 10serviceops-radar, 10Maps (Maps-data): ManagementSSHDown - https://phabricator.wikimedia.org/T344110 (10Jhancock.wm) @Joe @Jgiannelos maps2010 doesn't look to be up at all. the network link and mgmt link are down and the front panel indicates that the server is off. anyway you can confirm? [13:26:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:26:38] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:27:42] !log derick@deploy1002 Finished scap: Backport for [[gerrit:930798|wmf-config: Remove wgContentTranslationDefaultParsoidClient cleanup]] (duration: 16m 56s) [13:27:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:30:35] done with deployment! [13:30:38] PROBLEM - Check systemd state on kubernetes2009 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:08] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:31:38] (JobUnavailable) resolved: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:31:46] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2009 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:32:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:33:06] (03PS12) 10David Caro: WIP: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [13:33:52] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:35:48] (03CR) 10CI reject: [V: 04-1] WIP: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [13:37:01] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:40:13] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs2012.codfw.wmnet with OS bullseye [13:40:28] PROBLEM - Maps HTTPS on maps2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:40:48] PROBLEM - Maps HTTPS on maps1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:40:56] PROBLEM - Maps HTTPS on maps1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:40:56] PROBLEM - Maps HTTPS on maps2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:41:12] PROBLEM - Maps HTTPS on maps1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:41:16] PROBLEM - Maps HTTPS on maps2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:41:36] PROBLEM - Maps HTTPS on maps1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:41:36] PROBLEM - Maps HTTPS on maps1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:41:36] PROBLEM - Maps HTTPS on maps2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:42:04] PROBLEM - Maps HTTPS on maps2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:42:04] PROBLEM - Maps HTTPS on maps1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:43:50] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): volatile: We need to configure the volatile endpoint on puppetserveres - https://phabricator.wikimedia.org/T341056 (10jbond) 05Open→03In progress [13:44:05] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [13:44:37] (03PS1) 10David Caro: toolforge: add deployer module with the secrets [puppet] - 10https://gerrit.wikimedia.org/r/948566 (https://phabricator.wikimedia.org/T334585) [13:44:56] (03PS1) 10Jbond: puppetserver: Add support for defining additional mount points [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056) [13:45:37] (03CR) 10CI reject: [V: 04-1] puppetserver: Add support for defining additional mount points [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [13:45:42] (03PS2) 10David Caro: toolforge: add deployer module with the secrets [puppet] - 10https://gerrit.wikimedia.org/r/948566 (https://phabricator.wikimedia.org/T334585) [13:45:46] (03PS2) 10Jbond: puppetserver: Add support for defining additional mount points [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056) [13:46:25] (03CR) 10CI reject: [V: 04-1] puppetserver: Add support for defining additional mount points [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [13:46:30] (03PS1) 10Stevemunene: Grant Kara Payne shell access [puppet] - 10https://gerrit.wikimedia.org/r/948568 (https://phabricator.wikimedia.org/T342546) [13:47:18] (03PS3) 10David Caro: toolforge: add deployer module with the secrets [puppet] - 10https://gerrit.wikimedia.org/r/948566 (https://phabricator.wikimedia.org/T334585) [13:48:00] RECOVERY - Maps HTTPS on maps2006 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 8.848 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:48:04] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE, 10Patch-For-Review: Requesting access to analytics-wmde-users (no kerberos, with ssh) for karapayneWMDE - https://phabricator.wikimedia.org/T342546 (10Stevemunene) [13:49:16] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 98692048 and 81 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:49:41] (03PS3) 10Jbond: puppetserver: Add support for defining additional mount points [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056) [13:50:09] (03CR) 10CI reject: [V: 04-1] toolforge: add deployer module with the secrets [puppet] - 10https://gerrit.wikimedia.org/r/948566 (https://phabricator.wikimedia.org/T334585) (owner: 10David Caro) [13:50:11] 10SRE, 10Data-Platform-SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM request for WMDE Airflow - https://phabricator.wikimedia.org/T342424 (10Stevemunene) created the vm with `sudo cookbook sre.ganeti.makevm --vcpus 4 --memory 8 --disk 100 --network analytics --os buster --cluster eqiad --g... [13:50:15] (03CR) 10CI reject: [V: 04-1] puppetserver: Add support for defining additional mount points [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [13:51:17] (03PS4) 10Jbond: puppetserver: Add support for defining additional mount points [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056) [13:51:50] 10SRE, 10Data-Platform-SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM request for WMDE Airflow - https://phabricator.wikimedia.org/T342424 (10Stevemunene) 05Open→03Resolved [13:52:16] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:52:32] PROBLEM - Maps HTTPS on maps2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:53:06] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1016.eqiad.wmnet with OS bullseye [13:54:12] (ATSBackendErrorsHigh) firing: ATS: elevated 5xx errors from kartotherian.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=drmrs%20prometheus/ops&var-cluster=upload&var-origin=kartotherian.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHi [13:54:39] checking [13:54:45] <_joe_> godog: it's maps [13:55:05] _joe_: hah, do you know if expected/known ? [13:55:15] !incidents [13:55:15] You're not allowed to perform this action. [13:55:18] (03CR) 10Eevans: [C: 03+2] restbase: Upgrade restbase2013 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/944959 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans) [13:55:18] <_joe_> godog: definitely not expected [13:55:19] :( [13:55:21] <_joe_> !incidents [13:55:21] godog: I found out earlier that something is up, looking into it [13:55:21] 3946 (UNACKED) ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet drmrs) [13:55:21] 3945 (RESOLVED) [14x] ProbeDown sre (probes/service esams) [13:55:22] 3944 (RESOLVED) [4x] ProbeDown sre (probes/service esams) [13:55:34] <_joe_> !ack 3946 [13:55:34] 3946 (ACKED) ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet drmrs) [13:55:40] <_joe_> effie: I think this is something else btw [13:55:42] thank you [13:55:58] PROBLEM - MariaDB Replica Lag: pc3 on pc2013 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 692.66 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:56:14] <_joe_> so what I know up to now, then I have a meeting coming up, is that the latency for kartotherian exploded in both datacenters [13:56:18] something else is indeed going [13:56:20] PROBLEM - MariaDB Replica Lag: pc1 on pc2011 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 740.57 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:56:24] ok I will lookinto it [13:56:26] <_joe_> https://grafana.wikimedia.org/d/000000030/service-kartotherian?orgId=1&refresh=5m&from=now-1h&to=now [13:56:47] ok, also looking into it [13:58:34] I am onsite where maps2009 is. server appears to be completely offline. no output on physical console and no lights on ports/front panel. [13:59:13] (ATSBackendErrorsHigh) firing: (5) ATS: elevated 5xx errors from kartotherian.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [13:59:29] JennH: [13:59:56] thank you so much, please let us know who things progres, we are not sure if the current errors are related to this server [14:01:09] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2012.codfw.wmnet with reason: host reimage [14:01:21] effie: thoughts and/or leads to find the smoking gun ? [14:01:25] PROBLEM - MariaDB Replica Lag: pc2 on pc2012 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 575.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:01:33] godog: stoll looking [14:01:39] ack [14:02:46] (03PS2) 10Stevemunene: Dummy db for new wmde airflow [labs/private] - 10https://gerrit.wikimedia.org/r/940936 (https://phabricator.wikimedia.org/T340648) [14:03:00] (03CR) 10Stevemunene: [C: 03+2] Dummy db for new wmde airflow [labs/private] - 10https://gerrit.wikimedia.org/r/940936 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [14:03:08] (03CR) 10Stevemunene: [V: 03+2 C: 03+2] Dummy db for new wmde airflow [labs/private] - 10https://gerrit.wikimedia.org/r/940936 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [14:03:28] (03CR) 10Stevemunene: [C: 03+2] Add dummy keytabs for new an-airflow1007 [labs/private] - 10https://gerrit.wikimedia.org/r/940937 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [14:03:30] (03CR) 10Stevemunene: [V: 03+2 C: 03+2] Add dummy keytabs for new an-airflow1007 [labs/private] - 10https://gerrit.wikimedia.org/r/940937 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [14:03:38] effie: looks like kartho can't talk to tegola-vector-tiles.svc ? [14:03:44] (03PS2) 10Stevemunene: Add dummy keytabs for new an-airflow1007 [labs/private] - 10https://gerrit.wikimedia.org/r/940937 (https://phabricator.wikimedia.org/T340648) [14:03:49] or rather, tegola is 503'ing [14:03:50] (03CR) 10Stevemunene: [V: 03+2] Add dummy keytabs for new an-airflow1007 [labs/private] - 10https://gerrit.wikimedia.org/r/940937 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [14:04:01] godog: we are working on it [14:04:10] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2012.codfw.wmnet with reason: host reimage [14:04:30] who's "we" ? :) [14:04:59] !log upgrading Cassandra to 4.1.1, restbase2013-{a,b,c} — T339298 [14:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:04] T339298: Upgrade restbase cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339298 [14:06:01] (03CR) 10Ssingh: [C: 03+1] config-master: add new discovery record for config-master [dns] - 10https://gerrit.wikimedia.org/r/948562 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [14:06:53] effie: I don't want to step on anyone's toes or muddy the waters, though where is the investigation happening? [14:07:09] (03CR) 10Ssingh: [V: 03+1 C: 03+2] bird: create dummy anycast-prefix files [puppet] - 10https://gerrit.wikimedia.org/r/948158 (owner: 10Jbond) [14:07:27] in my terminal :p [14:08:10] ok, I'll stop pestering you with questions, though please do let me know how I can help [14:09:45] (03CR) 10Ssingh: "Thanks for the patch. Let's hold off on merging this until after the knams migration this week, just to be safe, if that's fine." [puppet] - 10https://gerrit.wikimedia.org/r/948087 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [14:09:47] RECOVERY - MariaDB Replica Lag: pc2 on pc2012 is OK: OK slave_sql_lag Replication lag: 34.83 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:11:38] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:12:47] RECOVERY - MariaDB Replica Lag: pc3 on pc2013 is OK: OK slave_sql_lag Replication lag: 15.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:13:32] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host durum1002.eqiad.wmnet with OS bookworm [14:16:01] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1016.eqiad.wmnet with reason: host reimage [14:16:26] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/948571 [14:16:38] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:43] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:16:57] ^ expected [14:17:09] (03CR) 10David Caro: [V: 03+1] dns::dotls: expose and gather haproxy internal metrics (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/948087 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [14:17:17] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:17:37] 10sre-alert-triage, 10Data-Platform-SRE: Alert triage - https://phabricator.wikimedia.org/T342247 (10bking) a:05bking→03None [14:18:25] 10sre-alert-triage, 10Data-Platform-SRE: Alert triage - https://phabricator.wikimedia.org/T342247 (10bking) @BTullis I unassigned this from myself as I'm not actively working on it. I'm guessing it can probably be closed, but leaving that up to you. [14:19:07] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1016.eqiad.wmnet with reason: host reimage [14:21:30] (03CR) 10Stevemunene: [C: 03+2] airflow-wmde: Add a postgresql database and user for airflow wmde [puppet] - 10https://gerrit.wikimedia.org/r/940961 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [14:22:59] (03PS4) 10David Caro: toolforge: add deployer module with the secrets [puppet] - 10https://gerrit.wikimedia.org/r/948566 (https://phabricator.wikimedia.org/T334585) [14:24:48] (03CR) 10Ssingh: [C: 03+1] service::catalog: Add config-master to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/948560 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [14:25:40] (03CR) 10Fabfur: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/948195 (https://phabricator.wikimedia.org/T344073) (owner: 10Ssingh) [14:25:44] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@ee544cb]: (no justification provided) [14:25:54] !log jgiannelos@deploy1002 deploy aborted: (no justification provided) (duration: 00m 10s) [14:26:01] (03CR) 10Ssingh: [C: 03+2] learn.wiki: update A records [dns] - 10https://gerrit.wikimedia.org/r/948195 (https://phabricator.wikimedia.org/T344073) (owner: 10Ssingh) [14:26:04] (03PS2) 10Ssingh: learn.wiki: update A records [dns] - 10https://gerrit.wikimedia.org/r/948195 (https://phabricator.wikimedia.org/T344073) [14:26:13] (03CR) 10Ssingh: [V: 03+2] learn.wiki: update A records [dns] - 10https://gerrit.wikimedia.org/r/948195 (https://phabricator.wikimedia.org/T344073) (owner: 10Ssingh) [14:26:30] !log running authdns-update for CR 948195 [14:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:37] !log running authdns-update for CR 948195: T344073 [14:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:40] T344073: additional DNS changes for WikiLearn - https://phabricator.wikimedia.org/T344073 [14:26:57] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum1002.eqiad.wmnet with reason: host reimage [14:27:37] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@ee544cb]: (no justification provided) [14:27:38] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@ee544cb]: (no justification provided) (duration: 00m 01s) [14:28:18] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:34] RECOVERY - MariaDB Replica Lag: pc1 on pc2011 is OK: OK slave_sql_lag Replication lag: 0.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:30:04] (03Abandoned) 10Ssingh: hiera: temporarily remove v4 IP for ns2 from authdns_addrs [puppet] - 10https://gerrit.wikimedia.org/r/947810 (https://phabricator.wikimedia.org/T343942) (owner: 10Ssingh) [14:30:33] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@ee544cb] (eqiad): (no justification provided) [14:30:34] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@ee544cb] (eqiad): (no justification provided) (duration: 00m 00s) [14:30:47] (03PS1) 10Eevans: restbase: Upgrade restbase1016 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/948574 (https://phabricator.wikimedia.org/T339298) [14:30:49] (03PS1) 10Eevans: restbase: Upgrade restbase1019 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/948575 (https://phabricator.wikimedia.org/T339298) [14:30:51] (03PS1) 10Eevans: restbase: Upgrade restbase1020 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/948576 (https://phabricator.wikimedia.org/T339298) [14:30:53] (03PS1) 10Eevans: restbase: Upgrade restbase1021 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/948577 (https://phabricator.wikimedia.org/T339298) [14:30:55] (03PS1) 10Eevans: restbase: Upgrade restbase1028 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/948578 (https://phabricator.wikimedia.org/T339298) [14:30:57] (03PS1) 10Eevans: restbase: Upgrade restbase1031 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/948579 (https://phabricator.wikimedia.org/T339298) [14:30:59] 10SRE, 10DNS, 10Traffic: additional DNS changes for WikiLearn - https://phabricator.wikimedia.org/T344073 (10ssingh) 05Open→03Resolved a:03ssingh The requested records have been updated. Thanks! [14:31:30] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum1002.eqiad.wmnet with reason: host reimage [14:33:10] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@ee544cb] (eqiad): (no justification provided) [14:33:14] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@ee544cb] (eqiad): (no justification provided) (duration: 00m 03s) [14:33:53] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/948574 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans) [14:34:33] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@ee544cb] (eqiad): (no justification provided) [14:34:33] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@ee544cb] (eqiad): (no justification provided) (duration: 00m 00s) [14:35:00] <_joe_> effie, godog any news? [14:35:08] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:35:11] <_joe_> I'm just out of the meeting [14:35:18] _joe_: -serviceops [14:40:54] 10SRE, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Urbanecm) [14:41:13] 10SRE, 10Stewards-and-global-tools, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Urbanecm) [14:42:17] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1016.eqiad.wmnet with OS bullseye [14:44:05] 10SRE, 10Stewards-and-global-tools, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Urbanecm) [14:45:17] 10SRE, 10Stewards-and-global-tools, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Urbanecm) [14:45:30] (03PS1) 10Herron: Revert "thanos-fe: switch to cfssl" [puppet] - 10https://gerrit.wikimedia.org/r/948124 [14:45:35] (03CR) 10Jbond: [C: 03+1] Revert "thanos-fe: switch to cfssl" [puppet] - 10https://gerrit.wikimedia.org/r/948124 (owner: 10Herron) [14:46:09] (03CR) 10Herron: [C: 03+2] Revert "thanos-fe: switch to cfssl" [puppet] - 10https://gerrit.wikimedia.org/r/948124 (owner: 10Herron) [14:46:21] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE, 10Patch-For-Review: Requesting access to analytics-wmde-users (no kerberos, with ssh) for karapayneWMDE - https://phabricator.wikimedia.org/T342546 (10jbond) [14:47:04] RECOVERY - BFD status on cr2-eqiad is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:47:11] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum1002.eqiad.wmnet with OS bookworm [14:47:42] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:48:33] (03CR) 10Jbond: [C: 03+1] "change lgtm but it doesn't grant any access, you will also need to add the user to analytics-wmde-users" [puppet] - 10https://gerrit.wikimedia.org/r/948568 (https://phabricator.wikimedia.org/T342546) (owner: 10Stevemunene) [14:50:42] RECOVERY - Maps HTTPS on maps2006 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 9.286 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:51:00] RECOVERY - Maps HTTPS on maps2008 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.423 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:51:02] RECOVERY - Maps HTTPS on maps2007 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 4.957 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:51:10] RECOVERY - Maps HTTPS on maps1010 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 3.330 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:51:12] RECOVERY - Maps HTTPS on maps2010 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 1.187 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:51:22] RECOVERY - Maps HTTPS on maps1006 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 1.381 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:51:22] RECOVERY - Maps HTTPS on maps1005 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 7.835 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:51:31] (03CR) 10Stevemunene: Grant Kara Payne shell access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948568 (https://phabricator.wikimedia.org/T342546) (owner: 10Stevemunene) [14:51:34] RECOVERY - Maps HTTPS on maps1007 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.470 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:52:04] RECOVERY - Maps HTTPS on maps1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1342 bytes in 0.368 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:52:08] RECOVERY - Maps HTTPS on maps1008 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.164 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:53:33] <_joe_> jayme: see? [14:53:36] RECOVERY - Maps HTTPS on maps2005 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 1.195 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:53:59] I'll make a note to ditch these spammy alerts [14:54:11] <_joe_> I mean, they were useful [14:54:26] sure, they can be both [14:54:29] (03PS1) 10Herron: thanos-fe: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/948125 (https://phabricator.wikimedia.org/T343987) [14:54:38] _joe_: yes,yes I wasn't trying to argue. I just thought I missed something [14:54:54] <_joe_> godog: yeah I was just saying we should aggregate not ditch them :) [14:55:28] ^^ is patch to retry later after addressing the tegola certs [14:55:38] 10SRE, 10serviceops-radar, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi) [14:55:56] yeah the probe failure is indeed the aggregated version [14:56:18] <_joe_> herron: do you mind initiating the incident report? [14:56:36] <_joe_> maps was down for 90 minutes [14:56:49] _joe_: sure [14:57:03] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2012.codfw.wmnet with OS bullseye [14:59:13] (ATSBackendErrorsHigh) resolved: (5) ATS: elevated 5xx errors from kartotherian.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [14:59:22] (03PS5) 10Jbond: puppetserver: Add support for defining additional mount points [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056) [14:59:55] (03PS1) 10Ssingh: hiera: enable single backend on esams (post knams migration) [puppet] - 10https://gerrit.wikimedia.org/r/948581 (https://phabricator.wikimedia.org/T288106) [15:00:26] (03PS1) 10Marostegui: Revert "es2025: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/948586 [15:01:24] (03CR) 10Marostegui: [C: 03+2] Revert "es2025: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/948586 (owner: 10Marostegui) [15:01:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2025 (re)pooling @ 1%: Repooling after onsite maintenance', diff saved to https://phabricator.wikimedia.org/P50572 and previous config saved to /var/cache/conftool/dbconfig/20230814-150154-root.json [15:02:14] 10SRE, 10ops-codfw, 10DBA: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10Marostegui) I am repooling this host [15:10:15] (03PS2) 10Ssingh: hiera: enable single backend on esams and switch to F4-U hardware config [puppet] - 10https://gerrit.wikimedia.org/r/948581 (https://phabricator.wikimedia.org/T288106) [15:10:30] (03CR) 10Fabfur: [C: 03+1] "Compared with the ulsfo values, these looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/948581 (https://phabricator.wikimedia.org/T288106) (owner: 10Ssingh) [15:16:26] (03PS6) 10Jbond: puppetserver: Add support for defining additional mount points [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056) [15:16:38] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1093.eqiad.wmnet with OS bullseye [15:16:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2025 (re)pooling @ 3%: Repooling after onsite maintenance', diff saved to https://phabricator.wikimedia.org/P50574 and previous config saved to /var/cache/conftool/dbconfig/20230814-151659-root.json [15:16:59] (03CR) 10CI reject: [V: 04-1] puppetserver: Add support for defining additional mount points [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [15:17:28] (03PS7) 10Jbond: puppetserver: Add support for defining additional mount points [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056) [15:24:49] (03PS1) 10Jbond: P:puppetserver: add support for extra_mounts [puppet] - 10https://gerrit.wikimedia.org/r/948607 (https://phabricator.wikimedia.org/T341056) [15:24:51] (03PS1) 10Jbond: puppetserver: add volatile file mount [puppet] - 10https://gerrit.wikimedia.org/r/948608 (https://phabricator.wikimedia.org/T341056) [15:25:29] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/948607 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [15:25:37] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/948608 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [15:26:48] (03PS2) 10Jbond: airflow-wmde: Add Kara Payne to analytics-wmde [puppet] - 10https://gerrit.wikimedia.org/r/940863 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [15:27:35] (03PS3) 10Jbond: airflow-wmde: Add Kara Payne to analytics-wmde [puppet] - 10https://gerrit.wikimedia.org/r/940863 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [15:27:41] (03CR) 10CI reject: [V: 04-1] airflow-wmde: Add Kara Payne to analytics-wmde [puppet] - 10https://gerrit.wikimedia.org/r/940863 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [15:28:09] (03PS2) 10Eevans: restbase: Upgrade restbase1016 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/948574 (https://phabricator.wikimedia.org/T339298) [15:28:11] (03PS2) 10Eevans: restbase: Upgrade restbase1019 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/948575 (https://phabricator.wikimedia.org/T339298) [15:28:13] (03PS2) 10Eevans: restbase: Upgrade restbase1020 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/948576 (https://phabricator.wikimedia.org/T339298) [15:28:15] (03PS2) 10Eevans: restbase: Upgrade restbase1021 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/948577 (https://phabricator.wikimedia.org/T339298) [15:28:17] (03PS2) 10Eevans: restbase: Upgrade restbase1028 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/948578 (https://phabricator.wikimedia.org/T339298) [15:28:19] (03PS2) 10Eevans: restbase: Upgrade restbase1031 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/948579 (https://phabricator.wikimedia.org/T339298) [15:28:21] (03CR) 10Jbond: [C: 03+1] Grant Kara Payne shell access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948568 (https://phabricator.wikimedia.org/T342546) (owner: 10Stevemunene) [15:28:28] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/940863 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [15:28:54] (03PS1) 10Jgiannelos: Force image rebuild [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/948609 [15:29:17] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1093.eqiad.wmnet with reason: host reimage [15:29:29] !log bking@deploy1002 Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 [15:29:35] T343124: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 [15:29:36] (03CR) 10Jgiannelos: [C: 03+2] Force image rebuild [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/948609 (owner: 10Jgiannelos) [15:29:39] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/948574 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans) [15:30:05] jan_drewniak: Your horoscope predicts another unfortunate Wikimedia Portals Update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230814T1530). [15:30:13] !log bking@deploy1002 Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 43s) [15:30:17] (03CR) 10CI reject: [V: 04-1] Force image rebuild [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/948609 (owner: 10Jgiannelos) [15:30:32] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:32:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2025 (re)pooling @ 5%: Repooling after onsite maintenance', diff saved to https://phabricator.wikimedia.org/P50575 and previous config saved to /var/cache/conftool/dbconfig/20230814-153203-root.json [15:32:18] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1093.eqiad.wmnet with reason: host reimage [15:32:51] (03CR) 10Eevans: [C: 03+2] restbase: Upgrade restbase1016 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/948574 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans) [15:36:24] !log upgrading Cassandra to 4.1.1, restbase1016-{a,b,c} — T339298 [15:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:27] T339298: Upgrade restbase cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339298 [15:38:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:38:07] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:38:19] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [15:40:07] (03PS2) 10Bking: wdqs.data-transfer: Keep downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/937535 (https://phabricator.wikimedia.org/T340793) [15:41:45] (03PS1) 10Sharvaniharan: Add reading list schema config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948611 [15:42:17] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:56] (03CR) 10CI reject: [V: 04-1] wdqs.data-transfer: Keep downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/937535 (https://phabricator.wikimedia.org/T340793) (owner: 10Bking) [15:43:04] (03CR) 10Bking: wdqs.data-transfer: Keep downtime (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/937535 (https://phabricator.wikimedia.org/T340793) (owner: 10Bking) [15:43:46] (03PS6) 10Sohom Datta: Enable EditInSequence on all wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947883 (https://phabricator.wikimedia.org/T308098) [15:45:24] (03PS1) 10Jgiannelos: Dependencies maintenance [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/948613 [15:45:44] !log bking@deploy1002 Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 [15:45:53] T343124: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 [15:45:59] !log bking@deploy1002 Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 15s) [15:46:11] (03CR) 10Filippo Giunchedi: "I'm not well versed in how this is different than probing gerrit.w.o like profile::gerrit::proxy does, if it isn't then IMHO we should kee" [puppet] - 10https://gerrit.wikimedia.org/r/948555 (owner: 10Jelto) [15:47:03] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [15:47:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2025 (re)pooling @ 10%: Repooling after onsite maintenance', diff saved to https://phabricator.wikimedia.org/P50576 and previous config saved to /var/cache/conftool/dbconfig/20230814-154708-root.json [15:47:15] (03PS2) 10Jgiannelos: Dependencies maintenance [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/948613 [15:48:18] (03PS2) 10David Caro: openstack: use the haproxy internal stat for alerts [alerts] - 10https://gerrit.wikimedia.org/r/948098 (https://phabricator.wikimedia.org/T343885) [15:49:56] (03CR) 10CI reject: [V: 04-1] openstack: use the haproxy internal stat for alerts [alerts] - 10https://gerrit.wikimedia.org/r/948098 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [15:50:40] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [15:53:30] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Rename cr3-knams to cr2-esams - cmooney@cumin1001" [15:55:37] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1093.eqiad.wmnet with OS bullseye [15:58:44] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1094.eqiad.wmnet with OS bullseye [15:59:03] 10SRE, 10Data-Platform-SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T343715 (10bking) 05Open→03Resolved [16:00:59] !log sukhe@cumin2002 START - Cookbook sre.dns.wipe-cache cr2-esams.wikimedia.org on all recursors [16:01:02] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cr2-esams.wikimedia.org on all recursors [16:02:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2025 (re)pooling @ 25%: Repooling after onsite maintenance', diff saved to https://phabricator.wikimedia.org/P50577 and previous config saved to /var/cache/conftool/dbconfig/20230814-160213-root.json [16:04:11] (03PS1) 10Bking: flink-zk: add cluster info for codfw [puppet] - 10https://gerrit.wikimedia.org/r/948615 (https://phabricator.wikimedia.org/T341792) [16:06:29] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/948125 (https://phabricator.wikimedia.org/T343987) (owner: 10Herron) [16:11:23] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1094.eqiad.wmnet with reason: host reimage [16:13:37] (03PS2) 10Sharvaniharan: Config changes for new Android schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942707 [16:13:47] (03Abandoned) 10Sharvaniharan: Add reading list schema config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948611 (owner: 10Sharvaniharan) [16:14:07] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1094.eqiad.wmnet with reason: host reimage [16:15:22] (03PS1) 10Filippo Giunchedi: pipeline: switch CI to bookworm [alerts] - 10https://gerrit.wikimedia.org/r/948618 [16:15:52] (03CR) 10Filippo Giunchedi: [C: 03+1] thanos-fe: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/948125 (https://phabricator.wikimedia.org/T343987) (owner: 10Herron) [16:16:54] 10SRE-OnFire, 10Discovery-Search (Current work), 10Sustainability: WDQS: Document procedure for switching between Kubernetes and Yarn Streaming Updater - https://phabricator.wikimedia.org/T337801 (10bking) 05Open→03Resolved Looks good, thanks for writing this down. [16:17:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2025 (re)pooling @ 50%: Repooling after onsite maintenance', diff saved to https://phabricator.wikimedia.org/P50578 and previous config saved to /var/cache/conftool/dbconfig/20230814-161718-root.json [16:18:27] (03CR) 10Filippo Giunchedi: [C: 03+2] pipeline: switch CI to bookworm [alerts] - 10https://gerrit.wikimedia.org/r/948618 (owner: 10Filippo Giunchedi) [16:18:59] (03PS3) 10David Caro: openstack: use the haproxy internal stat for alerts [alerts] - 10https://gerrit.wikimedia.org/r/948098 (https://phabricator.wikimedia.org/T343885) [16:21:03] (03CR) 10Andrew Bogott: [C: 03+1] openstack: use the haproxy internal stat for alerts [alerts] - 10https://gerrit.wikimedia.org/r/948098 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [16:22:16] (03CR) 10David Caro: [C: 03+2] openstack: use the haproxy internal stat for alerts [alerts] - 10https://gerrit.wikimedia.org/r/948098 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [16:28:19] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Rename cr3-knams to cr2-esams - cmooney@cumin1001" [16:28:19] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:28:21] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [16:28:41] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:30:15] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:32:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2025 (re)pooling @ 75%: Repooling after onsite maintenance', diff saved to https://phabricator.wikimedia.org/P50579 and previous config saved to /var/cache/conftool/dbconfig/20230814-163222-root.json [16:34:50] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/948615 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [16:35:26] (03PS1) 10BCornwall: trafficserver: Use svc urls for eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/948624 (https://phabricator.wikimedia.org/T326657) [16:37:51] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1094.eqiad.wmnet with OS bullseye [16:40:57] (03PS1) 10Urbanecm: NewcomerTasksLogFactory: Use getName(), not getDbKey() [extensions/GrowthExperiments] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948588 (https://phabricator.wikimedia.org/T344163) [16:42:32] (03PS1) 10Cathal Mooney: Remove include for reverse /24 range formerly used for esams tunnels [dns] - 10https://gerrit.wikimedia.org/r/948626 (https://phabricator.wikimedia.org/T329219) [16:42:36] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:43:24] (03CR) 10Urbanecm: GrowthExperiments: enable add a link in 12th round of wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948144 (https://phabricator.wikimedia.org/T308137) (owner: 10Sergio Gimeno) [16:43:32] (03CR) 10CI reject: [V: 04-1] Remove include for reverse /24 range formerly used for esams tunnels [dns] - 10https://gerrit.wikimedia.org/r/948626 (https://phabricator.wikimedia.org/T329219) (owner: 10Cathal Mooney) [16:43:36] (03CR) 10Urbanecm: [C: 03+1] GrowthExperiments: enable add a link in 12th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948144 (https://phabricator.wikimedia.org/T308137) (owner: 10Sergio Gimeno) [16:43:48] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947883 (https://phabricator.wikimedia.org/T308098) (owner: 10Sohom Datta) [16:47:15] jouncebot: nowandnext [16:47:15] No deployments scheduled for the next 0 hour(s) and 12 minute(s) [16:47:15] In 0 hour(s) and 12 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230814T1700) [16:47:15] In 0 hour(s) and 12 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230814T1700) [16:47:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2025 (re)pooling @ 100%: Repooling after onsite maintenance', diff saved to https://phabricator.wikimedia.org/P50580 and previous config saved to /var/cache/conftool/dbconfig/20230814-164727-root.json [16:47:33] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42891/console" [puppet] - 10https://gerrit.wikimedia.org/r/948624 (https://phabricator.wikimedia.org/T326657) (owner: 10BCornwall) [16:47:38] okay, 12 minutes is too short. [16:48:38] 10ops-knams, 10DC-Ops, 10Traffic: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10RobH) [16:50:24] (03PS2) 10Cathal Mooney: Remove include for reverse /24 range formerly used for esams tunnels [dns] - 10https://gerrit.wikimedia.org/r/948626 (https://phabricator.wikimedia.org/T329219) [16:50:35] 10ops-knams, 10DC-Ops, 10Traffic: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10RobH) [16:52:11] (03CR) 10David Caro: [C: 03+1] "Tested in toolsbeta" [puppet] - 10https://gerrit.wikimedia.org/r/948566 (https://phabricator.wikimedia.org/T334585) (owner: 10David Caro) [16:52:23] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2006 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [16:52:30] oh [16:52:31] hmm [16:52:59] that you? [16:53:16] (03CR) 10Herron: [C: 03+1] trafficserver: Use svc urls for eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/948624 (https://phabricator.wikimedia.org/T326657) (owner: 10BCornwall) [16:53:19] well, this check is my doing yes but I didn't cause it here :] [16:53:23] so looking shortly on what changed [16:53:32] ntp_peers probably but not sure why not [16:54:50] s/why not/why now [16:55:22] (03CR) 10Ssingh: [C: 03+1] Remove include for reverse /24 range formerly used for esams tunnels [dns] - 10https://gerrit.wikimedia.org/r/948626 (https://phabricator.wikimedia.org/T329219) (owner: 10Cathal Mooney) [16:55:42] (03CR) 10Cathal Mooney: [C: 03+2] Remove include for reverse /24 range formerly used for esams tunnels [dns] - 10https://gerrit.wikimedia.org/r/948626 (https://phabricator.wikimedia.org/T329219) (owner: 10Cathal Mooney) [16:56:26] 10ops-knams, 10DC-Ops, 10Traffic: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10ssingh) [16:57:08] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [16:58:26] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:59:26] ah [16:59:27] https://puppetboard.wikimedia.org/report/dns1004.wikimedia.org/9ac4e629a27091dc56c4388fad15d8fa861f289c [16:59:30] +restrict 185.15.59.0 mask 255.255.255.0 notrap nomodify noquery nopeer [16:59:33] restrict 185.15.58.0 mask 255.255.255.0 notrap nomodify noquery nopeer [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230814T1700) [17:00:05] ryankemper: #bothumor My software never has bugs. It just develops random features. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230814T1700). [17:00:27] so we can skip restarting ntp for now but we should and will restart it before we bring up the new dns hosts in esams/knams [17:00:58] going to downtime these for a few hours [17:01:50] (03PS5) 10David Caro: toolforge: add deployer module with the secrets [puppet] - 10https://gerrit.wikimedia.org/r/948566 (https://phabricator.wikimedia.org/T334585) [17:02:43] 10ops-codfw, 10serviceops-radar, 10Maps (Maps-data): ManagementSSHDown - https://phabricator.wikimedia.org/T344110 (10Jhancock.wm) after attempting to boot the server I believe it is a bad motherboard. server will not power up even with minimum configuration. PSUs/PDU are working/have green lights. No networ... [17:02:58] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1095.eqiad.wmnet with OS bullseye [17:07:04] (03PS6) 10David Caro: toolforge: add deployer module with the secrets [puppet] - 10https://gerrit.wikimedia.org/r/948566 (https://phabricator.wikimedia.org/T334585) [17:11:28] (03PS1) 10Cathal Mooney: Additon of new devices in new esams racks [homer/public] - 10https://gerrit.wikimedia.org/r/948630 (https://phabricator.wikimedia.org/T329219) [17:13:42] (03CR) 10Cathal Mooney: [C: 03+2] Additon of new devices in new esams racks [homer/public] - 10https://gerrit.wikimedia.org/r/948630 (https://phabricator.wikimedia.org/T329219) (owner: 10Cathal Mooney) [17:14:16] (03Merged) 10jenkins-bot: Additon of new devices in new esams racks [homer/public] - 10https://gerrit.wikimedia.org/r/948630 (https://phabricator.wikimedia.org/T329219) (owner: 10Cathal Mooney) [17:15:39] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1095.eqiad.wmnet with reason: host reimage [17:18:11] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1095.eqiad.wmnet with reason: host reimage [17:18:57] (03PS1) 10Sergio Gimeno: GrowthExperiments: enable AddLink backend 13th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948631 (https://phabricator.wikimedia.org/T308138) [17:28:22] (03PS1) 10Cathal Mooney: Rename mr devices at Amsterdam POP sites [homer/public] - 10https://gerrit.wikimedia.org/r/948635 (https://phabricator.wikimedia.org/T329219) [17:30:29] (03CR) 10Ayounsi: [C: 03+1] Rename mr devices at Amsterdam POP sites [homer/public] - 10https://gerrit.wikimedia.org/r/948635 (https://phabricator.wikimedia.org/T329219) (owner: 10Cathal Mooney) [17:37:01] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:39:02] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [17:39:40] (03PS2) 10Sergio Gimeno: GrowthExperiments: enable add a link in 12th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948144 (https://phabricator.wikimedia.org/T308137) [17:39:49] (03CR) 10Sergio Gimeno: GrowthExperiments: enable add a link in 12th round of wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948144 (https://phabricator.wikimedia.org/T308137) (owner: 10Sergio Gimeno) [17:41:14] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [17:41:57] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1095.eqiad.wmnet with OS bullseye [17:43:57] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Volunteer NDA for RhinosF1 - https://phabricator.wikimedia.org/T341272 (10KFrancis) Hello all, the NDA has been signed. Please proceed to next steps for access. [17:45:31] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10User-RhinosF1: Volunteer NDA for RhinosF1 - https://phabricator.wikimedia.org/T341272 (10RhinosF1) 05Open→03Resolved a:03RhinosF1 Per @LSobanski's comments, closing and moving to a next part for security access [17:57:01] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:01:29] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:14:07] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:17:05] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:23:25] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [18:26:12] well that's weird eh [18:26:14] dns3001 and 3002? [18:26:25] how did those make a comeback [18:27:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10Jclark-ctr) [18:31:57] (03CR) 10JHathaway: [C: 03+1] flink-zk: add cluster info for codfw [puppet] - 10https://gerrit.wikimedia.org/r/948615 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [18:37:05] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: merge cp3081 and cp3079 - sukhe@cumin2002" [18:38:43] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: merge cp3081 and cp3079 - sukhe@cumin2002" [18:38:43] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:43:03] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp3081.mgmt.esams.wmnet with reboot policy FORCED [18:43:05] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:43:15] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp3081.mgmt.esams.wmnet with reboot policy FORCED [18:45:15] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp3081.mgmt.esams.wmnet with reboot policy FORCED [18:45:26] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp3081.mgmt.esams.wmnet with reboot policy FORCED [18:56:29] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:57:28] (03CR) 10Bking: [C: 03+2] flink-zk: add cluster info for codfw [puppet] - 10https://gerrit.wikimedia.org/r/948615 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [19:01:01] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:05:31] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:07:22] (03PS1) 10Ssingh: network: update data.yaml: swap esams/knams [puppet] - 10https://gerrit.wikimedia.org/r/948646 (https://phabricator.wikimedia.org/T329219) [19:08:20] (03CR) 10Eevans: [C: 03+2] restbase: Upgrade restbase2014 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/944960 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans) [19:08:39] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42892/console" [puppet] - 10https://gerrit.wikimedia.org/r/948646 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [19:08:56] (03PS1) 10Jbond: puppetserver: add volatile config [puppet] - 10https://gerrit.wikimedia.org/r/948647 [19:11:11] !log upgrading Cassandra to 4.1.1, restbase2014-{a,b,c} — T339298 [19:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:18] T339298: Upgrade restbase cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339298 [19:11:26] (03CR) 10Ssingh: [V: 03+1 C: 03+2] network: update data.yaml: swap esams/knams [puppet] - 10https://gerrit.wikimedia.org/r/948646 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [19:11:39] (03CR) 10CI reject: [V: 04-1] puppetserver: add volatile config [puppet] - 10https://gerrit.wikimedia.org/r/948647 (owner: 10Jbond) [19:16:35] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp3081.mgmt.esams.wmnet with reboot policy FORCED [19:21:06] (03CR) 10Eevans: [C: 03+2] restbase: Upgrade restbase2019 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/944961 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans) [19:21:11] (03CR) 10Urbanecm: [C: 04-1] [tests] Ensure each config has at most one value per wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947365 (owner: 10Urbanecm) [19:23:12] (03PS1) 10Ssingh: site.pp: add new cp notes in esams [puppet] - 10https://gerrit.wikimedia.org/r/948649 (https://phabricator.wikimedia.org/T344174) [19:24:23] !log upgrading Cassandra to 4.1.1, restbase2019-{a,b,c} — T339298 [19:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:27] T339298: Upgrade restbase cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339298 [19:25:59] RECOVERY - Check systemd state on kubernetes2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:29:10] (03CR) 10JHathaway: [C: 04-1] "just a minor newline issue" [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [19:30:03] (03CR) 10JHathaway: [C: 03+1] puppetserver: add volatile file mount (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/948608 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [19:31:30] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp3079.mgmt.esams.wmnet with reboot policy FORCED [19:31:33] (03CR) 10Eevans: [C: 03+2] restbase: Upgrade restbase2021 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/944962 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans) [19:32:59] (03CR) 10JHathaway: [C: 03+1] P:puppetserver: add support for extra_mounts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948607 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [19:34:30] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp3081.mgmt.esams.wmnet with reboot policy FORCED [19:34:35] !log upgrading Cassandra to 4.1.1, restbase2021-{a,b,c} — T339298 [19:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:40] T339298: Upgrade restbase cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339298 [19:35:15] (03CR) 10Ssingh: [C: 03+2] site.pp: add new cp notes in esams [puppet] - 10https://gerrit.wikimedia.org/r/948649 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [19:36:15] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2009 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:37:14] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3081'] [19:37:48] (03PS1) 10Ssingh: cp3081: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/948651 (https://phabricator.wikimedia.org/T327438) [19:38:21] !log robh@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cp3081'] [19:38:38] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3081'] [19:40:05] (03PS2) 10Ssingh: cp3081: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/948651 (https://phabricator.wikimedia.org/T327438) [19:41:23] PROBLEM - Check systemd state on dse-k8s-worker1007 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:42:05] (03CR) 10Eevans: [C: 03+2] restbase: Upgrade restbase2024 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/944963 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans) [19:43:04] (03PS2) 10Jbond: puppetserver: add volatile config [puppet] - 10https://gerrit.wikimedia.org/r/948647 [19:43:07] !log upgrading Cassandra to 4.1.1, restbase2024-{a,b,c} — T339298 [19:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:10] T339298: Upgrade restbase cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339298 [19:44:15] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:44:53] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp3081'] [19:45:07] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp3079.mgmt.esams.wmnet with reboot policy FORCED [19:45:23] (03PS1) 10Ssingh: cp3079: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/948652 (https://phabricator.wikimedia.org/T327438) [19:45:35] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3079'] [19:45:44] (03CR) 10CI reject: [V: 04-1] puppetserver: add volatile config [puppet] - 10https://gerrit.wikimedia.org/r/948647 (owner: 10Jbond) [19:47:22] 10SRE, 10ops-knams, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10RobH) [19:48:41] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:49:13] (03PS1) 10Eevans: restbase: upgrade to Cassandra 4.1.1, codfw/row C (5 hosts) [puppet] - 10https://gerrit.wikimedia.org/r/948654 (https://phabricator.wikimedia.org/T339298) [19:50:11] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:52:56] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp3079'] [19:53:46] (03CR) 10Eevans: [C: 03+2] restbase: upgrade to Cassandra 4.1.1, codfw/row C (5 hosts) [puppet] - 10https://gerrit.wikimedia.org/r/948654 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans) [19:54:39] (03CR) 10Fabfur: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/948651 (https://phabricator.wikimedia.org/T327438) (owner: 10Ssingh) [19:54:39] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:54:47] 10SRE, 10ops-knams, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10RobH) [19:55:17] (03CR) 10Fabfur: [C: 03+1] "OK" [puppet] - 10https://gerrit.wikimedia.org/r/948652 (https://phabricator.wikimedia.org/T327438) (owner: 10Ssingh) [19:56:15] (03CR) 10Ssingh: [C: 03+2] cp3081: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/948651 (https://phabricator.wikimedia.org/T327438) (owner: 10Ssingh) [19:57:02] !log upgrading Cassandra to 4.1.1, restbase20[15,16,20,22,25]-{a,b,c} (codfw/row C) — T339298 [19:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:13] T339298: Upgrade restbase cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339298 [19:57:53] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3081.esams.wmnet with OS bullseye [19:58:05] jouncebot: nowandnext [19:58:05] No deployments scheduled for the next 0 hour(s) and 1 minute(s) [19:58:05] In 0 hour(s) and 1 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230814T2000) [19:58:12] (03CR) 10Urbanecm: [C: 03+2] NewcomerTasksLogFactory: Use getName(), not getDbKey() [extensions/GrowthExperiments] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948588 (https://phabricator.wikimedia.org/T344163) (owner: 10Urbanecm) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: Dear deployers, time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230814T2000). [20:00:05] sharvani_ and Urbanecm: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] i can deploy today [20:00:15] sharvani_: hi, are you around? [20:00:25] probably best I don't deploy from a field [20:00:39] field? [20:00:52] (an actual field?) [20:00:59] https://events.ccc.de/camp/2023/infos/index.html :P [20:01:14] enjoy! [20:01:17] PROBLEM - Check whether ferm is active by checking the default input chain on dse-k8s-worker1007 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:02:07] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:03:15] sharvani_: hi, are you around for your deployment? [20:06:37] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:07:25] (03PS1) 10Eevans: restbase: upgrade to Cassandra 4.1.1, codfw/row D (6 hosts) [puppet] - 10https://gerrit.wikimedia.org/r/948657 (https://phabricator.wikimedia.org/T339298) [20:08:05] RECOVERY - Check systemd state on dse-k8s-worker1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:10:37] sharvani_: last ping ^^ [20:13:30] (03CR) 10Eevans: [C: 03+2] restbase: upgrade to Cassandra 4.1.1, codfw/row D (6 hosts) [puppet] - 10https://gerrit.wikimedia.org/r/948657 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans) [20:13:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948588 (https://phabricator.wikimedia.org/T344163) (owner: 10Urbanecm) [20:16:25] (03Merged) 10jenkins-bot: NewcomerTasksLogFactory: Use getName(), not getDbKey() [extensions/GrowthExperiments] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948588 (https://phabricator.wikimedia.org/T344163) (owner: 10Urbanecm) [20:16:41] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:948588|NewcomerTasksLogFactory: Use getName(), not getDbKey() (T344163)]] [20:16:45] T344163: AddLink feature ignore 'maxTasksPerDay' rule - https://phabricator.wikimedia.org/T344163 [20:17:10] !log upgrading Cassandra to 4.1.1, restbase20[12,17-18,23,26-27]-{a,b,c} (codfw/row C) — T339298 [20:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:13] T339298: Upgrade restbase cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339298 [20:18:08] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:948588|NewcomerTasksLogFactory: Use getName(), not getDbKey() (T344163)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:18:48] !log urbanecm@deploy1002 urbanecm: Continuing with sync [20:19:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:22:20] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T344135 (10phaultfinder) [20:24:03] (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:25:49] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:948588|NewcomerTasksLogFactory: Use getName(), not getDbKey() (T344163)]] (duration: 09m 08s) [20:25:59] T344163: AddLink feature ignore 'maxTasksPerDay' rule - https://phabricator.wikimedia.org/T344163 [20:26:42] Hi Marten... sorry I was around but took my eyes out for a bit... are you still around for deployment .. [20:26:55] hi sharvani_. yup, just finished with my patch. [20:27:04] (03PS3) 10Urbanecm: Config changes for new Android schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942707 (owner: 10Sharvaniharan) [20:27:07] (03CR) 10Urbanecm: [C: 03+2] Config changes for new Android schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942707 (owner: 10Sharvaniharan) [20:27:12] Thank you so much [20:27:38] (03CR) 10Eevans: [C: 03+2] restbase: Upgrade restbase1019 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/948575 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans) [20:27:49] (03Merged) 10jenkins-bot: Config changes for new Android schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942707 (owner: 10Sharvaniharan) [20:28:04] (03CR) 10Eevans: [C: 03+2] restbase: Upgrade restbase1020 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/948576 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans) [20:28:20] (03CR) 10Eevans: [C: 03+2] restbase: Upgrade restbase1021 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/948577 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans) [20:28:46] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:942707|Config changes for new Android schema]] [20:28:57] (03CR) 10Eevans: [C: 03+2] restbase: Upgrade restbase1028 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/948578 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans) [20:29:10] (03CR) 10Eevans: [C: 03+2] restbase: Upgrade restbase1031 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/948579 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans) [20:30:12] !log urbanecm@deploy1002 urbanecm and sharvaniharan: Backport for [[gerrit:942707|Config changes for new Android schema]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:31:47] RECOVERY - Check whether ferm is active by checking the default input chain on dse-k8s-worker1007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:33:46] !log upgrading Cassandra to 4.1.1, restbase10[19-21,28,31]-{a,b,c} (eqiad/row A) — T339298 [20:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:49] T339298: Upgrade restbase cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339298 [20:34:14] sharvani_: is it possible to test the patch on the debug servers? [20:34:17] or should i proceed? [20:34:32] yes.. just did on 1002 nd it is working! [20:34:39] please process [20:34:44] proceed* [20:35:49] awesome, proceeding [20:35:49] !log urbanecm@deploy1002 urbanecm and sharvaniharan: Continuing with sync [20:41:01] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:41:59] 10SRE, 10ops-knams, 10DC-Ops: Relocate one of the mx480 from esams to knams - https://phabricator.wikimedia.org/T342198 (10Papaul) 05Open→03Resolved Complete [20:42:09] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10Papaul) [20:42:22] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:942707|Config changes for new Android schema]] (duration: 13m 36s) [20:42:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:42:55] sharvani_: should be live [20:42:57] anything else? [20:43:09] perfect! thank you thats it for me. [20:43:21] Sorry for being late to the deployment. [20:44:21] :) [20:45:29] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:45:54] (03PS1) 10Eevans: restbase: upgrade to Cassandra 4.1.1, eqiad/row B (6 hosts) [puppet] - 10https://gerrit.wikimedia.org/r/948661 (https://phabricator.wikimedia.org/T339298) [20:45:57] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10Papaul) [20:47:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:47:38] (03CR) 10Fabfur: [C: 03+2] cp3079: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/948652 (https://phabricator.wikimedia.org/T327438) (owner: 10Ssingh) [20:53:00] (03PS1) 10Ssingh: hiera: common: add knams to cache-text/upload [puppet] - 10https://gerrit.wikimedia.org/r/948662 [20:54:40] (03CR) 10Ssingh: [C: 03+2] hiera: common: add knams to cache-text/upload [puppet] - 10https://gerrit.wikimedia.org/r/948662 (owner: 10Ssingh) [20:54:43] (03CR) 10Fabfur: [C: 03+1] "seems ok to me" [puppet] - 10https://gerrit.wikimedia.org/r/948662 (owner: 10Ssingh) [21:00:04] Reedy, sbassett, Maryum, and manfredi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Weekly Security deployment window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230814T2100). [21:02:18] (03CR) 10Urbanecm: [C: 03+1] GrowthExperiments: enable add a link in 12th round of wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948144 (https://phabricator.wikimedia.org/T308137) (owner: 10Sergio Gimeno) [21:02:32] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948631 (https://phabricator.wikimedia.org/T308138) (owner: 10Sergio Gimeno) [21:06:42] !log robh@cumin1001 START - Cookbook sre.dns.netbox [21:08:57] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp3081.esams.wmnet with OS bullseye [21:09:36] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3081.esams.wmnet with OS bullseye [21:10:18] !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: new hosts in by27 - robh@cumin1001" [21:11:05] !log robh@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: new hosts in by27 - robh@cumin1001" [21:11:05] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:11:43] (03PS2) 10Ssingh: cp3079: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/948652 (https://phabricator.wikimedia.org/T327438) [21:11:58] preparing to do a security deploy [21:18:03] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp3077.mgmt.esams.wmnet with reboot policy FORCED [21:19:16] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp3075.mgmt.esams.wmnet with reboot policy FORCED [21:19:48] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp3073.mgmt.esams.wmnet with reboot policy FORCED [21:21:11] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp3071.mgmt.esams.wmnet with reboot policy FORCED [21:21:33] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp3069.mgmt.esams.wmnet with reboot policy FORCED [21:21:56] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp3067.mgmt.esams.wmnet with reboot policy FORCED [21:22:32] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host dns3003.mgmt.esams.wmnet with reboot policy FORCED [21:23:21] (03CR) 10Eevans: [C: 03+2] restbase: upgrade to Cassandra 4.1.1, eqiad/row B (6 hosts) [puppet] - 10https://gerrit.wikimedia.org/r/948661 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans) [21:27:43] !log upgrading Cassandra to 4.1.1, restbase10[17,22-24,29,32]-{a,b,c} (eqiad/row B) — T339298 [21:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:47] T339298: Upgrade restbase cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339298 [21:33:34] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:35:23] (03PS1) 10Ahmon Dancy: Use gitlab-settings v1.3.0 [puppet] - 10https://gerrit.wikimedia.org/r/948665 [21:35:38] !log security deploy for T341529 [21:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:58] (03PS2) 10Ahmon Dancy: Class gitlab: Use gitlab-settings v1.3.0 [puppet] - 10https://gerrit.wikimedia.org/r/948665 [21:37:01] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:37:24] 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10ATsay-WMF) [21:37:28] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp3077.mgmt.esams.wmnet with reboot policy FORCED [21:37:31] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp3075.mgmt.esams.wmnet with reboot policy FORCED [21:38:00] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp3073.mgmt.esams.wmnet with reboot policy FORCED [21:38:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:39:32] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp3071.mgmt.esams.wmnet with reboot policy FORCED [21:39:45] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp3069.mgmt.esams.wmnet with reboot policy FORCED [21:40:28] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dns3003.mgmt.esams.wmnet with reboot policy FORCED [21:40:41] (03PS1) 10Eevans: restbase: upgrade to Cassandra 4.1.1, eqiad/row D (6 hosts) [puppet] - 10https://gerrit.wikimedia.org/r/948666 (https://phabricator.wikimedia.org/T339298) [21:42:15] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp3067.mgmt.esams.wmnet with reboot policy FORCED [21:42:23] (03CR) 10Eevans: [C: 03+2] restbase: upgrade to Cassandra 4.1.1, eqiad/row D (6 hosts) [puppet] - 10https://gerrit.wikimedia.org/r/948666 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans) [21:43:39] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host ganeti3005.mgmt.esams.wmnet with reboot policy FORCED [21:43:42] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host ganeti3007.mgmt.esams.wmnet with reboot policy FORCED [21:44:59] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for fkaelin - https://phabricator.wikimedia.org/T343957 (10colewhite) [21:46:04] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3077'] [21:46:23] !log upgrading Cassandra to 4.1.1, restbase10[18,25-27,30,33]-{a,b,c} (eqiad/row D) — T339298 [21:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:26] T339298: Upgrade restbase cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339298 [21:46:30] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3075'] [21:46:46] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3073'] [21:47:05] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3071'] [21:47:40] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3069'] [21:47:58] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3067'] [21:48:15] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns3003'] [21:50:50] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for fkaelin - https://phabricator.wikimedia.org/T343957 (10colewhite) - L3 not yet signed. - LDAP user does not exist. Fabian should try logging in to [[ https://wikitech.wikimedia.org/ | wikitech ]] and verify the email address, if required.... [21:53:08] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp3071'] [21:53:37] (03PS6) 10BCornwall: pybal: Make check conform to the Nagios plugin API [puppet] - 10https://gerrit.wikimedia.org/r/933398 (https://phabricator.wikimedia.org/T322377) (owner: 10Jbond) [21:53:52] (03CR) 10BCornwall: [C: 03+1] pybal: Make check conform to the Nagios plugin API (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/933398 (https://phabricator.wikimedia.org/T322377) (owner: 10Jbond) [21:54:18] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dns3003'] [21:54:39] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp3069'] [21:54:44] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti3005.mgmt.esams.wmnet with reboot policy FORCED [21:54:51] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp3077'] [21:54:53] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp3073'] [21:54:59] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp3067'] [21:55:03] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti3005'] [21:55:24] !log robh@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ganeti3005'] [21:56:28] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp3081.esams.wmnet with OS bullseye [21:56:36] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti3005'] [21:59:22] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp3075'] [22:01:52] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti3007.mgmt.esams.wmnet with reboot policy FORCED [22:02:12] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti3005'] [22:05:34] 10SRE, 10Wikimedia-Mailing-lists: Shut down two en-arbcom mailing lists (audit, appeals-en) - https://phabricator.wikimedia.org/T344112 (10Barkeep49) Thanks. The two groups are arbcom-audit (or perhaps arbcom-en-audit, I have never been subscribed so I don't know if it happened before or after the en appellati... [22:09:30] !log robh@cumin1001 START - Cookbook sre.dns.netbox [22:09:40] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for lojo_wmde - https://phabricator.wikimedia.org/T342973 (10colewhite) [22:11:05] 10SRE, 10Wikimedia-Mailing-lists: Shut down two en-arbcom mailing lists (audit, appeals-en) - https://phabricator.wikimedia.org/T344112 (10Ladsgroup) https://lists.wikimedia.org/postorius/lists/arbcom-audit.lists.wikimedia.org/ doesn't exist nor https://lists.wikimedia.org/postorius/lists/arbcom-en-audit.lists... [22:16:43] (03PS1) 10Ssingh: Revert "hiera: common: add knams to cache-text/upload" [puppet] - 10https://gerrit.wikimedia.org/r/948589 [22:18:23] !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs3009 - robh@cumin1001" [22:19:08] !log robh@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs3009 - robh@cumin1001" [22:19:08] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:19:54] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:20:10] (03PS1) 10Ssingh: realm: update subnets for knams migration [puppet] - 10https://gerrit.wikimedia.org/r/948670 (https://phabricator.wikimedia.org/T329219) [22:20:30] PROBLEM - cassandra-a CQL 10.64.48.234:9042 on restbase1030 is CRITICAL: connect to address 10.64.48.234 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [22:20:32] PROBLEM - cassandra-a SSL 10.64.48.234:7001 on restbase1030 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [22:20:38] PROBLEM - cassandra-a service on restbase1030 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:21:24] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:21:46] (03CR) 10Ssingh: [C: 03+2] Revert "hiera: common: add knams to cache-text/upload" [puppet] - 10https://gerrit.wikimedia.org/r/948589 (owner: 10Ssingh) [22:21:49] I've got that ^^^ [22:22:04] thanks urandom [22:22:08] er, weird [22:22:08] RECOVERY - cassandra-a service on restbase1030 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:22:10] I got that [22:22:18] oh restbase haha [22:22:19] sorry [22:22:26] I mean, I did it, so it's only fair :) [22:22:27] I thought you meant puppet merge :) [22:22:30] true, you get this one [22:23:24] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host lvs3009.mgmt.esams.wmnet with reboot policy FORCED [22:23:30] RECOVERY - cassandra-a CQL 10.64.48.234:9042 on restbase1030 is OK: TCP OK - 0.000 second response time on 10.64.48.234 port 9042 https://phabricator.wikimedia.org/T93886 [22:23:34] RECOVERY - cassandra-a SSL 10.64.48.234:7001 on restbase1030 is OK: SSL OK - Certificate restbase1030-a valid until 2024-08-30 21:39:16 +0000 (expires in 381 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [22:26:09] (03CR) 10Ssingh: [C: 03+2] realm: update subnets for knams migration [puppet] - 10https://gerrit.wikimedia.org/r/948670 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [22:26:39] 10SRE, 10ops-knams, 10DC-Ops, 10Traffic: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10RobH) [22:27:05] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3081.esams.wmnet with OS bullseye [22:32:01] (NodeTextfileStale) resolved: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:35:40] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs3009.mgmt.esams.wmnet with reboot policy FORCED [22:35:56] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs3009'] [22:36:21] 10SRE, 10SRE-Access-Requests: Requesting access to wmcs-admin for Wiki Replicas for dr0ptp4kt - https://phabricator.wikimedia.org/T343862 (10colewhite) [22:39:38] 10SRE, 10SRE-Access-Requests: Requesting access to Wiki Replicas end-to-end tiers for dr0ptp4kt - https://phabricator.wikimedia.org/T343039 (10colewhite) [22:41:28] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lvs3009'] [22:51:30] (03PS7) 10BCornwall: pybal: Make check conform to the Nagios plugin API [puppet] - 10https://gerrit.wikimedia.org/r/933398 (https://phabricator.wikimedia.org/T322377) (owner: 10Jbond) [22:52:37] (03CR) 10BCornwall: [V: 03+1 C: 03+1] "Works as expected on lvs4010. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/933398 (https://phabricator.wikimedia.org/T322377) (owner: 10Jbond) [22:52:39] (03CR) 10BCornwall: [V: 03+1 C: 03+2] pybal: Make check conform to the Nagios plugin API [puppet] - 10https://gerrit.wikimedia.org/r/933398 (https://phabricator.wikimedia.org/T322377) (owner: 10Jbond) [22:55:18] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: Consider confirming the hostname by user input when running the reimaging cookbook - https://phabricator.wikimedia.org/T332202 (10BCornwall) 05Stalled→03Declined Due to lack of interest I'm going to just decline/abandon. Please re-open... [22:55:42] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:56:16] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10colewhite) a:03ATsay-WMF Hi! Currently there is no ldap account associated with your work email. First thing to try is to create a [[ https://wiki... [22:56:16] PROBLEM - cassandra-a CQL 10.64.48.234:9042 on restbase1030 is CRITICAL: connect to address 10.64.48.234 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [22:56:20] PROBLEM - cassandra-a SSL 10.64.48.234:7001 on restbase1030 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [22:56:34] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10colewhite) [22:56:44] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp3081.esams.wmnet with OS bullseye [22:57:14] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:57:22] what the what? [22:57:28] * urandom is investigating [22:57:46] RECOVERY - cassandra-a CQL 10.64.48.234:9042 on restbase1030 is OK: TCP OK - 0.000 second response time on 10.64.48.234 port 9042 https://phabricator.wikimedia.org/T93886 [22:57:52] RECOVERY - cassandra-a SSL 10.64.48.234:7001 on restbase1030 is OK: SSL OK - Certificate restbase1030-a valid until 2024-08-30 21:39:16 +0000 (expires in 381 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [22:59:36] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Ricki Jay (WMDE) - https://phabricator.wikimedia.org/T343700 (10colewhite) @RickiJay-WMDE does not appear to have an NDA on file. CC: @KFrancis [23:01:39] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for Mpossoupe - https://phabricator.wikimedia.org/T343432 (10colewhite) 05Open→03Resolved a:03colewhite This appears to be resolved. If not quite true, please feel free to reopen. [23:15:26] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10ATsay-WMF) Got it -- does this work? https://wikitech.wikimedia.org/wiki/User:Amy_T [23:21:14] (03PS1) 10Cwhite: admin: add amyt to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/948686 (https://phabricator.wikimedia.org/T344199) [23:24:43] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:24:44] PROBLEM - Check systemd state on wdqs2012 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:29:40] (03PS1) 10BCornwall: Release 0.36-2 for Bookworm [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/948672 (https://phabricator.wikimedia.org/T342154) [23:30:20] PROBLEM - cassandra-a CQL 10.64.48.234:9042 on restbase1030 is CRITICAL: connect to address 10.64.48.234 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [23:30:28] PROBLEM - cassandra-a SSL 10.64.48.234:7001 on restbase1030 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [23:30:34] PROBLEM - cassandra-a service on restbase1030 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:31:20] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:31:37] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1030.eqiad.wmnet [23:33:06] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:37:12] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:37:30] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:37:36] (03CR) 10BCornwall: [V: 03+1] "lintian and piuparts are happy" [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/948672 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [23:39:24] RECOVERY - cassandra-a service on restbase1030 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:40:14] RECOVERY - cassandra-a CQL 10.64.48.234:9042 on restbase1030 is OK: TCP OK - 0.009 second response time on 10.64.48.234 port 9042 https://phabricator.wikimedia.org/T93886 [23:40:16] RECOVERY - cassandra-a SSL 10.64.48.234:7001 on restbase1030 is OK: SSL OK - Certificate restbase1030-a valid until 2024-08-30 21:39:16 +0000 (expires in 381 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [23:40:21] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1030.eqiad.wmnet [23:47:18] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:48:46] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:50:42] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Ricki Jay (WMDE) - https://phabricator.wikimedia.org/T343700 (10KFrancis) Please provide Ricki Jay's email address and I will start processing this request. You may send it to kfrancis@wikimedia.org if you prefer not to post it here.