[00:04:56] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: dispatch-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:30] PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:18:12] (03CR) 10Eevans: [C: 03+1] swift: storage schema for larger disks_by_path backends (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/911290 (https://phabricator.wikimedia.org/T335275) (owner: 10MVernon) [00:36:30] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [00:39:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/911387 [00:39:26] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/911387 (owner: 10TrainBranchBot) [00:45:58] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:57:56] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/911387 (owner: 10TrainBranchBot) [01:00:03] (03CR) 10Dzahn: [C: 03+2] "This was reverted because it broke things for code-search in cloud. It said it can't contact gerrit-replica. But the IPs on the gerrit-rep" [puppet] - 10https://gerrit.wikimedia.org/r/909794 (owner: 10Dzahn) [01:07:33] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T335327 (10phaultfinder) [01:20:19] (03CR) 10Dzahn: [C: 03+2] "I don't see the change yet in https://os-reports.wikimedia.org/buster.html - is this auto-generated once per day or so?" [puppet] - 10https://gerrit.wikimedia.org/r/908644 (https://phabricator.wikimedia.org/T327068) (owner: 10Dzahn) [01:50:30] (NodeTextfileStale) firing: Stale textfile for sretest1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:00:06] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230425T0200) [02:02:20] (03PS1) 10Andrew Bogott: Remove unused role and profile for wmcs project- and home- nfs servers [puppet] - 10https://gerrit.wikimedia.org/r/911424 (https://phabricator.wikimedia.org/T333477) [02:04:01] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on mw2432:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=mw2432 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:55] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.6 [core] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/911388 (https://phabricator.wikimedia.org/T330212) [02:07:59] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.6 [core] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/911388 (https://phabricator.wikimedia.org/T330212) (owner: 10TrainBranchBot) [02:15:52] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [02:21:01] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T335293 (10phaultfinder) [02:24:16] (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.6 [core] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/911388 (https://phabricator.wikimedia.org/T330212) (owner: 10TrainBranchBot) [02:25:30] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [02:26:32] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:46] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T335293 (10phaultfinder) [02:34:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on db2185:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=db2185 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [02:41:01] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T335295 (10phaultfinder) [02:41:03] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [03:00:06] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230425T0300) [03:00:45] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T335299 (10phaultfinder) [03:01:00] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T335299 (10phaultfinder) [03:01:19] (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911427 (https://phabricator.wikimedia.org/T330212) [03:01:21] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911427 (https://phabricator.wikimedia.org/T330212) (owner: 10TrainBranchBot) [03:02:04] (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911427 (https://phabricator.wikimedia.org/T330212) (owner: 10TrainBranchBot) [03:02:27] !log mwpresync@deploy2002 Started scap: testwikis wikis to 1.41.0-wmf.6 refs T330212 [03:02:34] T330212: 1.41.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T330212 [03:50:33] !log mwpresync@deploy2002 Finished scap: testwikis wikis to 1.41.0-wmf.6 refs T330212 (duration: 48m 05s) [03:50:39] T330212: 1.41.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T330212 [03:52:30] (Access port speed <= 100Mbps) firing: Alert for device asw2-d-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [03:52:41] !log mwpresync@deploy2002 Pruned MediaWiki: 1.41.0-wmf.4 (duration: 02m 06s) [04:09:18] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:14:18] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:36:30] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:31:24] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:46:20] (03CR) 10Ayounsi: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/910083 (https://phabricator.wikimedia.org/T335027) (owner: 10Cwhite) [05:49:02] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability, 10Patch-For-Review: Prometheus: ingest SONiC metrics - https://phabricator.wikimedia.org/T335027 (10ayounsi) Read only, there are already some "prometheus-*-expoter" images in https://docker-registry.wikimedia.org/ so it might just be a matt... [05:50:30] (NodeTextfileStale) firing: Stale textfile for sretest1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230425T0600) [06:00:05] kormat, marostegui, and Amir1: Dear deployers, time to do the Primary database switchover deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230425T0600). [06:04:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on mw2432:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=mw2432 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [06:11:12] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:11:30] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:11:36] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 4557 [06:11:53] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 4557 [06:12:13] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 46887 [06:12:39] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46887 [06:13:05] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Marostegui) [06:15:52] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:16:45] 10ops-codfw, 10DBA, 10DC-Ops: db2185 alerting for power supply redundancy - https://phabricator.wikimedia.org/T335331 (10Marostegui) [06:17:17] 10ops-codfw, 10DBA, 10DC-Ops: db2185 alerting for power supply redundancy - https://phabricator.wikimedia.org/T335331 (10Marostegui) p:05Triage→03Medium [06:25:26] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [06:25:30] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [06:25:46] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T335293 (10phaultfinder) [06:27:01] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T335293 (10phaultfinder) [06:27:33] (03PS1) 10Marostegui: switchover-tmpl.py: Replace zarcillo host [software] - 10https://gerrit.wikimedia.org/r/911693 (https://phabricator.wikimedia.org/T334455) [06:28:19] (03CR) 10Marostegui: [C: 03+2] switchover-tmpl.py: Replace zarcillo host [software] - 10https://gerrit.wikimedia.org/r/911693 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [06:28:52] (03Merged) 10jenkins-bot: switchover-tmpl.py: Replace zarcillo host [software] - 10https://gerrit.wikimedia.org/r/911693 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [06:34:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on db2185:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=db2185 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [06:45:47] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [06:45:49] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T335295 (10phaultfinder) [06:54:40] (03PS1) 10Jelto: gitlab: add backup type failover [puppet] - 10https://gerrit.wikimedia.org/r/911759 (https://phabricator.wikimedia.org/T330771) [06:57:58] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40814/console" [puppet] - 10https://gerrit.wikimedia.org/r/911759 (https://phabricator.wikimedia.org/T330771) (owner: 10Jelto) [07:00:06] Amir1, Urbanecm, and taavi: Dear deployers, time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230425T0700). [07:00:06] No Gerrit patches in the queue for this window AFAICS. [07:01:01] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T335299 (10phaultfinder) [07:05:46] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T335299 (10phaultfinder) [07:06:07] (03PS1) 10Muehlenhoff: Remove remaining obsolete nodejs images only used on Stretch [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/911761 (https://phabricator.wikimedia.org/T335282) [07:20:12] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:21:46] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:23:36] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:23:56] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:30:33] (03CR) 10Ayounsi: Expose additional link information to Homer templates in wmf-netbox.py (032 comments) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/884908 (https://phabricator.wikimedia.org/T328313) (owner: 10Cathal Mooney) [07:36:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [07:37:00] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm [07:42:38] RECOVERY - BFD status on cr2-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:43:00] RECOVERY - BFD status on cr2-eqiad is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:52:30] (Access port speed <= 100Mbps) firing: Alert for device asw2-d-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [07:53:39] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bookworm [07:53:44] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**) - Do... [08:04:53] (03PS1) 10Muehlenhoff: Exclude more stretch images [puppet] - 10https://gerrit.wikimedia.org/r/911764 (https://phabricator.wikimedia.org/T335282) [08:05:01] (03PS2) 10Michael Große: Beta-Wikidata: Enable Labels in Wikidata edit summaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911311 (https://phabricator.wikimedia.org/T327062) [08:05:57] (03CR) 10Michael Große: Beta-Wikidata: Enable Labels in Wikidata edit summaries (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911311 (https://phabricator.wikimedia.org/T327062) (owner: 10Michael Große) [08:06:26] (03PS1) 10Jelto: gitlab: add ferm rule for certbot on WMCS [puppet] - 10https://gerrit.wikimedia.org/r/911765 (https://phabricator.wikimedia.org/T335161) [08:09:15] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40816/console" [puppet] - 10https://gerrit.wikimedia.org/r/911765 (https://phabricator.wikimedia.org/T335161) (owner: 10Jelto) [08:12:47] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Good point. Let's see what we can do about annotating the image (I am not fond of forcing the name of images)." [puppet] - 10https://gerrit.wikimedia.org/r/911764 (https://phabricator.wikimedia.org/T335282) (owner: 10Muehlenhoff) [08:17:00] 10SRE, 10Data-Services, 10Traffic: 2022-09-04 Scraping from AS714 (Apple) against dumps.wikimedia.org saturating network links - https://phabricator.wikimedia.org/T317001 (10Marostegui) 05Stalled→03Resolved I am going to close it for now, Chris, please reopen if you feel there's still work pending here! [08:17:28] 10SRE, 10serviceops: Remove jessie and stretch-based images from our image registry - https://phabricator.wikimedia.org/T335333 (10MoritzMuehlenhoff) [08:19:57] (03CR) 10Muehlenhoff: Exclude more stretch images (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/911764 (https://phabricator.wikimedia.org/T335282) (owner: 10Muehlenhoff) [08:22:56] (03CR) 10Alexandros Kosiaris: wikifunctions: Add AppArmor profile usage (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/879282 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris) [08:24:27] (03PS1) 10Slyngshede: P:idm Compare IDM redis master on FQDN [puppet] - 10https://gerrit.wikimedia.org/r/911767 [08:25:38] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40817/console" [puppet] - 10https://gerrit.wikimedia.org/r/911767 (owner: 10Slyngshede) [08:32:55] (03CR) 10MVernon: [C: 03+2] swift: storage schema for larger disks_by_path backends [puppet] - 10https://gerrit.wikimedia.org/r/911290 (https://phabricator.wikimedia.org/T335275) (owner: 10MVernon) [08:36:30] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:41:47] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:42:51] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:43:16] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Beta-Wikidata: Enable Labels in Wikidata edit summaries (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911311 (https://phabricator.wikimedia.org/T327062) (owner: 10Michael Große) [08:43:29] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-be1003.eqiad.wmnet [08:45:31] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:idm Compare IDM redis master on FQDN [puppet] - 10https://gerrit.wikimedia.org/r/911767 (owner: 10Slyngshede) [08:50:59] RECOVERY - Disk space on thanos-be1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1003&var-datasource=eqiad+prometheus/ops [08:50:59] RECOVERY - Check systemd state on thanos-be1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:51:16] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1003.eqiad.wmnet [08:57:27] RECOVERY - BFD status on cr2-eqiad is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:58:29] RECOVERY - BFD status on cr2-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:04:08] 10SRE, 10Infrastructure-Foundations, 10serviceops: Annotate images in our registry with OS (and OS version) - https://phabricator.wikimedia.org/T335337 (10MoritzMuehlenhoff) [09:04:31] (03PS1) 10Slyngshede: C:idm::deployment enable AUX ldap schemas [puppet] - 10https://gerrit.wikimedia.org/r/911770 [09:06:57] 10SRE-swift-storage: Bring ms-be107[2-5] into the rings - https://phabricator.wikimedia.org/T335279 (10MatthewVernon) [09:06:59] 10SRE-swift-storage: Bring ms-be207[0-3] into the rings - https://phabricator.wikimedia.org/T335278 (10MatthewVernon) [09:07:01] 10SRE-swift-storage: Q4 ms backend refresh work (KR) - https://phabricator.wikimedia.org/T335270 (10MatthewVernon) [09:07:03] 10SRE-swift-storage, 10Patch-For-Review: Create new storage scheme entries for larger disks_by_path swift backends - https://phabricator.wikimedia.org/T335275 (10MatthewVernon) 05Open→03Resolved [09:07:54] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40818/console" [puppet] - 10https://gerrit.wikimedia.org/r/911770 (owner: 10Slyngshede) [09:09:39] (03CR) 10Muehlenhoff: Exclude more stretch images (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/911764 (https://phabricator.wikimedia.org/T335282) (owner: 10Muehlenhoff) [09:09:43] (03CR) 10Muehlenhoff: [C: 03+2] Exclude more stretch images [puppet] - 10https://gerrit.wikimedia.org/r/911764 (https://phabricator.wikimedia.org/T335282) (owner: 10Muehlenhoff) [09:13:53] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40819/console" [puppet] - 10https://gerrit.wikimedia.org/r/911298 (owner: 10Slyngshede) [09:14:04] (03PS1) 10MVernon: swift: add new backends to swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/911771 (https://phabricator.wikimedia.org/T335278) [09:15:30] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:idm:deployment Enable SSH keymanagement. [puppet] - 10https://gerrit.wikimedia.org/r/911298 (owner: 10Slyngshede) [09:17:11] (03Abandoned) 10Slyngshede: C:idm::deployment enable AUX ldap schemas [puppet] - 10https://gerrit.wikimedia.org/r/911770 (owner: 10Slyngshede) [09:18:02] (03CR) 10Marostegui: [C: 03+1] swift: add new backends to swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/911771 (https://phabricator.wikimedia.org/T335278) (owner: 10MVernon) [09:18:32] (03CR) 10MVernon: [C: 03+2] swift: add new backends to swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/911771 (https://phabricator.wikimedia.org/T335278) (owner: 10MVernon) [09:25:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:27:19] PROBLEM - Host ms-be1074 is DOWN: PING CRITICAL - Packet loss = 100% [09:27:21] PROBLEM - Host ms-be2072 is DOWN: PING CRITICAL - Packet loss = 100% [09:27:23] PROBLEM - Host ms-be2073 is DOWN: PING CRITICAL - Packet loss = 100% [09:27:27] PROBLEM - Host ms-be1073 is DOWN: PING CRITICAL - Packet loss = 100% [09:27:29] PROBLEM - Host ms-be1075 is DOWN: PING CRITICAL - Packet loss = 100% [09:27:31] PROBLEM - Host ms-be2070 is DOWN: PING CRITICAL - Packet loss = 100% [09:27:41] PROBLEM - Host ms-be1072 is DOWN: PING CRITICAL - Packet loss = 100% [09:28:07] RECOVERY - Host ms-be1073 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [09:28:13] RECOVERY - Host ms-be1072 is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [09:28:15] RECOVERY - Host ms-be2073 is UP: PING OK - Packet loss = 0%, RTA = 33.15 ms [09:28:15] RECOVERY - Host ms-be1075 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [09:28:15] RECOVERY - Host ms-be1074 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [09:28:23] RECOVERY - Host ms-be2072 is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms [09:28:43] RECOVERY - Host ms-be2070 is UP: PING OK - Packet loss = 0%, RTA = 33.69 ms [09:30:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:30:13] PROBLEM - Check systemd state on ms-be2073 is CRITICAL: CRITICAL - degraded: The following units failed: srv-swift\x2dstorage-objects1.mount https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:30:15] PROBLEM - Check systemd state on ms-be1075 is CRITICAL: CRITICAL - degraded: The following units failed: srv-swift\x2dstorage-objects18.mount https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:32:35] PROBLEM - Check systemd state on ms-be1073 is CRITICAL: CRITICAL - degraded: The following units failed: srv-swift\x2dstorage-objects20.mount https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:32:49] PROBLEM - Check systemd state on ms-be1074 is CRITICAL: CRITICAL - degraded: The following units failed: srv-swift\x2dstorage-objects1.mount https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:33:07] PROBLEM - Check systemd state on ms-be2072 is CRITICAL: CRITICAL - degraded: The following units failed: srv-swift\x2dstorage-objects22.mount https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:36] !log upgrade php-excimer on remaining mediawiki hosts to 1.0.2-1+wmf3+buster1 (which rebases Excimer to 1.1.1) T332964 [09:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:43] T332964: Upgrade php-excimer package from 1.0.4 to 1.1.1 - https://phabricator.wikimedia.org/T332964 [09:43:54] PROBLEM - Check systemd state on ms-be1073 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:45:31] RECOVERY - Check systemd state on ms-be1073 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:46] (03PS5) 10Joal: Refactor dumps::web::fetches::analytics::job [puppet] - 10https://gerrit.wikimedia.org/r/910761 (https://phabricator.wikimedia.org/T317167) [09:52:47] 10SRE, 10Infrastructure-Foundations, 10serviceops: Annotate images in our registry with OS (and OS version) - https://phabricator.wikimedia.org/T335337 (10JMeybohm) The initial idea (at least for production-images) was to not care (versioning wise) about the underlying OS version. This makes it more easy do... [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230425T1000) [10:04:00] (03CR) 10Vgutierrez: [C: 03+2] varnish: Allow disabling port 80 [puppet] - 10https://gerrit.wikimedia.org/r/907824 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [10:04:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on mw2432:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=mw2432 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [10:07:34] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Datacenter-Switchover: sre.discovery.datacenter breaks on services not in "production" state - https://phabricator.wikimedia.org/T335341 (10Clement_Goubert) [10:09:08] (03CR) 10Clément Goubert: "This change is ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/911777 (https://phabricator.wikimedia.org/T335341) (owner: 10Clément Goubert) [10:15:17] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/911777 (https://phabricator.wikimedia.org/T335341) (owner: 10Clément Goubert) [10:15:51] (03CR) 10Clément Goubert: [C: 03+2] sre.discovery.datacenter: Exclude device-analytics [cookbooks] - 10https://gerrit.wikimedia.org/r/911777 (https://phabricator.wikimedia.org/T335341) (owner: 10Clément Goubert) [10:15:52] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:18:41] (03Merged) 10jenkins-bot: sre.discovery.datacenter: Exclude device-analytics [cookbooks] - 10https://gerrit.wikimedia.org/r/911777 (https://phabricator.wikimedia.org/T335341) (owner: 10Clément Goubert) [10:21:08] !log installing libxml2 security updates on bullseye [10:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:47] PROBLEM - Host ms-be1074 is DOWN: PING CRITICAL - Packet loss = 100% [10:22:22] RECOVERY - Host ms-be1074 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [10:22:53] PROBLEM - Check systemd state on ms-be1074 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:24:31] RECOVERY - Check systemd state on ms-be1074 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:25:30] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [10:25:43] PROBLEM - Host ms-be1075 is DOWN: PING CRITICAL - Packet loss = 100% [10:26:01] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T335293 (10phaultfinder) [10:26:23] RECOVERY - Host ms-be1075 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [10:26:49] PROBLEM - Check systemd state on ms-be1075 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:28:25] RECOVERY - Check systemd state on ms-be1075 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:28:29] RECOVERY - Check systemd state on ms-be2073 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:28:39] PROBLEM - Host ms-be2072 is DOWN: PING CRITICAL - Packet loss = 100% [10:29:31] RECOVERY - Host ms-be2072 is UP: PING OK - Packet loss = 0%, RTA = 33.18 ms [10:29:41] PROBLEM - Host ms-be2073 is DOWN: PING CRITICAL - Packet loss = 100% [10:29:45] PROBLEM - Check systemd state on ms-be2072 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:46] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T335293 (10phaultfinder) [10:31:51] RECOVERY - Host ms-be2073 is UP: PING OK - Packet loss = 0%, RTA = 33.13 ms [10:33:01] RECOVERY - Check systemd state on ms-be2072 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:36:17] (03PS1) 10Jbond: team-dcops: Add or clause for older node-exporter versions [alerts] - 10https://gerrit.wikimedia.org/r/911778 (https://phabricator.wikimedia.org/T333007) [10:36:44] (03CR) 10Alexandros Kosiaris: [C: 03+1] sre.discovery.datacenter: Exclude device-analytics [cookbooks] - 10https://gerrit.wikimedia.org/r/911777 (https://phabricator.wikimedia.org/T335341) (owner: 10Clément Goubert) [10:41:45] (03PS1) 10MVernon: swift: add new nodes, drain old nodes from the rings [puppet] - 10https://gerrit.wikimedia.org/r/911779 (https://phabricator.wikimedia.org/T335278) [10:42:08] (03PS1) 10Clément Goubert: sre.switchdc.mediawiki: Add mw-api-int to mediawiki services [cookbooks] - 10https://gerrit.wikimedia.org/r/911780 (https://phabricator.wikimedia.org/T327920) [10:42:53] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/911779 (https://phabricator.wikimedia.org/T335278) (owner: 10MVernon) [10:43:09] (03PS2) 10JMeybohm: Make kubernetes::clusters the central place for k8s config #2 [puppet] - 10https://gerrit.wikimedia.org/r/910509 (https://phabricator.wikimedia.org/T325268) [10:45:16] (03PS1) 10Jgiannelos: proton: Bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/911781 [10:45:49] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/911780 (https://phabricator.wikimedia.org/T327920) (owner: 10Clément Goubert) [10:46:00] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T335295 (10phaultfinder) [10:46:03] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [10:46:06] (03CR) 10Clément Goubert: [C: 03+2] sre.switchdc.mediawiki: Add mw-api-int to mediawiki services [cookbooks] - 10https://gerrit.wikimedia.org/r/911780 (https://phabricator.wikimedia.org/T327920) (owner: 10Clément Goubert) [10:48:23] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: Add mw-api-int to mediawiki services [cookbooks] - 10https://gerrit.wikimedia.org/r/911780 (https://phabricator.wikimedia.org/T327920) (owner: 10Clément Goubert) [10:50:12] (03PS2) 10Jbond: dumps::distribution::ferm: update to resolve hosts in puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/911338 (https://phabricator.wikimedia.org/T323324) [10:50:26] (03PS3) 10Jbond: dumps::distribution::ferm: update to resolve hosts in puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/911338 (https://phabricator.wikimedia.org/T323324) [10:52:35] !log cgoubert@cumin1001 conftool action : set/weight=25; selector: dc=codfw,cluster=jobrunner,service=canary [10:52:46] !log cgoubert@cumin1001 conftool action : set/weight=25; selector: dc=codfw,cluster=videoscaler,service=canary [10:53:06] (03PS3) 10JMeybohm: Make kubernetes::clusters the central place for k8s config #2 [puppet] - 10https://gerrit.wikimedia.org/r/910509 (https://phabricator.wikimedia.org/T325268) [10:53:20] 10SRE, 10Infrastructure-Foundations, 10serviceops: Annotate images in our registry with OS (and OS version) - https://phabricator.wikimedia.org/T335337 (10MoritzMuehlenhoff) >>! In T335337#8803990, @JMeybohm wrote: > The initial idea (at least for production-images) was to not care (naming wise) about the un... [10:54:31] (03CR) 10Jgiannelos: [C: 03+2] proton: Bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/911781 (owner: 10Jgiannelos) [10:55:24] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40825/console" [puppet] - 10https://gerrit.wikimedia.org/r/910509 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [10:56:41] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/911779 (https://phabricator.wikimedia.org/T335278) (owner: 10MVernon) [10:56:55] !log cgoubert@cumin1001 conftool action : set/weight=20; selector: name=mw2411.codfw.wmnet [10:57:04] (03PS4) 10JMeybohm: Make kubernetes::clusters the central place for k8s config #2 [puppet] - 10https://gerrit.wikimedia.org/r/910509 (https://phabricator.wikimedia.org/T325268) [10:57:19] !log cgoubert@cumin1001 conftool action : set/weight=20; selector: name=mw2410.codfw.wmnet [10:57:21] 10SRE-swift-storage: Document the process for making new-style storage nodes - https://phabricator.wikimedia.org/T335274 (10MatthewVernon) 05Open→03Resolved Done. [10:57:24] 10SRE-swift-storage: Q4 ms backend refresh work (KR) - https://phabricator.wikimedia.org/T335270 (10MatthewVernon) [10:57:54] !log cgoubert@cumin1001 conftool action : set/weight=20; selector: name=mw2395.codfw.wmnet [10:58:09] !log cgoubert@cumin1001 conftool action : set/weight=20; selector: name=mw2394.codfw.wmnet [10:59:27] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 10): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40826/console" [puppet] - 10https://gerrit.wikimedia.org/r/910509 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [11:02:21] (03Merged) 10jenkins-bot: proton: Bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/911781 (owner: 10Jgiannelos) [11:05:47] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T335299 (10phaultfinder) [11:06:00] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T335299 (10phaultfinder) [11:09:02] (03PS1) 10Muehlenhoff: docker-report: Exclude more stretch base images [puppet] - 10https://gerrit.wikimedia.org/r/911784 (https://phabricator.wikimedia.org/T335282) [11:19:03] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:20:39] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:21:44] 10SRE, 10MediaWiki-Authentication-and-authorization, 10MediaWiki-User-login-and-signup, 10MediaWiki-extensions-CentralAuth, and 2 others: Account creation attempt on mobile Wikipedia domain leads user to desktop Special:CentralLogin/complete, often in logged-out st... - https://phabricator.wikimedia.org/T335125 [11:28:48] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:30:23] (03PS1) 10Jbond: httpd: always use systemd [puppet] - 10https://gerrit.wikimedia.org/r/911847 (https://phabricator.wikimedia.org/T331706) [11:33:08] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40828/console" [puppet] - 10https://gerrit.wikimedia.org/r/911847 (https://phabricator.wikimedia.org/T331706) (owner: 10Jbond) [11:40:41] !log jmm@cumin2002 START - Cookbook sre.o11y.roll-restart-reboot-thanos-fe rolling restart_daemons on A:thanos-fe [11:41:28] (03CR) 10Jbond: gerrit: make the lfs data path configurable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/911363 (https://phabricator.wikimedia.org/T333143) (owner: 10Dzahn) [11:44:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.o11y.roll-restart-reboot-thanos-fe (exit_code=0) rolling restart_daemons on A:thanos-fe [11:45:08] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10User-MoritzMuehlenhoff: Annotate images in our registry with OS (and OS version) - https://phabricator.wikimedia.org/T335337 (10MoritzMuehlenhoff) [11:45:30] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Thanks for this!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [11:46:09] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability: Alertmanager rule for network interface errors? - https://phabricator.wikimedia.org/T335350 (10cmooney) p:05Triage→03Low [11:52:30] (Access port speed <= 100Mbps) firing: Alert for device asw2-d-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [11:53:06] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability: Alertmanager rule for network interface errors? - https://phabricator.wikimedia.org/T335350 (10ayounsi) FYI we do alert on those on the network side, see "Inbound interface errors" and "Outbound interface errors" on https://librenms.wikimedia.... [11:53:19] (03PS1) 10Ladsgroup: beta: Set externallinks migration stage to READ_NEW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911850 (https://phabricator.wikimedia.org/T335343) [11:57:37] (Nonwrite HTTP requests with primary DB writes alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+writes+alert [12:00:30] (03CR) 10Ladsgroup: [C: 03+2] beta: Set externallinks migration stage to READ_NEW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911850 (https://phabricator.wikimedia.org/T335343) (owner: 10Ladsgroup) [12:01:26] (03Merged) 10jenkins-bot: beta: Set externallinks migration stage to READ_NEW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911850 (https://phabricator.wikimedia.org/T335343) (owner: 10Ladsgroup) [12:02:44] rebased [12:07:35] (03PS29) 10KartikMistry: Add new self hosted machinetranslation service (MinT) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) [12:17:37] (Nonwrite HTTP requests with primary DB writes alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+writes+alert [12:23:56] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: document puppet/netbox/hiera interaction - https://phabricator.wikimedia.org/T311304 (10jbond) 05Open→03Resolved updated, please re-open if anything needs adding/clarifying [12:24:04] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10observability, and 2 others: Puppet: get data (row, rack, site, and other information) from Netbox - https://phabricator.wikimedia.org/T229397 (10jbond) [12:28:10] (03PS1) 10Slyngshede: P:IDM Enable Wikimedia Global Account linking. [puppet] - 10https://gerrit.wikimedia.org/r/911852 [12:35:54] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, but best to sync up with ServiceOps before merging, so that merging this doesn't interfere with the DC switchback work." [puppet] - 10https://gerrit.wikimedia.org/r/911847 (https://phabricator.wikimedia.org/T331706) (owner: 10Jbond) [12:36:30] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:45:18] (03PS1) 10Ottomata: Add dumny keytab an-web1001.eqiad.wmnet/analytics.keyta [labs/private] - 10https://gerrit.wikimedia.org/r/911855 (https://phabricator.wikimedia.org/T317167) [12:47:37] 10SRE, 10SRE-Access-Requests: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T335353 (10PPenloglou-WMF) [12:48:58] 10SRE, 10SRE-Access-Requests: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T335353 (10Marostegui) [12:49:19] 10SRE, 10SRE-Access-Requests: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T335353 (10Marostegui) [12:50:42] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add dumny keytab an-web1001.eqiad.wmnet/analytics.keyta [labs/private] - 10https://gerrit.wikimedia.org/r/911855 (https://phabricator.wikimedia.org/T317167) (owner: 10Ottomata) [12:50:56] 10SRE, 10SRE-Access-Requests: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T335353 (10Marostegui) @DBu-WMF we need your approval for this [12:51:23] 10SRE, 10SRE-Access-Requests: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T335353 (10Marostegui) [12:52:03] 10SRE, 10SRE-Access-Requests: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T335353 (10PPenloglou-WMF) Dear SRE Team, Apologies in advance for any missed steps or actions on my behalf. This is my first time requesting access of this sort and I have been... [13:00:04] Deploy window UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230425T1300) [13:00:04] No Gerrit patches in the queue for this window AFAICS. [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230425T1300) [13:02:17] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/proton: apply [13:03:23] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: apply [13:03:46] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/proton: apply [13:04:48] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/proton: apply [13:05:17] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/proton: apply [13:05:19] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/proton: apply [13:05:23] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/proton: apply [13:06:14] (03PS1) 10Ottomata: Add dummy stats@an-web1001 keytab [labs/private] - 10https://gerrit.wikimedia.org/r/911858 [13:06:58] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [13:14:00] (PowerSupply) resolved: (2) Power Supply - PS Redundancy - issue on mw2432:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=mw2432 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [13:15:27] RECOVERY - IPMI Sensor Status on mw2432 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [13:21:50] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: db2185 alerting for power supply redundancy - https://phabricator.wikimedia.org/T335331 (10Jhancock.wm) found with power cord not plugged in all the way. replaced and secured. logging into idrac, all alerts have cleared. [13:22:32] (03PS5) 10JMeybohm: Make kubernetes::clusters the central place for k8s config #2 [puppet] - 10https://gerrit.wikimedia.org/r/910509 (https://phabricator.wikimedia.org/T325268) [13:25:26] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 10): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40832/console" [puppet] - 10https://gerrit.wikimedia.org/r/910509 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [13:25:26] 10SRE, 10SRE-Access-Requests: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T335353 (10Marostegui) Waiting for the SSH key confirmation out of band [13:25:35] 10SRE, 10SRE-Access-Requests: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T335353 (10Marostegui) p:05Triage→03Medium [13:26:31] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs2009.codfw.wmnet with reason: attempting WDQS stack on bullseye [13:26:44] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs2009.codfw.wmnet with reason: attempting WDQS stack on bullseye [13:26:47] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: db2185 alerting for power supply redundancy - https://phabricator.wikimedia.org/T335331 (10Marostegui) 05Open→03Resolved a:03Jhancock.wm Thank you! The alert recovered too! [13:28:43] (03PS6) 10JMeybohm: Make kubernetes::clusters the central place for k8s config #2 [puppet] - 10https://gerrit.wikimedia.org/r/910509 (https://phabricator.wikimedia.org/T325268) [13:30:35] 10SRE, 10SRE-Access-Requests: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T335353 (10Marostegui) uidNumber: 42507 [13:30:54] (03PS7) 10JMeybohm: Make kubernetes::clusters the central place for k8s config #2 [puppet] - 10https://gerrit.wikimedia.org/r/910509 (https://phabricator.wikimedia.org/T325268) [13:30:57] !log bking@cumin1001 transfer.py wdqs2009.codfw.wmnet:/srv/wdqs wdqs2022.codfw.wmnet:/srv/wdqs [13:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:22] (03PS1) 10DCausse: rdf-streaming-updater@staging: upgrade to flink 1.16.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/911861 (https://phabricator.wikimedia.org/T334244) [13:32:36] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 (10Clement_Goubert) [13:32:54] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 (10Clement_Goubert) [13:33:02] (03PS1) 10Slyngshede: IDM: Add placeholders for mediawiki OAuth [labs/private] - 10https://gerrit.wikimedia.org/r/911862 [13:33:15] !log cgoubert@deploy2002 Locking from deployment [ALL REPOSITORIES]: Datacenter Service Switchback - T335015 [13:33:20] T335015: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 [13:33:45] 10SRE, 10SRE-Access-Requests: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T335353 (10Marostegui) Checked: The ssh key isn't being used on WMCS [13:33:49] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [13:33:59] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 (10Clement_Goubert) 05Open→03In progress p:05Triage→03High [13:35:16] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 (10Clement_Goubert) [13:35:51] (03CR) 10CI reject: [V: 04-1] rdf-streaming-updater@staging: upgrade to flink 1.16.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/911861 (https://phabricator.wikimedia.org/T334244) (owner: 10DCausse) [13:39:14] (03PS8) 10JMeybohm: Make kubernetes::clusters the central place for k8s config #2 [puppet] - 10https://gerrit.wikimedia.org/r/910509 (https://phabricator.wikimedia.org/T325268) [13:40:36] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40835/console" [puppet] - 10https://gerrit.wikimedia.org/r/910509 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [13:41:24] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 (10Clement_Goubert) [13:42:05] (03CR) 10Clément Goubert: Revert "wmnet: Switch deployment CNAMEs to codfw" [dns] - 10https://gerrit.wikimedia.org/r/909873 (https://phabricator.wikimedia.org/T335015) (owner: 10Clément Goubert) [13:42:09] (03CR) 10Clément Goubert: Revert "Switch deployment server to deploy2002.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/909874 (https://phabricator.wikimedia.org/T335015) (owner: 10Clément Goubert) [13:42:36] (03PS4) 10Clément Goubert: Revert "wmnet: Switch deployment CNAMEs to codfw" [dns] - 10https://gerrit.wikimedia.org/r/909873 (https://phabricator.wikimedia.org/T335015) [13:42:42] (03CR) 10KartikMistry: Add new self hosted machinetranslation service (MinT) (0312 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [13:43:14] 10SRE, 10SRE-Access-Requests: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T335353 (10MoritzMuehlenhoff) The peopleweb hosts are granted access to by the all-users catchup (like the bastions), so adding the SSH key, but no further group membership is su... [13:44:46] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 (10Clement_Goubert) [13:46:19] (03PS9) 10JMeybohm: Make kubernetes::clusters the central place for k8s config #2 [puppet] - 10https://gerrit.wikimedia.org/r/910509 (https://phabricator.wikimedia.org/T325268) [13:47:15] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-worker1002.eqiad.wmnet [13:47:21] 10SRE, 10SRE-Access-Requests: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T335353 (10Marostegui) [13:48:27] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:37] (03PS10) 10JMeybohm: Make kubernetes::clusters the central place for k8s config #2 [puppet] - 10https://gerrit.wikimedia.org/r/910509 (https://phabricator.wikimedia.org/T325268) [13:53:31] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-worker1002.eqiad.wmnet [13:53:38] (03PS1) 10Vgutierrez: hiera: Disable http->https in varnish on cp4044,cp4052 [puppet] - 10https://gerrit.wikimedia.org/r/911867 (https://phabricator.wikimedia.org/T322774) [13:53:40] (03PS1) 10Vgutierrez: hiera: Enable http->https in haproxy on cp4044,cp4052 [puppet] - 10https://gerrit.wikimedia.org/r/911868 [13:53:46] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 (10Clement_Goubert) [13:55:07] (03PS11) 10JMeybohm: Make kubernetes::clusters the central place for k8s config #2 [puppet] - 10https://gerrit.wikimedia.org/r/910509 (https://phabricator.wikimedia.org/T325268) [13:56:16] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40839/console" [puppet] - 10https://gerrit.wikimedia.org/r/910509 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [14:00:02] !log Starting Datacenter Services Switchback - T335015 [14:00:05] claime: I, the Bot under the Fountain, call upon thee, The Deployer, to do Datacenter Switchback - Services deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230425T1400). [14:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:09] (03CR) 10Alexandros Kosiaris: Add new self hosted machinetranslation service (MinT) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [14:00:16] T335015: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 [14:00:43] (03CR) 10Cory Massaro: wikifunctions: Add AppArmor profile usage (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/879282 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris) [14:01:00] !log cgoubert@cumin1001 START - Cookbook sre.discovery.datacenter depool all services in codfw: Datacenter Services Switchback - T335015 [14:01:12] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all services in codfw: Datacenter Services Sw... [14:02:09] !log herron@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-logging1001.eqiad.wmnet with OS bullseye [14:04:25] !log cgoubert@cumin1001 END (ERROR) - Cookbook sre.discovery.datacenter (exit_code=93) depool all services in codfw: Datacenter Services Switchback - T335015 [14:04:29] !log cgoubert@cumin1001 START - Cookbook sre.discovery.datacenter depool all services in codfw: Datacenter Services Switchback - T335015 [14:04:35] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all services in codfw: Datacenter Services Sw... [14:04:41] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all services in codfw: Datacenter Services Sw... [14:04:45] claime: error ? [14:04:56] akosiaris: It doesn't raise an error [14:04:57] can I help ? [14:05:02] It just stops trying to check the DNS [14:05:27] Since it's idempotent, I just kill it and restart [14:05:30] ok [14:05:52] But I already ran into this the other day when I repooled from the switch update, and didn't have time to chase down what's actually happening [14:05:53] (03PS12) 10JMeybohm: Make kubernetes::clusters the central place for k8s config #2 [puppet] - 10https://gerrit.wikimedia.org/r/910509 (https://phabricator.wikimedia.org/T325268) [14:07:41] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40841/console" [puppet] - 10https://gerrit.wikimedia.org/r/911868 (owner: 10Vgutierrez) [14:08:42] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40840/console" [puppet] - 10https://gerrit.wikimedia.org/r/910509 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [14:10:15] (03PS2) 10Vgutierrez: hiera: Disable http->https in varnish on cp4044,cp4052 [puppet] - 10https://gerrit.wikimedia.org/r/911867 (https://phabricator.wikimedia.org/T322774) [14:10:18] (03PS2) 10Vgutierrez: hiera: Enable http->https in haproxy on cp4044,cp4052 [puppet] - 10https://gerrit.wikimedia.org/r/911868 [14:10:19] (03PS1) 10Vgutierrez: haproxy: Fix http_redirection_port templating [puppet] - 10https://gerrit.wikimedia.org/r/911871 (https://phabricator.wikimedia.org/T322774) [14:10:21] (03CR) 10CDanis: [C: 03+2] Set NEL 'success_fraction: 1.0' on HTTP responses for measurement domains [puppet] - 10https://gerrit.wikimedia.org/r/908328 (https://phabricator.wikimedia.org/T334608) (owner: 10Jameel Kaisar) [14:11:03] (03PS3) 10CDanis: Handle h/2 coalescing issue for measurement domains [puppet] - 10https://gerrit.wikimedia.org/r/908790 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [14:11:15] (03CR) 10CDanis: [C: 03+1] Handle h/2 coalescing issue for measurement domains [puppet] - 10https://gerrit.wikimedia.org/r/908790 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [14:11:51] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40842/console" [puppet] - 10https://gerrit.wikimedia.org/r/911871 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [14:11:59] (03CR) 10CDanis: [C: 03+2] Handle h/2 coalescing issue for measurement domains [puppet] - 10https://gerrit.wikimedia.org/r/908790 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [14:13:18] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40843/console" [puppet] - 10https://gerrit.wikimedia.org/r/911868 (owner: 10Vgutierrez) [14:14:11] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:48] (03PS1) 10Herron: services: add kafka-logging100[12] to network rules and broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/911872 (https://phabricator.wikimedia.org/T326419) [14:15:52] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:16:23] !log herron@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging1001.eqiad.wmnet with reason: host reimage [14:16:34] (03PS2) 10DCausse: rdf-streaming-updater@staging: upgrade to flink 1.16.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/911861 (https://phabricator.wikimedia.org/T334244) [14:17:15] (03PS1) 10Alexandros Kosiaris: admin_ng: Make vim-ale happy [deployment-charts] - 10https://gerrit.wikimedia.org/r/911873 [14:17:17] (03PS1) 10Alexandros Kosiaris: admin_ng: Create machinetranslation namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/911874 (https://phabricator.wikimedia.org/T331505) [14:17:54] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add dummy stats@an-web1001 keytab [labs/private] - 10https://gerrit.wikimedia.org/r/911858 (owner: 10Ottomata) [14:18:49] !log cgoubert@cumin1001 END (ERROR) - Cookbook sre.discovery.datacenter (exit_code=93) depool all services in codfw: Datacenter Services Switchback - T335015 [14:18:54] T335015: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 [14:18:58] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all services in codfw: Datacenter Services Sw... [14:19:11] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging1001.eqiad.wmnet with reason: host reimage [14:19:26] Same issue, restarting [14:19:26] (03CR) 10BBlack: [C: 03+1] haproxy: Fix http_redirection_port templating [puppet] - 10https://gerrit.wikimedia.org/r/911871 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [14:19:31] !log cgoubert@cumin1001 START - Cookbook sre.discovery.datacenter depool all services in codfw: Datacenter Services Switchback - T335015 [14:19:35] (03CR) 10BBlack: [C: 03+1] hiera: Disable http->https in varnish on cp4044,cp4052 [puppet] - 10https://gerrit.wikimedia.org/r/911867 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [14:19:38] (18 services left) [14:19:42] (03CR) 10BBlack: [C: 03+1] hiera: Enable http->https in haproxy on cp4044,cp4052 [puppet] - 10https://gerrit.wikimedia.org/r/911868 (owner: 10Vgutierrez) [14:19:44] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all services in codfw: Datacenter Services Sw... [14:21:09] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] haproxy: Fix http_redirection_port templating [puppet] - 10https://gerrit.wikimedia.org/r/911871 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [14:21:15] (03PS1) 10Ottomata: Move an-web1001 keytabs to proper directory [labs/private] - 10https://gerrit.wikimedia.org/r/911875 (https://phabricator.wikimedia.org/T317167) [14:21:25] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Move an-web1001 keytabs to proper directory [labs/private] - 10https://gerrit.wikimedia.org/r/911875 (https://phabricator.wikimedia.org/T317167) (owner: 10Ottomata) [14:21:57] Ottomata: Add dummy stats@an-web1001 keytab (1dc2f9e) --> that's pending [14:22:17] (regarding puppet-merge) [14:22:39] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40845/console" [puppet] - 10https://gerrit.wikimedia.org/r/910761 (https://phabricator.wikimedia.org/T317167) (owner: 10Joal) [14:24:20] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Refactor dumps::web::fetches::analytics::job [puppet] - 10https://gerrit.wikimedia.org/r/910761 (https://phabricator.wikimedia.org/T317167) (owner: 10Joal) [14:24:59] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all services in codfw: Datacenter Services Switchback - T335015 [14:25:05] T335015: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 [14:25:08] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 (10Clement_Goubert) Encountering some intermittent lockups of the cookbook {P47279} Restarting the cookbook is idempotent, doing that pend... [14:25:13] !log cgoubert@cumin1001 START - Cookbook sre.discovery.datacenter status all services in all: None - None [14:25:16] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in all: None - None [14:25:18] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all services in codfw: Datacenter Services Sw... [14:25:28] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 (10Clement_Goubert) [14:25:30] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [14:26:20] !log All services pooled in eqiad, all depooled in codfw, proceeding with repooling active/active services in codfw - T335015 [14:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:38] !log cgoubert@cumin1001 START - Cookbook sre.discovery.datacenter pool all active/active services in codfw: Datacenter Services Switchback - T335015 [14:26:51] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: Datacenter... [14:27:05] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 (10Clement_Goubert) [14:30:48] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T335293 (10phaultfinder) [14:32:01] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T335293 (10phaultfinder) [14:34:48] (03PS1) 10Marostegui: data.yaml: Add Panagiotis Penloglou [puppet] - 10https://gerrit.wikimedia.org/r/911877 (https://phabricator.wikimedia.org/T335353) [14:35:03] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging1001.eqiad.wmnet with OS bullseye [14:35:06] (03CR) 10Marostegui: [C: 04-2] "Waiting for manager approval" [puppet] - 10https://gerrit.wikimedia.org/r/911877 (https://phabricator.wikimedia.org/T335353) (owner: 10Marostegui) [14:35:35] (03CR) 10CI reject: [V: 04-1] data.yaml: Add Panagiotis Penloglou [puppet] - 10https://gerrit.wikimedia.org/r/911877 (https://phabricator.wikimedia.org/T335353) (owner: 10Marostegui) [14:36:05] !log herron@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-logging1002.eqiad.wmnet with OS bullseye [14:38:12] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T335353 (10Marostegui) @PPenloglou-WMF when does your contract expire? I need the date for the access patch :) [14:39:52] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 (10Clement_Goubert) [14:40:31] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T335353 (10PPenloglou-WMF) Hey @Marostegui ! My contract expires June 30 2023. Would you like me to ping you here if this is updated in the future? [14:40:44] (03PS13) 10JMeybohm: Make kubernetes::clusters the central place for k8s config #2 [puppet] - 10https://gerrit.wikimedia.org/r/910509 (https://phabricator.wikimedia.org/T325268) [14:41:14] (03CR) 10DCausse: [C: 04-1] "depends on: https://gitlab.wikimedia.org/repos/search-platform/flink-rdf-streaming-updater/-/merge_requests/9" [deployment-charts] - 10https://gerrit.wikimedia.org/r/911861 (https://phabricator.wikimedia.org/T334244) (owner: 10DCausse) [14:42:34] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: Datacenter... [14:43:20] !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.discovery.datacenter (exit_code=93) pool all active/active services in codfw: Datacenter Services Switchback - T335015 [14:43:24] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 14): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40846/console" [puppet] - 10https://gerrit.wikimedia.org/r/910509 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [14:43:26] (03PS2) 10Marostegui: data.yaml: Add Panagiotis Penloglou [puppet] - 10https://gerrit.wikimedia.org/r/911877 (https://phabricator.wikimedia.org/T335353) [14:43:26] T335015: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 [14:43:48] !log All active/active services repooled in codfw - T335015 [14:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:05] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T335353 (10Marostegui) >>! In T335353#8804827, @PPenloglou-WMF wrote: > Hey @Marostegui ! My contract expires June 30 2023. > Would you like me to ping you... [14:44:08] Aborted because of rest-gateway that is in service_setup, but all A/A repooled [14:44:10] (03CR) 10CI reject: [V: 04-1] data.yaml: Add Panagiotis Penloglou [puppet] - 10https://gerrit.wikimedia.org/r/911877 (https://phabricator.wikimedia.org/T335353) (owner: 10Marostegui) [14:44:32] Proceeding with deployment server switch [14:44:37] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 (10Clement_Goubert) [14:44:41] !log Switch deployment server back to eqiad - T335015 [14:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:59] (03CR) 10Clément Goubert: [C: 03+2] Revert "wmnet: Switch deployment CNAMEs to codfw" [dns] - 10https://gerrit.wikimedia.org/r/909873 (https://phabricator.wikimedia.org/T335015) (owner: 10Clément Goubert) [14:45:10] (03PS3) 10Marostegui: data.yaml: Add Panagiotis Penloglou [puppet] - 10https://gerrit.wikimedia.org/r/911877 (https://phabricator.wikimedia.org/T335353) [14:45:35] !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.123`. Pre-deploy tests passing on canary `wdqs1003` [14:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:50] !log Running authdns-update - T335015 [14:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:23] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 (10Clement_Goubert) [14:46:50] (03CR) 10Clément Goubert: [C: 03+2] Revert "Switch deployment server to deploy2002.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/909874 (https://phabricator.wikimedia.org/T335015) (owner: 10Clément Goubert) [14:47:30] (03PS1) 10Alexandros Kosiaris: Add machinetranslation tokens [labs/private] - 10https://gerrit.wikimedia.org/r/911878 (https://phabricator.wikimedia.org/T331505) [14:48:02] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/911877 (https://phabricator.wikimedia.org/T335353) (owner: 10Marostegui) [14:48:22] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:48:55] !log herron@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging1002.eqiad.wmnet with reason: host reimage [14:49:51] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 (10Clement_Goubert) [14:50:32] !log bking@deploy1002 Started deploy [wdqs/wdqs@0e051d8]: 0.3.123 [14:50:46] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [14:51:45] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging1002.eqiad.wmnet with reason: host reimage [14:52:55] (03PS1) 10David Caro: k8s: Allow loading relative paths on kubeconfig certs [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/911880 [14:53:11] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: imagecatalog_record.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:20] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 (10Clement_Goubert) [14:54:16] That systemd error will be checked afterwards [14:54:27] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: imagecatalog_record.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:54:34] !log cgoubert@deploy2002 Unlocked for deployment [ALL REPOSITORIES]: Datacenter Service Switchback - T335015 (duration: 81m 19s) [14:54:40] T335015: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 [14:55:22] Hmm [14:55:40] scap deployment fails from deploy1002 [14:55:46] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 (10Clement_Goubert) [14:58:10] !log bking@deploy1002 Finished deploy [wdqs/wdqs@0e051d8]: 0.3.123 (duration: 07m 38s) [14:58:13] imagecatalog's database is read-only [14:58:40] (03PS14) 10JMeybohm: Make kubernetes::clusters the central place for k8s config #2 [puppet] - 10https://gerrit.wikimedia.org/r/910509 (https://phabricator.wikimedia.org/T325268) [14:59:09] And scap sync-files fails with [14:59:11] Your configuration specifies to merge with the ref 'refs/heads/master' [14:59:13] from the remote, but no such ref was fetched. [14:59:15] 14:58:51 sync-file failed: Command 'sudo -u mwbuilder /usr/local/bin/update-mediawiki-tools-release' returned non-zero exit status 1. [14:59:23] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [14:59:26] !log btullis@cumin1001 Added views for new wiki: guwwikinews T334408 [14:59:26] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [14:59:31] T334408: Prepare and check storage layer for guwwikinews - https://phabricator.wikimedia.org/T334408 [14:59:34] From the first try : [14:59:36] From https://gitlab.wikimedia.org/repos/releng/release [14:59:38] 1fc04f0..a4f8339 main -> origin/main [14:59:40] Your configuration specifies to merge with the ref 'refs/heads/master' [14:59:52] It looks like deploy1002 wasn't updated to checkout from the right branch [14:59:53] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40847/console" [puppet] - 10https://gerrit.wikimedia.org/r/910509 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [15:00:15] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [15:00:18] !log btullis@cumin1001 Added views for new wiki: kcgwiktionary T334739 [15:00:18] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [15:00:27] T334739: Prepare and check storage layer for kcgwiktionary - https://phabricator.wikimedia.org/T334739 [15:00:51] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [15:00:54] !log btullis@cumin1001 Added views for new wiki: fatwiki T335018 [15:00:54] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [15:01:00] T335018: Prepare and check storage layer for fatwiki - https://phabricator.wikimedia.org/T335018 [15:01:03] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:17] claime: ugh [15:01:25] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [15:01:27] !log btullis@cumin1001 Added views for new wiki: ckbwiktionary T331834 [15:01:27] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [15:01:31] akosiaris: Switching origins manually [15:01:35] T331834: Prepare and check storage layer for ckbwiktionary - https://phabricator.wikimedia.org/T331834 [15:01:52] log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [15:01:56] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [15:01:59] !log btullis@cumin1001 Added views for new wiki: vewikimedia T330704 [15:01:59] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [15:01:59] !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [15:02:06] T330704: Prepare and check storage layer for vewikimedia - https://phabricator.wikimedia.org/T330704 [15:02:09] akosiaris: scap sync-files is working rn [15:02:09] claime: ah master to main ? [15:02:12] akosiaris: yeah [15:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:24] well, it had to bite us somehow [15:02:25] Still have an issue with imagecatalog [15:02:33] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [15:02:37] anyway, happy you figured it out so quickly [15:02:57] what issue with the imagecatalog? [15:03:33] I think it's permissions on the sqlite db [15:03:40] Apr 25 14:57:44 deploy1002 imagecatalog[22506]: sqlite3.OperationalError: attempt to write a readonly database [15:04:39] ehmm, what? [15:05:27] !log cgoubert@deploy1002 Started deploy [restbase/deploy@a08f56d]: (no justification provided) [15:05:51] PROBLEM - ircecho bot process on irc2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, regex args /usr/local/bin/udpmxircecho.py https://wikitech.wikimedia.org/wiki/Ircecho [15:05:57] let me try to restart that one [15:06:01] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T335299 (10phaultfinder) [15:06:10] akosiaris: all yours, I'm testing deployments [15:06:23] 👍 [15:08:07] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging1002.eqiad.wmnet with OS bullseye [15:08:12] (03CR) 10David Caro: [C: 04-1] profile:toolforge:harbor: setup blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) (owner: 10Raymond Ndibe) [15:09:01] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:09:20] claime: fixed, but trying to figure out what happened. the sqlite file ended up owned by mwbuilder, not imagecatalog [15:09:48] akosiaris: puppet being overeager? [15:10:14] yeah, running it to see if that's the reason [15:10:19] well, apparently .. not [15:10:45] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T335299 (10phaultfinder) [15:11:18] we have an interesting discrepancy of gids/uids for imagecatalog across deploy hosts [15:11:23] (03PS1) 10Herron: kafka-logging: add kafka-logging100[12] with node ids 100[12] [puppet] - 10https://gerrit.wikimedia.org/r/911883 (https://phabricator.wikimedia.org/T326419) [15:11:26] and somehow also ownerships [15:14:39] (03CR) 10Herron: [C: 03+2] kafka-logging: add kafka-logging100[12] with node ids 100[12] [puppet] - 10https://gerrit.wikimedia.org/r/911883 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron) [15:15:24] (03PS4) 10Alexandros Kosiaris: thanos-fe: proper insetup Puppet roles to machine [puppet] - 10https://gerrit.wikimedia.org/r/906023 [15:15:26] (03PS1) 10Alexandros Kosiaris: machinetranslation: deployment_server stanzas [puppet] - 10https://gerrit.wikimedia.org/r/911886 (https://phabricator.wikimedia.org/T331505) [15:15:28] (03PS1) 10Alexandros Kosiaris: services_proxy: Add machinetranslation [puppet] - 10https://gerrit.wikimedia.org/r/911887 (https://phabricator.wikimedia.org/T331505) [15:15:44] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Add machinetranslation tokens [labs/private] - 10https://gerrit.wikimedia.org/r/911878 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [15:15:54] (03PS15) 10JMeybohm: Make kubernetes::clusters the central place for k8s config #2 [puppet] - 10https://gerrit.wikimedia.org/r/910509 (https://phabricator.wikimedia.org/T325268) [15:16:14] (03CR) 10Alexandros Kosiaris: [C: 03+1] admin_ng: Make vim-ale happy [deployment-charts] - 10https://gerrit.wikimedia.org/r/911873 (owner: 10Alexandros Kosiaris) [15:16:19] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin_ng: Make vim-ale happy [deployment-charts] - 10https://gerrit.wikimedia.org/r/911873 (owner: 10Alexandros Kosiaris) [15:16:42] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin_ng: Create machinetranslation namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/911874 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [15:17:07] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40848/console" [puppet] - 10https://gerrit.wikimedia.org/r/910509 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [15:17:17] claime: Is it safe for me to deploy an updated version of scap? [15:17:33] dancy: Still running two scap deploys rn [15:17:42] dancy: In a few minutes [15:17:45] ok. Please ping me. [15:17:49] for sure [15:17:49] I'll prep in the meantime [15:17:51] ack [15:18:33] !log cgoubert@deploy1002 Finished deploy [restbase/deploy@a08f56d]: (no justification provided) (duration: 13m 06s) [15:18:59] !log Restoring restbase-async to codfw only - T335015 [15:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:04] T335015: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 [15:19:06] !log cgoubert@cumin2002 START - Cookbook sre.discovery.service-route depool restbase-async in eqiad: T335015 [15:19:08] !log cgoubert@cumin2002 START - Cookbook sre.dns.wipe-cache restbase-async.discovery.wmnet on all recursors [15:19:11] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase-async.discovery.wmnet on all recursors [15:19:26] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add new self hosted machinetranslation service (MinT) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [15:19:35] (03CR) 10Alexandros Kosiaris: [C: 03+2] "LGTM, merging! Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [15:19:44] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 (10Clement_Goubert) [15:21:50] !log cgoubert@deploy1002 Synchronized README: check the deployment server after switchback - T335015 (duration: 19m 55s) [15:21:58] dancy: all yours [15:22:18] !log Datacenter Service Switchback concluded - T335015 [15:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:32] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on irc2002.wikimedia.org with reason: Non-functional, WIP for Bullseye update [15:22:44] (03Merged) 10jenkins-bot: admin_ng: Make vim-ale happy [deployment-charts] - 10https://gerrit.wikimedia.org/r/911873 (owner: 10Alexandros Kosiaris) [15:22:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on irc2002.wikimedia.org with reason: Non-functional, WIP for Bullseye update [15:22:57] 10SRE, 10SRE-Unowned, 10Wikimedia-IRC-RC-Server: Migrate mw_rc_irc servers to Bullseye - https://phabricator.wikimedia.org/T331702 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a3bb7d5d-06b6-47e0-986b-d299e4bb9639) set by jmm@cumin2002 for 2 days, 0:00:00 on 1 host(s) and their services... [15:22:59] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [15:23:08] (03Merged) 10jenkins-bot: admin_ng: Create machinetranslation namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/911874 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [15:23:38] (03PS5) 10KartikMistry: Add MinT support to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579 [15:24:03] Notice: /Stage[main]/Deployment::Deployment_server/File[/srv/deployment]/owner: owner changed 'mcrouter' to 'trebuchet' (corrective) [15:24:03] Notice: /Stage[main]/Imagecatalog/File[/srv/deployment/imagecatalog]/owner: owner changed 'helm' to 'imagecatalog' (corrective) [15:24:03] Notice: /Stage[main]/Imagecatalog/File[/srv/deployment/imagecatalog]/group: group changed 'helm' to 'imagecatalog' (corrective) [15:24:05] wow [15:24:07] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 (10Clement_Goubert) 05In progress→03Resolved [15:24:08] that's on deploy2002 [15:24:10] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool restbase-async in eqiad: T335015 [15:24:15] how did that happen [15:24:16] T335015: 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 [15:24:35] akosiaris: when was that? [15:24:43] and trebuchet? it must be ... 5 years now that we don't have trebuchet? [15:24:46] claime: just about now [15:24:52] O_O [15:25:10] (03CR) 10Raymond Ndibe: profile:toolforge:harbor: setup blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) (owner: 10Raymond Ndibe) [15:25:33] (03Merged) 10jenkins-bot: Add new self hosted machinetranslation service (MinT) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [15:25:38] !log dancy@deploy1002 Installing scap version "4.50.0" for 592 hosts [15:25:39] akosiaris: There are still quite a few references to the trebuchet user in puppet [15:26:09] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:27:28] (03PS16) 10JMeybohm: Make kubernetes::clusters the central place for k8s config #2 [puppet] - 10https://gerrit.wikimedia.org/r/910509 (https://phabricator.wikimedia.org/T325268) [15:27:31] I 'll have a look at puppetboard tomorrow to see puppet run history for the deploy hosts. Those diffs, especially for imagecatalog are unexpected [15:27:35] !log btullis@cumin1001 Added views for new wiki: azwikimedia T330442 [15:27:35] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [15:27:41] T330442: Prepare and check storage layer for azwikimedia - https://phabricator.wikimedia.org/T330442 [15:28:33] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks! merging" [puppet] - 10https://gerrit.wikimedia.org/r/906023 (owner: 10Alexandros Kosiaris) [15:28:49] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40849/console" [puppet] - 10https://gerrit.wikimedia.org/r/910509 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [15:28:51] !log update cr2-eqsin BBIX interface [15:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:34] !log dancy@deploy1002 Installing scap version "4.50.0" for 1 hosts [15:30:45] !log dancy@deploy1002 Installation of scap version "4.50.0" completed for 1 hosts [15:31:24] !log akosiaris@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:31:34] (03PS17) 10JMeybohm: Make kubernetes::clusters the central place for k8s config #2 [puppet] - 10https://gerrit.wikimedia.org/r/910509 (https://phabricator.wikimedia.org/T325268) [15:31:40] !log akosiaris@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:31:59] !log akosiaris@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:32:05] claime: I'm done. Thanks! [15:32:49] !log akosiaris@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:33:05] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [15:33:47] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 82, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:34:05] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:34:42] ^ the cr2-eqsin error is expected and should clear as soon as the other side configured their side - https://phabricator.wikimedia.org/T327284 [15:34:57] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 20): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40850/console" [puppet] - 10https://gerrit.wikimedia.org/r/910509 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [15:35:31] (03PS1) 10Herron: kafka-logging: assign kafka::logging role to kafka-logging100[12] [puppet] - 10https://gerrit.wikimedia.org/r/911888 (https://phabricator.wikimedia.org/T326419) [15:35:42] 10SRE, 10ops-codfw, 10DC-Ops: Q3:rack/setup/install X - https://phabricator.wikimedia.org/T334505 (10Jhancock.wm) [15:36:27] (03CR) 10Herron: [C: 03+2] kafka-logging: assign kafka::logging role to kafka-logging100[12] [puppet] - 10https://gerrit.wikimedia.org/r/911888 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron) [15:36:59] (03PS1) 10Alexandros Kosiaris: mc-wf100[12]: Add memcached role [puppet] - 10https://gerrit.wikimedia.org/r/911889 (https://phabricator.wikimedia.org/T313965) [15:38:12] (03CR) 10Alexandros Kosiaris: [C: 03+2] mc-wf100[12]: Add memcached role [puppet] - 10https://gerrit.wikimedia.org/r/911889 (https://phabricator.wikimedia.org/T313965) (owner: 10Alexandros Kosiaris) [15:39:33] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Datacenter-Switchover: sre.discovery.datacenter should support only moving the active/passive services to the other datacenter - https://phabricator.wikimedia.org/T335364 (10Clement_Goubert) [15:39:41] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:39:59] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:41:13] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover eqiad pooling schedule - https://phabricator.wikimedia.org/T328903 (10Clement_Goubert) 05In progress→03Resolved [15:41:29] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [15:41:37] (03Abandoned) 10JMeybohm: Make kubernetes::clusters the central place for k8s config #2 [puppet] - 10https://gerrit.wikimedia.org/r/910509 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [15:42:27] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: March 2023 Datacenter Switchover Excluded services - https://phabricator.wikimedia.org/T329193 (10Clement_Goubert) 05Open→03Resolved [15:42:37] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [15:42:48] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Datacenter-Switchover: sre.discovery.datacenter should support switching the active/passive services to the other datacenter - https://phabricator.wikimedia.org/T335364 (10Clement_Goubert) [15:42:58] cmutt [15:43:04] oops :) [15:46:12] cmutt: command not found [15:47:30] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [15:47:30] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [15:47:49] uh [15:48:12] * jhathaway looking [15:48:37] peering link [15:48:59] https://librenms.wikimedia.org/graphs/to=1682437500/id=22276/type=port_bits/from=1682415900/ [15:49:15] * brett is here [15:49:31] no https traffic though [15:49:54] AS396982 Google LLC [15:49:58] https://bgp.he.net/AS396982 [15:50:02] yep, was going to say [15:50:05] why outbound? [15:50:11] maybe sucking up images? [15:50:16] bblack: they're probably fetching big files [15:50:29] yes [15:50:32] src IP is upload-lb [15:50:34] huge responses at a low rps rate? [15:51:07] yeah.. https://grafana.wikimedia.org/d/oMIu2XI4z/data-transfer-rates?orgId=1&var-site=codfw&var-min_step=2m&var-cluster=All [15:51:18] https://w.wiki/6dLC [15:51:25] we're flatlining at 10G [15:51:56] we can requestctl block them for now? to save others [15:52:30] (Access port speed <= 100Mbps) firing: Alert for device asw2-d-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [15:52:34] UA is Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.3 Safari/605.1.15 [15:53:18] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:55:52] (03CR) 10Ahmon Dancy: "+1 to not requiring users to perform an extra step or know about the magic parameter." [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/893708 (owner: 10Lucas Werkmeister (WMDE)) [15:57:24] (03CR) 10David Caro: [C: 04-1] profile:toolforge:harbor: setup blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) (owner: 10Raymond Ndibe) [15:58:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:00:07] jbond and rzl: Dear deployers, time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230425T1600). [16:00:07] No Gerrit patches in the queue for this window AFAICS. [16:03:03] (03PS1) 10Clément Goubert: sre.discovery.datacenter: exclude services not in production [cookbooks] - 10https://gerrit.wikimedia.org/r/911894 (https://phabricator.wikimedia.org/T335341) [16:04:29] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: sre.discovery.datacenter breaks on services not in "production" state - https://phabricator.wikimedia.org/T335341 (10Clement_Goubert) p:05Triage→03Medium a:03Clement_Goubert [16:05:39] (03CR) 10CI reject: [V: 04-1] sre.discovery.datacenter: exclude services not in production [cookbooks] - 10https://gerrit.wikimedia.org/r/911894 (https://phabricator.wikimedia.org/T335341) (owner: 10Clément Goubert) [16:06:28] (03PS2) 10Clément Goubert: sre.discovery.datacenter: exclude services not in production [cookbooks] - 10https://gerrit.wikimedia.org/r/911894 (https://phabricator.wikimedia.org/T335341) [16:07:30] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:07:30] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:08:41] (03CR) 10CI reject: [V: 04-1] sre.discovery.datacenter: exclude services not in production [cookbooks] - 10https://gerrit.wikimedia.org/r/911894 (https://phabricator.wikimedia.org/T335341) (owner: 10Clément Goubert) [16:12:25] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T335353 (10DBu-WMF) By the way, I am @PPenloglou-WMF manager and I approve this request. [16:14:56] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:15:42] (03PS3) 10Clément Goubert: sre.discovery.datacenter: exclude services not in production [cookbooks] - 10https://gerrit.wikimedia.org/r/911894 (https://phabricator.wikimedia.org/T335341) [16:16:52] (03CR) 10Dzahn: "Thank you! Just to be sure, is the order of preference ascending or descending?" [puppet] - 10https://gerrit.wikimedia.org/r/911363 (https://phabricator.wikimedia.org/T333143) (owner: 10Dzahn) [16:17:30] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:17:30] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:17:53] (03CR) 10Dzahn: "oooooh.. I had seen that change but it didn't come to mind in relation to certbot. this makes a lot of sense.. good find!" [puppet] - 10https://gerrit.wikimedia.org/r/911765 (https://phabricator.wikimedia.org/T335161) (owner: 10Jelto) [16:18:12] second page acked [16:18:33] (03CR) 10Dzahn: [C: 03+1] gitlab: add ferm rule for certbot on WMCS [puppet] - 10https://gerrit.wikimedia.org/r/911765 (https://phabricator.wikimedia.org/T335161) (owner: 10Jelto) [16:19:26] (03PS4) 10Clément Goubert: sre.discovery.datacenter: exclude services not in production [cookbooks] - 10https://gerrit.wikimedia.org/r/911894 (https://phabricator.wikimedia.org/T335341) [16:19:32] (03CR) 10Dzahn: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/911759 (https://phabricator.wikimedia.org/T330771) (owner: 10Jelto) [16:22:30] (03PS5) 10Hokwelum: make dumpsdata1006 the xmlfallback host [puppet] - 10https://gerrit.wikimedia.org/r/908995 (https://phabricator.wikimedia.org/T325232) [16:22:32] (03PS1) 10Hokwelum: WME refresh token api uses a post request [puppet] - 10https://gerrit.wikimedia.org/r/911897 [16:23:03] (03CR) 10CI reject: [V: 04-1] WME refresh token api uses a post request [puppet] - 10https://gerrit.wikimedia.org/r/911897 (owner: 10Hokwelum) [16:23:07] (03CR) 10Clément Goubert: [C: 03+1] "Would like to have @volans' opinion on this" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/893708 (owner: 10Lucas Werkmeister (WMDE)) [16:32:30] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:32:30] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:33:44] 10SRE, 10Wikimedia-Mailing-lists: Create arbcom-ru@wikimedia.org - https://phabricator.wikimedia.org/T262525 (10AntiAryan) Hi, is it russian arbcom? [16:36:30] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:36:48] (03Abandoned) 10BCornwall: service: Set LVS default scheduler to Maglev (mh) [software/spicerack] - 10https://gerrit.wikimedia.org/r/911349 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [16:49:54] (03PS1) 10Ottomata: java/openjdk-11 - base on debian bullsye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/911905 [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230425T1700) [17:02:21] (03PS3) 10BCornwall: lvs: Switch text/upload 'sh' schedulers to 'mh' [puppet] - 10https://gerrit.wikimedia.org/r/911350 (https://phabricator.wikimedia.org/T263797) [17:04:59] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40857/console" [puppet] - 10https://gerrit.wikimedia.org/r/911350 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [17:05:34] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10Papaul) Waiting on traffic team to come up with a plan on how to install the new serves. In the pass, what we did was to decommission one server re-use the cables... [17:11:44] (03PS2) 10Hokwelum: WME refresh token api uses a post request [puppet] - 10https://gerrit.wikimedia.org/r/911897 [17:12:22] (03CR) 10CI reject: [V: 04-1] WME refresh token api uses a post request [puppet] - 10https://gerrit.wikimedia.org/r/911897 (owner: 10Hokwelum) [17:12:48] (03PS3) 10Hokwelum: WME refresh token api uses a post request [puppet] - 10https://gerrit.wikimedia.org/r/911897 (https://phabricator.wikimedia.org/T335368) [17:13:13] (03CR) 10CI reject: [V: 04-1] WME refresh token api uses a post request [puppet] - 10https://gerrit.wikimedia.org/r/911897 (https://phabricator.wikimedia.org/T335368) (owner: 10Hokwelum) [17:21:02] (03PS4) 10Ottomata: Install flink operator in wikikube staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/904226 (https://phabricator.wikimedia.org/T333464) [17:21:04] (03CR) 10Ottomata: Install flink operator in wikikube staging-eqiad (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/904226 (https://phabricator.wikimedia.org/T333464) (owner: 10Ottomata) [17:21:22] (03PS5) 10Ottomata: Install flink operator in wikikube staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/904226 (https://phabricator.wikimedia.org/T333464) [17:31:01] 10SRE, 10Wikimedia-Mailing-lists: Create arbcom-ru@wikimedia.org - https://phabricator.wikimedia.org/T262525 (10Dzahn) This was the ticket to create a list for Russian Arbcom. But it's not the place where you reach Russian Arbcom. You should do that via the email address in the comment above. [17:32:44] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:33:08] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:34:10] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.277 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:34:36] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49852 bytes in 0.141 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:34:39] (03PS2) 10Herron: services: add kafka-logging100[12] to network rules and broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/911872 (https://phabricator.wikimedia.org/T326419) [17:36:23] (03PS4) 10Hokwelum: WME refresh token api uses a post request [puppet] - 10https://gerrit.wikimedia.org/r/911897 (https://phabricator.wikimedia.org/T335368) [17:39:41] (03PS1) 10Ottomata: hdfs_rsync job fixes [puppet] - 10https://gerrit.wikimedia.org/r/911913 (https://phabricator.wikimedia.org/T317167) [17:42:10] (03PS2) 10Ottomata: hdfs_rsync job fixes [puppet] - 10https://gerrit.wikimedia.org/r/911913 (https://phabricator.wikimedia.org/T317167) [17:43:09] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40861/console" [puppet] - 10https://gerrit.wikimedia.org/r/911913 (https://phabricator.wikimedia.org/T317167) (owner: 10Ottomata) [17:44:11] (03CR) 10Joal: hdfs_rsync job fixes (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/911913 (https://phabricator.wikimedia.org/T317167) (owner: 10Ottomata) [17:49:24] (03PS3) 10Ottomata: hdfs_rsync job fixes [puppet] - 10https://gerrit.wikimedia.org/r/911913 (https://phabricator.wikimedia.org/T317167) [17:49:44] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:50:03] (03CR) 10Ottomata: hdfs_rsync job fixes (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/911913 (https://phabricator.wikimedia.org/T317167) (owner: 10Ottomata) [17:50:09] (03CR) 10Ottomata: [C: 03+2] hdfs_rsync job fixes [puppet] - 10https://gerrit.wikimedia.org/r/911913 (https://phabricator.wikimedia.org/T317167) (owner: 10Ottomata) [17:50:16] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:52:46] (03PS1) 10Ottomata: hdfs_rsync - ensure old renamed systemd timers and script are absent [puppet] - 10https://gerrit.wikimedia.org/r/911915 (https://phabricator.wikimedia.org/T317167) [17:53:32] (03CR) 10Ottomata: [C: 03+2] hdfs_rsync - ensure old renamed systemd timers and script are absent [puppet] - 10https://gerrit.wikimedia.org/r/911915 (https://phabricator.wikimedia.org/T317167) (owner: 10Ottomata) [17:55:19] (03PS1) 10Ottomata: hdfs_rsync - Remove absented [puppet] - 10https://gerrit.wikimedia.org/r/911916 (https://phabricator.wikimedia.org/T317167) [18:00:05] jeena and jnuche: Dear deployers, time to do the MediaWiki train - Utc-7+Utc-0 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230425T1800). [18:05:00] (03CR) 10Ottomata: [C: 03+2] hdfs_rsync - Remove absented [puppet] - 10https://gerrit.wikimedia.org/r/911916 (https://phabricator.wikimedia.org/T317167) (owner: 10Ottomata) [18:10:42] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911919 (https://phabricator.wikimedia.org/T330212) [18:10:44] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911919 (https://phabricator.wikimedia.org/T330212) (owner: 10TrainBranchBot) [18:11:11] first run of "scap train" \o/ [18:11:50] (03CR) 10Eevans: [C: 03+1] swift: add new nodes, drain old nodes from the rings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/911779 (https://phabricator.wikimedia.org/T335278) (owner: 10MVernon) [18:11:52] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911919 (https://phabricator.wikimedia.org/T330212) (owner: 10TrainBranchBot) [18:12:02] nice! [18:14:05] yay! [18:15:52] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [18:16:57] (03PS5) 10Hokwelum: WME refresh token api uses a post request [puppet] - 10https://gerrit.wikimedia.org/r/911897 (https://phabricator.wikimedia.org/T335368) [18:18:47] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.6 refs T330212 [18:18:55] T330212: 1.41.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T330212 [18:19:17] (03PS5) 10Dzahn: gerrit: make the lfs data path configurable [puppet] - 10https://gerrit.wikimedia.org/r/911363 (https://phabricator.wikimedia.org/T333143) [18:20:08] (03CR) 10Dzahn: "I am going to put the default into hieradata/common/profile. but since almost everything else is not there.. how could the rspec tests eve" [puppet] - 10https://gerrit.wikimedia.org/r/911363 (https://phabricator.wikimedia.org/T333143) (owner: 10Dzahn) [18:25:30] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [18:25:37] (03PS1) 10Dzahn: gerrit: move hieradata from role/common to common/profile [puppet] - 10https://gerrit.wikimedia.org/r/911920 [18:26:29] (03CR) 10Dzahn: "Does this mean that basically EVERYTHING currently under hieradata/role/common should move to common/profile ?" [puppet] - 10https://gerrit.wikimedia.org/r/911363 (https://phabricator.wikimedia.org/T333143) (owner: 10Dzahn) [18:30:32] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/output/911363/40865/" [puppet] - 10https://gerrit.wikimedia.org/r/911363 (https://phabricator.wikimedia.org/T333143) (owner: 10Dzahn) [18:31:01] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T335293 (10phaultfinder) [18:36:46] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T335293 (10phaultfinder) [18:39:05] (03CR) 10Jbond: "LGTM, bar the comment inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/911894 (https://phabricator.wikimedia.org/T335341) (owner: 10Clément Goubert) [18:39:38] !log bking@deploy1002 Started deploy [wdqs/wdqs@0e051d8]: 0.3.123 [18:51:01] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T335295 (10phaultfinder) [18:51:03] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [18:57:07] !log bking@deploy1002 Finished deploy [wdqs/wdqs@0e051d8]: 0.3.123 (duration: 17m 29s) [19:10:47] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T335299 (10phaultfinder) [19:11:01] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T335299 (10phaultfinder) [19:23:57] !log bking@cumin1001 finishing WDQS deploy...restarting `wdqs-categories` across lvs-managed hosts [19:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:08] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T335327 (10Papaul) 05Open→03Resolved a:03Papaul [19:46:16] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on wdqs2006.codfw.wmnet with reason: attempting WDQS stack on bullseye [19:46:30] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on wdqs2006.codfw.wmnet with reason: attempting WDQS stack on bullseye [19:46:47] !log bking@cumin1001 START - Cookbook sre.hosts.remove-downtime for wdqs2009.codfw.wmnet [19:46:48] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wdqs2009.codfw.wmnet [19:48:05] !log bking@cumin1001 START - Cookbook sre.hosts.remove-downtime for wdqs2006.codfw.wmnet [19:48:06] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wdqs2006.codfw.wmnet [19:48:36] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on wdqs2012.codfw.wmnet with reason: attempting WDQS stack on bullseye [19:48:50] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on wdqs2012.codfw.wmnet with reason: attempting WDQS stack on bullseye [19:52:31] (Access port speed <= 100Mbps) firing: Alert for device asw2-d-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [19:56:12] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10Jdforrester-WMF) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230425T2000). [20:00:04] No Gerrit patches in the queue for this window AFAICS. [20:02:11] (03PS1) 10Andrew Bogott: rabbitmq: add a single-purpose metric to detect network partition [puppet] - 10https://gerrit.wikimedia.org/r/911927 (https://phabricator.wikimedia.org/T335304) [20:02:36] (03CR) 10CI reject: [V: 04-1] rabbitmq: add a single-purpose metric to detect network partition [puppet] - 10https://gerrit.wikimedia.org/r/911927 (https://phabricator.wikimedia.org/T335304) (owner: 10Andrew Bogott) [20:03:14] (03PS2) 10Andrew Bogott: rabbitmq: add a single-purpose metric to detect network partition [puppet] - 10https://gerrit.wikimedia.org/r/911927 (https://phabricator.wikimedia.org/T335304) [20:05:22] (03CR) 10CI reject: [V: 04-1] rabbitmq: add a single-purpose metric to detect network partition [puppet] - 10https://gerrit.wikimedia.org/r/911927 (https://phabricator.wikimedia.org/T335304) (owner: 10Andrew Bogott) [20:07:52] (03PS3) 10Andrew Bogott: rabbitmq: add a single-purpose metric to detect network partition [puppet] - 10https://gerrit.wikimedia.org/r/911927 (https://phabricator.wikimedia.org/T335304) [20:08:53] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [20:10:21] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [20:36:30] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [20:38:53] 10SRE-OnFire, 10Traffic, 10conftool, 10serviceops, 10Sustainability (Incident Followup): Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 (10BCornwall) @bblack and @cdanis: Could the ticket title/description be updat... [20:41:08] (03CR) 10Dzahn: [C: 03+2] "no change on prod host: https://puppet-compiler.wmflabs.org/output/911363/40865/gerrit1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/911363 (https://phabricator.wikimedia.org/T333143) (owner: 10Dzahn) [20:41:20] 10SRE, 10Infrastructure-Foundations, 10Traffic: Set NEL `success_fraction: 1.0` on HTTP responses for measurement domains - https://phabricator.wikimedia.org/T334608 (10BCornwall) 05Open→03Resolved Thanks for doing that! [20:41:22] 10SRE, 10Infrastructure-Foundations, 10Traffic: Serve an HTTP response for measurement domains directly from Varnish - https://phabricator.wikimedia.org/T332028 (10BCornwall) [20:47:38] (03PS3) 10Dzahn: gerrit: relocate LFS data [puppet] - 10https://gerrit.wikimedia.org/r/908617 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [20:48:49] (03CR) 10Dzahn: "amended. now it's just a Hiera change and making it the default. I think that we can do this after switching to gerrit1003 which will have" [puppet] - 10https://gerrit.wikimedia.org/r/908617 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [21:00:11] (03CR) 10Dzahn: [V: 03+1 C: 03+2] site: add gerrit prod role to gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/910049 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [21:00:12] 10SRE, 10Traffic: Let HAProxy handle port 80 - https://phabricator.wikimedia.org/T323557 (10BCornwall) 05Stalled→03In progress [21:04:00] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on gerrit1003.wikimedia.org with reason: setup [21:04:13] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on gerrit1003.wikimedia.org with reason: setup [21:06:55] !log adding production gerrit role to new machine gerrit1003 - monitoring downtimed - but it has a service IP that is going to be added by this and cant be downtimed ? (Bug: T326368) [21:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:02] T326368: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 [21:10:41] (03PS1) 10Eevans: cassandra_dev: Upgrade cluster to 'dev' version (3.11.14) [puppet] - 10https://gerrit.wikimedia.org/r/911934 (https://phabricator.wikimedia.org/T335383) [21:12:02] !log once again running into T257317 when applying gerrit role to new hardware [21:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:08] T257317: scap deploy --init on deployment server fails on first puppet run - https://phabricator.wikimedia.org/T257317 [21:12:28] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/911934 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans) [21:14:03] (03PS2) 10Eevans: cassandra_dev: Upgrade cluster to 'dev' version (3.11.14) [puppet] - 10https://gerrit.wikimedia.org/r/911934 (https://phabricator.wikimedia.org/T335383) [21:14:07] !log gerrit1003 - manually replacing deploy2002 with deploy1002 in /srv/deployment/gerrit/gerrit-cache/.config to fix initial scap deployment T257317 T326368 [21:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:16] T326368: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 [21:14:29] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/911934 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans) [21:17:02] !log gerrit1003 - mv /srv/gerrit/plugins/lfs /srv/gerrit/data/ T333143 T326368 [21:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:08] T333143: Move Gerrit data out of root partition - https://phabricator.wikimedia.org/T333143 [21:19:20] !log gerrit1003 - chown -R gerrit2:gerrit2 /srv/gerrit T333143 T326368 [21:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:27] T326368: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 [21:27:28] 10SRE, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) https://gerrit-new.wikimedia.org/r/ [21:27:50] 10SRE, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) [21:28:05] 10SRE, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) [21:33:58] (03PS1) 10Ryan Kemper: wdqs: try alternative slo query approach [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/911936 (https://phabricator.wikimedia.org/T328306) [21:37:52] (03CR) 10Ryan Kemper: "Preview dashboard: https://grafana.wikimedia.org/dashboard/snapshot/Nm3w6jgEby6iIAdNb3hnXy7J62PwN81V?orgId=1" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/911936 (https://phabricator.wikimedia.org/T328306) (owner: 10Ryan Kemper) [21:39:53] (03PS1) 10Ryan Kemper: nit: fix missing space in desc [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/911939 [21:40:30] !log gerrit1003 - chown -R gerrit2:gerrit2 /var/lib/gerrit2/review_site/ - T326368 [21:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:36] T326368: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 [21:46:19] (03PS2) 10Ryan Kemper: wdqs: try alternative slo query approach [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/911936 (https://phabricator.wikimedia.org/T328306) [21:48:57] (03CR) 10Ryan Kemper: "Previous patch wasn't calculating the actual error budget. Here's the preview for PS2 which does display it as an error budget: https://gr" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/911936 (https://phabricator.wikimedia.org/T328306) (owner: 10Ryan Kemper) [21:53:21] (03PS1) 10Ebernhardson: search: Report age of titlesuggest indices to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/911940 (https://phabricator.wikimedia.org/T327199) [21:53:57] (03CR) 10CI reject: [V: 04-1] search: Report age of titlesuggest indices to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/911940 (https://phabricator.wikimedia.org/T327199) (owner: 10Ebernhardson) [21:54:07] (03PS1) 10Dzahn: gerrit: make configurable whether service is running [puppet] - 10https://gerrit.wikimedia.org/r/911941 (https://phabricator.wikimedia.org/T326368) [21:54:15] (03PS3) 10Ryan Kemper: wdqs: try alternative slo query approach [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/911936 (https://phabricator.wikimedia.org/T328306) [21:54:35] (03CR) 10CI reject: [V: 04-1] gerrit: make configurable whether service is running [puppet] - 10https://gerrit.wikimedia.org/r/911941 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [22:04:01] (03PS4) 10Ryan Kemper: wdqs: try alternative slo query approach [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/911936 (https://phabricator.wikimedia.org/T328306) [22:04:47] (03PS1) 10Jcrespo: Create new script to read recent logs and update backups metadata [software/mediabackups] - 10https://gerrit.wikimedia.org/r/911943 (https://phabricator.wikimedia.org/T327157) [22:05:11] (03PS2) 10Dzahn: gerrit: make configurable whether service is running [puppet] - 10https://gerrit.wikimedia.org/r/911941 (https://phabricator.wikimedia.org/T326368) [22:06:31] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (gerrit1003), Fresh: 124 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [22:07:19] (03CR) 10CI reject: [V: 04-1] gerrit: make configurable whether service is running [puppet] - 10https://gerrit.wikimedia.org/r/911941 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [22:15:52] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:18:38] (03CR) 10Ryan Kemper: "Okay, I'm happy with the resulting dashboard (this one was from PS4 which was mainly making the query look slightly cleaner):" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/911936 (https://phabricator.wikimedia.org/T328306) (owner: 10Ryan Kemper) [22:21:04] (03PS3) 10Dzahn: gerrit: make configurable whether service is running [puppet] - 10https://gerrit.wikimedia.org/r/911941 (https://phabricator.wikimedia.org/T326368) [22:23:14] (03CR) 10CI reject: [V: 04-1] gerrit: make configurable whether service is running [puppet] - 10https://gerrit.wikimedia.org/r/911941 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [22:25:30] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [22:31:49] (03PS1) 10Ebernhardson: search: Add alert based on age of titlesuggest indices [alerts] - 10https://gerrit.wikimedia.org/r/911945 (https://phabricator.wikimedia.org/T327199) [22:33:02] (03CR) 10Dzahn: "Can't help the feeling that rspec tests make our life harder but we don't get much value from them." [puppet] - 10https://gerrit.wikimedia.org/r/911941 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [22:33:15] (03CR) 10CI reject: [V: 04-1] search: Add alert based on age of titlesuggest indices [alerts] - 10https://gerrit.wikimedia.org/r/911945 (https://phabricator.wikimedia.org/T327199) (owner: 10Ebernhardson) [22:33:17] (03PS2) 10Ebernhardson: search: Report age of titlesuggest indices to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/911940 (https://phabricator.wikimedia.org/T327199) [22:33:53] (03CR) 10CI reject: [V: 04-1] search: Report age of titlesuggest indices to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/911940 (https://phabricator.wikimedia.org/T327199) (owner: 10Ebernhardson) [22:34:32] (03PS4) 10Dzahn: gerrit: make configurable whether service is running [puppet] - 10https://gerrit.wikimedia.org/r/911941 (https://phabricator.wikimedia.org/T326368) [22:35:47] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T335293 (10phaultfinder) [22:36:44] (03PS3) 10Ebernhardson: search: Report age of titlesuggest indices to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/911940 (https://phabricator.wikimedia.org/T327199) [22:37:01] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T335293 (10phaultfinder) [22:39:23] (03CR) 10Ebernhardson: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/911945 (https://phabricator.wikimedia.org/T327199) (owner: 10Ebernhardson) [22:40:59] (03PS2) 10Ebernhardson: search: Add alert based on age of titlesuggest indices [alerts] - 10https://gerrit.wikimedia.org/r/911945 (https://phabricator.wikimedia.org/T327199) [22:44:39] 10ops-ulsfo, 10DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T335293 (10Dzahn) [22:51:01] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T335295 (10phaultfinder) [22:55:47] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [22:55:49] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T335295 (10phaultfinder) [22:59:23] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:00:49] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49851 bytes in 0.083 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:04:14] (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 229.8k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [23:07:06] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 125 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [23:11:01] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T335299 (10phaultfinder) [23:14:06] (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 211.2k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [23:15:47] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T335299 (10phaultfinder) [23:16:51] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [23:27:28] (03PS2) 10Cwhite: remove strict ecs version gate [puppet] - 10https://gerrit.wikimedia.org/r/906702 [23:35:18] (03PS1) 10EoghanGaffney: [gitlab/failover] Rename host flags [cookbooks] - 10https://gerrit.wikimedia.org/r/911951 [23:40:52] (03PS4) 10Raymond Ndibe: profile:toolforge:harbor: setup blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) [23:41:14] (03CR) 10Raymond Ndibe: profile:toolforge:harbor: setup blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) (owner: 10Raymond Ndibe) [23:41:29] (03CR) 10Raymond Ndibe: profile:toolforge:harbor: setup blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) (owner: 10Raymond Ndibe) [23:42:00] (03PS1) 10Jdrewniak: Set Vector 2022 as default skin on Polish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911952 (https://phabricator.wikimedia.org/T335311) [23:42:55] (03PS1) 10BryanDavis: Remove jessie and stretch image configuration [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/911953 [23:52:30] (Access port speed <= 100Mbps) firing: Alert for device asw2-d-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps