[00:01:25] (03PS1) 10Dzahn: gerrit: disable monitoring for gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/919244 (https://phabricator.wikimedia.org/T326368) [00:01:48] (03PS1) 10Papaul: Add cloudswift100[1-2] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/919245 (https://phabricator.wikimedia.org/T289882) [00:05:04] (03PS1) 10Dzahn: gerrit: add parameter service_ensure, set to stopped on gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/919246 (https://phabricator.wikimedia.org/T326368) [00:07:24] (03CR) 10CI reject: [V: 04-1] gerrit: add parameter service_ensure, set to stopped on gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/919246 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [00:09:12] (03CR) 10Dzahn: "the goal is to allow re-enabling puppet on gerrit1001 WITHOUT starting the gerrit service." [puppet] - 10https://gerrit.wikimedia.org/r/919246 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [00:11:14] (03CR) 10Dzahn: "well I see there is still an issue here, will amend" [puppet] - 10https://gerrit.wikimedia.org/r/919246 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [00:12:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q3:rack/setup/install gerrit1003 - https://phabricator.wikimedia.org/T326366 (10Dzahn) [00:13:10] 10SRE, 10Gerrit, 10Release-Engineering-Team, 10serviceops-collab, 10Patch-For-Review: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) 05In progress→03Resolved Service is implemeted on gerrit1003. It is now the production server behind gerrit.wikimedia.org... [00:15:26] PROBLEM - Host prometheus3001 is DOWN: PING CRITICAL - Packet loss = 100% [00:21:34] ^ I know denisse was decom'ing that [00:22:04] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on prometheus3001.esams.wmnet with reason: maintenance [00:22:17] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on prometheus3001.esams.wmnet with reason: maintenance [00:22:19] denisse: I put a 1 day downtime on prometheus1003 [00:23:39] mutante: Thanks!! [00:23:58] ok, ack:) cu later.. going afk [00:24:13] thanks for reviews earlier. those are merged [00:28:47] 10SRE, 10Wikimedia-Mailing-lists: mailman3 discard_held_messages systemd script apparently failing since 2023-03-26 - https://phabricator.wikimedia.org/T336555 (10MarcoAurelio) [00:31:24] See you!! ^^ [00:32:21] !log manually removing prometheus4001.ulsfo.wmnet from the Ganeti master after a failed step in the decommission cookbook - T335585 [00:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:25] T335585: Decommission prometheus4001 - https://phabricator.wikimedia.org/T335585 [00:34:52] PROBLEM - Host prometheus4001 is DOWN: PING CRITICAL - Packet loss = 100% [00:37:55] (JobUnavailable) firing: (2) Reduced availability for job envoy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:39:18] 10ops-ulsfo, 10DC-Ops, 10decommission-hardware, 10SRE Observability (FY2022/2023-Q4): Decommission prometheus4001 - https://phabricator.wikimedia.org/T335585 (10andrea.denisse) 05In progress→03Open a:05andrea.denisse→03None [00:39:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/919193 [00:39:36] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/919193 (owner: 10TrainBranchBot) [00:42:04] (03PS4) 10Andrea Denisse: prometheus: Decommission prometheus5001 in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/913251 (https://phabricator.wikimedia.org/T335587) [00:42:39] (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Decommission prometheus5001 in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/913251 (https://phabricator.wikimedia.org/T335587) (owner: 10Andrea Denisse) [00:42:55] (JobUnavailable) firing: (2) Reduced availability for job envoy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:44:00] !log denisse@cumin1001 START - Cookbook sre.hosts.decommission for hosts prometheus5001.eqsin.wmnet [00:48:24] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [00:49:33] (03CR) 10Andrea Denisse: prometheus: Decommission prometheus6001 in drmrs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/913256 (https://phabricator.wikimedia.org/T335588) (owner: 10Andrea Denisse) [00:50:26] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: prometheus5001.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - denisse@cumin1001" [00:51:05] (03PS4) 10Andrea Denisse: prometheus: Decommission prometheus6001 in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/913256 (https://phabricator.wikimedia.org/T335588) [00:51:30] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: prometheus5001.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - denisse@cumin1001" [00:51:31] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:51:31] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts prometheus5001.eqsin.wmnet [00:53:54] (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Decommission prometheus6001 in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/913256 (https://phabricator.wikimedia.org/T335588) (owner: 10Andrea Denisse) [00:56:43] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/919193 (owner: 10TrainBranchBot) [00:57:38] !log denisse@cumin1001 START - Cookbook sre.hosts.decommission for hosts prometheus6001.drmrs.wmnet [01:00:25] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:01:46] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [01:07:09] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: prometheus6001.drmrs.wmnet decommissioned, removing all IPs except the asset tag one - denisse@cumin1001" [01:07:55] (JobUnavailable) firing: (3) Reduced availability for job envoy in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:08:28] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: prometheus6001.drmrs.wmnet decommissioned, removing all IPs except the asset tag one - denisse@cumin1001" [01:08:28] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:08:29] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts prometheus6001.drmrs.wmnet [01:08:36] 10ops-drmrs, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q4): Decommission prometheus6001 - https://phabricator.wikimedia.org/T335588 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by denisse@cumin1001 for hosts: `prometheus6001.drmrs.wmnet` -... [01:08:40] 10ops-drmrs, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q4): Decommission prometheus6001 - https://phabricator.wikimedia.org/T335588 (10andrea.denisse) 05In progress→03Open a:05andrea.denisse→03None [02:07:55] (JobUnavailable) firing: (5) Reduced availability for job envoy in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:13:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:22:55] (JobUnavailable) firing: (5) Reduced availability for job envoy in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:23:33] (JobUnavailable) firing: (5) Reduced availability for job envoy in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:23:33] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [02:38:18] (03CR) 10Cwhite: [C: 03+1] role::webperf::profiling_tools: add redis instance for arclamp [puppet] - 10https://gerrit.wikimedia.org/r/919164 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron) [02:39:15] (03CR) 10Cwhite: [C: 03+1] profile::arclamp::redis: introduce/move arclamp redis config to profile [puppet] - 10https://gerrit.wikimedia.org/r/919163 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron) [02:49:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:54:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:44:42] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:46:06] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.272 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:07:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:12:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:12:44] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence-Backup: db1225 crashed (CPU 1 machine check error detected) - https://phabricator.wikimedia.org/T336326 (10Marostegui) I would go for the flea power drain, but I'd let @jcrespo decide as this is a backup source. [05:21:43] (03PS1) 10Marostegui: install_server: Do not reimage db1218 [puppet] - 10https://gerrit.wikimedia.org/r/919258 [05:22:32] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1218 [puppet] - 10https://gerrit.wikimedia.org/r/919258 (owner: 10Marostegui) [05:23:16] (03CR) 10Giuseppe Lavagetto: [C: 03+2] base.meta.pod_annotations: support annotations for prometheus scraping [deployment-charts] - 10https://gerrit.wikimedia.org/r/919055 (https://phabricator.wikimedia.org/T271822) (owner: 10Giuseppe Lavagetto) [05:23:57] (03Merged) 10jenkins-bot: base.meta.pod_annotations: support annotations for prometheus scraping [deployment-charts] - 10https://gerrit.wikimedia.org/r/919055 (https://phabricator.wikimedia.org/T271822) (owner: 10Giuseppe Lavagetto) [05:26:02] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: add listeners to the tls fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/919156 (owner: 10Giuseppe Lavagetto) [05:27:00] (03Merged) 10jenkins-bot: mediawiki: add listeners to the tls fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/919156 (owner: 10Giuseppe Lavagetto) [05:31:22] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,dc=eqiad,name=mw1461.eqiad.wmnet [05:32:33] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,dc=eqiad,name=mw1458.eqiad.wmnet [05:32:51] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,dc=eqiad,name=mw1466.eqiad.wmnet [05:33:06] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,dc=eqiad,name=mw1495.eqiad.wmnet [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230512T0600) [06:13:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:19:08] (03PS1) 10Giuseppe Lavagetto: jobrunner: reduce max_requests_per_connection to 100 [puppet] - 10https://gerrit.wikimedia.org/r/919262 (https://phabricator.wikimedia.org/T336554) [06:23:33] (JobUnavailable) firing: (3) Reduced availability for job envoy in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:23:33] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [06:39:20] 10SRE, 10Traffic, 10serviceops, 10Platform Team Initiatives (API Gateway): Handle edge cache invalidation for the api gateway - https://phabricator.wikimedia.org/T324200 (10Joe) p:05Triage→03High [06:44:14] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Robert Timm (WMDE) - https://phabricator.wikimedia.org/T336435 (10roti_WMDE) Thanks a lot! [06:55:54] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 12): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10gmodena) >>! In T330693#8846250, @Eevans wrote: > Ok, this is setup and has b... [06:56:50] (03CR) 10Ayounsi: installserver: enable ZTP for network devices (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/919076 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230512T0700) [07:04:37] 10SRE, 10Traffic, 10serviceops, 10Platform Team Initiatives (API Gateway): Handle edge cache invalidation for the api gateway - https://phabricator.wikimedia.org/T324200 (10Joe) [07:05:04] (03PS1) 10Volans: installserver: rename temporary juniper ZTP passwd [labs/private] - 10https://gerrit.wikimedia.org/r/919263 (https://phabricator.wikimedia.org/T336485) [07:05:41] (03CR) 10Volans: "replies inline" [puppet] - 10https://gerrit.wikimedia.org/r/919076 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [07:09:16] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Access port speed <= 100Mbps False posatives - https://phabricator.wikimedia.org/T336511 (10ayounsi) The issue is that the check is ran from the switch side, and for the switch the port is up `Physical interface: ge-3/0/22, Enabled, Physical link is... [07:10:07] (03CR) 10Ayounsi: [C: 03+1] installserver: rename temporary juniper ZTP passwd [labs/private] - 10https://gerrit.wikimedia.org/r/919263 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [07:10:13] 10SRE, 10Traffic, 10serviceops, 10Platform Team Initiatives (API Gateway): Handle edge cache invalidation for the api gateway - https://phabricator.wikimedia.org/T324200 (10Joe) My idea for implementing this is as follows: - Create a benthos container - Add a release containing a `Deployment` with N replic... [07:16:39] (03PS1) 10Volans: validators: fix creating a new device [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/919266 (https://phabricator.wikimedia.org/T336547) [07:17:25] (03CR) 10Volans: "Tested on netbox-next: https://netbox-next.wikimedia.org/dcim/devices/4642/" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/919266 (https://phabricator.wikimedia.org/T336547) (owner: 10Volans) [07:18:37] (03CR) 10Volans: [V: 03+2 C: 03+2] installserver: rename temporary juniper ZTP passwd [labs/private] - 10https://gerrit.wikimedia.org/r/919263 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [07:20:11] (03PS7) 10Volans: installserver: enable ZTP for network devices [puppet] - 10https://gerrit.wikimedia.org/r/919076 (https://phabricator.wikimedia.org/T336485) [07:25:32] (03CR) 10Ayounsi: [C: 03+1] validators: fix creating a new device [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/919266 (https://phabricator.wikimedia.org/T336547) (owner: 10Volans) [07:26:25] (03CR) 10Volans: [C: 03+2] validators: fix creating a new device [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/919266 (https://phabricator.wikimedia.org/T336547) (owner: 10Volans) [07:26:57] (03Merged) 10jenkins-bot: validators: fix creating a new device [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/919266 (https://phabricator.wikimedia.org/T336547) (owner: 10Volans) [07:27:10] (03CR) 10Ayounsi: installserver: enable ZTP for network devices (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919076 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [07:27:11] !log volans@cumin1001 START - Cookbook sre.netbox.update-extras rolling update on A:netbox-canary [07:27:33] !log volans@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling update on A:netbox-canary [07:28:11] !log volans@cumin1001 START - Cookbook sre.netbox.update-extras rolling update on A:netbox [07:32:52] (03PS1) 10Elukey: role::syslog::centralserver: tune benthos config [puppet] - 10https://gerrit.wikimedia.org/r/919268 (https://phabricator.wikimedia.org/T331801) [07:32:55] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 20940 [07:34:13] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41155/console" [puppet] - 10https://gerrit.wikimedia.org/r/919268 (https://phabricator.wikimedia.org/T331801) (owner: 10Elukey) [07:35:30] (03CR) 10Volans: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/919268 (https://phabricator.wikimedia.org/T331801) (owner: 10Elukey) [07:35:35] (03CR) 10Elukey: [V: 03+1] "To be merged (in case) next week :)" [puppet] - 10https://gerrit.wikimedia.org/r/919268 (https://phabricator.wikimedia.org/T331801) (owner: 10Elukey) [07:42:50] (03PS1) 10Majavah: Disable Graph (again) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919269 (https://phabricator.wikimedia.org/T336556) [07:43:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919269 (https://phabricator.wikimedia.org/T336556) (owner: 10Majavah) [07:43:22] (03CR) 10Zabe: [C: 03+1] Disable Graph (again) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919269 (https://phabricator.wikimedia.org/T336556) (owner: 10Majavah) [07:43:24] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 20940 [07:44:07] (03Merged) 10jenkins-bot: Disable Graph (again) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919269 (https://phabricator.wikimedia.org/T336556) (owner: 10Majavah) [07:44:51] !log taavi@deploy1002 Started scap: Backport for [[gerrit:919269|Disable Graph (again) (T336556)]] [07:45:15] !log volans@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling update on A:netbox [07:45:18] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 13335 [07:46:25] !log taavi@deploy1002 taavi: Backport for [[gerrit:919269|Disable Graph (again) (T336556)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [07:47:11] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 13335 [07:54:21] (03CR) 10Elukey: [V: 03+1 C: 04-1] "https://github.com/benthosdev/benthos/issues/1806#issuecomment-1545329948 :( :(" [puppet] - 10https://gerrit.wikimedia.org/r/919268 (https://phabricator.wikimedia.org/T331801) (owner: 10Elukey) [07:56:35] (03CR) 10Jelto: "looks mostly good, typo in httpbb path. Waiting with deploy until we got an answer in T336301." [puppet] - 10https://gerrit.wikimedia.org/r/918594 (https://phabricator.wikimedia.org/T336301) (owner: 10Dzahn) [07:57:21] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:919269|Disable Graph (again) (T336556)]] (duration: 12m 29s) [07:59:35] <_joe_> !log restaring envoyproxy on mw1439 to rebalance connections (see T336554) [07:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:38] T336554: Repool jobrunners and videoscalers - https://phabricator.wikimedia.org/T336554 [08:00:36] <_joe_> !log do it also on mw1438 [08:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:08] <_joe_> !log restarting envoy on all jobrunners pooled in the jobrunner cluster T336554 [08:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:46] 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T331699 (10akosiaris) Hi, couple of comments from my side. The plan above has a couple of caveats that unless more clearly detailed make it unfeasible. Specifically, as @jhathaway says... [08:15:34] (03PS1) 10Slyngshede: Reconnect handling reworked. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/919273 [08:15:53] (03CR) 10CI reject: [V: 04-1] Reconnect handling reworked. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/919273 (owner: 10Slyngshede) [08:17:28] (03PS1) 10Hashar: Reset the cached skin in RequestContext::setUser() [core] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/919178 (https://phabricator.wikimedia.org/T336504) [08:18:06] the train blocker got fixed with the above ^ (which I cherry picked from `master`) [08:18:16] I will get it merged then roll the train at some point this morning [08:18:59] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41156/console" [puppet] - 10https://gerrit.wikimedia.org/r/919211 (https://phabricator.wikimedia.org/T334958) (owner: 10Brennen Bearnes) [08:23:24] (03CR) 10Hashar: "Can you get it masked instead? ;) That will ensure it is not unexpectedly started (like one doing a service restart by mistake)." [puppet] - 10https://gerrit.wikimedia.org/r/919246 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [08:24:22] (03CR) 10Btullis: [C: 03+2] Add configs to spark-defaults.conf to enable Iceberg. [puppet] - 10https://gerrit.wikimedia.org/r/914928 (https://phabricator.wikimedia.org/T335721) (owner: 10Xcollazo) [08:25:00] (NodeTextfileStale) firing: Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:31:16] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: block auto created users [puppet] - 10https://gerrit.wikimedia.org/r/919211 (https://phabricator.wikimedia.org/T334958) (owner: 10Brennen Bearnes) [08:36:05] (03CR) 10Hashar: [C: 03+2] Reset the cached skin in RequestContext::setUser() [core] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/919178 (https://phabricator.wikimedia.org/T336504) (owner: 10Hashar) [08:37:20] (03CR) 10Btullis: [C: 03+1] "Looks good, thanks. As I understand it, mjolnir is no longer using python 3.7 on Hadoop, now that it has been migrated to Airflow." [puppet] - 10https://gerrit.wikimedia.org/r/917813 (owner: 10Muehlenhoff) [08:38:30] (03CR) 10Btullis: [C: 03+1] "Nice, thanks John." [puppet] - 10https://gerrit.wikimedia.org/r/919059 (owner: 10Jbond) [08:51:29] (03Merged) 10jenkins-bot: Reset the cached skin in RequestContext::setUser() [core] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/919178 (https://phabricator.wikimedia.org/T336504) (owner: 10Hashar) [08:52:30] backporting [08:52:30] !log hashar@deploy1002 Started scap: Backport for [[gerrit:919178|Reset the cached skin in RequestContext::setUser() (T336504)]] [08:52:34] T336504: Transcluding Special:Prefixindex can force the default skin - https://phabricator.wikimedia.org/T336504 [08:52:46] \o/ [08:54:00] !log hashar@deploy1002 hashar: Backport for [[gerrit:919178|Reset the cached skin in RequestContext::setUser() (T336504)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [08:57:49] (03PS2) 10Slyngshede: Reconnect handling reworked. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/919273 [08:59:23] (03PS8) 10Volans: installserver: enable ZTP for network devices [puppet] - 10https://gerrit.wikimedia.org/r/919076 (https://phabricator.wikimedia.org/T336485) [08:59:25] (03PS1) 10Volans: install_server: simplify DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/919276 (https://phabricator.wikimedia.org/T336485) [08:59:27] (03PS1) 10Volans: install_server: convert dhcpd.conf to template [puppet] - 10https://gerrit.wikimedia.org/r/919277 (https://phabricator.wikimedia.org/T336485) [09:00:09] (03CR) 10CI reject: [V: 04-1] install_server: simplify DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/919276 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [09:00:21] (03CR) 10CI reject: [V: 04-1] install_server: convert dhcpd.conf to template [puppet] - 10https://gerrit.wikimedia.org/r/919277 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [09:01:07] (03PS2) 10Volans: install_server: simplify DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/919276 (https://phabricator.wikimedia.org/T336485) [09:01:09] (03PS2) 10Volans: install_server: convert dhcpd.conf to template [puppet] - 10https://gerrit.wikimedia.org/r/919277 (https://phabricator.wikimedia.org/T336485) [09:01:18] (03CR) 10Volans: installserver: enable ZTP for network devices (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919076 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [09:01:58] (03CR) 10CI reject: [V: 04-1] install_server: convert dhcpd.conf to template [puppet] - 10https://gerrit.wikimedia.org/r/919277 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [09:04:04] (03PS3) 10Volans: install_server: convert dhcpd.conf to template [puppet] - 10https://gerrit.wikimedia.org/r/919277 (https://phabricator.wikimedia.org/T336485) [09:04:45] (03CR) 10CI reject: [V: 04-1] install_server: convert dhcpd.conf to template [puppet] - 10https://gerrit.wikimedia.org/r/919277 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [09:07:11] (03PS4) 10Volans: install_server: convert dhcpd.conf to template [puppet] - 10https://gerrit.wikimedia.org/r/919277 (https://phabricator.wikimedia.org/T336485) [09:08:50] (03PS4) 10Slyngshede: Sphinx: Start work on documentation [software/bitu] - 10https://gerrit.wikimedia.org/r/908769 [09:08:54] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:08:57] (03CR) 10Slyngshede: Sphinx: Start work on documentation (035 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/908769 (owner: 10Slyngshede) [09:08:58] !log hashar@deploy1002 Finished scap: Backport for [[gerrit:919178|Reset the cached skin in RequestContext::setUser() (T336504)]] (duration: 16m 27s) [09:09:02] T336504: Transcluding Special:Prefixindex can force the default skin - https://phabricator.wikimedia.org/T336504 [09:10:25] (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919278 (https://phabricator.wikimedia.org/T330214) [09:10:27] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919278 (https://phabricator.wikimedia.org/T330214) (owner: 10TrainBranchBot) [09:11:33] (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919278 (https://phabricator.wikimedia.org/T330214) (owner: 10TrainBranchBot) [09:13:06] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,dc=eqiad,name=mw146[7-9].eqiad.wmnet [09:13:22] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,dc=eqiad,name=mw1494.eqiad.wmnet [09:15:56] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:18:26] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.8 refs T330214 [09:18:30] T330214: 1.41.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T330214 [09:20:12] https://phabricator.wikimedia.org/T336529 still happens [09:20:20] but apparently does not have much impact [09:21:23] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/919276 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [09:21:42] (03PS5) 10Volans: install_server: convert dhcpd.conf to template [puppet] - 10https://gerrit.wikimedia.org/r/919277 (https://phabricator.wikimedia.org/T336485) [09:22:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] trove: bind on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/918570 (owner: 10Majavah) [09:22:23] (03CR) 10CI reject: [V: 04-1] install_server: convert dhcpd.conf to template [puppet] - 10https://gerrit.wikimedia.org/r/919277 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [09:26:23] (03PS6) 10Volans: install_server: convert dhcpd.conf to template [puppet] - 10https://gerrit.wikimedia.org/r/919277 (https://phabricator.wikimedia.org/T336485) [09:27:06] (03CR) 10CI reject: [V: 04-1] install_server: convert dhcpd.conf to template [puppet] - 10https://gerrit.wikimedia.org/r/919277 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [09:29:18] (03PS7) 10Volans: install_server: convert dhcpd.conf to template [puppet] - 10https://gerrit.wikimedia.org/r/919277 (https://phabricator.wikimedia.org/T336485) [09:33:09] (03CR) 10Kamila Součková: [C: 03+1] "I looked at https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service , did not check everything but it does make some sense" [dns] - 10https://gerrit.wikimedia.org/r/917306 (https://phabricator.wikimedia.org/T335505) (owner: 10Hnowlan) [09:35:35] (03CR) 10Jbond: [C: 03+1] "lgtm, minor nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/919076 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [09:39:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2131', diff saved to https://phabricator.wikimedia.org/P48205 and previous config saved to /var/cache/conftool/dbconfig/20230512-093950-root.json [09:40:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2131.codfw.wmnet with reason: Maintenance [09:40:13] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/919276 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [09:40:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2131.codfw.wmnet with reason: Maintenance [09:40:40] (03PS9) 10Volans: installserver: enable ZTP for network devices [puppet] - 10https://gerrit.wikimedia.org/r/919076 (https://phabricator.wikimedia.org/T336485) [09:40:42] (03PS3) 10Volans: install_server: simplify DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/919276 (https://phabricator.wikimedia.org/T336485) [09:40:44] (03PS8) 10Volans: install_server: convert dhcpd.conf to template [puppet] - 10https://gerrit.wikimedia.org/r/919277 (https://phabricator.wikimedia.org/T336485) [09:40:55] (03CR) 10Volans: "addressed comments" [puppet] - 10https://gerrit.wikimedia.org/r/919076 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [09:41:22] (03CR) 10CI reject: [V: 04-1] install_server: simplify DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/919276 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [09:42:08] (03CR) 10CI reject: [V: 04-1] install_server: convert dhcpd.conf to template [puppet] - 10https://gerrit.wikimedia.org/r/919277 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [09:42:24] (03CR) 10David Caro: wmcs::firewall: add a way to block addresses in wmcs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/918410 (owner: 10Jbond) [09:42:56] (03CR) 10CI reject: [V: 04-1] installserver: enable ZTP for network devices [puppet] - 10https://gerrit.wikimedia.org/r/919076 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [09:45:34] (03PS10) 10Volans: installserver: enable ZTP for network devices [puppet] - 10https://gerrit.wikimedia.org/r/919076 (https://phabricator.wikimedia.org/T336485) [09:45:36] (03PS4) 10Volans: install_server: simplify DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/919276 (https://phabricator.wikimedia.org/T336485) [09:45:38] (03PS9) 10Volans: install_server: convert dhcpd.conf to template [puppet] - 10https://gerrit.wikimedia.org/r/919277 (https://phabricator.wikimedia.org/T336485) [09:45:40] (03PS1) 10Volans: install_server: remove mgmt subnet already managed [puppet] - 10https://gerrit.wikimedia.org/r/919282 (https://phabricator.wikimedia.org/T336485) [09:49:04] (03CR) 10Jbond: [C: 03+1] "lgtm minor nit" [puppet] - 10https://gerrit.wikimedia.org/r/919277 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [09:49:18] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "the patch / idea itself looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/918410 (owner: 10Jbond) [09:49:38] (03CR) 10Jbond: [C: 03+1] installserver: enable ZTP for network devices [puppet] - 10https://gerrit.wikimedia.org/r/919076 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [09:49:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48206 and previous config saved to /var/cache/conftool/dbconfig/20230512-094941-root.json [09:49:45] (03CR) 10Ayounsi: [C: 03+1] install_server: remove mgmt subnet already managed [puppet] - 10https://gerrit.wikimedia.org/r/919282 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [09:50:01] (03CR) 10Jbond: [C: 03+1] install_server: remove mgmt subnet already managed [puppet] - 10https://gerrit.wikimedia.org/r/919282 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [09:51:43] (03CR) 10Ayounsi: [C: 03+1] install_server: convert dhcpd.conf to template [puppet] - 10https://gerrit.wikimedia.org/r/919277 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [09:54:17] (03CR) 10Ayounsi: [C: 03+1] install_server: simplify DHCP config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919276 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [10:04:38] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:04:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48208 and previous config saved to /var/cache/conftool/dbconfig/20230512-100446-root.json [10:05:04] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Access port speed <= 100Mbps False posatives - https://phabricator.wikimedia.org/T336511 (10jbond) WARNING: wild speculation > Is it possible that the server turns its interfaces off when the server is off? i guess if it has wake on lan, or some ty... [10:06:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [10:07:06] (03CR) 10Ayounsi: [C: 03+1] "time to test it :)" [puppet] - 10https://gerrit.wikimedia.org/r/919076 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [10:07:07] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence-Backup: db1225 crashed (CPU 1 machine check error detected) - https://phabricator.wikimedia.org/T336326 (10Marostegui) @Jclark-ctr Jaime is out today, but given that the mariadb service on the server is stopped, let's do the power drain. Please proc... [10:07:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [10:07:12] (03PS5) 10Volans: install_server: simplify DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/919276 (https://phabricator.wikimedia.org/T336485) [10:07:14] (03PS10) 10Volans: install_server: convert dhcpd.conf to template [puppet] - 10https://gerrit.wikimedia.org/r/919277 (https://phabricator.wikimedia.org/T336485) [10:07:16] (03PS2) 10Volans: install_server: remove mgmt subnet already managed [puppet] - 10https://gerrit.wikimedia.org/r/919282 (https://phabricator.wikimedia.org/T336485) [10:07:37] (03CR) 10Volans: "addressed comment" [puppet] - 10https://gerrit.wikimedia.org/r/919276 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [10:07:46] (03CR) 10Volans: "addressed comments" [puppet] - 10https://gerrit.wikimedia.org/r/919277 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [10:07:48] (03CR) 10CI reject: [V: 04-1] install_server: simplify DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/919276 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [10:08:02] (03CR) 10CI reject: [V: 04-1] install_server: convert dhcpd.conf to template [puppet] - 10https://gerrit.wikimedia.org/r/919277 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [10:08:12] (03CR) 10CI reject: [V: 04-1] install_server: remove mgmt subnet already managed [puppet] - 10https://gerrit.wikimedia.org/r/919282 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [10:08:31] 10SRE, 10ops-codfw, 10DBA, 10database-backups: db2139 s4 (commonswiki) instance crashed (backup source) - https://phabricator.wikimedia.org/T335396 (10Marostegui) @Jhancock.wm Jaime is out today, but the server is OFF (or unreachable), so please go ahead and replace the DIMM today if you can. Please leave... [10:08:40] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:08:59] (03PS80) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) [10:09:12] (03CR) 10Jbond: [C: 03+2] puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [10:09:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db2184.codfw.wmnet with reason: Maintenance [10:10:06] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2184 down - https://phabricator.wikimedia.org/T335640 (10Marostegui) @Jhancock.wm Jaime is out today, but I have stopped mariadb on it and powere it off, so please go ahead and replace the DIMM today if you can. Please leave the server UP once you're... [10:10:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2184.codfw.wmnet with reason: Maintenance [10:10:50] (03PS1) 10Btullis: Remove python-is-python3 package from hadoop bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/919286 (https://phabricator.wikimedia.org/T336281) [10:11:48] (03PS6) 10Volans: install_server: simplify DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/919276 (https://phabricator.wikimedia.org/T336485) [10:11:50] (03PS11) 10Volans: install_server: convert dhcpd.conf to template [puppet] - 10https://gerrit.wikimedia.org/r/919277 (https://phabricator.wikimedia.org/T336485) [10:11:52] (03PS3) 10Volans: install_server: remove mgmt subnet already managed [puppet] - 10https://gerrit.wikimedia.org/r/919282 (https://phabricator.wikimedia.org/T336485) [10:12:47] (03CR) 10CI reject: [V: 04-1] install_server: convert dhcpd.conf to template [puppet] - 10https://gerrit.wikimedia.org/r/919277 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [10:12:59] (03CR) 10CI reject: [V: 04-1] install_server: remove mgmt subnet already managed [puppet] - 10https://gerrit.wikimedia.org/r/919282 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [10:13:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:15:23] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/919286 (https://phabricator.wikimedia.org/T336281) (owner: 10Btullis) [10:15:52] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/919076 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [10:16:02] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/919276 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [10:16:33] (03PS12) 10Volans: install_server: convert dhcpd.conf to template [puppet] - 10https://gerrit.wikimedia.org/r/919277 (https://phabricator.wikimedia.org/T336485) [10:16:35] (03PS4) 10Volans: install_server: remove mgmt subnet already managed [puppet] - 10https://gerrit.wikimedia.org/r/919282 (https://phabricator.wikimedia.org/T336485) [10:17:38] (03CR) 10Btullis: [C: 03+2] Remove python-is-python3 package from hadoop bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/919286 (https://phabricator.wikimedia.org/T336281) (owner: 10Btullis) [10:17:42] (03CR) 10Volans: "check experimenal" [puppet] - 10https://gerrit.wikimedia.org/r/919277 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [10:17:51] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/919282 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [10:19:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48209 and previous config saved to /var/cache/conftool/dbconfig/20230512-101950-root.json [10:21:40] (03CR) 10JMeybohm: [V: 03+1] Make kubernetes::clusters the central place for k8s config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [10:22:18] (03PS2) 10Arturo Borrero Gonzalez: wikimediacloud.org: move openstack.codfw1dev.wikimediacloud.org to new VIP [dns] - 10https://gerrit.wikimedia.org/r/918525 (https://phabricator.wikimedia.org/T332153) [10:23:33] (JobUnavailable) firing: (3) Reduced availability for job envoy in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:23:33] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [10:24:14] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:25:20] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:26:32] (03CR) 10Jbond: "thanks see response inline" [puppet] - 10https://gerrit.wikimedia.org/r/918410 (owner: 10Jbond) [10:26:35] (03PS14) 10Jbond: wmcs::firewall: add a way to block addresses in wmcs [puppet] - 10https://gerrit.wikimedia.org/r/918410 [10:26:37] (03PS6) 10Jbond: firewall: Remove kafka_brokers_analytics [puppet] - 10https://gerrit.wikimedia.org/r/919059 [10:26:39] (03PS9) 10Jbond: profile::base::firewall: move to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) [10:26:41] (03PS9) 10Jbond: firewall: add basic firewall class [puppet] - 10https://gerrit.wikimedia.org/r/919061 [10:26:43] (03PS10) 10Jbond: firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) [10:27:28] (03CR) 10CI reject: [V: 04-1] profile::base::firewall: move to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [10:28:53] (03CR) 10Jbond: [C: 03+1] "thanks, lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/919277 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [10:31:18] (03PS81) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) [10:34:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48210 and previous config saved to /var/cache/conftool/dbconfig/20230512-103455-root.json [10:34:59] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41157/console" [puppet] - 10https://gerrit.wikimedia.org/r/919063 (https://phabricator.wikimedia.org/T277445) (owner: 10Herron) [10:35:38] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/919063 (https://phabricator.wikimedia.org/T277445) (owner: 10Herron) [10:36:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs::firewall: add a way to block addresses in wmcs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/918410 (owner: 10Jbond) [10:38:46] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10cmooney) One question did arise to me, I'll mention it here but not sure we need to focus on it, at least initially. Shou... [10:41:33] (03PS82) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) [10:41:35] (03PS1) 10Jbond: httpyaml: replace URI.escape [puppet] - 10https://gerrit.wikimedia.org/r/919291 (https://phabricator.wikimedia.org/T330490) [10:41:52] (03CR) 10Jbond: puppetserver: add puppetserver module (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [10:43:47] (03CR) 10Jbond: [C: 03+2] wmcs::firewall: add a way to block addresses in wmcs [puppet] - 10https://gerrit.wikimedia.org/r/918410 (owner: 10Jbond) [10:43:51] (03CR) 10Jbond: [C: 03+2] firewall: Remove kafka_brokers_analytics [puppet] - 10https://gerrit.wikimedia.org/r/919059 (owner: 10Jbond) [10:43:55] (03CR) 10Majavah: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/918410 (owner: 10Jbond) [10:44:31] (03PS10) 10Jbond: profile::base::firewall: move to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) [10:44:39] (03PS10) 10Jbond: firewall: add basic firewall class [puppet] - 10https://gerrit.wikimedia.org/r/919061 [10:44:46] (03PS11) 10Jbond: firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) [10:45:30] (03CR) 10CI reject: [V: 04-1] firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [10:45:41] (03PS1) 10Arturo Borrero Gonzalez: network: introduce cloud-private-b1-codfw subnet [puppet] - 10https://gerrit.wikimedia.org/r/919292 (https://phabricator.wikimedia.org/T324992) [10:45:50] (03PS83) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) [10:48:11] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM. Sorry my oversight should have added it here when we provisioned it." [puppet] - 10https://gerrit.wikimedia.org/r/919292 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [10:48:34] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] network: introduce cloud-private-b1-codfw subnet [puppet] - 10https://gerrit.wikimedia.org/r/919292 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [10:50:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48211 and previous config saved to /var/cache/conftool/dbconfig/20230512-105000-root.json [10:50:50] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [10:51:20] (03PS2) 10Jbond: httpyaml: replace URI.escape [puppet] - 10https://gerrit.wikimedia.org/r/919291 (https://phabricator.wikimedia.org/T330490) [10:51:26] (03CR) 10Jbond: [C: 03+2] puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [10:58:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:03:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:03:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:04:28] (03CR) 10JMeybohm: [C: 03+1] miscweb: add annualreport release to miscweb [deployment-charts] - 10https://gerrit.wikimedia.org/r/915673 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [11:05:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48212 and previous config saved to /var/cache/conftool/dbconfig/20230512-110505-root.json [11:08:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:08:50] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:08:51] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) I've seen that option and decided that was not relevant for new host's ztp, but lmk if we need it too. The general... [11:10:57] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) [11:11:31] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: cloudgw: review security policy for edge network - https://phabricator.wikimedia.org/T336368 (10aborrero) 05Open→03Resolved Fixed! thanks [11:13:34] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:17:04] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10cmooney) >>! In T336485#8847055, @Volans wrote: > The general usage for that seems to me more for a "reimage" concept of u... [11:20:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48213 and previous config saved to /var/cache/conftool/dbconfig/20230512-112010-root.json [11:20:25] (03PS6) 10Jelto: miscweb: add annualreport release to miscweb [deployment-charts] - 10https://gerrit.wikimedia.org/r/915673 (https://phabricator.wikimedia.org/T300171) [11:21:43] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10ayounsi) It would be useful during the initial provisioning to have the device running the Junos version we want on day 1.... [11:23:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:25:07] (03PS1) 10Jbond: cloudlb: faile hard if lookup fails [puppet] - 10https://gerrit.wikimedia.org/r/919297 [11:25:41] (03CR) 10CI reject: [V: 04-1] cloudlb: faile hard if lookup fails [puppet] - 10https://gerrit.wikimedia.org/r/919297 (owner: 10Jbond) [11:28:15] 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10RobH) [11:33:13] (03PS2) 10Jbond: cloudlb: fail hard if lookup fails [puppet] - 10https://gerrit.wikimedia.org/r/919297 [11:35:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48215 and previous config saved to /var/cache/conftool/dbconfig/20230512-113514-root.json [11:35:28] (03CR) 10CI reject: [V: 04-1] cloudlb: fail hard if lookup fails [puppet] - 10https://gerrit.wikimedia.org/r/919297 (owner: 10Jbond) [11:38:18] (03PS1) 10Arturo Borrero Gonzalez: network: data: add cloud codfw1dev 185.15.57.24/29 [puppet] - 10https://gerrit.wikimedia.org/r/919298 (https://phabricator.wikimedia.org/T324992) [11:38:37] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:39:30] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/919298 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [11:41:02] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] network: data: add cloud codfw1dev 185.15.57.24/29 [puppet] - 10https://gerrit.wikimedia.org/r/919298 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [11:45:20] (03CR) 10Jelto: [C: 03+2] miscweb: add annualreport release to miscweb [deployment-charts] - 10https://gerrit.wikimedia.org/r/915673 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [11:47:35] (03Merged) 10jenkins-bot: miscweb: add annualreport release to miscweb [deployment-charts] - 10https://gerrit.wikimedia.org/r/915673 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [11:50:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:55:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:56:53] !log jelto@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [11:58:01] !log jelto@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [12:02:46] 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q4: install disks in frmon1002 - https://phabricator.wikimedia.org/T336569 (10RobH) [12:03:09] 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q4: install disks in frmon1002 - https://phabricator.wikimedia.org/T336569 (10RobH) [12:03:42] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10cmooney) ######Routing issue I hit an issue with the new spines in that the overlay loopback address was not reachable when they... [12:04:54] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:05:27] (03PS1) 10Jelto: miscweb: set lower resource requests for annual report microsite [deployment-charts] - 10https://gerrit.wikimedia.org/r/919299 (https://phabricator.wikimedia.org/T300171) [12:06:18] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:09:09] (03PS1) 10Majavah: ferm::service: allow passing array of hosts [puppet] - 10https://gerrit.wikimedia.org/r/919300 [12:10:31] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41161/console" [puppet] - 10https://gerrit.wikimedia.org/r/919300 (owner: 10Majavah) [12:22:44] (03CR) 10Jelto: [C: 03+2] miscweb: set lower resource requests for annual report microsite [deployment-charts] - 10https://gerrit.wikimedia.org/r/919299 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [12:23:40] (03Merged) 10jenkins-bot: miscweb: set lower resource requests for annual report microsite [deployment-charts] - 10https://gerrit.wikimedia.org/r/919299 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [12:25:00] (NodeTextfileStale) firing: Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:26:06] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) Is it ok to start testing without it? Based on how we want the workflow to go we would need a change in Spicerack... [12:26:14] (03PS3) 10Volans: dhcp: expand support for hostname based match [software/spicerack] - 10https://gerrit.wikimedia.org/r/919052 (https://phabricator.wikimedia.org/T336485) [12:26:23] !log jelto@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [12:29:16] (03CR) 10Ayounsi: [C: 03+1] dhcp: expand support for hostname based match [software/spicerack] - 10https://gerrit.wikimedia.org/r/919052 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [12:36:07] (03PS1) 10Jelto: miscweb: fix image name for annual report microsite [deployment-charts] - 10https://gerrit.wikimedia.org/r/919304 (https://phabricator.wikimedia.org/T300171) [12:37:14] !log jelto@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [12:37:55] (JobUnavailable) firing: (3) Reduced availability for job envoy in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:40:43] (03CR) 10Jelto: [C: 03+2] miscweb: fix image name for annual report microsite [deployment-charts] - 10https://gerrit.wikimedia.org/r/919304 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [12:41:28] (03Merged) 10jenkins-bot: miscweb: fix image name for annual report microsite [deployment-charts] - 10https://gerrit.wikimedia.org/r/919304 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [12:42:22] 10SRE, 10Traffic, 10serviceops, 10Platform Team Initiatives (API Gateway): Handle edge cache invalidation for the api gateway - https://phabricator.wikimedia.org/T324200 (10fgiunchedi) >>! In T324200#8846715, @Joe wrote: > My idea for implementing this is as follows: > - Create a benthos container > - Add... [12:42:55] (JobUnavailable) firing: (3) Reduced availability for job envoy in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:45:32] !log jelto@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [12:46:22] !log jelto@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [12:47:55] (JobUnavailable) firing: (5) Reduced availability for job envoy in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:52:34] I'm looking into some of these failures btw ^ [12:57:55] (JobUnavailable) resolved: (2) Reduced availability for job envoy in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:06:08] !log jelto@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [13:06:24] !log jelto@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [13:09:04] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:11:10] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:18:35] (03PS1) 10Ssingh: varnish: bump size of varnish shared memory log to 160M (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/919327 (https://phabricator.wikimedia.org/T253093) [13:19:48] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41162/console" [puppet] - 10https://gerrit.wikimedia.org/r/919327 (https://phabricator.wikimedia.org/T253093) (owner: 10Ssingh) [13:20:17] (03CR) 10Ssingh: [V: 03+1 C: 03+2] varnish: bump size of varnish shared memory log to 160M (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/919327 (https://phabricator.wikimedia.org/T253093) (owner: 10Ssingh) [13:20:54] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10ayounsi) sgtm as it's an additional feature and to prevent scope creep but might be worth looking at implementing it soone... [13:22:45] !log sudo cumin -b1 -s1200 'A:cp and A:eqiad' 'varnish-frontend-restart': T253093 [13:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:49] 10SRE, 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1110.eqiad.wmnet - https://phabricator.wikimedia.org/T335011 (10Ladsgroup) [13:22:50] T253093: varnish-frontend-fetcherr: Assert error in vslc_vtx_next, 100% CPU usage - https://phabricator.wikimedia.org/T253093 [13:24:45] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::arclamp::redis: introduce/move arclamp redis config to profile [puppet] - 10https://gerrit.wikimedia.org/r/919163 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron) [13:24:51] (03CR) 10Filippo Giunchedi: [C: 03+1] role::webperf::profiling_tools: add redis instance for arclamp [puppet] - 10https://gerrit.wikimedia.org/r/919164 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron) [13:29:14] (03CR) 10Filippo Giunchedi: "LGTM! just a couple of minor nits" [alerts] - 10https://gerrit.wikimedia.org/r/918547 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [13:30:07] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:31:18] (03CR) 10Hashar: [C: 03+1] gerrit: disable monitoring for gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/919244 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [13:35:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:38:25] (03PS1) 10Filippo Giunchedi: prometheus: don't fail on unknown blackbox probe type [puppet] - 10https://gerrit.wikimedia.org/r/919331 (https://phabricator.wikimedia.org/T320620) [13:38:29] !log disable puppet on A:dns-rec to merge CR 919067 [13:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:08] (03CR) 10Ssingh: [C: 03+2] hiera: fix dns*.yaml resolving nameservers [puppet] - 10https://gerrit.wikimedia.org/r/919067 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh) [13:42:28] RECOVERY - Host db2139 is UP: PING OK - Packet loss = 0%, RTA = 33.23 ms [13:45:07] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:46:26] _joe_: how's the jobrunner's queue looking? we got some flap in the alert ^^^ [13:46:48] <_joe_> volans: I suppose it's just more videoscaling [13:46:55] ok [13:47:01] side effect [13:47:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:47:54] <_joe_> volans: https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?orgId=1&viewPanel=10 [13:47:59] <_joe_> so yeah [13:48:22] git it [13:48:28] *got it (off by one) [13:51:22] <_joe_> tbh the thing that really worries me is that most hosts, even the ones NOT doing videoscaling, are at 75% of cpu usage [13:51:44] <_joe_> I assume this is a partial consequence of moving the parsoid warming jobs [13:52:41] right [13:53:02] <_joe_> let me take a look at the parsoid cluster as well [13:53:14] 10SRE, 10ops-codfw, 10DBA, 10database-backups: db2139 s4 (commonswiki) instance crashed (backup source) - https://phabricator.wikimedia.org/T335396 (10Jhancock.wm) @Marostegui DIMM_B6 has been replaced. the server is UP and I can reach it via idrac and ping the IP. Do you want to leave this ticket open for... [13:53:45] <_joe_> yeah cpu usage went down a lot there [13:53:52] <_joe_> we should probably move some servers over [13:53:55] <_joe_> next week though [13:54:25] <_joe_> volans: some servers dedicated to videoscaling are completely offloaded, as in they have no active connections [13:54:31] <_joe_> so I don't get the probe failure [13:54:42] !log enable puppet and run agent in A:dns-rec: done deploying CR 919067 [13:54:44] <_joe_> any idea what the probe calls [13:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:01] (03Abandoned) 10Ssingh: hiera: decommission dns2001 [puppet] - 10https://gerrit.wikimedia.org/r/917365 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh) [13:57:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:59:01] _joe_: not of the top of my head, I can check [13:59:07] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:59:30] <_joe_> I am asking because AFAICS jobs are being processed correctly [14:02:43] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability: Investigate Junos Prometheus exporter - https://phabricator.wikimedia.org/T333210 (10ayounsi) An alternative (or complement) here would be to go the gNMI way, probably through gNMIc https://github.com/openconfig/gnmic https://www.youtube.com/w... [14:04:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:04:26] <_joe_> volans: I think the best course of action right now is to leave things as-is [14:04:37] ack [14:04:41] <_joe_> or at worst we downtime the videoscaler probe [14:04:55] <_joe_> but we're not being paged, as I don't think these probes should [14:05:12] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:01] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2184 down - https://phabricator.wikimedia.org/T335640 (10Jhancock.wm) @Marostegui DIMM has been replaced. Server is up. everything looks green. lmk if you want to keep this one open for observation or close the ticket. returning bad DIMM tracking: 398... [14:07:49] (03PS1) 10Ssingh: hiera: decommission dns2001 [puppet] - 10https://gerrit.wikimedia.org/r/919340 (https://phabricator.wikimedia.org/T335777) [14:09:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:09:23] _joe_: I think it should call path: /w/health-check.php but I might be wrong [14:10:13] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Papaul) I 'do agree that we can also have the Junos image for upgrade during the process. Our first goal here was to have... [14:10:17] (03PS1) 10Majavah: P:opesntack::pdns: remove manual forwarding rules for wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/919341 (https://phabricator.wikimedia.org/T336566) [14:10:50] (03CR) 10Ebernhardson: [C: 03+1] Obsolete profile::python37 [puppet] - 10https://gerrit.wikimedia.org/r/917813 (owner: 10Muehlenhoff) [14:11:32] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41163/console" [puppet] - 10https://gerrit.wikimedia.org/r/919341 (https://phabricator.wikimedia.org/T336566) (owner: 10Majavah) [14:13:00] (03CR) 10Ssingh: [C: 03+2] sites.yaml: remove dns2001 from anycast_neighbors (host decom) [homer/public] - 10https://gerrit.wikimedia.org/r/917364 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh) [14:13:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:13:41] !log homer "cr*-codfw*" commit "Gerrit: 917364 remove to-be decommissioned host dns2001": T335777 [14:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:45] T335777: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 [14:14:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:14:26] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:14:44] (03CR) 10Papaul: [C: 03+2] Add cloudswift100[1-2] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/919245 (https://phabricator.wikimedia.org/T289882) (owner: 10Papaul) [14:14:57] <_joe_> cdanis, denisse ^^ these failed probes to jobrunners/videoscalers are caused by pressure by a lot of jobs (including parsoid re-parses) and video transcoding [14:15:08] <_joe_> My sugggestion would be not to intervene and let the storm pass [14:15:11] _joe_: yeah, a recurring problem [14:15:13] <_joe_> jobs are being executed [14:15:17] (03PS1) 10Andrew Bogott: Openstack galera/mariadb grants: allow access via haproxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/919342 [14:15:19] ACK, thanks! [14:15:20] <_joe_> I think we sometimes overreact [14:15:38] a probe is kind of a bad fit for monitoring this [14:15:43] (03CR) 10CI reject: [V: 04-1] Openstack galera/mariadb grants: allow access via haproxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/919342 (owner: 10Andrew Bogott) [14:15:53] <_joe_> but if you want to play with load balancers, I suggest you first merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/919262 [14:15:55] !log [done] homer "cr*-codfw*" commit "Gerrit: 917364 remove to-be decommissioned host dns2001": T335777 [14:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:00] <_joe_> because we have a big issue with load balancing [14:16:06] (03CR) 10Ssingh: [C: 03+2] hiera: decommission dns2001 [puppet] - 10https://gerrit.wikimedia.org/r/919340 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh) [14:16:41] I'm tempted to merge that, can't imagine it making things worse [14:16:51] I'm also tempted to downtime both jobrunner and videoscaler probes for the weekend tbh _joe_ [14:16:53] thoughts? [14:17:12] <_joe_> cdanis: I said so some minutes ago [14:17:15] <_joe_> so I agree :D [14:17:25] cool :) [14:18:13] (03PS2) 10Andrew Bogott: Openstack galera/mariadb grants: allow access via haproxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/919342 [14:19:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:19:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:opesntack::pdns: remove manual forwarding rules for wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/919341 (https://phabricator.wikimedia.org/T336566) (owner: 10Majavah) [14:21:17] (03CR) 10Andrea Denisse: [C: 03+1] mwlog: rotate api.log hourly [puppet] - 10https://gerrit.wikimedia.org/r/919063 (https://phabricator.wikimedia.org/T277445) (owner: 10Herron) [14:21:22] (03PS3) 10Andrew Bogott: Openstack galera/mariadb grants: allow access via haproxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/919342 [14:22:01] (03PS4) 10Arturo Borrero Gonzalez: Openstack galera/mariadb grants: allow access via haproxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/919342 (https://phabricator.wikimedia.org/T324992) (owner: 10Andrew Bogott) [14:23:33] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [14:23:38] (03CR) 10Majavah: "I'm trying to think of a scenario where a service would be connecting directly not via HAProxy - are there any? If not, we can drop grants" [puppet] - 10https://gerrit.wikimedia.org/r/919342 (https://phabricator.wikimedia.org/T324992) (owner: 10Andrew Bogott) [14:24:01] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts dns2001.wikimedia.wmnet [14:24:07] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:24:31] (03PS5) 10Andrew Bogott: Openstack galera/mariadb grants: allow access via haproxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/919342 [14:24:34] PROBLEM - PHP7 rendering on mw1466 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:25:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:26:00] RECOVERY - PHP7 rendering on mw1466 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 1.858 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:27:11] _joe_: yes to the best of my knowledge that thing is calling https://jobrunner.svc.eqiad.wmnet/w/health-check.php (but the config has the IP) [14:27:40] same for the videoscaler [14:28:31] 10SRE, 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10ssingh) [14:29:22] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:29:27] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [14:30:28] (03CR) 10Andrew Bogott: Openstack galera/mariadb grants: allow access via haproxy nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919342 (owner: 10Andrew Bogott) [14:30:46] (03PS11) 10Arturo Borrero Gonzalez: templates: add 20.172.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/914751 (https://phabricator.wikimedia.org/T335759) [14:32:20] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence-Backup: db1225 crashed (CPU 1 machine check error detected) - https://phabricator.wikimedia.org/T336326 (10Jclark-ctr) @Marostegui i performed flea power drain it is powering up right now [14:34:39] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns2001.wikimedia.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [14:35:28] (03PS6) 10Andrew Bogott: Openstack galera/mariadb grants: allow access via haproxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/919342 (https://phabricator.wikimedia.org/T324992) [14:35:49] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns2001.wikimedia.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [14:35:49] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:35:50] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts dns2001.wikimedia.wmnet [14:35:59] 10SRE, 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `dns2001.wikimedia.wmnet` - dns2001.wikimedia.wmnet (**FAIL**) - Down... [14:36:43] !log silencing jobrunner/videoscaler probes for the weekend [14:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:22] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [14:38:58] !log silencing jobrunner/videoscaler probes for the weekend -- silence ID 21903b52-047b-43d9-94be-908a4b92b5a7 [14:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:31] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:40:07] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:40:45] (03PS1) 10Ilias Sarantopoulos: ml-services: deploy Bloom-560m model on Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/919345 (https://phabricator.wikimedia.org/T333861) [14:44:22] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:45:57] (03PS1) 10Arturo Borrero Gonzalez: wikimedia.cloud: delegate svc.wikimedia.cloud to designate @ eqiad1 [dns] - 10https://gerrit.wikimedia.org/r/919346 [14:46:22] (03CR) 10Majavah: [C: 03+1] wikimedia.cloud: delegate svc.wikimedia.cloud to designate @ eqiad1 [dns] - 10https://gerrit.wikimedia.org/r/919346 (owner: 10Arturo Borrero Gonzalez) [14:46:38] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimedia.cloud: delegate svc.wikimedia.cloud to designate @ eqiad1 [dns] - 10https://gerrit.wikimedia.org/r/919346 (owner: 10Arturo Borrero Gonzalez) [14:46:40] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] wikimedia.cloud: delegate svc.wikimedia.cloud to designate @ eqiad1 [dns] - 10https://gerrit.wikimedia.org/r/919346 (owner: 10Arturo Borrero Gonzalez) [14:47:25] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/919291 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [14:47:34] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] "Reference: https://phabricator.wikimedia.org/T336581" [dns] - 10https://gerrit.wikimedia.org/r/919346 (owner: 10Arturo Borrero Gonzalez) [14:49:40] (03CR) 10Dzahn: "for this to remove monitoring I also need to run puppet on gerrit1001.. so first I have to do the other change that masks the service" [puppet] - 10https://gerrit.wikimedia.org/r/919244 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [14:56:14] (03CR) 10Andrew Bogott: [C: 03+2] Openstack galera/mariadb grants: allow access via haproxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/919342 (https://phabricator.wikimedia.org/T324992) (owner: 10Andrew Bogott) [14:56:55] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts dns2001.wikimedia.org [15:01:20] !log sukhe@deploy1002 Locking from deployment [ALL REPOSITORIES]: LVS reimaging in codfw, blocking deploys T326767 [15:01:26] T326767: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 [15:01:44] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [15:02:54] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:02:54] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts dns2001.wikimedia.org [15:03:07] 10SRE, 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `dns2001.wikimedia.org` - dns2001.wikimedia.org (**FAIL**) - //Unable... [15:04:50] (03CR) 10DCausse: "lgtm! left few questions/nits" [deployment-charts] - 10https://gerrit.wikimedia.org/r/895241 (https://phabricator.wikimedia.org/T325303) (owner: 10Ottomata) [15:06:26] (03PS7) 10Volans: install_server: simplify DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/919276 (https://phabricator.wikimedia.org/T336485) [15:06:28] (03PS13) 10Volans: install_server: convert dhcpd.conf to template [puppet] - 10https://gerrit.wikimedia.org/r/919277 (https://phabricator.wikimedia.org/T336485) [15:06:30] (03PS5) 10Volans: install_server: remove mgmt subnet already managed [puppet] - 10https://gerrit.wikimedia.org/r/919282 (https://phabricator.wikimedia.org/T336485) [15:06:32] PROBLEM - PHP7 jobrunner on mw1466 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [15:06:45] (03PS1) 10Majavah: openstack: db: use dnsquery::lookup [puppet] - 10https://gerrit.wikimedia.org/r/919349 [15:07:17] 10SRE, 10Observability-Logging: Logrotate fails for: "$FILE No such file or directory" - https://phabricator.wikimedia.org/T153940 (10Dzahn) [15:07:58] RECOVERY - PHP7 jobrunner on mw1466 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 4.810 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [15:08:03] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2012.codfw.wmnet with OS bullseye [15:08:13] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2012.codfw.wmnet with OS bullseye [15:08:19] (03CR) 10Andrew Bogott: [C: 03+2] openstack: db: use dnsquery::lookup [puppet] - 10https://gerrit.wikimedia.org/r/919349 (owner: 10Majavah) [15:08:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "please get jbond to +1 as well. He has a pretty good understanding of the function internals" [puppet] - 10https://gerrit.wikimedia.org/r/919349 (owner: 10Majavah) [15:09:01] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs2012.codfw.wmnet with OS bullseye [15:09:11] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs2012.codfw.wmnet with OS bullseye executed w... [15:09:33] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) Do we want to hardcode that in the dhcp settings? Or better to pass it dynamically to the cookbook? Based on that... [15:12:45] (03CR) 10Volans: [C: 03+2] dhcp: expand support for hostname based match [software/spicerack] - 10https://gerrit.wikimedia.org/r/919052 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [15:15:27] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) [15:16:49] (03Merged) 10jenkins-bot: dhcp: expand support for hostname based match [software/spicerack] - 10https://gerrit.wikimedia.org/r/919052 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [15:18:06] PROBLEM - PHP7 rendering on mw1466 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:18:24] 10SRE, 10Commons, 10WMF-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10ssingh) Hi, following up on this to check if this issue still persists? [15:19:23] 10SRE, 10Discovery-Search, 10Datacenter-Switchover: Warn when CirrusSearch is not configured to use local DC for an extended time - https://phabricator.wikimedia.org/T204135 (10Dzahn) [15:19:34] RECOVERY - PHP7 rendering on mw1466 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 5.403 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:20:06] 10SRE, 10Datacenter-Switchover: Find a way to verify mediawiki-config IPs ahead of datacenter switchovers - https://phabricator.wikimedia.org/T163354 (10Dzahn) [15:20:17] (03PS3) 10Arturo Borrero Gonzalez: wikimediacloud.org: move openstack.codfw1dev.wikimediacloud.org to new VIP [dns] - 10https://gerrit.wikimedia.org/r/918525 (https://phabricator.wikimedia.org/T332153) [15:20:37] 10SRE, 10serviceops, 10Datacenter-Switchover: Find a way to verify mediawiki-config IPs ahead of datacenter switchovers - https://phabricator.wikimedia.org/T163354 (10Dzahn) [15:21:02] 10SRE, 10MediaWiki-Parser, 10Traffic: Varnish 503 errors on page with large number of flag icons. - https://phabricator.wikimedia.org/T267804 (10Dzahn) [15:21:47] 10SRE, 10Infrastructure-Foundations, 10netops: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367 (10Dzahn) [15:22:19] (03PS1) 10EoghanGaffney: Add nginx logs for docker-registry host to rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/919350 (https://phabricator.wikimedia.org/T322579) [15:23:43] (03CR) 10Ayounsi: [C: 03+1] install_server: simplify DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/919276 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [15:25:24] (03PS1) 10EoghanGaffney: Send nginx and docker-registry logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/919351 (https://phabricator.wikimedia.org/T322579) [15:26:43] (03PS2) 10EoghanGaffney: Send nginx and docker-registry logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/919351 (https://phabricator.wikimedia.org/T322579) [15:27:10] (03PS1) 10Arturo Borrero Gonzalez: cloudservices: codfw1dev: enable cloud-private subnet [puppet] - 10https://gerrit.wikimedia.org/r/919352 (https://phabricator.wikimedia.org/T324992) [15:34:39] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "on hold: cloudservices boxes are connected to the wiki-prod asw switch and need to be reconnected to cloudsw for this to work. BUT they al" [puppet] - 10https://gerrit.wikimedia.org/r/919352 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [15:38:24] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudswift1001.eqiad.wmnet with OS bullseye [15:38:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with... [15:42:35] (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoyproxy: Fix most validation errors in the `good` build_envoy_config tests [puppet] - 10https://gerrit.wikimedia.org/r/773642 (https://phabricator.wikimedia.org/T304660) (owner: 10RLazarus) [15:43:14] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Allow basic validation of envoy config in CI [puppet] - 10https://gerrit.wikimedia.org/r/915745 (https://phabricator.wikimedia.org/T304660) (owner: 10JMeybohm) [15:43:45] (03PS1) 10David Caro: d/changelog: prepare release 0.98 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919356 (https://phabricator.wikimedia.org/T336507) [15:44:28] (03PS1) 10Ssingh: Revert "O:traffic: Add lvs::kernel_config during insetup to allow reimages" [puppet] - 10https://gerrit.wikimedia.org/r/919308 [15:44:55] (03CR) 10CI reject: [V: 04-1] Revert "O:traffic: Add lvs::kernel_config during insetup to allow reimages" [puppet] - 10https://gerrit.wikimedia.org/r/919308 (owner: 10Ssingh) [15:45:02] (03CR) 10Majavah: d/changelog: prepare release 0.98 (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919356 (https://phabricator.wikimedia.org/T336507) (owner: 10David Caro) [15:45:13] (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoy: Move upstream HTTP config into the new HttpProtocolOptions message [puppet] - 10https://gerrit.wikimedia.org/r/916498 (https://phabricator.wikimedia.org/T303230) (owner: 10JMeybohm) [15:45:16] (03PS2) 10Ssingh: Revert "O:traffic: Add lvs::kernel_config during insetup to allow reimages" [puppet] - 10https://gerrit.wikimedia.org/r/919308 [15:45:24] (03CR) 10CI reject: [V: 04-1] d/changelog: prepare release 0.98 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919356 (https://phabricator.wikimedia.org/T336507) (owner: 10David Caro) [15:45:38] (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoyproxy: Add python 3.11 to tox [puppet] - 10https://gerrit.wikimedia.org/r/916499 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [15:46:19] (03CR) 10David Caro: d/changelog: prepare release 0.98 (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919356 (https://phabricator.wikimedia.org/T336507) (owner: 10David Caro) [15:46:55] (03CR) 10Ssingh: [C: 03+2] Revert "O:traffic: Add lvs::kernel_config during insetup to allow reimages" [puppet] - 10https://gerrit.wikimedia.org/r/919308 (owner: 10Ssingh) [15:46:57] (03CR) 10BBlack: [C: 03+1] "This stanza can't work correctly with the normal base image ferm stuff. For now we just accept the tradeoff that LVSes have to be fully r" [puppet] - 10https://gerrit.wikimedia.org/r/919308 (owner: 10Ssingh) [15:48:09] (03PS2) 10David Caro: d/changelog: prepare release 0.98 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919356 (https://phabricator.wikimedia.org/T336507) [15:48:14] (03CR) 10David Caro: "looking at the failed tests :/" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919356 (https://phabricator.wikimedia.org/T336507) (owner: 10David Caro) [15:48:20] (03CR) 10David Caro: d/changelog: prepare release 0.98 (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919356 (https://phabricator.wikimedia.org/T336507) (owner: 10David Caro) [15:49:55] (03CR) 10CI reject: [V: 04-1] d/changelog: prepare release 0.98 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919356 (https://phabricator.wikimedia.org/T336507) (owner: 10David Caro) [15:50:24] (03CR) 10Majavah: d/changelog: prepare release 0.98 (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919356 (https://phabricator.wikimedia.org/T336507) (owner: 10David Caro) [15:51:25] (03CR) 10Ssingh: [C: 03+2] Revert "Revert "lvs2012: commission new LVS host (codfw hardware refresh)"" [puppet] - 10https://gerrit.wikimedia.org/r/919126 (owner: 10Ssingh) [15:52:28] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 13): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10JArguello-WMF) [15:52:31] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: cloudservices[2004/2005]-dev & cloudweb2002-dev: connect them to cloudsw so they can have cloud-private vlan - https://phabricator.wikimedia.org/T336587 (10aborrero) [15:53:25] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2012.codfw.wmnet with OS bullseye [15:54:05] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs2012.codfw.wmnet with OS bullseye [15:54:37] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2012.codfw.wmnet with OS bullseye [15:55:06] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2012.codfw.wmnet with OS bullseye [15:55:28] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: cloudservices[2004/2005]-dev & cloudweb2002-dev: connect them to cloudsw so they can have cloud-private vlan - https://phabricator.wikimedia.org/T336587 (10cmooney) [15:55:35] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) [15:55:51] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs2012.codfw.wmnet with OS bullseye executed w... [15:56:01] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2012.codfw.wmnet with OS bullseye [15:56:19] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: cloudservices[2004/2005]-dev & cloudweb2002-dev: connect them to cloudsw so they can have cloud-private vlan - https://phabricator.wikimedia.org/T336587 (10aborrero) p:05Triage→03Medium [15:57:34] (03CR) 10Jbond: openstack: db: use dnsquery::lookup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919349 (owner: 10Majavah) [15:59:18] (03PS1) 10BBlack: insetup::traffic: properly handle LVS case [puppet] - 10https://gerrit.wikimedia.org/r/919358 (https://phabricator.wikimedia.org/T336428) [16:00:11] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs2012.codfw.wmnet with OS bullseye [16:00:24] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs2012.codfw.wmnet with OS bullseye executed w... [16:00:34] (03PS1) 10Dzahn: gerrit: allow masking the service and do so on gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/919359 [16:01:22] (03Abandoned) 10Jbond: cloudlb: fail hard if lookup fails [puppet] - 10https://gerrit.wikimedia.org/r/919297 (owner: 10Jbond) [16:01:48] (03CR) 10Dzahn: "ACK, I have https://gerrit.wikimedia.org/r/c/operations/puppet/+/919359 instead now" [puppet] - 10https://gerrit.wikimedia.org/r/919246 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [16:02:01] (03Abandoned) 10Dzahn: gerrit: add parameter service_ensure, set to stopped on gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/919246 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [16:02:15] (03PS2) 10Dzahn: gerrit: allow masking the service and do so on gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/919359 (https://phabricator.wikimedia.org/T326368) [16:02:34] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: cloudservices[2004/2005]-dev & cloudweb2002-dev: connect them to cloudsw so they can have cloud-private vlan - https://phabricator.wikimedia.org/T336587 (10aborrero) a:03Papaul [16:02:42] (03PS3) 10David Caro: d/changelog: prepare release 0.98 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919356 (https://phabricator.wikimedia.org/T336507) [16:03:16] (03CR) 10David Caro: "As we did not pin the version, my tox env had the old one and locally did not fail." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919356 (https://phabricator.wikimedia.org/T336507) (owner: 10David Caro) [16:03:49] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2012.codfw.wmnet with OS bullseye [16:04:01] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2012.codfw.wmnet with OS bullseye [16:04:41] (03CR) 10BBlack: [C: 03+2] insetup::traffic: properly handle LVS case [puppet] - 10https://gerrit.wikimedia.org/r/919358 (https://phabricator.wikimedia.org/T336428) (owner: 10BBlack) [16:06:31] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) Current idea that has gained some momentum as part of {T297596} and {T324992}: * hook the cloudser... [16:06:39] (03CR) 10Volans: "Can't those hosts just use role::insetup_noferm that AFAIK was left aside from the team ones exactly for this use case?" [puppet] - 10https://gerrit.wikimedia.org/r/919358 (https://phabricator.wikimedia.org/T336428) (owner: 10BBlack) [16:06:53] (03PS2) 10Arturo Borrero Gonzalez: cloudservices: codfw1dev: enable cloud-private subnet [puppet] - 10https://gerrit.wikimedia.org/r/919352 (https://phabricator.wikimedia.org/T324992) [16:07:06] (03CR) 10Dzahn: [V: 03+1] "see compiler https://puppet-compiler.wmflabs.org/output/919359/41172/" [puppet] - 10https://gerrit.wikimedia.org/r/919359 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [16:08:01] (03PS4) 10Mforns: ::analytics::refinery::job::druid_load: remove remaining jobs [puppet] - 10https://gerrit.wikimedia.org/r/906667 (https://phabricator.wikimedia.org/T334095) [16:08:14] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs2012.codfw.wmnet with OS bullseye [16:08:23] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs2012.codfw.wmnet with OS bullseye executed w... [16:08:24] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2012.codfw.wmnet with OS bullseye [16:08:39] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2012.codfw.wmnet with OS bullseye [16:10:10] (03PS3) 10Dzahn: gerrit: allow masking the service and do so on gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/919359 (https://phabricator.wikimedia.org/T326368) [16:12:12] (03PS1) 10Andrew Bogott: wmcs-novastats-dnsleaks: distinguish system and project sessions [puppet] - 10https://gerrit.wikimedia.org/r/919362 [16:13:39] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-novastats-dnsleaks: distinguish system and project sessions [puppet] - 10https://gerrit.wikimedia.org/r/919362 (owner: 10Andrew Bogott) [16:13:43] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic: Reimaging lvs2012 fails as the host is unreachable from cumin2002 - https://phabricator.wikimedia.org/T336428 (10Volans) >>! In T336428#8847879, @gerritbot wrote: > Change 919358 **merged** by BBlack: > %%%[operations/puppet@production] insetup::... [16:14:43] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10cmooney) >>! In T336485#8847630, @Volans wrote: > Do we want to hardcode that in the dhcp settings? Or better to pass it d... [16:15:19] (03CR) 10FNegri: "can we keep the changes to the code in a separate commit? usually the "prepare release" commit modifies the changelog and nothing else." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919356 (https://phabricator.wikimedia.org/T336507) (owner: 10David Caro) [16:16:24] 10SRE, 10Wikimedia-Mailing-lists: mailman3 discard_held_messages systemd script apparently failing since 2023-03-26 - https://phabricator.wikimedia.org/T336555 (10Dzahn) Thanks for reporting. I don't think there is a any concern here since lists1003 isn't the production server yet. It is a new machine that is... [16:16:59] (03PS3) 10Arturo Borrero Gonzalez: cloudservices: codfw1dev: enable cloud-private subnet [puppet] - 10https://gerrit.wikimedia.org/r/919352 (https://phabricator.wikimedia.org/T324992) [16:18:21] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706 (10Dzahn) T336555 has been opened about alerts related to lists1003. Seems like expected though since this is still WIP. [16:19:42] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic: Reimaging lvs2012 fails as the host is unreachable from cumin2002 - https://phabricator.wikimedia.org/T336428 (10cmooney) >>! In T336428#8844635, @ssingh wrote: > I think the more probable cause is the switch issue @cmooney fixed above and while... [16:23:30] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2012.codfw.wmnet with reason: host reimage [16:23:41] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10Andrew) This plan sounds OK to me. We could also move the recursors onto VMs, at which point they'd need to... [16:25:00] (NodeTextfileStale) firing: Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:26:24] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on lists1003.wikimedia.org with reason: maintenance [16:26:37] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on lists1003.wikimedia.org with reason: maintenance [16:26:51] a ticket was opened about that alert as well ^ [16:26:57] but I see that isnt yet the prod server [16:27:17] so downtimed it over the weekend and left a comment on tickets [16:28:27] (03PS4) 10Arturo Borrero Gonzalez: cloudservices: codfw1dev: enable cloud-private subnet [puppet] - 10https://gerrit.wikimedia.org/r/919352 (https://phabricator.wikimedia.org/T324992) [16:28:55] 10SRE, 10Wikimedia-Mailing-lists: mailman3 discard_held_messages systemd script apparently failing since 2023-03-26 - https://phabricator.wikimedia.org/T336555 (10Dzahn) ` 16:25 <+jinxer-wm> (NodeTextfileStale) firing: Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_fil... [16:29:06] 10SRE, 10Wikimedia-Mailing-lists: mailman3 discard_held_messages systemd script apparently failing since 2023-03-26 - https://phabricator.wikimedia.org/T336555 (10Dzahn) p:05Triage→03Low [16:30:03] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2012.codfw.wmnet with reason: host reimage [16:32:25] (03PS1) 10Arturo Borrero Gonzalez: wikimedia.cloud: add cloudservices200[4/5]-dev cloud-private address [dns] - 10https://gerrit.wikimedia.org/r/919363 (https://phabricator.wikimedia.org/T307357) [16:33:58] 10SRE, 10User-MoritzMuehlenhoff: Hadoop MapReduce port range cannot be configured to a fixed range - https://phabricator.wikimedia.org/T111433 (10Dzahn) Would it be right or wrong to tag this with Infrastructure-Foundations (or Infrastructure-Security)? I am just asking because part of clinic duty for ticket... [16:34:35] 10SRE, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Revisit use of swap and related kernel settings - https://phabricator.wikimedia.org/T266118 (10Dzahn) [16:34:46] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudswift1001.eqiad.wmnet with OS bullseye [16:34:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS... [16:40:43] 10SRE, 10serviceops: Better handling of memcached service - https://phabricator.wikimedia.org/T255132 (10Dzahn) [16:41:14] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence-Backup: db1225 crashed (CPU 1 machine check error detected) - https://phabricator.wikimedia.org/T336326 (10Marostegui) Thanks John - I can reach the server just fine. [16:41:16] 10SRE, 10Infrastructure-Foundations: slapd fails to restart sometimes - https://phabricator.wikimedia.org/T269394 (10Dzahn) [16:43:07] 10SRE, 10ops-codfw, 10DBA, 10database-backups: db2139 s4 (commonswiki) instance crashed (backup source) - https://phabricator.wikimedia.org/T335396 (10Marostegui) Thanks @Jhancock.wm. I can reach the host. I prefer to leave the ticket open, as Jaime owns this server, I want him to decide when he feels comf... [16:43:26] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2184 down - https://phabricator.wikimedia.org/T335640 (10Marostegui) Thanks @Jhancock.wm. I can reach the host. I prefer to leave the ticket open, as Jaime owns this server, I want him to decide when he feels comfortable closing it. Thanks a lot for y... [16:44:51] 10SRE: x509-bundle as used by envoy::tlsproxy fails on single certificate file - https://phabricator.wikimedia.org/T283001 (10Dzahn) Any opinions on which SRE subteam should be tagged with this? serviceops, infra-foundations? [16:45:39] (03PS1) 10AikoChou: ml-services: add RevertRisk Wikidata model to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/919364 (https://phabricator.wikimedia.org/T333125) [16:47:44] 10SRE: allow non-roots to pool/depool certain DNS Discovery services - https://phabricator.wikimedia.org/T250557 (10Dzahn) Any opinions on which SRE subteam should be tagged? Is the pooling tooling (;p) software, serviceops or infra foundations to you? Just asking because the new clinic duty dashboard asks us to... [16:48:28] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs2012.codfw.wmnet with OS bullseye [16:48:38] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs2012.codfw.wmnet with OS bullseye completed:... [16:52:13] (03CR) 10BryanDavis: wikimedia.cloud: delegate svc.wikimedia.cloud to designate @ eqiad1 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/919346 (owner: 10Arturo Borrero Gonzalez) [16:52:23] (03PS3) 10Dzahn: microsites: change rewrite rule for https://transparency.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/918594 (https://phabricator.wikimedia.org/T336301) [16:52:33] 10SRE-tools, 10Infrastructure-Foundations, 10serviceops: Some httpbb checks are flapping - https://phabricator.wikimedia.org/T336590 (10Volans) p:05Triage→03Medium [16:53:41] (03CR) 10Dzahn: microsites: change rewrite rule for https://transparency.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/918594 (https://phabricator.wikimedia.org/T336301) (owner: 10Dzahn) [16:59:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q4: install disks in frmon1002 - https://phabricator.wikimedia.org/T336569 (10Dwisehaupt) [16:59:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q4: install disks in frmon1002 - https://phabricator.wikimedia.org/T336569 (10Dwisehaupt) We are holding off on setting up this machine until the disks are in. Go ahead and install the SSDs when they arrive. [17:00:47] 10SRE, 10serviceops: Log the real X-Client-IP in apache mediawiki logs - https://phabricator.wikimedia.org/T246348 (10Dzahn) [17:01:06] (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic eqiad.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 243.1k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [17:01:46] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] wikimedia.cloud: delegate svc.wikimedia.cloud to designate @ eqiad1 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/919346 (owner: 10Arturo Borrero Gonzalez) [17:03:30] (03PS1) 10Dzahn: logstash_checker.py: remove trusty-specific hacks [puppet] - 10https://gerrit.wikimedia.org/r/919365 (https://phabricator.wikimedia.org/T216380) [17:04:58] 10SRE, 10Infrastructure-Foundations, 10Traffic: Reimaging cookbok should force a Puppet run on the Icinga host - https://phabricator.wikimedia.org/T336593 (10ssingh) [17:05:28] 10SRE, 10Infrastructure-Foundations, 10Traffic: Reimaging cookbok should force a Puppet run on the Icinga host - https://phabricator.wikimedia.org/T336593 (10ssingh) p:05Triage→03Low [17:05:56] (03PS4) 10Ssingh: depool codfw (emergency patch, do not merge) [dns] - 10https://gerrit.wikimedia.org/r/914343 (https://phabricator.wikimedia.org/T335777) [17:07:29] ^ not actually depooling but just in case [17:07:51] we had some fun with the new lvs host and while everything _looks OK_, need to be sure when I remove the MED and send it traffic [17:08:31] 10SRE, 10Infrastructure-Foundations, 10User-Kormat: debdeploy skipped hosts and assumed they're up to date(?) - https://phabricator.wikimedia.org/T268735 (10Dzahn) [17:09:10] 10SRE, 10Release-Engineering-Team, 10ci-test-error: tox-docker CI test doesn't pick up overrides for pylint - https://phabricator.wikimedia.org/T281347 (10Dzahn) [17:09:51] (03CR) 10Ssingh: [C: 03+2] sites.yaml: add new LVS host lvs2012 (codfw hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/917924 (https://phabricator.wikimedia.org/T326767) (owner: 10Ssingh) [17:10:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudswift1001.eqiad.wmnet with OS bullseye [17:10:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with... [17:11:34] 10SRE, 10User-MoritzMuehlenhoff, 10User-jbond: Investigate GID allocation for system users - https://phabricator.wikimedia.org/T235163 (10Dzahn) Should this be tagged Infra-foundations, infra-security or none of these? [17:11:50] !log homer "cr*-codfw*" commit "Gerrit: 917924 add new LVS host lvs2012": T326767 [17:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:54] T326767: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 [17:13:12] 10SRE, 10Infrastructure-Foundations: Segfault for systemd-sysusers.service on stat1007 - https://phabricator.wikimedia.org/T256098 (10Dzahn) [17:16:06] (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic eqiad.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 202.6k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [17:17:00] (03CR) 10Ssingh: [C: 03+2] hiera: remove BGP MED override for lvs2012 [puppet] - 10https://gerrit.wikimedia.org/r/917926 (https://phabricator.wikimedia.org/T326767) (owner: 10Ssingh) [17:17:32] (03PS2) 10Ssingh: hiera: remove BGP MED override for lvs2012 [puppet] - 10https://gerrit.wikimedia.org/r/917926 (https://phabricator.wikimedia.org/T326767) [17:19:30] (03CR) 10Ssingh: [V: 03+2] hiera: remove BGP MED override for lvs2012 [puppet] - 10https://gerrit.wikimedia.org/r/917926 (https://phabricator.wikimedia.org/T326767) (owner: 10Ssingh) [17:21:45] !log restart pybal on lvs2012 to pick up bgp med change: T326767 [17:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:50] T326767: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 [17:27:43] !log set routing-options static route 208.80.153.240/28 [high-traffic2, codfw] next-hop 10.192.16.140: T326767 [17:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:50] T326767: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 [17:31:55] !log sukhe@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: LVS reimaging in codfw, blocking deploys T326767 (duration: 150m 34s) [17:52:38] (03PS1) 10Ssingh: ntp/codfw: point to dns2003 temporarily [dns] - 10https://gerrit.wikimedia.org/r/919388 (https://phabricator.wikimedia.org/T334049) [17:58:36] (03CR) 10Ssingh: [C: 03+2] ntp/codfw: point to dns2003 temporarily [dns] - 10https://gerrit.wikimedia.org/r/919388 (https://phabricator.wikimedia.org/T334049) (owner: 10Ssingh) [17:59:04] !log running authdns-update for CR 919388 [17:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:12] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudswift1001'] [18:02:49] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 13): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Eevans) 05Open→03Resolved [18:02:58] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 13): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Eevans) >>! In T330693#8846701, @gmodena wrote: >>>! In T330693#8846250, @Eev... [18:04:53] (03CR) 10Ssingh: [C: 03+2] "For posterity, the task ID is wrong here. It should be T326688" [dns] - 10https://gerrit.wikimedia.org/r/919388 (https://phabricator.wikimedia.org/T334049) (owner: 10Ssingh) [18:05:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudswift1001'] [18:08:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudswift1001.eqiad.wmnet with OS bullseye [18:08:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with... [18:08:15] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudswift1001.eqiad.wmnet with OS bullseye [18:08:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS... [18:13:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:13:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudswift1001.eqiad.wmnet with OS bullseye [18:13:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with... [18:13:10] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudswift1001.eqiad.wmnet with OS bullseye [18:13:18] 10SRE, 10Infrastructure-Foundations, 10Traffic: Reimaging cookbok should force a Puppet run on the Icinga host - https://phabricator.wikimedia.org/T336593 (10Volans) The reimage cookbook calls the downtime one with the `--force-puppet` flag, see https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookb... [18:13:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS... [18:17:31] (03CR) 10Dzahn: gerrit: relocate LFS data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908617 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [18:18:06] (03CR) 10Dzahn: "I thought this was already merged! double checking" [puppet] - 10https://gerrit.wikimedia.org/r/908617 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [18:23:33] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [18:28:39] (03PS5) 10Dzahn: gerrit: relocate LFS data [puppet] - 10https://gerrit.wikimedia.org/r/908617 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [18:29:33] (03CR) 10Dzahn: "ah yes, it was. I just rebased this and as you can see it is reduced to a one-liner. just the part that manages the old directory" [puppet] - 10https://gerrit.wikimedia.org/r/908617 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [18:30:34] (03CR) 10Dzahn: "most of this was done in https://gerrit.wikimedia.org/r/c/operations/puppet/+/911363" [puppet] - 10https://gerrit.wikimedia.org/r/908617 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [18:31:22] (03PS6) 10Dzahn: gerrit: stop managing /srv/gerrit/plugins/lfs [puppet] - 10https://gerrit.wikimedia.org/r/908617 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [18:34:46] (03PS7) 10Dzahn: gerrit: stop managing /srv/gerrit/plugins/lfs [puppet] - 10https://gerrit.wikimedia.org/r/908617 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [18:36:04] (03CR) 10Dzahn: "[gerrit1003:~] $ ls /srv/gerrit/plugins/lfs/" [puppet] - 10https://gerrit.wikimedia.org/r/908617 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [18:40:27] (03PS1) 10Mazevedo: Add stream config for mobile apps schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919372 [18:58:17] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic: Reimaging lvs2012 fails as the host is unreachable from cumin2002 - https://phabricator.wikimedia.org/T336428 (10Southparkfan) >>! In T336428#8844099, @cmooney wrote: > Ok I think I see what the issue is. Looking at the [[ https://www.kernel.org... [19:15:14] (03PS21) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) [19:22:15] 10SRE, 10Infrastructure-Foundations, 10Traffic: Reimaging cookbok should force a Puppet run on the Icinga host - https://phabricator.wikimedia.org/T336593 (10ssingh) >>! In T336593#8848181, @Volans wrote: > The reimage cookbook calls the downtime one with the `--force-puppet` flag, see https://gerrit.wikimed... [19:22:40] 10SRE, 10Infrastructure-Foundations, 10Traffic: Reimaging cookbook not forcing a Puppet agent run on lvs2011, lvs2012 - https://phabricator.wikimedia.org/T336593 (10ssingh) [19:42:35] 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10colewhite) [19:49:22] (03PS1) 10Dzahn: gerrit: remove gerrit1001 as a source host for migrations [puppet] - 10https://gerrit.wikimedia.org/r/919400 (https://phabricator.wikimedia.org/T336427) [19:51:19] (03PS1) 10Dzahn: gerrit: remove gerrit1001 from ssh_allowed hosts and acme_chief [puppet] - 10https://gerrit.wikimedia.org/r/919401 (https://phabricator.wikimedia.org/T336427) [19:55:53] (03PS2) 10Dzahn: gerrit: remove gerrit1001 from ssh_allowed hosts and acme_chief [puppet] - 10https://gerrit.wikimedia.org/r/919401 (https://phabricator.wikimedia.org/T336427) [19:56:17] (03CR) 10CI reject: [V: 04-1] gerrit: remove gerrit1001 from ssh_allowed hosts and acme_chief [puppet] - 10https://gerrit.wikimedia.org/r/919401 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [19:56:29] (03PS3) 10Dzahn: gerrit: remove gerrit1001 from ssh_allowed hosts and acme_chief [puppet] - 10https://gerrit.wikimedia.org/r/919401 (https://phabricator.wikimedia.org/T336427) [19:59:34] (03PS1) 10Dzahn: gerrit: add gerrit1003 to hosts using KexAlgo ecdh-sha2-nistp521 for ssh [puppet] - 10https://gerrit.wikimedia.org/r/919402 (https://phabricator.wikimedia.org/T326368) [19:59:54] (03PS4) 10Samtar: InitialiseSettings: Set wgWatchersMaxAge=30days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919023 (https://phabricator.wikimedia.org/T336250) (owner: 10Sarah Mukuti) [20:01:25] (03PS1) 10Dzahn: gerrit: remove gerrit1001 from .ssh/config [puppet] - 10https://gerrit.wikimedia.org/r/919403 (https://phabricator.wikimedia.org/T336427) [20:02:45] (03CR) 10Dzahn: "history is https://phabricator.wikimedia.org/T315942" [puppet] - 10https://gerrit.wikimedia.org/r/919402 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [20:03:34] (03CR) 10Dzahn: "remove now or don't start any decom until grace period over" [puppet] - 10https://gerrit.wikimedia.org/r/919403 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [20:05:38] (03CR) 10Dzahn: "it seems sure that 1001 will not be the source of a migration again" [puppet] - 10https://gerrit.wikimedia.org/r/919400 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [20:08:07] !log gerrit1001 - systemctl mask gerrit T326368 [20:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:11] T326368: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 [20:26:32] (03PS1) 10Dzahn: gerrit: add gerrit1003 SSH host key known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/919405 (https://phabricator.wikimedia.org/T326368) [20:29:47] (03PS1) 10Dzahn: site: remove gerrit1001 from gerrit role, rm hiera host data [puppet] - 10https://gerrit.wikimedia.org/r/919407 (https://phabricator.wikimedia.org/T336427) [20:31:55] (03PS1) 10Dzahn: openstack: remove old Gerrit IP from cloudgw [puppet] - 10https://gerrit.wikimedia.org/r/919408 (https://phabricator.wikimedia.org/T336427) [20:33:37] (03PS2) 10Cwhite: team-sre: add openapi/swagger alerts [alerts] - 10https://gerrit.wikimedia.org/r/918547 (https://phabricator.wikimedia.org/T320620) [20:34:32] (03CR) 10Cwhite: team-sre: add openapi/swagger alerts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/918547 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [20:36:50] (03PS2) 10Cwhite: prometheus: don't fail on unknown blackbox probe type [puppet] - 10https://gerrit.wikimedia.org/r/919331 (https://phabricator.wikimedia.org/T320620) (owner: 10Filippo Giunchedi) [20:44:17] (03CR) 10TChin: Add flink-app default log config and use it in page_content_change (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/917999 (https://phabricator.wikimedia.org/T335802) (owner: 10TChin) [21:05:48] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudswift1001.eqiad.wmnet with OS buster [21:05:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with... [21:05:58] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudswift1001.eqiad.wmnet with OS buster [21:06:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS... [21:06:29] (03CR) 10EoghanGaffney: [C: 03+1] gitlab: run backup sync and restore twice daily [puppet] - 10https://gerrit.wikimedia.org/r/918427 (https://phabricator.wikimedia.org/T316935) (owner: 10Jelto) [21:07:21] (03CR) 10EoghanGaffney: [C: 03+1] gitlab: make sure letsencrypt extention is disabled [puppet] - 10https://gerrit.wikimedia.org/r/919022 (https://phabricator.wikimedia.org/T336476) (owner: 10Jelto) [21:12:08] (03PS3) 10Cwhite: prometheus: don't fail on unknown blackbox probe type [puppet] - 10https://gerrit.wikimedia.org/r/919331 (https://phabricator.wikimedia.org/T320620) (owner: 10Filippo Giunchedi) [21:21:27] (03CR) 10EoghanGaffney: [C: 03+1] gerrit: allow masking the service and do so on gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/919359 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [21:27:14] 10SRE, 10Gerrit, 10Release-Engineering-Team, 10serviceops-collab, 10Patch-For-Review: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) [21:30:42] (03PS8) 10Krinkle: Profiler: Implement "Excimer UI" option for WikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902529 (https://phabricator.wikimedia.org/T291015) [21:33:45] (03PS4) 10Cwhite: prometheus: don't fail on unknown blackbox probe type [puppet] - 10https://gerrit.wikimedia.org/r/919331 (https://phabricator.wikimedia.org/T320620) (owner: 10Filippo Giunchedi) [22:13:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:18:13] (03PS1) 10Krinkle: webperf: Expose /excimer/ingest/ to internal requests only [puppet] - 10https://gerrit.wikimedia.org/r/919419 (https://phabricator.wikimedia.org/T291015) [22:18:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudswift1001.eqiad.wmnet with OS bullseye [22:18:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudswift1001.eqiad.wmnet with O... [22:20:28] (03PS2) 10Krinkle: [WIP] webperf: Fix /excimer/ingest/ restriction [puppet] - 10https://gerrit.wikimedia.org/r/919419 (https://phabricator.wikimedia.org/T291015) [22:23:33] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [22:26:09] (03PS3) 10Krinkle: webperf: Expose /excimer/ingest/ to internal requests only [puppet] - 10https://gerrit.wikimedia.org/r/919419 (https://phabricator.wikimedia.org/T291015) [22:26:36] (03CR) 10Cwhite: "Appears to add an empty item to probes-service_catalog_private.yaml for each swagger probe definition: https://puppet-compiler.wmflabs.org" [puppet] - 10https://gerrit.wikimedia.org/r/919331 (https://phabricator.wikimedia.org/T320620) (owner: 10Filippo Giunchedi) [22:27:25] (03PS4) 10Krinkle: webperf: Expose /excimer/ingest/ to internal requests only [puppet] - 10https://gerrit.wikimedia.org/r/919419 (https://phabricator.wikimedia.org/T291015) [22:31:52] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [22:32:16] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudswift1001.eqiad.wmnet with OS bullseye [22:32:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bu... [22:33:47] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update cloudswift ip address - pt1979@cumin2002" [22:34:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update cloudswift ip address - pt1979@cumin2002" [22:34:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:35:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudswift1001.eqiad.wmnet with OS bullseye [22:35:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudswift1001.eqiad.wmnet with O... [22:45:48] (03PS5) 10Krinkle: webperf: Expose /excimer/ingest/ to internal requests only [puppet] - 10https://gerrit.wikimedia.org/r/919419 (https://phabricator.wikimedia.org/T291015) [22:46:03] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 4 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " (due to missing physical file for old image e... - https://phabricator.wikimedia.org/T244567 [22:52:52] (03PS6) 10Krinkle: webperf: Expose /excimer/ingest/ to internal requests only [puppet] - 10https://gerrit.wikimedia.org/r/919419 (https://phabricator.wikimedia.org/T291015) [22:54:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Papaul) a:05Papaul→03Jhancock.wm @Jhancock.wm was trying to install the OS on cloudswitf1001 and the server was not getting DHC... [22:57:50] (03PS7) 10Krinkle: webperf: Fix /excimer/ POST restriction [puppet] - 10https://gerrit.wikimedia.org/r/919419 (https://phabricator.wikimedia.org/T291015) [22:59:44] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudswift1001.eqiad.wmnet with OS bullseye [22:59:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bu... [23:00:37] (03CR) 10Krinkle: "Confirmed in Beta, cherry-picked there via its puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/919419 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [23:04:00] (03PS8) 10Krinkle: webperf: Fix /excimer/ POST restriction [puppet] - 10https://gerrit.wikimedia.org/r/919419 (https://phabricator.wikimedia.org/T291015) [23:12:22] (03PS9) 10Krinkle: Profiler: Implement "Excimer UI" option for WikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902529 (https://phabricator.wikimedia.org/T291015) [23:13:03] (03CR) 10CI reject: [V: 04-1] Profiler: Implement "Excimer UI" option for WikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902529 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [23:14:20] (03PS10) 10Krinkle: Profiler: Implement "Excimer UI" option for WikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902529 (https://phabricator.wikimedia.org/T291015) [23:20:51] (03PS11) 10Krinkle: Profiler: Implement "Excimer UI" option for WikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902529 (https://phabricator.wikimedia.org/T291015) [23:21:29] (03CR) 10CI reject: [V: 04-1] Profiler: Implement "Excimer UI" option for WikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902529 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [23:24:34] (03CR) 10Krinkle: "Testing this on mwdebug yields a warning:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902529 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [23:24:44] (03PS12) 10Krinkle: Profiler: Implement "Excimer UI" option for WikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902529 (https://phabricator.wikimedia.org/T291015) [23:34:27] (03PS13) 10Krinkle: Profiler: Implement "Excimer UI" option for WikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902529 (https://phabricator.wikimedia.org/T291015) [23:35:50] (03PS14) 10Krinkle: Profiler: Implement "Excimer UI" option for WikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902529 (https://phabricator.wikimedia.org/T291015) [23:50:08] (03PS1) 10Krinkle: webperf: Fix excimer_mysql_user typo [puppet] - 10https://gerrit.wikimedia.org/r/919422 (https://phabricator.wikimedia.org/T291015) [23:51:33] (03CR) 10Krinkle: Profiler: Implement "Excimer UI" option for WikimediaDebug (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902529 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [23:52:33] (03CR) 10Krinkle: "POST requestst to performance.discovery.wmnet/excimer/ingest/ currently fail as attempted via https://gerrit.wikimedia.org/r/c/operations/" [puppet] - 10https://gerrit.wikimedia.org/r/919422 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [23:54:51] (03PS1) 10Dwisehaupt: Add donorpreferences, delete dash CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/919423 (https://phabricator.wikimedia.org/T335793) [23:55:04] (03PS15) 10Krinkle: Profiler: Implement "Excimer UI" option for WikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902529 (https://phabricator.wikimedia.org/T291015) [23:58:17] (03PS16) 10Krinkle: Profiler: Implement "Excimer UI" option for WikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902529 (https://phabricator.wikimedia.org/T291015)