[00:20:37] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:21:41] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:22:15] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49853 bytes in 7.729 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:23:13] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.347 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:39:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/906696 [00:39:20] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/906696 (owner: 10TrainBranchBot) [00:56:17] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/906696 (owner: 10TrainBranchBot) [01:08:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:13:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:32] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:18:11] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:24:21] PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:49:41] (NodeTextfileStale) firing: (2) Stale textfile for puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:14:43] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:19:43] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:23:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:28:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:16:07] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:21:27] RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:39:09] (03PS1) 10Marostegui: es1022: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/906803 (https://phabricator.wikimedia.org/T333961) [05:39:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P46157 and previous config saved to /var/cache/conftool/dbconfig/20230410-053919-root.json [05:40:39] (03CR) 10Marostegui: [C: 03+2] es1022: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/906803 (https://phabricator.wikimedia.org/T333961) (owner: 10Marostegui) [05:43:26] (03PS1) 10Marostegui: instances.yaml: Add db1207 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/906804 (https://phabricator.wikimedia.org/T326669) [05:43:51] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1207 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/906804 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [05:45:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1207 to dbctl T326669', diff saved to https://phabricator.wikimedia.org/P46158 and previous config saved to /var/cache/conftool/dbconfig/20230410-054504-marostegui.json [05:45:09] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [05:45:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1207 (re)pooling @ 1%: Pooling T326669', diff saved to https://phabricator.wikimedia.org/P46159 and previous config saved to /var/cache/conftool/dbconfig/20230410-054532-root.json [05:45:52] 10SRE, 10Abstract Wikipedia team, 10Anti-Harassment, 10Cloud-Services, and 18 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10Physikerwelt) [05:46:36] (03PS1) 10Marostegui: db1207: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/906805 (https://phabricator.wikimedia.org/T326669) [05:47:14] (03CR) 10Marostegui: [C: 03+2] db1207: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/906805 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [05:48:39] 10ops-codfw, 10Data-Persistence-Backup: Inbound interface errors - https://phabricator.wikimedia.org/T331373 (10Marostegui) @Jhancock.wm this is a backup source host, so probably best to coordinate with @jcrespo so it can be powered off whenever there's not a backup running. [05:50:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1161 T334080', diff saved to https://phabricator.wikimedia.org/P46160 and previous config saved to /var/cache/conftool/dbconfig/20230410-055005-marostegui.json [05:50:10] T334080: Move db1183 to s5 - https://phabricator.wikimedia.org/T334080 [05:53:33] (03PS1) 10Marostegui: mariadb: Move db1183 to s5 [puppet] - 10https://gerrit.wikimedia.org/r/906806 (https://phabricator.wikimedia.org/T334080) [05:54:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P46162 and previous config saved to /var/cache/conftool/dbconfig/20230410-055424-root.json [05:54:45] PROBLEM - MariaDB Replica IO: s5 on db1154 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1161.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1161.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:54:46] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1183 to s5 [puppet] - 10https://gerrit.wikimedia.org/r/906806 (https://phabricator.wikimedia.org/T334080) (owner: 10Marostegui) [05:55:07] ^ me [06:00:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1207 (re)pooling @ 2%: Pooling T326669', diff saved to https://phabricator.wikimedia.org/P46163 and previous config saved to /var/cache/conftool/dbconfig/20230410-060037-root.json [06:00:42] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [06:08:01] PROBLEM - MariaDB Replica Lag: s5 on clouddb1016 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1040.67 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:09:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P46164 and previous config saved to /var/cache/conftool/dbconfig/20230410-060929-root.json [06:15:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1207 (re)pooling @ 3%: Pooling T326669', diff saved to https://phabricator.wikimedia.org/P46165 and previous config saved to /var/cache/conftool/dbconfig/20230410-061541-root.json [06:15:47] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [06:24:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P46166 and previous config saved to /var/cache/conftool/dbconfig/20230410-062434-root.json [06:28:46] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:29:58] RECOVERY - MariaDB Replica IO: s5 on db1154 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:30:02] RECOVERY - MariaDB Replica Lag: s5 on clouddb1016 is OK: OK slave_sql_lag Replication lag: 0.28 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:30:34] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:30:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1207 (re)pooling @ 4%: Pooling T326669', diff saved to https://phabricator.wikimedia.org/P46167 and previous config saved to /var/cache/conftool/dbconfig/20230410-063046-root.json [06:30:51] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [06:32:13] (03PS1) 10Marostegui: instances.yaml: Add db1220 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/906899 (https://phabricator.wikimedia.org/T326669) [06:32:42] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1220 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/906899 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [06:34:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1220 to dbctl T326669', diff saved to https://phabricator.wikimedia.org/P46168 and previous config saved to /var/cache/conftool/dbconfig/20230410-063458-marostegui.json [06:35:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1220 (re)pooling @ 1%: Pooling', diff saved to https://phabricator.wikimedia.org/P46169 and previous config saved to /var/cache/conftool/dbconfig/20230410-063534-root.json [06:36:01] (03PS1) 10Marostegui: db1220: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/906900 (https://phabricator.wikimedia.org/T326669) [06:36:28] (03CR) 10Marostegui: [C: 03+2] db1220: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/906900 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [06:37:36] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1179 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/906697 (https://phabricator.wikimedia.org/T334374) [06:38:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 12 hosts with reason: Primary switchover x1 T334374 [06:38:51] T334374: Switchover x1 master (db1103 -> db1179) - https://phabricator.wikimedia.org/T334374 [06:39:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 12 hosts with reason: Primary switchover x1 T334374 [06:39:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1179 with weight 0 T334374', diff saved to https://phabricator.wikimedia.org/P46170 and previous config saved to /var/cache/conftool/dbconfig/20230410-063916-root.json [06:39:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P46171 and previous config saved to /var/cache/conftool/dbconfig/20230410-063939-root.json [06:41:26] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:41:54] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:43:18] (03PS27) 10KartikMistry: Add new self hosted machinetranslation service (MinT) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) [06:43:42] (03PS2) 10KartikMistry: WIP: Add MinT support to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579 [06:45:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1207 (re)pooling @ 5%: Pooling T326669', diff saved to https://phabricator.wikimedia.org/P46172 and previous config saved to /var/cache/conftool/dbconfig/20230410-064551-root.json [06:45:56] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [06:50:07] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1179 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/906697 (https://phabricator.wikimedia.org/T334374) (owner: 10Gerrit maintenance bot) [06:50:25] !log Starting x1 eqiad failover from db1103 to db1179 - T334374 [06:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:29] T334374: Switchover x1 master (db1103 -> db1179) - https://phabricator.wikimedia.org/T334374 [06:50:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1220 (re)pooling @ 2%: Pooling', diff saved to https://phabricator.wikimedia.org/P46173 and previous config saved to /var/cache/conftool/dbconfig/20230410-065039-root.json [06:50:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1179 to x1 primary T334374', diff saved to https://phabricator.wikimedia.org/P46174 and previous config saved to /var/cache/conftool/dbconfig/20230410-065047-root.json [06:51:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1103 T334374', diff saved to https://phabricator.wikimedia.org/P46175 and previous config saved to /var/cache/conftool/dbconfig/20230410-065149-marostegui.json [06:52:30] (03PS3) 10KartikMistry: Add MinT support to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579 [06:54:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P46176 and previous config saved to /var/cache/conftool/dbconfig/20230410-065443-root.json [06:58:08] (03PS1) 10Marostegui: mariadb: Decommission db1101 [puppet] - 10https://gerrit.wikimedia.org/r/906901 (https://phabricator.wikimedia.org/T331381) [06:58:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1101.eqiad.wmnet [07:00:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1207 (re)pooling @ 10%: Pooling T326669', diff saved to https://phabricator.wikimedia.org/P46177 and previous config saved to /var/cache/conftool/dbconfig/20230410-070056-root.json [07:01:01] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [07:02:30] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1101 [puppet] - 10https://gerrit.wikimedia.org/r/906901 (https://phabricator.wikimedia.org/T331381) (owner: 10Marostegui) [07:03:32] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [07:05:31] !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1101.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [07:05:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1220 (re)pooling @ 3%: Pooling', diff saved to https://phabricator.wikimedia.org/P46178 and previous config saved to /var/cache/conftool/dbconfig/20230410-070544-root.json [07:09:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1101.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [07:09:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:09:07] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts db1101.eqiad.wmnet [07:09:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P46179 and previous config saved to /var/cache/conftool/dbconfig/20230410-070948-root.json [07:10:33] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1101.eqiad.wmnet - https://phabricator.wikimedia.org/T331381 (10Marostegui) a:05Marostegui→03Jclark-ctr [07:12:33] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1101.eqiad.wmnet - https://phabricator.wikimedia.org/T331381 (10Marostegui) @Volans any advise on how to proceed with the above error? I have been searching on wikitech but I haven't found any follow up to that. [07:16:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1207 (re)pooling @ 25%: Pooling T326669', diff saved to https://phabricator.wikimedia.org/P46180 and previous config saved to /var/cache/conftool/dbconfig/20230410-071600-root.json [07:16:06] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [07:16:43] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1101.eqiad.wmnet - https://phabricator.wikimedia.org/T331381 (10Marostegui) a:05Jclark-ctr→03None @Jclark-ctr don't proceed yet with this decommissioning until we've clarified the error. [07:17:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1109 T326669', diff saved to https://phabricator.wikimedia.org/P46181 and previous config saved to /var/cache/conftool/dbconfig/20230410-071747-marostegui.json [07:20:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1220 (re)pooling @ 4%: Pooling', diff saved to https://phabricator.wikimedia.org/P46183 and previous config saved to /var/cache/conftool/dbconfig/20230410-072048-root.json [07:20:53] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/906698 (https://phabricator.wikimedia.org/T334375) [07:22:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1163', diff saved to https://phabricator.wikimedia.org/P46184 and previous config saved to /var/cache/conftool/dbconfig/20230410-072206-marostegui.json [07:31:02] (03PS1) 10Majavah: cr-cloud: remove clouddb_return term [homer/public] - 10https://gerrit.wikimedia.org/r/907132 (https://phabricator.wikimedia.org/T303663) [07:31:04] (03PS1) 10Majavah: cr-cloud: remove labstore term [homer/public] - 10https://gerrit.wikimedia.org/r/907133 (https://phabricator.wikimedia.org/T333477) [07:31:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1207 (re)pooling @ 50%: Pooling T326669', diff saved to https://phabricator.wikimedia.org/P46185 and previous config saved to /var/cache/conftool/dbconfig/20230410-073105-root.json [07:31:11] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [07:31:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P46186 and previous config saved to /var/cache/conftool/dbconfig/20230410-073112-root.json [07:34:17] (03PS1) 10Majavah: P:toolforge::checker: remove showmount check [puppet] - 10https://gerrit.wikimedia.org/r/907134 [07:34:19] (03PS1) 10Majavah: hieradata: openstack: drop NAT exceptions for nfs-tools-project [puppet] - 10https://gerrit.wikimedia.org/r/907135 (https://phabricator.wikimedia.org/T333477) [07:35:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1220 (re)pooling @ 5%: Pooling', diff saved to https://phabricator.wikimedia.org/P46187 and previous config saved to /var/cache/conftool/dbconfig/20230410-073553-root.json [07:36:35] (03PS1) 10Majavah: wmnet: Remove nfs-tools-project.svc.eqiad [dns] - 10https://gerrit.wikimedia.org/r/907136 (https://phabricator.wikimedia.org/T333477) [07:39:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P46188 and previous config saved to /var/cache/conftool/dbconfig/20230410-073947-root.json [07:43:37] (03PS1) 10Marostegui: mariadb: Make db1220 x1 candidate [puppet] - 10https://gerrit.wikimedia.org/r/907137 (https://phabricator.wikimedia.org/T326669) [07:46:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1207 (re)pooling @ 75%: Pooling T326669', diff saved to https://phabricator.wikimedia.org/P46189 and previous config saved to /var/cache/conftool/dbconfig/20230410-074610-root.json [07:46:15] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [07:46:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P46190 and previous config saved to /var/cache/conftool/dbconfig/20230410-074617-root.json [07:49:41] (NodeTextfileStale) firing: (2) Stale textfile for puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:49:43] (03CR) 10Marostegui: [C: 03+2] mariadb: Make db1220 x1 candidate [puppet] - 10https://gerrit.wikimedia.org/r/907137 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [07:50:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1220 (re)pooling @ 10%: Pooling', diff saved to https://phabricator.wikimedia.org/P46191 and previous config saved to /var/cache/conftool/dbconfig/20230410-075058-root.json [07:54:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 2%: Repooling', diff saved to https://phabricator.wikimedia.org/P46192 and previous config saved to /var/cache/conftool/dbconfig/20230410-075451-root.json [08:00:50] RECOVERY - Check systemd state on ms-be2069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:01:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1207 (re)pooling @ 100%: Pooling T326669', diff saved to https://phabricator.wikimedia.org/P46193 and previous config saved to /var/cache/conftool/dbconfig/20230410-080115-root.json [08:01:20] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [08:01:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P46194 and previous config saved to /var/cache/conftool/dbconfig/20230410-080121-root.json [08:06:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1220 (re)pooling @ 25%: Pooling', diff saved to https://phabricator.wikimedia.org/P46195 and previous config saved to /var/cache/conftool/dbconfig/20230410-080603-root.json [08:09:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 3%: Repooling', diff saved to https://phabricator.wikimedia.org/P46196 and previous config saved to /var/cache/conftool/dbconfig/20230410-080956-root.json [08:13:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Please also collect +1 from David" [homer/public] - 10https://gerrit.wikimedia.org/r/907132 (https://phabricator.wikimedia.org/T303663) (owner: 10Majavah) [08:13:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Please also collect +1 from David and/or Andrew." [homer/public] - 10https://gerrit.wikimedia.org/r/907133 (https://phabricator.wikimedia.org/T333477) (owner: 10Majavah) [08:16:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P46197 and previous config saved to /var/cache/conftool/dbconfig/20230410-081626-root.json [08:18:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Please @David or @Andrew confirm this is good to merge. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/907135 (https://phabricator.wikimedia.org/T333477) (owner: 10Majavah) [08:18:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] P:toolforge::checker: remove showmount check [puppet] - 10https://gerrit.wikimedia.org/r/907134 (owner: 10Majavah) [08:19:03] (03PS1) 10Marostegui: db1103: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/907431 (https://phabricator.wikimedia.org/T332293) [08:19:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] O:wmcs::nfs: remove primary_backup::misc and related classes [puppet] - 10https://gerrit.wikimedia.org/r/906783 (https://phabricator.wikimedia.org/T301280) (owner: 10Majavah) [08:19:36] (03CR) 10Marostegui: [C: 03+2] db1103: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/907431 (https://phabricator.wikimedia.org/T332293) (owner: 10Marostegui) [08:20:19] (03PS3) 10Krinkle: Remove redundant wgOriginTrials and wgFeaturePolicyReportOnly settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891314 [08:20:24] (03CR) 10Krinkle: [C: 03+2] Remove redundant wgOriginTrials and wgFeaturePolicyReportOnly settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891314 (owner: 10Krinkle) [08:20:50] (03PS1) 10Marostegui: install_server: Do not reimage db1207 [puppet] - 10https://gerrit.wikimedia.org/r/907432 (https://phabricator.wikimedia.org/T326669) [08:21:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1220 (re)pooling @ 50%: Pooling', diff saved to https://phabricator.wikimedia.org/P46198 and previous config saved to /var/cache/conftool/dbconfig/20230410-082108-root.json [08:21:17] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] openstack::util::envscript: allow caller to specify domain_ids (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/906777 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [08:21:45] (03Merged) 10jenkins-bot: Remove redundant wgOriginTrials and wgFeaturePolicyReportOnly settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891314 (owner: 10Krinkle) [08:22:14] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] codfw1dev: set enforce_policy_scope and enforce_new_policy_defaults to false. [puppet] - 10https://gerrit.wikimedia.org/r/906775 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [08:22:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Openstack envscripts: allow unsetting environment variables [puppet] - 10https://gerrit.wikimedia.org/r/906776 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [08:22:26] PROBLEM - Check systemd state on alert2001 is CRITICAL: CRITICAL - degraded: The following units failed: sync_check_icinga_contacts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:23:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Openstack envscripts.pp: create additional scripts for system and domain scope [puppet] - 10https://gerrit.wikimedia.org/r/906778 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [08:25:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 4%: Repooling', diff saved to https://phabricator.wikimedia.org/P46199 and previous config saved to /var/cache/conftool/dbconfig/20230410-082501-root.json [08:26:44] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1207 [puppet] - 10https://gerrit.wikimedia.org/r/907432 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [08:29:08] (03PS1) 10Marostegui: mariadb: Productionize db1209 [puppet] - 10https://gerrit.wikimedia.org/r/907433 (https://phabricator.wikimedia.org/T326669) [08:31:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P46200 and previous config saved to /var/cache/conftool/dbconfig/20230410-083131-root.json [08:36:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1220 (re)pooling @ 75%: Pooling', diff saved to https://phabricator.wikimedia.org/P46201 and previous config saved to /var/cache/conftool/dbconfig/20230410-083613-root.json [08:38:32] RECOVERY - Ubuntu mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/ubuntu is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [08:38:56] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:40:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P46202 and previous config saved to /var/cache/conftool/dbconfig/20230410-084006-root.json [08:46:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P46203 and previous config saved to /var/cache/conftool/dbconfig/20230410-084636-root.json [08:48:08] 10SRE, 10Traffic: varnish-frontend-fetcherr: Assert error in vslc_vtx_next, 100% CPU usage - https://phabricator.wikimedia.org/T253093 (10Aklapper) a:05ema→03None Removing inactive task assignee as this task got reopened after 3 years [08:50:00] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:51:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1220 (re)pooling @ 100%: Pooling', diff saved to https://phabricator.wikimedia.org/P46204 and previous config saved to /var/cache/conftool/dbconfig/20230410-085117-root.json [08:55:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P46205 and previous config saved to /var/cache/conftool/dbconfig/20230410-085511-root.json [09:00:18] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:01:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P46206 and previous config saved to /var/cache/conftool/dbconfig/20230410-090141-root.json [09:05:53] (03PS1) 10Marostegui: packages_wmf.pp: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/907438 [09:10:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P46207 and previous config saved to /var/cache/conftool/dbconfig/20230410-091015-root.json [09:13:19] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1209 [puppet] - 10https://gerrit.wikimedia.org/r/907433 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [09:23:08] (03PS1) 10Marostegui: db1211: Move db1211 to s8 [puppet] - 10https://gerrit.wikimedia.org/r/907440 (https://phabricator.wikimedia.org/T326669) [09:24:13] (03CR) 10Marostegui: [C: 03+2] db1211: Move db1211 to s8 [puppet] - 10https://gerrit.wikimedia.org/r/907440 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [09:25:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P46209 and previous config saved to /var/cache/conftool/dbconfig/20230410-092520-root.json [09:27:45] (03PS1) 10Marostegui: mariadb: Install 10.6 on db1211 [puppet] - 10https://gerrit.wikimedia.org/r/907442 (https://phabricator.wikimedia.org/T326669) [09:28:27] (03CR) 10Marostegui: [C: 03+2] mariadb: Install 10.6 on db1211 [puppet] - 10https://gerrit.wikimedia.org/r/907442 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [09:31:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 1%: Pooling', diff saved to https://phabricator.wikimedia.org/P46210 and previous config saved to /var/cache/conftool/dbconfig/20230410-093149-root.json [09:33:50] 10SRE, 10Infrastructure-Foundations, 10Traffic: Manual upload of iDRAC EXE results in broken web interface - https://phabricator.wikimedia.org/T334146 (10jbond) @BCornwall i suspect this is T322419#8370970. the fix would be to run `racadm set IDRAC.WeServer.HostHeaderCheck 0` [09:37:13] (03CR) 10Jbond: [C: 03+1] "thanks for the info" [puppet] - 10https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [09:40:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P46211 and previous config saved to /var/cache/conftool/dbconfig/20230410-094025-root.json [09:46:03] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/906687 (https://phabricator.wikimedia.org/T334158) (owner: 10Kevin Bazira) [09:46:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 2%: Pooling', diff saved to https://phabricator.wikimedia.org/P46212 and previous config saved to /var/cache/conftool/dbconfig/20230410-094654-root.json [09:48:14] hmm, seeing "Original error: upstream connect error or disconnect/reset before headers. reset reason: overflow " - anyone else with the same issue? [09:48:24] +1 [09:49:07] (ProbeDown) firing: (3) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:49:25] Now it's working (after ~2 minute in which I can only saw this message) [09:49:31] 10SRE, 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1101.eqiad.wmnet - https://phabricator.wikimedia.org/T331381 (10Volans) >>! In T331381#8767588, @Marostegui wrote: > @Volans any advise on how to proceed with the above error? I have been searching on wikitech but I haven't found any follow... [09:50:07] (ProbeDown) firing: Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:50:36] (03PS11) 10Ilias Sarantopoulos: ml-services: FastAPI chart using sextant for ores-legacy service [deployment-charts] - 10https://gerrit.wikimedia.org/r/904777 (https://phabricator.wikimedia.org/T330414) [09:50:40] Sigh [09:50:50] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [09:51:08] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1101.eqiad.wmnet - https://phabricator.wikimedia.org/T331381 (10Marostegui) a:03Jclark-ctr Thanks @volans. Yeah, the host is going to be decommissioned entirely, and won't ever come back. @Jclark-ctr per the above, you can proceed with the last o... [09:51:20] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1101.eqiad.wmnet - https://phabricator.wikimedia.org/T331381 (10Marostegui) [09:54:07] (ProbeDown) resolved: (3) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:55:07] (ProbeDown) resolved: Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:55:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P46213 and previous config saved to /var/cache/conftool/dbconfig/20230410-095530-root.json [09:55:39] (03PS12) 10Ilias Sarantopoulos: ml-services: FastAPI chart using sextant for ores-legacy service [deployment-charts] - 10https://gerrit.wikimedia.org/r/904777 (https://phabricator.wikimedia.org/T330414) [09:55:50] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [09:57:00] (03PS1) 10Marostegui: instances.yaml: Add db1183 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/907443 (https://phabricator.wikimedia.org/T334080) [09:57:36] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1183 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/907443 (https://phabricator.wikimedia.org/T334080) (owner: 10Marostegui) [09:58:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1183 to s5 depooled T334080', diff saved to https://phabricator.wikimedia.org/P46214 and previous config saved to /var/cache/conftool/dbconfig/20230410-095846-marostegui.json [09:58:50] T334080: Move db1183 to s5 - https://phabricator.wikimedia.org/T334080 [09:59:57] (03PS1) 10Marostegui: db1183: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/907444 (https://phabricator.wikimedia.org/T334080) [10:00:30] (03CR) 10Ilias Sarantopoulos: ml-services: FastAPI chart using sextant for ores-legacy service (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/904777 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [10:02:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 3%: Pooling', diff saved to https://phabricator.wikimedia.org/P46215 and previous config saved to /var/cache/conftool/dbconfig/20230410-100159-root.json [10:05:00] (03CR) 10Marostegui: [C: 03+2] db1183: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/907444 (https://phabricator.wikimedia.org/T334080) (owner: 10Marostegui) [10:05:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1183 (re)pooling @ 1%: Pooling T334080', diff saved to https://phabricator.wikimedia.org/P46216 and previous config saved to /var/cache/conftool/dbconfig/20230410-100528-root.json [10:05:33] T334080: Move db1183 to s5 - https://phabricator.wikimedia.org/T334080 [10:17:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 4%: Pooling', diff saved to https://phabricator.wikimedia.org/P46217 and previous config saved to /var/cache/conftool/dbconfig/20230410-101704-root.json [10:18:42] (03PS1) 10Marostegui: mariadb: Productionize db1211 [puppet] - 10https://gerrit.wikimedia.org/r/907445 (https://phabricator.wikimedia.org/T326669) [10:20:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1183 (re)pooling @ 2%: Pooling T334080', diff saved to https://phabricator.wikimedia.org/P46218 and previous config saved to /var/cache/conftool/dbconfig/20230410-102033-root.json [10:20:38] T334080: Move db1183 to s5 - https://phabricator.wikimedia.org/T334080 [10:23:36] (03PS1) 10Arturo Borrero Gonzalez: wmcs-k8s-node-upgrade.py: upgrade version defaults [puppet] - 10https://gerrit.wikimedia.org/r/907448 (https://phabricator.wikimedia.org/T286856) [10:23:38] (03PS2) 10Jon Harald Søby: Add blkwiki to wgSitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906793 (https://phabricator.wikimedia.org/T334351) [10:24:31] are there no deployment windows today? re. [[wikitech:Deployments]] not showing this week yet [10:26:47] PROBLEM - Check systemd state on ms-be2069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:32:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 5%: Pooling', diff saved to https://phabricator.wikimedia.org/P46219 and previous config saved to /var/cache/conftool/dbconfig/20230410-103209-root.json [10:35:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1183 (re)pooling @ 3%: Pooling T334080', diff saved to https://phabricator.wikimedia.org/P46220 and previous config saved to /var/cache/conftool/dbconfig/20230410-103538-root.json [10:35:43] T334080: Move db1183 to s5 - https://phabricator.wikimedia.org/T334080 [10:47:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 10%: Pooling', diff saved to https://phabricator.wikimedia.org/P46221 and previous config saved to /var/cache/conftool/dbconfig/20230410-104714-root.json [10:50:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1183 (re)pooling @ 4%: Pooling T334080', diff saved to https://phabricator.wikimedia.org/P46222 and previous config saved to /var/cache/conftool/dbconfig/20230410-105043-root.json [10:50:48] T334080: Move db1183 to s5 - https://phabricator.wikimedia.org/T334080 [10:56:49] (03PS1) 10Arturo Borrero Gonzalez: wmcs-k8s-node-upgrade.py: reboot after upgrades [puppet] - 10https://gerrit.wikimedia.org/r/907449 [11:02:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 25%: Pooling', diff saved to https://phabricator.wikimedia.org/P46224 and previous config saved to /var/cache/conftool/dbconfig/20230410-110218-root.json [11:05:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1183 (re)pooling @ 5%: Pooling T334080', diff saved to https://phabricator.wikimedia.org/P46225 and previous config saved to /var/cache/conftool/dbconfig/20230410-110548-root.json [11:05:52] T334080: Move db1183 to s5 - https://phabricator.wikimedia.org/T334080 [11:08:16] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1211 [puppet] - 10https://gerrit.wikimedia.org/r/907445 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [11:15:28] (03CR) 10Majavah: [C: 03+1] wmcs-k8s-node-upgrade.py: upgrade version defaults [puppet] - 10https://gerrit.wikimedia.org/r/907448 (https://phabricator.wikimedia.org/T286856) (owner: 10Arturo Borrero Gonzalez) [11:16:25] (03CR) 10Majavah: wmcs-k8s-node-upgrade.py: reboot after upgrades (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/907449 (owner: 10Arturo Borrero Gonzalez) [11:17:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 50%: Pooling', diff saved to https://phabricator.wikimedia.org/P46226 and previous config saved to /var/cache/conftool/dbconfig/20230410-111723-root.json [11:20:39] (03PS1) 10Marostegui: db1100: Remove candidate from s5 [puppet] - 10https://gerrit.wikimedia.org/r/907464 (https://phabricator.wikimedia.org/T329352) [11:20:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1183 (re)pooling @ 10%: Pooling T334080', diff saved to https://phabricator.wikimedia.org/P46227 and previous config saved to /var/cache/conftool/dbconfig/20230410-112052-root.json [11:20:57] T334080: Move db1183 to s5 - https://phabricator.wikimedia.org/T334080 [11:21:07] (03CR) 10Marostegui: [C: 03+2] db1100: Remove candidate from s5 [puppet] - 10https://gerrit.wikimedia.org/r/907464 (https://phabricator.wikimedia.org/T329352) (owner: 10Marostegui) [11:21:50] (03CR) 10Jbond: [C: 04-1] "Thanks for the work I think the logic of this script looks good the minus -1 is because it doesn't follow the nrpe api spec[1]. That said" [puppet] - 10https://gerrit.wikimedia.org/r/906103 (https://phabricator.wikimedia.org/T333007) (owner: 10Cathal Mooney) [11:25:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1201 to clone db1224 T326669', diff saved to https://phabricator.wikimedia.org/P46228 and previous config saved to /var/cache/conftool/dbconfig/20230410-112524-marostegui.json [11:25:29] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [11:30:24] (03PS1) 10Marostegui: db1224: Shard s6 [puppet] - 10https://gerrit.wikimedia.org/r/907479 (https://phabricator.wikimedia.org/T326669) [11:30:49] (03CR) 10Marostegui: [C: 03+2] db1224: Shard s6 [puppet] - 10https://gerrit.wikimedia.org/r/907479 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [11:32:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 75%: Pooling', diff saved to https://phabricator.wikimedia.org/P46230 and previous config saved to /var/cache/conftool/dbconfig/20230410-113228-root.json [11:35:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1183 (re)pooling @ 25%: Pooling T334080', diff saved to https://phabricator.wikimedia.org/P46231 and previous config saved to /var/cache/conftool/dbconfig/20230410-113557-root.json [11:36:02] T334080: Move db1183 to s5 - https://phabricator.wikimedia.org/T334080 [11:36:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:36:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs-k8s-node-upgrade.py: upgrade version defaults [puppet] - 10https://gerrit.wikimedia.org/r/907448 (https://phabricator.wikimedia.org/T286856) (owner: 10Arturo Borrero Gonzalez) [11:39:31] (03PS2) 10Arturo Borrero Gonzalez: wmcs-k8s-node-upgrade.py: reboot after upgrades [puppet] - 10https://gerrit.wikimedia.org/r/907449 [11:39:39] (03CR) 10Arturo Borrero Gonzalez: wmcs-k8s-node-upgrade.py: reboot after upgrades (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/907449 (owner: 10Arturo Borrero Gonzalez) [11:41:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:47:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 100%: Pooling', diff saved to https://phabricator.wikimedia.org/P46232 and previous config saved to /var/cache/conftool/dbconfig/20230410-114733-root.json [11:49:41] (NodeTextfileStale) firing: (2) Stale textfile for puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:51:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1183 (re)pooling @ 50%: Pooling T334080', diff saved to https://phabricator.wikimedia.org/P46233 and previous config saved to /var/cache/conftool/dbconfig/20230410-115102-root.json [11:51:07] T334080: Move db1183 to s5 - https://phabricator.wikimedia.org/T334080 [12:02:31] (03CR) 10Majavah: [C: 03+1] wmcs-k8s-node-upgrade.py: reboot after upgrades [puppet] - 10https://gerrit.wikimedia.org/r/907449 (owner: 10Arturo Borrero Gonzalez) [12:06:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1183 (re)pooling @ 75%: Pooling T334080', diff saved to https://phabricator.wikimedia.org/P46234 and previous config saved to /var/cache/conftool/dbconfig/20230410-120607-root.json [12:06:12] T334080: Move db1183 to s5 - https://phabricator.wikimedia.org/T334080 [12:15:28] (03PS1) 10Marostegui: mariadb: Productionize db1224 [puppet] - 10https://gerrit.wikimedia.org/r/907483 (https://phabricator.wikimedia.org/T326669) [12:19:55] (03CR) 10Jbond: [C: 03+1] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/906736 (https://phabricator.wikimedia.org/T333204) (owner: 10Filippo Giunchedi) [12:21:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1183 (re)pooling @ 100%: Pooling T334080', diff saved to https://phabricator.wikimedia.org/P46235 and previous config saved to /var/cache/conftool/dbconfig/20230410-122112-root.json [12:21:17] T334080: Move db1183 to s5 - https://phabricator.wikimedia.org/T334080 [12:22:28] (03PS2) 10Andrew Bogott: openstack::util::envscript: allow caller to specify domain_ids [puppet] - 10https://gerrit.wikimedia.org/r/906777 (https://phabricator.wikimedia.org/T330759) [12:22:30] (03PS2) 10Andrew Bogott: Openstack envscripts.pp: create additional scripts for system and domain scope [puppet] - 10https://gerrit.wikimedia.org/r/906778 (https://phabricator.wikimedia.org/T330759) [12:23:07] (03CR) 10CI reject: [V: 04-1] Openstack envscripts.pp: create additional scripts for system and domain scope [puppet] - 10https://gerrit.wikimedia.org/r/906778 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [12:24:15] (03CR) 10Jbond: [C: 04-1] Change check_eth script to work without filter on netdev names (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/906103 (https://phabricator.wikimedia.org/T333007) (owner: 10Cathal Mooney) [12:35:24] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1224 [puppet] - 10https://gerrit.wikimedia.org/r/907483 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [12:36:58] (03PS1) 10Marostegui: site.pp: Remove insetup role from db1224 [puppet] - 10https://gerrit.wikimedia.org/r/907484 (https://phabricator.wikimedia.org/T326669) [12:37:28] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup role from db1224 [puppet] - 10https://gerrit.wikimedia.org/r/907484 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [12:40:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P46236 and previous config saved to /var/cache/conftool/dbconfig/20230410-124023-root.json [12:55:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P46237 and previous config saved to /var/cache/conftool/dbconfig/20230410-125528-root.json [13:05:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs-k8s-node-upgrade.py: reboot after upgrades [puppet] - 10https://gerrit.wikimedia.org/r/907449 (owner: 10Arturo Borrero Gonzalez) [13:10:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P46238 and previous config saved to /var/cache/conftool/dbconfig/20230410-131033-root.json [13:25:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P46239 and previous config saved to /var/cache/conftool/dbconfig/20230410-132538-root.json [13:26:27] (03CR) 10Cparle: [C: 03+2] structured-data: Temporarily remove ImageSuggestionsPushFailure alert. [alerts] - 10https://gerrit.wikimedia.org/r/906744 (https://phabricator.wikimedia.org/T328789) (owner: 10Xcollazo) [13:28:39] (03Merged) 10jenkins-bot: structured-data: Temporarily remove ImageSuggestionsPushFailure alert. [alerts] - 10https://gerrit.wikimedia.org/r/906744 (https://phabricator.wikimedia.org/T328789) (owner: 10Xcollazo) [13:40:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10RobH) [13:40:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10RobH) [13:40:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P46240 and previous config saved to /var/cache/conftool/dbconfig/20230410-134042-root.json [13:55:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P46241 and previous config saved to /var/cache/conftool/dbconfig/20230410-135547-root.json [13:59:32] 10SRE, 10Abstract Wikipedia team, 10Anti-Harassment, 10Cloud-Services, and 18 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10Dbrant) [14:00:44] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:00:50] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:06:28] (03PS4) 10Jforrester: Remove obsolete Timeline configuration and fonts submodule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723652 (owner: 10Legoktm) [14:06:48] (03CR) 10Jforrester: [C: 03+1] "This should now be (very) safe to deploy. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723652 (owner: 10Legoktm) [14:07:28] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 7.730 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:10:36] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49853 bytes in 2.070 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:10:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P46242 and previous config saved to /var/cache/conftool/dbconfig/20230410-141052-root.json [14:13:11] 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 17 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10Jdforrester-WMF) [14:18:46] 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 17 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10taavi) [14:22:08] 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 17 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10ssastry) [14:29:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:29:19] (03CR) 10Andrew Bogott: openstack::util::envscript: allow caller to specify domain_ids (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/906777 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [14:31:05] 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 17 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10taavi) [14:34:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:38:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10RobH) [14:38:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10RobH) [14:42:27] (03PS3) 10Ssingh: hiera: lvs/balancer: unify hiera post bullseye upgrade [puppet] - 10https://gerrit.wikimedia.org/r/906580 (https://phabricator.wikimedia.org/T321309) [14:43:31] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40561/console" [puppet] - 10https://gerrit.wikimedia.org/r/906580 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:43:33] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: set enforce_policy_scope and enforce_new_policy_defaults to false. [puppet] - 10https://gerrit.wikimedia.org/r/906775 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [14:46:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install pki-root1002 - https://phabricator.wikimedia.org/T334401 (10RobH) [14:46:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install pki-root1002 - https://phabricator.wikimedia.org/T334401 (10RobH) [14:47:12] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:47:20] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:48:42] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49851 bytes in 0.254 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:48:50] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.466 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:50:38] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP [14:51:45] (03PS3) 10Xcollazo: structured-data: Add metric alert for section image suggestions. [alerts] - 10https://gerrit.wikimedia.org/r/905719 (https://phabricator.wikimedia.org/T328789) [14:52:51] !log disable puppet on A:lvs and A:ulsfo to merge 906580 [14:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:16] (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: lvs/balancer: unify hiera post bullseye upgrade [puppet] - 10https://gerrit.wikimedia.org/r/906580 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:57:15] !log enable puppet on A:lvs and A:ulsfo to merge 906580 [14:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:38] (03PS3) 10Ssingh: hiera: lvs/balancer: unify hiera post bullseye upgrade (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/906583 (https://phabricator.wikimedia.org/T321309) [15:00:32] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40562/console" [puppet] - 10https://gerrit.wikimedia.org/r/906583 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:03:54] 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 17 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10bd808) [15:08:09] 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 17 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10bd808) The main thing I can think of being a potential blocker for #toolhub and #wikimedia-developer-portal moving would be translatewiki.n... [15:08:55] (03CR) 10Herron: [C: 03+2] kafka-logging: stop kafka services on kafka-logging1001 [puppet] - 10https://gerrit.wikimedia.org/r/900336 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron) [15:14:45] (03PS1) 10Herron: kafka-logging1001: update role [puppet] - 10https://gerrit.wikimedia.org/r/907492 [15:15:37] (03CR) 10Herron: [C: 03+2] kafka-logging1001: update role [puppet] - 10https://gerrit.wikimedia.org/r/907492 (owner: 10Herron) [15:17:10] (03PS2) 10Jbond: P:netbox: add consumers for prefixes and net devices [puppet] - 10https://gerrit.wikimedia.org/r/906031 [15:18:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40564/console" [puppet] - 10https://gerrit.wikimedia.org/r/906031 (owner: 10Jbond) [15:21:19] (03PS2) 10Herron: kafka-logging: bring up kafka-logging1004 with node id 1004 [puppet] - 10https://gerrit.wikimedia.org/r/900337 (https://phabricator.wikimedia.org/T326419) [15:23:43] (03PS3) 10Jbond: spicerack: install python3-aiohttp [puppet] - 10https://gerrit.wikimedia.org/r/906066 [15:23:46] (03CR) 10Herron: [C: 03+2] kafka-logging: bring up kafka-logging1004 with node id 1004 [puppet] - 10https://gerrit.wikimedia.org/r/900337 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron) [15:24:10] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10MPhamWMF) [15:24:38] 10SRE, 10Infrastructure-Foundations, 10Traffic: Manual upload of iDRAC EXE results in broken web interface - https://phabricator.wikimedia.org/T334146 (10BCornwall) I'm aware of that workaround, but I'm not sure if that's related: Flashing the exe via the web uploader results in the broken web interface but... [15:24:49] (03PS3) 10Jbond: P:netbox:host rename netbox host to netbox device [puppet] - 10https://gerrit.wikimedia.org/r/906031 [15:24:53] (03PS1) 10Jbond: P:netbox: add consumers for prefixes and net devices [puppet] - 10https://gerrit.wikimedia.org/r/907493 [15:26:54] (03PS4) 10Jbond: P:netbox:host rename netbox host to netbox device [puppet] - 10https://gerrit.wikimedia.org/r/906031 [15:26:56] (03PS2) 10Jbond: P:netbox: add consumers for prefixes and net devices [puppet] - 10https://gerrit.wikimedia.org/r/907493 [15:28:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40565/console" [puppet] - 10https://gerrit.wikimedia.org/r/907493 (owner: 10Jbond) [15:29:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:30:25] ^ silenced [15:30:53] 10SRE, 10Traffic: Upgrade lvs1013-1016 firmware - https://phabricator.wikimedia.org/T334259 (10BCornwall) [15:31:53] 10SRE, 10Traffic: Upgrade lvs1013-1016 firmware - https://phabricator.wikimedia.org/T334259 (10BCornwall) [15:33:30] (03PS3) 10Jbond: P:netbox: add consumers for prefixes and net devices [puppet] - 10https://gerrit.wikimedia.org/r/907493 [15:37:01] (03PS5) 10Jbond: P:netbox:host rename netbox host to netbox device [puppet] - 10https://gerrit.wikimedia.org/r/906031 [15:37:03] (03PS4) 10Jbond: P:netbox: add consumers for prefixes and net devices [puppet] - 10https://gerrit.wikimedia.org/r/907493 [15:37:24] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [15:37:29] RECOVERY - Check systemd state on ms-be1061 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:56] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 13): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40568/console" [puppet] - 10https://gerrit.wikimedia.org/r/906031 (owner: 10Jbond) [15:44:16] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:netbox:host rename netbox host to netbox device (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/906031 (owner: 10Jbond) [15:44:55] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10BCornwall) > Unless you're also planning to deploy changes, this sounds like you might be fine with restricted which would let you run maintenan... [15:45:01] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10BCornwall) 05In progress→03Stalled [15:46:14] !log Disable Puppet/PyBal on lvs6001 in preparation for reimaging - T321309 [15:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:19] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [15:47:29] 10SRE, 10LDAP-Access-Requests: add MarcoAurelio to LDAP nda group - https://phabricator.wikimedia.org/T333884 (10BCornwall) p:05Triage→03Medium [15:48:17] (03PS5) 10Jbond: P:netbox: add consumers for prefixes and net devices [puppet] - 10https://gerrit.wikimedia.org/r/907493 [15:49:41] (NodeTextfileStale) firing: (2) Stale textfile for puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:49:41] PROBLEM - PyBal backends health check on lvs6001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [15:49:50] (03CR) 10BCornwall: [C: 03+2] hiera: lvs/interfaces: update 6001 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906640 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [15:49:57] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:50:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:50:05] PROBLEM - pybal on lvs6001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:52:21] PROBLEM - PyBal connections to etcd on lvs6001 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [15:52:43] ^ expected [15:53:34] !log centrallog1002:~# systemctl restart rsyslog [15:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:57:49] (03PS6) 10Jbond: P:netbox: add consumers for prefixes and net devices [puppet] - 10https://gerrit.wikimedia.org/r/907493 [16:00:11] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] openstack::util::envscript: allow caller to specify domain_ids [puppet] - 10https://gerrit.wikimedia.org/r/906777 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [16:05:21] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs6001.drmrs.wmnet with OS bullseye [16:05:27] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1061 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:05:31] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs6001.drmrs.wmnet with OS bullseye [16:23:00] ryankemper: Hi! I'm looking for some SRE help, since today week all our SREs are off. I'd like to remove a couple systemd timers that we are migrating to Airflow. I created 2 patches, one for absenting them, the other to remove them completely. Could you please review and merge if appropriate? Or else, can you point me to someone that can help? Thank you a lot! [16:23:07] https://gerrit.wikimedia.org/r/c/operations/puppet/+/906665 [16:23:12] https://gerrit.wikimedia.org/r/c/operations/puppet/+/906667 [16:27:54] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs6001.drmrs.wmnet with reason: host reimage [16:31:14] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs6001.drmrs.wmnet with reason: host reimage [16:48:19] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs6001.drmrs.wmnet with OS bullseye [16:48:22] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:48:32] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs6001.drmrs.wmnet with OS bullseye completed: - lvs6001 (**PASS**) - Downtimed on Icinga/Aler... [16:49:56] PROBLEM - Check systemd state on cp6015 is CRITICAL: CRITICAL - degraded: The following units failed: varnishmtail@default.service,varnishmtail@internal.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:50:17] er what happend here [16:50:56] oh ok, same as https://phabricator.wikimedia.org/T253093 [16:51:06] it should recover but yeah, I guess it's time to look into this more carefully [16:51:56] by "recover", I meant that the service restarts (Restart is set to always) [16:55:32] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [16:59:14] RECOVERY - Check systemd state on cp6015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:22] 10SRE, 10Traffic: Upgrade lvs1013-1016 firmware - https://phabricator.wikimedia.org/T334259 (10BCornwall) [17:04:17] (03PS1) 10BCornwall: hiera: lvs/interfaces: update lvs6002 iface name [puppet] - 10https://gerrit.wikimedia.org/r/907499 (https://phabricator.wikimedia.org/T321309) [17:19:07] (03CR) 10Ssingh: "Please double-check my comments:" [puppet] - 10https://gerrit.wikimedia.org/r/907499 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [17:21:05] (03PS2) 10BCornwall: hiera: lvs/interfaces: update lvs6002 iface name [puppet] - 10https://gerrit.wikimedia.org/r/907499 (https://phabricator.wikimedia.org/T321309) [17:21:09] (03CR) 10BCornwall: hiera: lvs/interfaces: update lvs6002 iface name (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/907499 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [17:25:04] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [17:26:20] (03CR) 10Ssingh: [C: 03+1] hiera: lvs/interfaces: update lvs6002 iface name [puppet] - 10https://gerrit.wikimedia.org/r/907499 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [17:28:55] (03CR) 10BCornwall: [C: 03+2] hiera: lvs/interfaces: update lvs6002 iface name [puppet] - 10https://gerrit.wikimedia.org/r/907499 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [17:29:24] !log Disable Puppet/PyBal on lvs6002 in preparation for reimaging - T321309 [17:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:29] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [17:32:44] PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:33:16] PROBLEM - PyBal backends health check on lvs6002 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [17:33:34] PROBLEM - pybal on lvs6002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:34:21] ^expected [17:36:20] PROBLEM - PyBal connections to etcd on lvs6002 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [17:40:52] (03PS1) 10Herron: kafka-logging: stop kafka service on kafka-logging1002 [puppet] - 10https://gerrit.wikimedia.org/r/907504 (https://phabricator.wikimedia.org/T326419) [17:40:54] (03PS1) 10Herron: kafka-logging: bring up kafka-logging1005 with node id 1005 [puppet] - 10https://gerrit.wikimedia.org/r/907505 (https://phabricator.wikimedia.org/T326419) [17:41:23] 10SRE, 10Infrastructure-Foundations, 10Traffic: Receive network latency reports from the browsers - https://phabricator.wikimedia.org/T334417 (10JameelKaisar) [17:41:29] (03PS1) 10Cwhite: logstash: decouple template_version and ecs.version [puppet] - 10https://gerrit.wikimedia.org/r/906701 (https://phabricator.wikimedia.org/T292585) [17:44:27] 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 17 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10thcipriani) >>! In T332953#8768454, @bd808 wrote: > The main thing I can think of being a potential blocker for #toolhub and #wikimedia-dev... [17:47:37] (03PS2) 10Cwhite: logstash: decouple template_version and ecs.version [puppet] - 10https://gerrit.wikimedia.org/r/906701 (https://phabricator.wikimedia.org/T292585) [18:02:37] (03PS1) 10Dzahn: trafficserver: remove map for iegreview.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/907507 (https://phabricator.wikimedia.org/T334415) [18:04:57] (03PS2) 10Dzahn: trafficserver: remove ma/config for iegreview.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/907507 (https://phabricator.wikimedia.org/T334415) [18:10:14] (03PS3) 10Majavah: openstack: puppet-enc: add foreign keys for hiera/role tables [puppet] - 10https://gerrit.wikimedia.org/r/906085 [18:10:16] (03PS3) 10Majavah: openstack: puppet-enc: add endpoint for deleting entire projects [puppet] - 10https://gerrit.wikimedia.org/r/906086 (https://phabricator.wikimedia.org/T334127) [18:10:18] (03PS3) 10Majavah: openstack: admin_scripts: properly remove old projects from enc [puppet] - 10https://gerrit.wikimedia.org/r/906087 (https://phabricator.wikimedia.org/T334127) [18:10:20] (03PS1) 10Dzahn: miscweb: remove iegreview profile from role/hiera/tests [puppet] - 10https://gerrit.wikimedia.org/r/907509 (https://phabricator.wikimedia.org/T334415) [18:10:22] (03CR) 10CI reject: [V: 04-1] openstack: puppet-enc: add foreign keys for hiera/role tables [puppet] - 10https://gerrit.wikimedia.org/r/906085 (owner: 10Majavah) [18:10:24] (03CR) 10CI reject: [V: 04-1] openstack: puppet-enc: add endpoint for deleting entire projects [puppet] - 10https://gerrit.wikimedia.org/r/906086 (https://phabricator.wikimedia.org/T334127) (owner: 10Majavah) [18:10:28] (03CR) 10CI reject: [V: 04-1] openstack: admin_scripts: properly remove old projects from enc [puppet] - 10https://gerrit.wikimedia.org/r/906087 (https://phabricator.wikimedia.org/T334127) (owner: 10Majavah) [18:11:11] (03PS4) 10Majavah: openstack: puppet-enc: add foreign keys for hiera/role tables [puppet] - 10https://gerrit.wikimedia.org/r/906085 [18:11:13] (03PS4) 10Majavah: openstack: puppet-enc: add endpoint for deleting entire projects [puppet] - 10https://gerrit.wikimedia.org/r/906086 (https://phabricator.wikimedia.org/T334127) [18:11:15] (03PS4) 10Majavah: openstack: admin_scripts: properly remove old projects from enc [puppet] - 10https://gerrit.wikimedia.org/r/906087 (https://phabricator.wikimedia.org/T334127) [18:11:24] (03CR) 10CI reject: [V: 04-1] openstack: puppet-enc: add endpoint for deleting entire projects [puppet] - 10https://gerrit.wikimedia.org/r/906086 (https://phabricator.wikimedia.org/T334127) (owner: 10Majavah) [18:11:26] (03CR) 10CI reject: [V: 04-1] openstack: admin_scripts: properly remove old projects from enc [puppet] - 10https://gerrit.wikimedia.org/r/906087 (https://phabricator.wikimedia.org/T334127) (owner: 10Majavah) [18:13:11] (03PS1) 10Cwhite: remove strict ecs version gate [puppet] - 10https://gerrit.wikimedia.org/r/906702 [18:13:22] (03CR) 10CI reject: [V: 04-1] remove strict ecs version gate [puppet] - 10https://gerrit.wikimedia.org/r/906702 (owner: 10Cwhite) [18:16:48] !log krinkle@deploy2002 Synchronized wmf-config/: (no justification provided) (duration: 587m 34s) [18:17:07] wtf [18:17:34] tsk /j [18:18:32] 587 minutes? [18:19:57] I started searching in my GNU screen yesterday after the deployment was effectively done [18:20:07] (it had just one more server to sync and to log to IRC) [18:20:15] I scrolled up and back down, but stayed within the ctrl-F search buffer thingy [18:20:24] apparenlty that prevented the process from continuing to execute? [18:20:24] oh, ha [18:20:25] maybe unrelated but deploy2002 seems to be failing puppet runs [18:20:34] so when I logged in now to exit my search buffer, the process suddenly finished [18:20:44] that seems Very Bad (TM) [18:21:03] looking at why puppet fails. and it is: [18:21:10] DNS lookup failed for kafka-logging1004.eqiad.wmnet [18:21:48] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:22:09] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs6002.drmrs.wmnet with OS bullseye [18:22:20] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs6002.drmrs.wmnet with OS bullseye [18:22:50] mutante: seems like that is caused by https://netbox.wikimedia.org/ipam/ip-addresses/11777/ not having the "DNS name" field set [18:23:14] (same thing probably with https://netbox.wikimedia.org/ipam/ip-addresses/11783/) [18:23:17] herron: ^ [18:23:27] (03PS1) 10Jdlrobson: Drop unused VectorPageTools feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907511 (https://phabricator.wikimedia.org/T332090) [18:24:19] taavi: thanks, yea, looks related to https://phabricator.wikimedia.org/T326419 or close to that [18:24:37] not in dns? hmm... looking at deploy2002 [18:25:11] it has a v4 entry, but not a v6 one since the netbox address objects I just linked don't have the 'DNS name' field set [18:25:24] Resolv::DNS::Resource::IN::AAAA [18:25:26] ^ missing v6 [18:25:40] ah, alright, I'll add that now [18:27:36] (03PS3) 10Dzahn: trafficserver: remove map/config for iegreview.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/907507 (https://phabricator.wikimedia.org/T334415) [18:30:32] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:31:20] 10SRE, 10Traffic: Upgrade lvs1013-1016 firmware - https://phabricator.wikimedia.org/T334259 (10BCornwall) [18:31:27] !log herron@cumin1001 START - Cookbook sre.dns.netbox [18:33:12] (03CR) 10Thcipriani: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/906086 (https://phabricator.wikimedia.org/T334127) (owner: 10Majavah) [18:33:22] !log herron@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add kafka-logging1004 ipv6 - herron@cumin1001" [18:34:27] !log herron@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add kafka-logging1004 ipv6 - herron@cumin1001" [18:34:27] !log herron@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:37:15] thanks taavi mutante, puppet it happy once again [18:37:18] is* [18:43:19] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs6002.drmrs.wmnet with reason: host reimage [18:46:59] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs6002.drmrs.wmnet with reason: host reimage [18:49:25] 10SRE, 10SRE-Access-Requests: Update SSH key for Mikhail Popov - https://phabricator.wikimedia.org/T334423 (10mpopov) [18:53:59] (03CR) 10Thcipriani: "recheck (restarted zuul-merger)" [puppet] - 10https://gerrit.wikimedia.org/r/906086 (https://phabricator.wikimedia.org/T334127) (owner: 10Majavah) [19:01:36] 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 17 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10sbassett) [19:08:51] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:08:56] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs6002.drmrs.wmnet with OS bullseye [19:09:07] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs6002.drmrs.wmnet with OS bullseye completed: - lvs6002 (**WARN**) - Downtimed on Icinga/Aler... [19:13:51] RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:14:00] 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 17 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10sbassett) >>! In T332953#8769056, @thcipriani wrote: > - Tricky part: recreate mediawiki-i18n-check, only run on changes from l10nbot/local... [19:15:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:16:38] !log power-cycling mw2448 - down, no console output T334429 [19:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:43] T334429: mw2448 crashed - https://phabricator.wikimedia.org/T334429 [19:19:17] RECOVERY - Host mw2448 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [19:19:25] !log mforns@deploy2002 Started deploy [airflow-dags/analytics@6d6f1ec]: (no justification provided) [19:19:36] !log mforns@deploy2002 Finished deploy [airflow-dags/analytics@6d6f1ec]: (no justification provided) (duration: 00m 11s) [19:19:53] PROBLEM - Check systemd state on mw2448 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:20:21] PROBLEM - puppet last run on mw2448 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:20:49] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [19:22:52] !log sukhe@cumin2002 START - Cookbook sre.hosts.remove-downtime for lvs6002.drmrs.wmnet [19:22:52] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs6002.drmrs.wmnet [19:23:05] RECOVERY - Check systemd state on mw2448 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:24:30] mforns: I’m off on PTO today but back around tomorrow so will be able to assist then if you still need help! [19:25:03] !log mw2488 - scap pull - T334429 [19:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:08] T334429: mw2448 crashed - https://phabricator.wikimedia.org/T334429 [19:25:59] RECOVERY - puppet last run on mw2448 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:31:36] (03PS1) 10BCornwall: hiera: lvs/interfaces: update lvs3005 iface name [puppet] - 10https://gerrit.wikimedia.org/r/907519 (https://phabricator.wikimedia.org/T321309) [19:33:30] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [19:35:53] !log Disable Puppet/PyBal on lvs3005 in preparation for reimaging - T321309 [19:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:58] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [19:36:02] (03CR) 10Ssingh: [C: 03+1] hiera: lvs/interfaces: update lvs3005 iface name [puppet] - 10https://gerrit.wikimedia.org/r/907519 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [19:36:49] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:37:11] ^expected! [19:39:55] (03PS1) 10Dzahn: iegreview: remove blackbox::http monitoring [puppet] - 10https://gerrit.wikimedia.org/r/907520 (https://phabricator.wikimedia.org/T334415) [19:40:29] PROBLEM - pybal on lvs3005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:40:39] PROBLEM - PyBal backends health check on lvs3005 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [19:40:53] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:43:16] (03CR) 10Dzahn: [C: 03+2] iegreview: remove blackbox::http monitoring [puppet] - 10https://gerrit.wikimedia.org/r/907520 (https://phabricator.wikimedia.org/T334415) (owner: 10Dzahn) [19:43:49] PROBLEM - PyBal connections to etcd on lvs3005 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [19:48:22] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirtlocal1001.mgmt.eqiad.wmnet with reboot policy FORCED [19:49:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10Jclark-ctr) [19:49:41] (NodeTextfileStale) firing: (2) Stale textfile for puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:51:16] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirtlocal1001'] [19:51:49] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirtlocal1001'] [19:52:03] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirtlocal1002'] [19:52:25] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirtlocal1002'] [19:52:33] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirtlocal1003'] [19:53:14] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirtlocal1003'] [19:53:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10Jclark-ctr) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230410T2000). [20:00:04] No Gerrit patches in the queue for this window AFAICS. [20:00:42] Indeed, no patches. :) [20:05:43] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1072.mgmt.eqiad.wmnet with reboot policy FORCED [20:06:57] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10Jclark-ctr) [20:07:01] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1072.mgmt.eqiad.wmnet with reboot policy FORCED [20:07:27] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1073.mgmt.eqiad.wmnet with reboot policy FORCED [20:09:14] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1073.mgmt.eqiad.wmnet with reboot policy FORCED [20:15:08] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs3005.esams.wmnet with OS bullseye [20:15:18] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs3005.esams.wmnet with OS bullseye [20:19:48] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/906736 (https://phabricator.wikimedia.org/T333204) (owner: 10Filippo Giunchedi) [20:20:18] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/906581 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [20:20:45] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/907504 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron) [20:21:05] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/907505 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron) [20:33:34] (03PS3) 10Eevans: sessionstore: assign values to net.ipv4.conf.all.arp_{ignore,announce} [puppet] - 10https://gerrit.wikimedia.org/r/905746 (https://phabricator.wikimedia.org/T327954) [20:36:50] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs3005.esams.wmnet with reason: host reimage [20:38:57] (03CR) 10Eevans: [C: 03+2] sessionstore: assign values to net.ipv4.conf.all.arp_{ignore,announce} [puppet] - 10https://gerrit.wikimedia.org/r/905746 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans) [20:40:29] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs3005.esams.wmnet with reason: host reimage [20:50:15] (03PS1) 10Eevans: sessionstore: (intentionally) make native transport unreachable [puppet] - 10https://gerrit.wikimedia.org/r/907527 (https://phabricator.wikimedia.org/T327954) [20:51:43] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/907527 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans) [20:53:48] ryankemper: thank you and don't worry! [20:54:36] (03CR) 10Eevans: [C: 03+2] sessionstore: (intentionally) make native transport unreachable [puppet] - 10https://gerrit.wikimedia.org/r/907527 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans) [20:57:35] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1001.eqiad.wmnet [21:00:06] Reedy, sbassett, Maryum, and manfredi: May I have your attention please! Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230410T2100) [21:02:47] Hey all - I have one quick security mitigation update for PrivateSettings.php to deploy. [21:04:55] !log restarting Cassandra, sessionstore1002-a — T327954 [21:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:00] T327954: session storage: dissonant cluster status after reboot (was: 'cannot achieve consistency level' errors) - https://phabricator.wikimedia.org/T327954 [21:06:26] !log restarting Cassandra, sessionstore1003-a — T327954 [21:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:54] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host sessionstore1001.eqiad.wmnet [21:10:35] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1001.eqiad.wmnet [21:13:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:13:41] !log Deployed updated security mitigation for T333140 [21:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:48] (03CR) 10BCornwall: [C: 03+2] hiera: lvs/interfaces: update lvs3005 iface name [puppet] - 10https://gerrit.wikimedia.org/r/907519 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [21:14:13] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs3005.esams.wmnet with OS bullseye [21:14:23] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs3005.esams.wmnet with OS bullseye executed with errors: - lvs3005 (**FAIL**) - Downtimed on... [21:14:29] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs3005.esams.wmnet with OS bullseye [21:14:39] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs3005.esams.wmnet with OS bullseye [21:18:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:21:51] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host sessionstore1001.eqiad.wmnet [21:22:26] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1001.eqiad.wmnet [21:27:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:31:08] !log restarting Cassandra, sessionstore1002-a — T327954 [21:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:13] T327954: session storage: dissonant cluster status after reboot (was: 'cannot achieve consistency level' errors) - https://phabricator.wikimedia.org/T327954 [21:31:45] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wi [21:32:45] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [21:32:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (5) wdqs1004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:32:59] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs3005.esams.wmnet with reason: host reimage [21:33:41] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host sessionstore1001.eqiad.wmnet [21:34:29] PROBLEM - cassandra-a CQL 10.64.0.144:9042 on sessionstore1001 is CRITICAL: connect to address 10.64.0.144 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [21:36:22] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs3005.esams.wmnet with reason: host reimage [21:36:28] grrr... I've got that ^^ [21:37:12] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.0.144:9042 on sessionstore1001 is CRITICAL: connect to address 10.64.0.144 and port 9042: Connection refused eevans Disabled (T327954) https://phabricator.wikimedia.org/T93886 [21:40:26] 10SRE, 10Traffic: Upgrade lvs1013-1016 firmware - https://phabricator.wikimedia.org/T334259 (10BCornwall) [21:41:42] (SystemdUnitFailed) firing: wdqs-blazegraph.service Failed on wdqs1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:43:14] (03PS1) 10Eevans: Revert "sessionstore: (intentionally) make native transport unreachable" [puppet] - 10https://gerrit.wikimedia.org/r/906623 [21:43:53] (03CR) 10Eevans: [C: 03+2] Revert "sessionstore: (intentionally) make native transport unreachable" [puppet] - 10https://gerrit.wikimedia.org/r/906623 (owner: 10Eevans) [21:46:04] !log restarting Cassandra, sessionstore1001-a, to restore native transport settings — T327954 [21:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:08] T327954: session storage: dissonant cluster status after reboot (was: 'cannot achieve consistency level' errors) - https://phabricator.wikimedia.org/T327954 [21:46:11] 10SRE, 10Traffic: Upgrade lvs1013-1016 firmware - https://phabricator.wikimedia.org/T334259 (10BCornwall) @wiki_willy I'd love your advice on upgrading lvs1013-1016 NICs! These servers are r430s. I've been able to upgrade the [[ https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=rh05p&osco... [21:46:42] (SystemdUnitFailed) resolved: wdqs-blazegraph.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:46:52] (03PS1) 10Eevans: Revert "sessionstore: assign values to net.ipv4.conf.all.arp_{ignore,announce}" [puppet] - 10https://gerrit.wikimedia.org/r/906624 [21:48:04] RECOVERY - cassandra-a CQL 10.64.0.144:9042 on sessionstore1001 is OK: TCP OK - 0.034 second response time on 10.64.0.144 port 9042 https://phabricator.wikimedia.org/T93886 [21:49:37] (03CR) 10Eevans: [C: 03+2] Revert "sessionstore: assign values to net.ipv4.conf.all.arp_{ignore,announce}" [puppet] - 10https://gerrit.wikimedia.org/r/906624 (owner: 10Eevans) [21:53:14] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs3005.esams.wmnet with OS bullseye [21:53:25] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs3005.esams.wmnet with OS bullseye completed: - lvs3005 (**WARN**) - Downtimed on Icinga/Aler... [21:56:54] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:57:11] (03PS1) 10Jdlrobson: Deploy Vector 2022 on Welsh Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907539 (https://phabricator.wikimedia.org/T334279) [22:07:25] 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 17 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10Michaelcochez) [22:12:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: (4) wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:13:17] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [22:20:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:25:30] (03CR) 10Dzahn: [C: 03+2] "thank you for everything related to planet, Legoktm" [puppet] - 10https://gerrit.wikimedia.org/r/906763 (owner: 10Legoktm) [22:26:33] (03CR) 10Dzahn: [C: 03+2] trafficserver: remove map/config for iegreview.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/907507 (https://phabricator.wikimedia.org/T334415) (owner: 10Dzahn) [22:30:36] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:51:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:53:02] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on miscweb1002.eqiad.wmnet with reason: decom [22:53:29] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on miscweb1002.eqiad.wmnet with reason: decom [22:55:50] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts miscweb1002.eqiad.wmnet [22:56:12] (03PS1) 10Dzahn: remove miscweb1002->webserver-misc-apps [dns] - 10https://gerrit.wikimedia.org/r/907546 (https://phabricator.wikimedia.org/T334024) [22:58:56] 10SRE, 10Traffic: Upgrade lvs1013-1016 firmware - https://phabricator.wikimedia.org/T334259 (10wiki_willy) Hi @BCornwall - thanks for reaching out. I'm going to add @Papaul to the the thread, for any input/suggestions that he might have on upgrading the firmware on these NICs >>! In T334259#8769600, @BCornwa... [22:59:16] (03PS1) 10Dzahn: site/miscweb: remove miscweb1002, switch rsync source to miscweb1003 [puppet] - 10https://gerrit.wikimedia.org/r/907547 (https://phabricator.wikimedia.org/T331896) [23:00:49] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [23:01:18] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:04:36] (03CR) 10Dzahn: [C: 03+2] remove miscweb1002->webserver-misc-apps [dns] - 10https://gerrit.wikimedia.org/r/907546 (https://phabricator.wikimedia.org/T334024) (owner: 10Dzahn) [23:06:15] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: miscweb1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin1001" [23:07:38] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Dzahn) [23:07:43] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: miscweb1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin1001" [23:07:43] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:07:44] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts miscweb1002.eqiad.wmnet [23:08:52] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Dzahn) [23:16:31] (03CR) 10Cwhite: [C: 03+1] kafka-logging: bring up kafka-logging1005 with node id 1005 [puppet] - 10https://gerrit.wikimedia.org/r/907505 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron) [23:16:52] (03CR) 10Cwhite: [C: 03+1] kafka-logging: stop kafka service on kafka-logging1002 [puppet] - 10https://gerrit.wikimedia.org/r/907504 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron) [23:17:46] (03CR) 10Cwhite: [C: 03+1] alertmanager: sink notifications for dev/test hosts [puppet] - 10https://gerrit.wikimedia.org/r/906736 (https://phabricator.wikimedia.org/T333204) (owner: 10Filippo Giunchedi) [23:29:17] (03CR) 10Dzahn: [C: 03+2] site/miscweb: remove miscweb1002, switch rsync source to miscweb1003 [puppet] - 10https://gerrit.wikimedia.org/r/907547 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [23:49:41] (NodeTextfileStale) firing: (2) Stale textfile for puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale