[00:05:06] !log disable puppet on dns4003 till we resolve the puppet failures [00:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:53] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS5511/IPv4: Connect - Orange, AS5511/IPv6: Connect - Orange https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:35:53] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:40:35] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:55:37] RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 57, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:11:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [01:17:05] RECOVERY - SSH on wdqs2005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:29:05] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): PXE boot failure on cloudvirt1023 - https://phabricator.wikimedia.org/T319042 (10Papaul) @andew if the server is not in production can i take a quick look at it [01:38:45] (JobUnavailable) firing: (9) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:48:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:53:43] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [01:58:21] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [02:04:05] PROBLEM - SSH on db1113.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:05:21] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:08:45] (JobUnavailable) firing: (6) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:19:49] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host cloudvirt1023.eqiad.wmnet [02:19:50] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cloudvirt1023.eqiad.wmnet [02:20:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host cloudvirt1023.eqiad.wmnet [02:20:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host cloudvirt1023.eqiad.wmnet [02:21:29] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host cloudvirt1023.eqiad.wmnet [02:23:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [02:24:21] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:59:26] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): PXE boot failure on cloudvirt1023 - https://phabricator.wikimedia.org/T319042 (10Papaul) @ayounsi @cmooney looks like we are having a situation similar to https://phabricator.wikimedia.org/T303296. The server racked in B7 is sending request to the DHCP serv... [02:59:30] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host cloudvirt1023.eqiad.wmnet [03:05:25] RECOVERY - SSH on db1113.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:06:37] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:09:01] PROBLEM - Check systemd state on mw1316 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:14:31] RECOVERY - SSH on mw1315.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:16:07] (03PS2) 10DDesouza: Remove Research Incentive survey from arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837695 (https://phabricator.wikimedia.org/T318328) [03:25:35] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:00:23] RECOVERY - Check systemd state on mw1316 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:00:06] (03PS1) 10Marostegui: es2030: Upgrade mariadb [puppet] - 10https://gerrit.wikimedia.org/r/838288 [05:00:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2030', diff saved to https://phabricator.wikimedia.org/P35352 and previous config saved to /var/cache/conftool/dbconfig/20221005-050018-root.json [05:01:28] (03CR) 10Marostegui: [C: 03+2] es2030: Upgrade mariadb [puppet] - 10https://gerrit.wikimedia.org/r/838288 (owner: 10Marostegui) [05:09:37] (03PS1) 10Marostegui: Revert "es2030: Upgrade mariadb" [puppet] - 10https://gerrit.wikimedia.org/r/838213 [05:09:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35353 and previous config saved to /var/cache/conftool/dbconfig/20221005-050944-root.json [05:11:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:12:20] (03CR) 10Marostegui: [C: 03+2] Revert "es2030: Upgrade mariadb" [puppet] - 10https://gerrit.wikimedia.org/r/838213 (owner: 10Marostegui) [05:13:55] (03PS1) 10Marostegui: control-mariadb-5.5,control-mysql-5.6: Remove them [software] - 10https://gerrit.wikimedia.org/r/838534 [05:16:44] (03CR) 10Marostegui: [C: 03+2] control-mariadb-5.5,control-mysql-5.6: Remove them [software] - 10https://gerrit.wikimedia.org/r/838534 (owner: 10Marostegui) [05:17:20] (03Merged) 10jenkins-bot: control-mariadb-5.5,control-mysql-5.6: Remove them [software] - 10https://gerrit.wikimedia.org/r/838534 (owner: 10Marostegui) [05:24:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35354 and previous config saved to /var/cache/conftool/dbconfig/20221005-052449-root.json [05:32:59] (03CR) 10Ayounsi: [C: 03+1] sites.yaml: add dns4003 to anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/838239 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [05:33:48] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 62044 [05:39:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35355 and previous config saved to /var/cache/conftool/dbconfig/20221005-053954-root.json [05:41:36] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:41:46] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:42:54] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.221 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:43:02] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48827 bytes in 0.092 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:46:04] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (phab1001), Fresh: 115 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:50:42] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 62044 [05:51:14] RECOVERY - SSH on analytics1076.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:55:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35356 and previous config saved to /var/cache/conftool/dbconfig/20221005-055459-root.json [05:58:12] RECOVERY - SSH on mw1316.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:09:00] (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:10:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35357 and previous config saved to /var/cache/conftool/dbconfig/20221005-061004-root.json [06:23:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [06:25:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35358 and previous config saved to /var/cache/conftool/dbconfig/20221005-062509-root.json [06:27:14] (03CR) 10Elukey: [V: 03+1 C: 03+2] Move kafka-logging1002's Kafka TLS config to PKI [puppet] - 10https://gerrit.wikimedia.org/r/838123 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [06:27:34] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on kafka-logging1002.eqiad.wmnet with reason: Kafka PKI upgrade [06:27:47] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on kafka-logging1002.eqiad.wmnet with reason: Kafka PKI upgrade [06:30:02] !log restart kafka on kafka-logging1002 to pick up the new cert+settings for PKI [06:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:12] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): PXE boot failure on cloudvirt1023 - https://phabricator.wikimedia.org/T319042 (10ayounsi) @Papaul ping me when you're around and I can walk you through it. TLDR is: `cloudsw1-c8-eqiad# deactivate vlans cloud-hosts1-eqiad forwarding-options dhcp-security opt... [06:31:18] (03CR) 10Ayounsi: [V: 03+1 C: 03+2] Use license keys stored in Netbox instead of homer-private [homer/public] - 10https://gerrit.wikimedia.org/r/838188 (https://phabricator.wikimedia.org/T311008) (owner: 10Ayounsi) [06:31:27] (03CR) 10Ayounsi: [V: 03+1 C: 03+2] LibreNMS report: ignore licenses [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/838187 (owner: 10Ayounsi) [06:32:45] (03Merged) 10jenkins-bot: Use license keys stored in Netbox instead of homer-private [homer/public] - 10https://gerrit.wikimedia.org/r/838188 (https://phabricator.wikimedia.org/T311008) (owner: 10Ayounsi) [06:33:00] (03Merged) 10jenkins-bot: LibreNMS report: ignore licenses [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/838187 (owner: 10Ayounsi) [06:34:36] (03PS1) 10Elukey: Move kafka-logging1003 to the kafka PKI intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/838643 (https://phabricator.wikimedia.org/T300130) [06:36:15] (MjolnirUpdateFailureRateExceedesThreshold) firing: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [06:36:48] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 5 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37445/console" [puppet] - 10https://gerrit.wikimedia.org/r/838643 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [06:40:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35359 and previous config saved to /var/cache/conftool/dbconfig/20221005-064014-root.json [06:41:15] (03PS1) 10Elukey: role::kafka::logging: final clean up after migrating to PKI [puppet] - 10https://gerrit.wikimedia.org/r/838650 (https://phabricator.wikimedia.org/T300130) [06:41:15] (MjolnirUpdateFailureRateExceedesThreshold) resolved: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [06:42:55] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37446/console" [puppet] - 10https://gerrit.wikimedia.org/r/838650 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [06:43:51] (03CR) 10Muehlenhoff: netops::ripeatlas::cli: Switch to systemd::sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/837072 (owner: 10Muehlenhoff) [06:43:57] (03PS3) 10Muehlenhoff: netops::ripeatlas::cli: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/837072 [06:44:16] PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:44:43] (03CR) 10Elukey: [V: 03+1] "The pcc bit to consider is:" [puppet] - 10https://gerrit.wikimedia.org/r/838650 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [06:55:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35360 and previous config saved to /var/cache/conftool/dbconfig/20221005-065519-root.json [06:58:32] (03CR) 10Muehlenhoff: [C: 03+2] netops::ripeatlas::cli: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/837072 (owner: 10Muehlenhoff) [07:00:05] Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221005T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:39] (03CR) 10Giuseppe Lavagetto: confd: export template status as Prometheus metrics (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/838078 (https://phabricator.wikimedia.org/T319272) (owner: 10Filippo Giunchedi) [07:01:23] (03PS1) 10Muehlenhoff: Fix parameter [puppet] - 10https://gerrit.wikimedia.org/r/838668 [07:02:36] (03CR) 10Muehlenhoff: [C: 03+2] Fix parameter [puppet] - 10https://gerrit.wikimedia.org/r/838668 (owner: 10Muehlenhoff) [07:09:34] PROBLEM - SSH on db1113.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:09:35] 10SRE, 10serviceops: Move Kafka main to the new intermediate PKI CA - https://phabricator.wikimedia.org/T319372 (10elukey) [07:11:23] (03CR) 10Hashar: gerrit: decouple scap and daemon users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832345 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar) [07:13:22] (03CR) 10Hashar: [C: 03+1] "Well done thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/833385 (owner: 10Hashar) [07:18:44] (03CR) 10Filippo Giunchedi: [C: 03+1] Move kafka-logging1003 to the kafka PKI intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/838643 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [07:19:07] (03CR) 10Filippo Giunchedi: [C: 03+1] role::kafka::logging: final clean up after migrating to PKI [puppet] - 10https://gerrit.wikimedia.org/r/838650 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [07:21:57] (03CR) 10DCausse: [C: 03+1] beta: Set shard count for commonswiki_file to 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838272 (https://phabricator.wikimedia.org/T316711) (owner: 10Ebernhardson) [07:34:39] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) 05Open→03Resolved I am going to tentatively consider this fixed. It's been a month since we repooled the hosts with... [07:45:36] RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:48:11] (03CR) 10DCausse: [C: 03+1] cirrus: remove cross-dc poolcounter increases (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838269 (owner: 10Ebernhardson) [07:48:36] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:49:06] (03CR) 10Elukey: [V: 03+1 C: 03+2] Move kafka-logging1003 to the kafka PKI intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/838643 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [07:49:58] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on kafka-logging1003.eqiad.wmnet with reason: Kafka PKI upgrade [07:50:11] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on kafka-logging1003.eqiad.wmnet with reason: Kafka PKI upgrade [07:52:55] (03CR) 10DCausse: [C: 03+1] cirrus: Drop client side connect timeout config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838276 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [07:54:19] !log restart kafka on kafka-logging1003 to pick up new PKI TLS settings [07:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:15] (03PS2) 10Elukey: role::kafka::logging: final clean up after migrating to PKI [puppet] - 10https://gerrit.wikimedia.org/r/838650 (https://phabricator.wikimedia.org/T300130) [07:55:42] (03CR) 10Filippo Giunchedi: confd: export template status as Prometheus metrics (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/838078 (https://phabricator.wikimedia.org/T319272) (owner: 10Filippo Giunchedi) [07:57:06] (03PS3) 10Filippo Giunchedi: confd: export template status as Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/838078 (https://phabricator.wikimedia.org/T319272) [07:57:08] (03PS3) 10Filippo Giunchedi: confd: install and run confd_prometheus_metrics [puppet] - 10https://gerrit.wikimedia.org/r/838079 (https://phabricator.wikimedia.org/T319272) [08:00:54] (03CR) 10Filippo Giunchedi: confd: export template status as Prometheus metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/838078 (https://phabricator.wikimedia.org/T319272) (owner: 10Filippo Giunchedi) [08:03:37] (03CR) 10Muehlenhoff: [C: 03+2] New cookbook to roll-restart (or roll-reboot) the eventschemas cluster(s) (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/836181 (owner: 10Muehlenhoff) [08:05:44] (03PS1) 10Ayounsi: Only apply the license stanza when needed [homer/public] - 10https://gerrit.wikimedia.org/r/838715 (https://phabricator.wikimedia.org/T311008) [08:08:52] (03CR) 10Ayounsi: [V: 03+1 C: 03+2] "Tested locally with an empty inventory and inventory with no licenses." [homer/public] - 10https://gerrit.wikimedia.org/r/838715 (https://phabricator.wikimedia.org/T311008) (owner: 10Ayounsi) [08:09:35] (03Merged) 10jenkins-bot: Only apply the license stanza when needed [homer/public] - 10https://gerrit.wikimedia.org/r/838715 (https://phabricator.wikimedia.org/T311008) (owner: 10Ayounsi) [08:28:17] (03Abandoned) 10Hashar: gerrit: disable automatic plugin handling [puppet] - 10https://gerrit.wikimedia.org/r/831913 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar) [08:30:04] hoo: gettimeofday() says it's time for Wikibase client unexpectedUnconnectedPage page prop format conversion. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221005T0830) [08:34:15] (MjolnirUpdateFailureRateExceedesThreshold) firing: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [08:39:15] (MjolnirUpdateFailureRateExceedesThreshold) resolved: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [08:39:46] 10SRE, 10Traffic: Don't set cookies for api.wikimedia.org at the caching layer - https://phabricator.wikimedia.org/T260943 (10hnowlan) Late responding on this one but thanks a lot for adding this feature! [08:57:00] (03PS1) 10Hoo man: Disable UnconnectedPagePagePropMigrationLegacyFormat for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838732 [08:57:57] (03CR) 10Vgutierrez: [C: 03+2] vcl: stop overriding cache-control header for bad title errors [puppet] - 10https://gerrit.wikimedia.org/r/837742 (https://phabricator.wikimedia.org/T316932) (owner: 10Zabe) [08:59:57] (03Abandoned) 10Vgutierrez: Revert "Cache Badtitle 400s for 60s in varnish-fe" [puppet] - 10https://gerrit.wikimedia.org/r/769827 (owner: 10Legoktm) [09:02:59] (03CR) 10Hoo man: [C: 03+2] Disable UnconnectedPagePagePropMigrationLegacyFormat for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838732 (owner: 10Hoo man) [09:04:01] (03Merged) 10jenkins-bot: Disable UnconnectedPagePagePropMigrationLegacyFormat for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838732 (owner: 10Hoo man) [09:05:54] RECOVERY - Confd vcl based reload on cp1086 is OK: reload-vcl successfully ran 0h, 2 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:06:55] !log reimport ganeti 3.0.1-1~bpo10+1 to component/ganeti3 (got removed alongside via a reprepro bug/misfeature when the bullseye component was removed) [09:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:54] !log hoo@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Disable UnconnectedPagePagePropMigrationLegacyFormat for arwiki (duration: 03m 49s) [09:10:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:11:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [09:11:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:11:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:12:10] RECOVERY - SSH on db1113.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:15:02] (03PS12) 10Slyngshede: Initial checkin. User and Group classes for interacting with LDAP. [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) [09:15:24] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s: Align formatting along k8s module and profiles [puppet] - 10https://gerrit.wikimedia.org/r/838168 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [09:15:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:15:27] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Remove unused mwautopull class [puppet] - 10https://gerrit.wikimedia.org/r/838169 (https://phabricator.wikimedia.org/T284628) (owner: 10JMeybohm) [09:17:30] (03PS13) 10Slyngshede: Initial checkin. User and Group classes for interacting with LDAP. [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) [09:20:05] !log restarting blazegraph on wdqs1014 (BlazegraphFreeAllocatorsDecreasingRapidly) [09:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:54] !log upgrading ganeti/eqiad nodes to Ganeti 3 T311687 [09:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:58] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [09:23:38] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [09:24:21] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) The problem described on {T319300} may block work on some servers, but we have plenty of others to migrate, so we should have enough work to do. [09:25:41] (03PS1) 10Hoo man: Disable UnconnectedPagePagePropMigrationLegacyFormat for ruwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838746 [09:26:05] (03CR) 10Hoo man: [C: 03+2] Disable UnconnectedPagePagePropMigrationLegacyFormat for ruwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838746 (owner: 10Hoo man) [09:26:51] (03Merged) 10jenkins-bot: Disable UnconnectedPagePagePropMigrationLegacyFormat for ruwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838746 (owner: 10Hoo man) [09:27:42] PROBLEM - Host ps1-oe14-esams is DOWN: PING CRITICAL - Packet loss = 100% [09:27:42] PROBLEM - Host ps1-oe16-esams is DOWN: PING CRITICAL - Packet loss = 100% [09:28:30] PROBLEM - Host cp3052.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:28:30] PROBLEM - Host cp3051.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:28:30] PROBLEM - Host cp3053.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:28:32] PROBLEM - Host cp3054.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:28:32] PROBLEM - Host cp3050.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:28:32] PROBLEM - Host cp3065.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:28:36] oups [09:28:44] RECOVERY - Host ps1-oe14-esams is UP: PING OK - Packet loss = 0%, RTA = 81.99 ms [09:29:18] RECOVERY - Host ps1-oe16-esams is UP: PING OK - Packet loss = 0%, RTA = 81.98 ms [09:30:20] PROBLEM - Host scs-oe16-esams is DOWN: PING CRITICAL - Packet loss = 100% [09:30:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:31:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [09:31:11] !log hoo@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Disable UnconnectedPagePagePropMigrationLegacyFormat for ruwikinews (duration: 03m 39s) [09:31:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:31:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:32:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:32:58] RECOVERY - Host cp3065.mgmt is UP: PING OK - Packet loss = 0%, RTA = 81.69 ms [09:33:06] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Routing loop for unused WMCS IPs in 185.15.56.0/24 - https://phabricator.wikimedia.org/T315956 (10cmooney) 05Open→03Resolved [09:34:21] looking [09:34:25] looks like mgmt issue [09:34:30] cc topranks [09:34:31] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [09:34:40] seems so at first sight [09:34:54] RECOVERY - Host cp3052.mgmt is UP: PING OK - Packet loss = 0%, RTA = 81.66 ms [09:34:54] RECOVERY - Host cp3051.mgmt is UP: PING OK - Packet loss = 0%, RTA = 81.62 ms [09:34:54] RECOVERY - Host cp3053.mgmt is UP: PING OK - Packet loss = 0%, RTA = 81.63 ms [09:34:56] RECOVERY - Host cp3054.mgmt is UP: PING OK - Packet loss = 0%, RTA = 81.66 ms [09:34:56] RECOVERY - Host cp3050.mgmt is UP: PING OK - Packet loss = 0%, RTA = 81.49 ms [09:35:15] can someone check the calendar and maint-announce? [09:35:20] looking [09:35:50] 10SRE, 10Infrastructure-Foundations, 10netops: Set consistent MTUs - https://phabricator.wikimedia.org/T315838 (10cmooney) a:03cmooney Thanks @ayounsi. Yeah 9216 was default max I had used for the VXLAN stuff originally, but 9192 is more than enough to support a 9,000 byte IP packet and allow for the VXLA... [09:35:51] XioNoX: not right now [09:36:13] did something just reboot? [09:36:44] RECOVERY - Host scs-oe16-esams is UP: PING OK - Packet loss = 0%, RTA = 81.43 ms [09:36:49] 10SRE, 10Infrastructure-Foundations, 10netops: Validate new (anycast) IPv6 /48 announcement being accepted by transits - https://phabricator.wikimedia.org/T301900 (10cmooney) 05Open→03Resolved Thanks @ayounsi. I didn't finish checking every single one but it was accepted by all our major transits and is... [09:37:39] interfaces on mr1-esams to msw-oe14-esams:47 and msw-oe16-esams:47 flapped 8min ago, so most likely the msw rebooted [09:39:56] yeah bit odd... but seems to be ok now pings solid [09:40:31] https://www.irccloud.com/pastebin/ALQ4ejqU/ [09:40:44] that's quite the flaps on multiple circuits/racks [09:41:24] so Feed X from rack oe14/16, and feed Y from oe15 [09:41:41] that's why the mgmt switch on oe15 didn't go down [09:42:11] everything critical have dual power supplies, that's what saved us [09:42:18] hmm... I wonder are those feeds mixed up in the cabling perhaps? [09:42:55] i.e. on FPC 5 are the two feeds in the alternate sockets than the rest of the devices? [09:45:26] (03PS1) 10Ayounsi: Network MTU check, remove 9216 from allowlist [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/838755 (https://phabricator.wikimedia.org/T315838) [09:46:31] could be too [09:47:58] PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:49:15] also not all the hosts in the same rack alerted [09:51:36] !log Ran extensions/Wikibase/client/maintenance/PopulateUnexpectedUnconnectedPagePageProp.php for all of arwiki [09:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:04] !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for all of ruwikinews [09:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:07] (03PS10) 10Hnowlan: thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) [10:00:38] (03PS2) 10Btullis: Add a new production image for spark version 3.3.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) [10:01:32] (03PS3) 10Btullis: Add a new production image for spark version 3.3.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) [10:03:10] (03PS4) 10Btullis: Add a new production image for spark version 3.3.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) [10:08:43] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "The patch is correct, I just have a UX question but feel free to merge the patch and we can change behaviour later" [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [10:09:00] (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:09:19] (03CR) 10Giuseppe Lavagetto: [C: 03+2] reqconfig: Add a default for git_repo and ensure its a Path [software/conftool] - 10https://gerrit.wikimedia.org/r/805338 (owner: 10Jbond) [10:10:49] (03CR) 10Ayounsi: [C: 03+2] "Self merge as trivial" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/838755 (https://phabricator.wikimedia.org/T315838) (owner: 10Ayounsi) [10:11:48] (03Merged) 10jenkins-bot: Network MTU check, remove 9216 from allowlist [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/838755 (https://phabricator.wikimedia.org/T315838) (owner: 10Ayounsi) [10:11:51] (03Merged) 10jenkins-bot: reqconfig: Add a default for git_repo and ensure its a Path [software/conftool] - 10https://gerrit.wikimedia.org/r/805338 (owner: 10Jbond) [10:17:52] RECOVERY - Host ripe-atlas-esams is UP: PING OK - Packet loss = 0%, RTA = 81.21 ms [10:19:40] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan) [10:20:53] (03CR) 10JMeybohm: [C: 03+1] admin: add thumbor namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/824473 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [10:21:40] RECOVERY - Host ripe-atlas-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 81.74 ms [10:23:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [10:25:20] (03PS1) 10Jbond: P:openstack::base::neutron: dont use the legacy naming [puppet] - 10https://gerrit.wikimedia.org/r/838761 [10:26:08] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37447/console" [puppet] - 10https://gerrit.wikimedia.org/r/838761 (owner: 10Jbond) [10:27:48] 10SRE, 10Traffic, 10Upstream: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748 (10Vgutierrez) ATS is supposed to perform a cache_sync_dir every 60 seconds per the undocumented config setting `proxy.config.cache.dir.sync_... [10:28:45] (03PS1) 10Hnowlan: changeprop: remove remaining blocklist entries [deployment-charts] - 10https://gerrit.wikimedia.org/r/838762 (https://phabricator.wikimedia.org/T274359) [10:30:58] (03PS2) 10Jbond: P:openstack::base::neutron: dont use the legacy naming [puppet] - 10https://gerrit.wikimedia.org/r/838761 [10:31:53] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37448/console" [puppet] - 10https://gerrit.wikimedia.org/r/838761 (owner: 10Jbond) [10:32:59] 10SRE, 10Traffic, 10Performance-Team (Radar), 10Upstream: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748 (10Vgutierrez) [10:34:12] PROBLEM - Host cp2036 is DOWN: PING CRITICAL - Packet loss = 100% [10:34:44] (03PS1) 10Hoo man: Disable UnconnectedPagePagePropMigrationLegacyFormat for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838764 [10:35:11] (03PS3) 10Jbond: P:openstack::base::neutron: dont use the legacy naming [puppet] - 10https://gerrit.wikimedia.org/r/838761 [10:35:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37449/console" [puppet] - 10https://gerrit.wikimedia.org/r/838761 (owner: 10Jbond) [10:36:23] !log installing gdk-pixbuf security updates [10:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:30] (03PS4) 10Jbond: P:openstack::base::neutron: dont use the legacy naming [puppet] - 10https://gerrit.wikimedia.org/r/838761 [10:38:21] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37450/console" [puppet] - 10https://gerrit.wikimedia.org/r/838761 (owner: 10Jbond) [10:38:32] (03CR) 10Hoo man: [C: 03+2] Disable UnconnectedPagePagePropMigrationLegacyFormat for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838764 (owner: 10Hoo man) [10:39:49] (03Merged) 10jenkins-bot: Disable UnconnectedPagePagePropMigrationLegacyFormat for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838764 (owner: 10Hoo man) [10:42:02] 10SRE, 10Product-Infrastructure-Team-Backlog, 10WMDE-TechWish-Maintenance, 10serviceops, and 3 others: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388 (10awight) [10:43:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:44:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:44:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:44:17] uh... we lost cp2036? [10:44:37] !log hoo@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Disable UnconnectedPagePagePropMigrationLegacyFormat for commonswiki (duration: 03m 51s) [10:44:50] (03PS5) 10Jbond: P:openstack::base::neutron: dont use the legacy naming [puppet] - 10https://gerrit.wikimedia.org/r/838761 (https://phabricator.wikimedia.org/T319300) [10:45:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:45:23] (03CR) 10Hnowlan: admin: add thumbor namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/824473 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [10:46:26] !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for commonswiki [10:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "genius" [puppet] - 10https://gerrit.wikimedia.org/r/838761 (https://phabricator.wikimedia.org/T319300) (owner: 10Jbond) [10:48:29] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2036.codfw.wmnet [10:50:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:51:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:51:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:51:10] (03PS4) 10Filippo Giunchedi: confd: export template status as Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/838078 (https://phabricator.wikimedia.org/T319272) [10:51:12] (03PS4) 10Filippo Giunchedi: confd: install and run confd_prometheus_metrics [puppet] - 10https://gerrit.wikimedia.org/r/838079 (https://phabricator.wikimedia.org/T319272) [10:51:44] (03CR) 10CI reject: [V: 04-1] confd: export template status as Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/838078 (https://phabricator.wikimedia.org/T319272) (owner: 10Filippo Giunchedi) [10:52:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:53:02] !log powercycle cp2036 - T319394 [10:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:06] T319394: cp2036 crashed on 2022-10-05 - https://phabricator.wikimedia.org/T319394 [10:53:57] (03CR) 10Jbond: [C: 03+2] P:openstack::base::neutron: dont use the legacy naming [puppet] - 10https://gerrit.wikimedia.org/r/838761 (https://phabricator.wikimedia.org/T319300) (owner: 10Jbond) [10:56:02] RECOVERY - Host cp2036 is UP: PING OK - Packet loss = 0%, RTA = 33.10 ms [10:59:00] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) [11:01:25] !log repool cp2036 - T319394 [11:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:30] T319394: cp2036 crashed on 2022-10-05 - https://phabricator.wikimedia.org/T319394 [11:04:13] !log running "gnt-cluster upgrade --to 3.0" for ganeti/eqiad T311687 [11:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:18] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [11:04:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10cmooney) @Jclark-ctr there is a discrepancy with the port allocation here. Apologies I'd been working on some input validation in Netbox to prevent thi... [11:05:02] if eqsin mgmt alert it's because of me [11:06:24] looks like it didn't :) [11:06:35] (03PS1) 10Arturo Borrero Gonzalez: openstack: neutron: l3_agent: more support for new vlan naming [puppet] - 10https://gerrit.wikimedia.org/r/838771 (https://phabricator.wikimedia.org/T319300) [11:06:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Vgutierrez) @wiki_willy could you help us prioritizing the remaining work on eqiad? this needs to be fixed ASAP [11:09:29] (03PS2) 10Arturo Borrero Gonzalez: openstack: neutron: l3_agent: more support for new vlan naming [puppet] - 10https://gerrit.wikimedia.org/r/838771 (https://phabricator.wikimedia.org/T319300) [11:09:54] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_eqiad_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:36] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/37453/" [puppet] - 10https://gerrit.wikimedia.org/r/838771 (https://phabricator.wikimedia.org/T319300) (owner: 10Arturo Borrero Gonzalez) [11:10:52] (03CR) 10Jbond: [C: 04-1] "lgtm but small clean up still needed" [puppet] - 10https://gerrit.wikimedia.org/r/838771 (https://phabricator.wikimedia.org/T319300) (owner: 10Arturo Borrero Gonzalez) [11:11:02] PROBLEM - Check unit status of netbox_ganeti_eqiad_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:11:59] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "Actually, found a problem. The sysctl expects interface/vlan syntax rather than interface.vlan, so need an additional consideration for th" [puppet] - 10https://gerrit.wikimedia.org/r/838771 (https://phabricator.wikimedia.org/T319300) (owner: 10Arturo Borrero Gonzalez) [11:12:02] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) [11:15:15] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [11:16:22] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:18:34] PROBLEM - Check nf_conntrack usage in neutron netns on cloudnet1006 is CRITICAL: CRITICAL: no netns defined? https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:20:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1029.eqiad.wmnet with OS bullseye [11:21:46] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:22:05] (03PS3) 10Arturo Borrero Gonzalez: openstack: neutron: l3_agent: more support for new vlan naming [puppet] - 10https://gerrit.wikimedia.org/r/838771 (https://phabricator.wikimedia.org/T319300) [11:22:24] RECOVERY - Check unit status of netbox_ganeti_eqiad_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:24:19] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "This PCC is better: https://puppet-compiler.wmflabs.org/pcc-worker1003/37454/" [puppet] - 10https://gerrit.wikimedia.org/r/838771 (https://phabricator.wikimedia.org/T319300) (owner: 10Arturo Borrero Gonzalez) [11:26:13] (03PS13) 10Jbond: reqconfig: add ip validation for ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) [11:27:20] (03CR) 10Arturo Borrero Gonzalez: P:terraform: add a new basic terraform module registry (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/834344 (https://phabricator.wikimedia.org/T317480) (owner: 10Majavah) [11:28:03] (03CR) 10Jbond: "updated" [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [11:28:33] (03CR) 10CI reject: [V: 04-1] reqconfig: add ip validation for ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [11:29:07] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/838771 (https://phabricator.wikimedia.org/T319300) (owner: 10Arturo Borrero Gonzalez) [11:29:45] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] openstack: neutron: l3_agent: more support for new vlan naming [puppet] - 10https://gerrit.wikimedia.org/r/838771 (https://phabricator.wikimedia.org/T319300) (owner: 10Arturo Borrero Gonzalez) [11:33:30] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1029.eqiad.wmnet with reason: host reimage [11:33:37] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1005.eqiad.wmnet with OS bullseye [11:33:38] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye [11:33:41] (03PS14) 10Jbond: reqconfig: add ip validation for ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) [11:37:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1029.eqiad.wmnet with reason: host reimage [11:38:35] (03PS1) 10Arturo Borrero Gonzalez: cloudnet1006: don't use legacy naming for vlan NICs [puppet] - 10https://gerrit.wikimedia.org/r/838786 (https://phabricator.wikimedia.org/T319300) [11:41:10] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/838786 (https://phabricator.wikimedia.org/T319300) (owner: 10Arturo Borrero Gonzalez) [11:42:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudnet1006: don't use legacy naming for vlan NICs [puppet] - 10https://gerrit.wikimedia.org/r/838786 (https://phabricator.wikimedia.org/T319300) (owner: 10Arturo Borrero Gonzalez) [11:47:44] (03PS7) 10Majavah: P:terraform: add a new basic terraform module registry [puppet] - 10https://gerrit.wikimedia.org/r/834344 (https://phabricator.wikimedia.org/T317480) [11:49:08] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet1005.eqiad.wmnet with reason: host reimage [11:49:22] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet1006.eqiad.wmnet with reason: host reimage [11:49:48] RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:49:48] (03CR) 10Majavah: P:terraform: add a new basic terraform module registry (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/834344 (https://phabricator.wikimedia.org/T317480) (owner: 10Majavah) [11:50:37] (03PS30) 10Jbond: C:varnish: Rate limit hotlinking dry-run [puppet] - 10https://gerrit.wikimedia.org/r/768723 (https://phabricator.wikimedia.org/T317799) [11:51:07] (03CR) 10CI reject: [V: 04-1] C:varnish: Rate limit hotlinking dry-run [puppet] - 10https://gerrit.wikimedia.org/r/768723 (https://phabricator.wikimedia.org/T317799) (owner: 10Jbond) [11:52:03] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet1005.eqiad.wmnet with reason: host reimage [11:52:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1029.eqiad.wmnet with OS bullseye [11:53:23] !log fix MTU between eqiad core routers and cloudsw - T315838 [11:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:27] T315838: Set consistent MTUs - https://phabricator.wikimedia.org/T315838 [11:54:44] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet1006.eqiad.wmnet with reason: host reimage [11:57:16] (03PS5) 10Filippo Giunchedi: confd: export template status as Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/838078 (https://phabricator.wikimedia.org/T319272) [11:57:18] (03PS5) 10Filippo Giunchedi: confd: install and run confd_prometheus_metrics [puppet] - 10https://gerrit.wikimedia.org/r/838079 (https://phabricator.wikimedia.org/T319272) [12:02:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1030.eqiad.wmnet with OS bullseye [12:05:21] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering: Update librsvg to > 2.44.10 - https://phabricator.wikimedia.org/T265549 (10Aklapper) [12:06:18] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering: Update librsvg to ≥2.42.3 (2.44.10) - https://phabricator.wikimedia.org/T193352 (10Aklapper) [12:06:50] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10Volans) p:05Triage→03Medium [12:07:28] (03PS1) 10Hoo man: Disable UnconnectedPagePagePropMigrationLegacyFormat for enwiki/zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838788 [12:10:18] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for MHorsey - https://phabricator.wikimedia.org/T318729 (10MHorsey-WMF) [12:13:58] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet1005.eqiad.wmnet with OS bullseye [12:15:15] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1030.eqiad.wmnet with reason: host reimage [12:16:41] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host cloudnet1006.eqiad.wmnet with OS bullseye [12:18:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1030.eqiad.wmnet with reason: host reimage [12:20:44] (03CR) 10Hoo man: [C: 03+2] Disable UnconnectedPagePagePropMigrationLegacyFormat for enwiki/zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838788 (owner: 10Hoo man) [12:21:28] (03Merged) 10jenkins-bot: Disable UnconnectedPagePagePropMigrationLegacyFormat for enwiki/zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838788 (owner: 10Hoo man) [12:28:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:28:49] !log hoo@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Disable UnconnectedPagePagePropMigrationLegacyFormat for enwiki/zhwiki (duration: 03m 46s) [12:30:04] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye [12:31:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1030.eqiad.wmnet with OS bullseye [12:32:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:32:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:33:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:37:46] 10SRE, 10Infrastructure-Foundations: Create an IDM for Wikimedia developer accounts - https://phabricator.wikimedia.org/T319405 (10MoritzMuehlenhoff) [12:40:00] 10SRE, 10Infrastructure-Foundations: IDM milestone 1 "Initial development work" - https://phabricator.wikimedia.org/T319407 (10MoritzMuehlenhoff) [12:41:34] !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for enwiki [12:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:02] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [12:43:34] 10SRE, 10Infrastructure-Foundations: Pick a name for the IDM - https://phabricator.wikimedia.org/T319409 (10MoritzMuehlenhoff) [12:43:35] (03PS1) 10Arturo Borrero Gonzalez: cloudnet1003: decom host [puppet] - 10https://gerrit.wikimedia.org/r/838793 (https://phabricator.wikimedia.org/T316284) [12:45:42] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet1006.eqiad.wmnet with reason: host reimage [12:46:16] 10SRE, 10Infrastructure-Foundations: Initial Django project setup - https://phabricator.wikimedia.org/T319410 (10MoritzMuehlenhoff) [12:46:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1031.eqiad.wmnet with OS bullseye [12:47:00] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1031.eqiad.wmnet with OS bullseye [12:47:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1031.eqiad.wmnet with OS bullseye [12:48:39] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet1006.eqiad.wmnet with reason: host reimage [12:50:30] (03CR) 10Samtar: [C: 03+2] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/838762 (https://phabricator.wikimedia.org/T274359) (owner: 10Hnowlan) [12:52:59] (03CR) 10Filippo Giunchedi: "LGTM (though I haven't verified the same commits / system options are applied to opensearch and logstash). Adding Cole" [puppet] - 10https://gerrit.wikimedia.org/r/838253 (owner: 10Ryan Kemper) [12:53:31] hnowlan: https://gerrit.wikimedia.org/r/838762 will need manual deploying, right? Probably should have asked before +2ing.. [12:54:01] (03CR) 10Filippo Giunchedi: "LGTM (not voting though as I'm not sure enough)" [puppet] - 10https://gerrit.wikimedia.org/r/838248 (https://phabricator.wikimedia.org/T319021) (owner: 10Bking) [12:54:10] (03Merged) 10jenkins-bot: changeprop: remove remaining blocklist entries [deployment-charts] - 10https://gerrit.wikimedia.org/r/838762 (https://phabricator.wikimedia.org/T274359) (owner: 10Hnowlan) [12:59:43] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1031.eqiad.wmnet with reason: host reimage [12:59:50] !log vgutierrez@apt1001:~$ sudo -i reprepro --component thirdparty/haproxy24 update buster-wikimedia # fetch HAProxy 2.4.19 [12:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:02] 10SRE, 10Infrastructure-Foundations: Pick a name for the IDM - https://phabricator.wikimedia.org/T319409 (10Volans) > A tentative initial name is Charon FYI It seems that's taken already in [[ https://pypi.org/search/?q=charon | PyPI ]] and there are similar ones in [[ https://packages.debian.org/search?keywor... [13:00:04] 10SRE, 10Infrastructure-Foundations: Evaluate Striker codebase - https://phabricator.wikimedia.org/T319415 (10MoritzMuehlenhoff) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221005T1300). [13:00:04] No Gerrit patches in the queue for this window AFAICS. [13:00:17] !log test HAProxy 2.4.19 in cp4026 && cp4032 [13:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:58] o/ [13:01:48] (03CR) 10Filippo Giunchedi: ats: Alert on high connection/request count (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [13:03:09] (03PS14) 10Slyngshede: Initial checkin. User and Group classes for interacting with LDAP. [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) [13:03:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1031.eqiad.wmnet with reason: host reimage [13:03:30] (03CR) 10Vgutierrez: "please note that we are no longer using ATS 8.x in production" [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [13:04:08] (looks like nothing to deploy indeed) [13:04:15] !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for zhwiki [13:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:06] (03PS1) 10Filippo Giunchedi: Check annotations in alerting rules only [alerts] - 10https://gerrit.wikimedia.org/r/838797 [13:07:28] !log draining ganeti1012 T311687 [13:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:32] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [13:08:18] (03PS1) 10David Caro: ceph.wait_for_cluster_healthy: add elapsed time too [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/838799 (https://phabricator.wikimedia.org/T315339) [13:14:24] (03PS1) 10Hoo man: Disable UnconnectedPagePagePropMigrationLegacyFormat for enwiktionary/frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838801 [13:14:46] (03CR) 10Hoo man: [C: 03+2] Disable UnconnectedPagePagePropMigrationLegacyFormat for enwiktionary/frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838801 (owner: 10Hoo man) [13:15:41] (03Merged) 10jenkins-bot: Disable UnconnectedPagePagePropMigrationLegacyFormat for enwiktionary/frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838801 (owner: 10Hoo man) [13:18:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1031.eqiad.wmnet with OS bullseye [13:18:32] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:18:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:19:16] (03PS2) 10Giuseppe Lavagetto: mediawiki::php: use only php 7.4 everywhere [puppet] - 10https://gerrit.wikimedia.org/r/838084 (https://phabricator.wikimedia.org/T318894) [13:19:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:19:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:19:58] (03CR) 10CI reject: [V: 04-1] mediawiki::php: use only php 7.4 everywhere [puppet] - 10https://gerrit.wikimedia.org/r/838084 (https://phabricator.wikimedia.org/T318894) (owner: 10Giuseppe Lavagetto) [13:20:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:21:17] !log hoo@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Disable UnconnectedPagePagePropMigrationLegacyFormat for enwiktionary/frwiki (duration: 03m 38s) [13:22:19] !log deploying fix for projectview dags on airflow [13:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:17] (03PS3) 10Giuseppe Lavagetto: mediawiki::php: use only php 7.4 everywhere [puppet] - 10https://gerrit.wikimedia.org/r/838084 (https://phabricator.wikimedia.org/T318894) [13:23:42] (03PS31) 10Jbond: C:varnish: Rate limit hotlinking dry-run [puppet] - 10https://gerrit.wikimedia.org/r/768723 (https://phabricator.wikimedia.org/T317799) [13:23:44] (03PS3) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/832621 (https://phabricator.wikimedia.org/T317799) [13:24:11] (03CR) 10CI reject: [V: 04-1] C:varnish: Rate limit hotlinking dry-run [puppet] - 10https://gerrit.wikimedia.org/r/768723 (https://phabricator.wikimedia.org/T317799) (owner: 10Jbond) [13:24:27] (03CR) 10CI reject: [V: 04-1] C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/832621 (https://phabricator.wikimedia.org/T317799) (owner: 10Jbond) [13:25:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:26:09] (03CR) 10Jbond: C:varnish: Rate limit hotlinking dry-run (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/768723 (https://phabricator.wikimedia.org/T317799) (owner: 10Jbond) [13:26:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10Jclark-ctr) @cmooney sorry Dac was not seated completely. all good now [13:26:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10fnegri) [13:26:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:26:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:27:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:30:34] 10SRE, 10Infrastructure-Foundations: Pick a name for the IDM - https://phabricator.wikimedia.org/T319409 (10fgiunchedi) > A tentative initial name is Charon, but we're happy to solicit further feedback via this task or the talk page of https://wikitech.wikimedia.org/wiki/Wikimedia_IDM Agreed with @volans re:... [13:32:38] TheresNoTime: no worries, I can handle it soon :) [13:33:55] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 16): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37456/console" [puppet] - 10https://gerrit.wikimedia.org/r/838084 (https://phabricator.wikimedia.org/T318894) (owner: 10Giuseppe Lavagetto) [13:36:54] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@f7a68c2]: (no justification provided) [13:37:06] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@f7a68c2]: (no justification provided) (duration: 00m 12s) [13:45:46] 10SRE, 10Infrastructure-Foundations: Pick a name for the IDM - https://phabricator.wikimedia.org/T319409 (10SLyngshede-WMF) There's also a Norse version https://en.wikipedia.org/wiki/M%C3%B3%C3%B0gu%C3%B0r [13:47:02] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Greg Grossmeier - https://phabricator.wikimedia.org/T318873 (10Arnoldokoth) Hey @greg Yeah, Lisa G is your manager (confirmed on Namely). So will need approval from her (@Lgruwell-WMF ) as well as @Ottomata or @odimitrijevic [13:47:56] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Greg Grossmeier - https://phabricator.wikimedia.org/T318873 (10Arnoldokoth) [13:49:12] 10SRE, 10SRE-Access-Requests: Remove old production ssh key for RelEng user - https://phabricator.wikimedia.org/T319274 (10Arnoldokoth) 05In progress→03Resolved [13:52:08] 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi) re: `MediaWiki EtcdConfig up-to-date` over the last 90d we got ~10 floods of varying intensity, ranging... [13:52:11] (03PS5) 10Btullis: Add a new production image for spark version 3.3.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) [13:52:14] 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi) [13:52:34] (03PS1) 10Jbond: P:bird::anycast: drop dependency [puppet] - 10https://gerrit.wikimedia.org/r/838804 [13:52:53] (03CR) 10Btullis: "It's worth noting that the upstream Dockerfile, on which this is based, has some additional steps that I have not included here, relating " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [13:53:42] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37457/console" [puppet] - 10https://gerrit.wikimedia.org/r/838804 (owner: 10Jbond) [13:55:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1032.eqiad.wmnet with OS bullseye [13:56:33] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-private-users for Slavina Stefanova - https://phabricator.wikimedia.org/T318807 (10Ottomata) Approved. [13:57:53] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Greg Grossmeier - https://phabricator.wikimedia.org/T318873 (10Ottomata) Approved. [14:02:39] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-private-users for Slavina Stefanova - https://phabricator.wikimedia.org/T318807 (10Arnoldokoth) [14:03:41] (03PS2) 10Jbond: P:bird::anycast: drop dependency [puppet] - 10https://gerrit.wikimedia.org/r/838804 [14:04:51] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37458/console" [puppet] - 10https://gerrit.wikimedia.org/r/838804 (owner: 10Jbond) [14:05:38] !log mforns@deploy1002 Started deploy [analytics/refinery@7e16d2a]: Regular analytics weekly train [analytics/refinery@7e16d2a] [14:06:50] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:07:02] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:08:02] <_joe_> jouncebot: now and next [14:08:03] No deployments scheduled for the next 3 hour(s) and 51 minute(s) [14:08:07] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1032.eqiad.wmnet with reason: host reimage [14:09:00] (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:09:33] ^ yeah, known [14:09:48] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki::php: use only php 7.4 everywhere [puppet] - 10https://gerrit.wikimedia.org/r/838084 (https://phabricator.wikimedia.org/T318894) (owner: 10Giuseppe Lavagetto) [14:09:56] (03CR) 10Ottomata: "Cool, ty!" [cookbooks] - 10https://gerrit.wikimedia.org/r/836181 (owner: 10Muehlenhoff) [14:11:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1032.eqiad.wmnet with reason: host reimage [14:13:51] (03CR) 10JMeybohm: [C: 04-1] "Sorry 😇" [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [14:15:11] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [14:15:14] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet1006.eqiad.wmnet with OS bullseye [14:15:28] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Jclark-ctr) [14:16:05] !log mforns@deploy1002 Finished deploy [analytics/refinery@7e16d2a]: Regular analytics weekly train [analytics/refinery@7e16d2a] (duration: 10m 27s) [14:16:22] !log mforns@deploy1002 Started deploy [analytics/refinery@7e16d2a] (thin): Regular analytics weekly train THIN [analytics/refinery@7e16d2a] [14:17:01] (03CR) 10JMeybohm: [C: 04-1] thumbor: new service chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [14:20:47] !log mforns@deploy1002 Finished deploy [analytics/refinery@7e16d2a] (thin): Regular analytics weekly train THIN [analytics/refinery@7e16d2a] (duration: 04m 24s) [14:23:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [14:23:33] (03PS2) 10Andrew Bogott: Make cloudnet100[56] into cloudnet nodes [puppet] - 10https://gerrit.wikimedia.org/r/835657 (https://phabricator.wikimedia.org/T316284) [14:23:37] (03CR) 10JMeybohm: [C: 04-1] thumbor: new service chart (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [14:26:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1032.eqiad.wmnet with OS bullseye [14:30:23] !log on going maintenance on msw1-eqiad [14:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:14] PROBLEM - Check systemd state on mw1434 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:42] _joe_: ^^ can I assume this is a race condition ? [14:32:07] between the removal of the php7.2-fpm_check_restart and the icinga checks [14:34:25] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 8359 [14:34:56] PROBLEM - Check systemd state on mw2290 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:12] <_joe_> volans: yes [14:35:19] <_joe_> I hopped I would be fast enough to avoid that [14:35:20] (03PS3) 10Jbond: P:bird::anycast: drop dependency [puppet] - 10https://gerrit.wikimedia.org/r/838804 [14:35:36] (03PS2) 10Giuseppe Lavagetto: mediawiki::php: remove php 7.2 from the servers [puppet] - 10https://gerrit.wikimedia.org/r/838085 (https://phabricator.wikimedia.org/T318894) [14:36:26] <_joe_> volans: no it's the timer that is still triggered even if you undeclare it [14:36:31] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 8359 [14:36:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37459/console" [puppet] - 10https://gerrit.wikimedia.org/r/838804 (owner: 10Jbond) [14:36:38] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php: remove php 7.2 from the servers [puppet] - 10https://gerrit.wikimedia.org/r/838085 (https://phabricator.wikimedia.org/T318894) (owner: 10Giuseppe Lavagetto) [14:36:49] ack, got it, thx [14:36:53] PROBLEM - Host asw2-d-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:36:57] RECOVERY - Check nf_conntrack usage in neutron netns on cloudnet1005 is OK: OK: everything is apparently fine https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:37:35] PROBLEM - Host ps1-e4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:37:35] PROBLEM - Host asw2-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:37:35] PROBLEM - Host ps1-e3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:37:41] PROBLEM - Host ps1-e2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:37:57] oh oh... XioNoX, topranks any work there related to these alerts? ^^^ [14:38:09] volans: me [14:38:11] volans: yes [14:38:15] PROBLEM - Host ps1-e1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:38:30] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on cloudnet1003.eqiad.wmnet with reason: decom [14:38:44] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cloudnet1003.eqiad.wmnet with reason: decom [14:38:51] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on cloudnet1004.eqiad.wmnet with reason: decom [14:38:53] yeah the ps I expected [14:38:53] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 35, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:39:05] what I didn't expected were the asw2-d-eqiad / asw2-c-eqiad [14:39:05] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cloudnet1004.eqiad.wmnet with reason: decom [14:39:28] volans: it's their mgmt interfaces, those devices are L2 only [14:39:33] PROBLEM - Host asw2-b-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:39:33] PROBLEM - Host asw2-a-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:40:20] right, but looked scarier that it is [14:40:45] PROBLEM - Host fasw-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:38:01] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [16:39:42] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/838852 (owner: 10RLazarus) [16:43:32] (03CR) 10Clément Goubert: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/838852 (owner: 10RLazarus) [16:44:02] RECOVERY - SSH on mw1319.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:44:40] (03CR) 10RLazarus: [V: 03+1 C: 03+2] cumin2002: Add an hourly httpbb run against mw2271 [puppet] - 10https://gerrit.wikimedia.org/r/838852 (owner: 10RLazarus) [16:47:05] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10BTullis) Thanks for all your work on this @Andrew. I'm going to do a fleet-wide check to see if anything still references t... [16:47:50] jouncebot: now [16:47:50] No deployments scheduled for the next 1 hour(s) and 12 minute(s) [16:48:42] (03CR) 10Clare Ming: [C: 03+2] Enable Special:Contribute on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838281 (https://phabricator.wikimedia.org/T319240) (owner: 10Jdlrobson) [16:49:28] (03Merged) 10jenkins-bot: Enable Special:Contribute on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838281 (https://phabricator.wikimedia.org/T319240) (owner: 10Jdlrobson) [16:51:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838281 (https://phabricator.wikimedia.org/T319240) (owner: 10Jdlrobson) [16:53:57] !log deployed labs-only config [16:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:55:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:55:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:55:59] (03PS1) 10Btullis: Add a spark-on-k8s-operator production image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838858 (https://phabricator.wikimedia.org/T318730) [16:56:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:57:36] (03PS2) 10Btullis: Add a spark-on-k8s-operator production image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838858 (https://phabricator.wikimedia.org/T318730) [17:00:36] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10dcaro) [17:01:38] (03PS3) 10Btullis: Add a spark-operator production image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838858 (https://phabricator.wikimedia.org/T318730) [17:01:48] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10ArielGlenn) Note that labstore1006 has some html dumps that didn't make it around to the other boxes, so please don't reimag... [17:03:09] (03CR) 10David Caro: [C: 03+2] ceph.wait_for_cluster_healthy: add elapsed time too [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/838799 (https://phabricator.wikimedia.org/T315339) (owner: 10David Caro) [17:04:17] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10dcaro) [17:04:39] RECOVERY - Check if anycast-healthchecker and all configured threads are running on dns4003 is OK: OK: UP (pid=23976) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [17:05:04] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) the 40G port in cr1-eqiad to connect asw-c2 and asw-d2 are reqady ` papaul@re0.cr1-eqiad> show interfaces terse | match et-1/1/ et-1/1/0... [17:06:33] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 261, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:07:02] (03Merged) 10jenkins-bot: ceph.wait_for_cluster_healthy: add elapsed time too [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/838799 (https://phabricator.wikimedia.org/T315339) (owner: 10David Caro) [17:12:38] (03PS1) 10Dduvall: jwt_authorizer: Start service as configured owner/group [puppet] - 10https://gerrit.wikimedia.org/r/838859 (https://phabricator.wikimedia.org/T308501) [17:12:55] !log mforns@deploy1002 Started deploy [analytics/refinery@7e16d2a] (thin): Regular analytics weekly train THIN [analytics/refinery@7e16d2a] [17:14:43] (03CR) 10Ahmon Dancy: [C: 03+1] jwt_authorizer: Start service as configured owner/group [puppet] - 10https://gerrit.wikimedia.org/r/838859 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall) [17:17:19] !log mforns@deploy1002 Finished deploy [analytics/refinery@7e16d2a] (thin): Regular analytics weekly train THIN [analytics/refinery@7e16d2a] (duration: 04m 24s) [17:18:09] !log mforns@deploy1002 Started deploy [analytics/refinery@7e16d2a] (thin): Regular analytics weekly train THIN [analytics/refinery@7e16d2a] [17:18:16] (ThanosSidecarPrometheusDown) firing: Thanos Sidecar cannot connect to Prometheus - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarPrometheusDown [17:18:28] !log mforns@deploy1002 Finished deploy [analytics/refinery@7e16d2a] (thin): Regular analytics weekly train THIN [analytics/refinery@7e16d2a] (duration: 00m 18s) [17:18:28] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [17:20:21] !log mforns@deploy1002 Started deploy [analytics/refinery@7e16d2a] (thin): Regular analytics weekly train THIN [analytics/refinery@7e16d2a] [17:20:36] !log mforns@deploy1002 Finished deploy [analytics/refinery@7e16d2a] (thin): Regular analytics weekly train THIN [analytics/refinery@7e16d2a] (duration: 00m 14s) [17:22:11] RECOVERY - Check systemd state on dns4003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:23:16] (ThanosSidecarPrometheusDown) resolved: Thanos Sidecar cannot connect to Prometheus - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarPrometheusDown [17:28:28] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [17:29:21] RECOVERY - Host ps1-a6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms [17:29:45] RECOVERY - Host ganeti1026.mgmt is UP: PING WARNING - Packet loss = 77%, RTA = 0.91 ms [17:29:45] RECOVERY - Host ganeti1030.mgmt is UP: PING WARNING - Packet loss = 77%, RTA = 0.89 ms [17:29:47] RECOVERY - Host ganeti1032.mgmt is UP: PING WARNING - Packet loss = 80%, RTA = 866.07 ms [17:29:47] RECOVERY - Host ps1-a7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.32 ms [17:30:01] RECOVERY - Host an-db1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 48.11 ms [17:30:01] RECOVERY - Host an-master1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 56.04 ms [17:30:07] RECOVERY - Host an-worker1082.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.40 ms [17:30:07] RECOVERY - Host an-worker1081.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.93 ms [17:30:08] RECOVERY - Host an-worker1103.mgmt is UP: PING OK - Packet loss = 0%, RTA = 4.25 ms [17:30:08] RECOVERY - Host an-worker1122.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.49 ms [17:30:09] RECOVERY - Host an-worker1123.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.32 ms [17:30:10] RECOVERY - Host clouddb1021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 14.03 ms [17:30:15] RECOVERY - Host asw2-a-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [17:30:25] RECOVERY - Host aqs1007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.53 ms [17:30:25] RECOVERY - Host backup1008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.28 ms [17:30:29] RECOVERY - Host cloudmetrics1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.87 ms [17:30:29] RECOVERY - Host cloudmetrics1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.82 ms [17:30:31] RECOVERY - Host cp1077.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.67 ms [17:30:31] RECOVERY - Host cp1078.mgmt is UP: PING OK - Packet loss = 0%, RTA = 13.70 ms [17:30:33] RECOVERY - Host ms-be1040.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.57 ms [17:30:33] RECOVERY - Host db1154.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.54 ms [17:30:33] RECOVERY - Host db1159.mgmt is UP: PING OK - Packet loss = 0%, RTA = 7.27 ms [17:30:33] RECOVERY - Host db1160.mgmt is UP: PING OK - Packet loss = 0%, RTA = 7.14 ms [17:30:33] RECOVERY - Host elastic1070.mgmt is UP: PING OK - Packet loss = 0%, RTA = 8.81 ms [17:30:33] RECOVERY - Host elastic1073.mgmt is UP: PING OK - Packet loss = 0%, RTA = 7.51 ms [17:30:34] RECOVERY - Host elastic1071.mgmt is UP: PING OK - Packet loss = 0%, RTA = 7.52 ms [17:30:34] RECOVERY - Host elastic1072.mgmt is UP: PING OK - Packet loss = 0%, RTA = 15.10 ms [17:30:41] RECOVERY - Host ms-be1051.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [17:30:41] RECOVERY - Host ms-be1060.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [17:30:45] RECOVERY - Host ms-fe1009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [17:30:45] RECOVERY - Host mw1309.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [17:30:45] RECOVERY - Host mw1307.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [17:30:45] RECOVERY - Host mw1308.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.13 ms [17:30:45] RECOVERY - Host mw1310.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [17:30:45] RECOVERY - Host mw1312.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [17:30:46] RECOVERY - Host mw1311.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [17:30:47] RECOVERY - Host ores1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.79 ms [17:30:47] RECOVERY - Host parse1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [17:30:47] RECOVERY - Host parse1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [17:30:55] RECOVERY - Host puppetmaster1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 6.98 ms [17:30:55] RECOVERY - Host prometheus1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 8.73 ms [17:30:55] RECOVERY - Host restbase-dev1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.30 ms [17:30:57] RECOVERY - Host restbase1031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 4.80 ms [17:30:57] RECOVERY - Host restbase1021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.44 ms [17:31:01] RECOVERY - Host stat1008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 18.39 ms [17:31:07] RECOVERY - Host thumbor1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.44 ms [17:31:13] RECOVERY - Host wdqs1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.20 ms [17:31:17] RECOVERY - Host kafka-main1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [17:32:05] (03Abandoned) 10Ryan Kemper: Revert "elastic: reduce master-eligibles for codfw back down to 2" [puppet] - 10https://gerrit.wikimedia.org/r/830923 (https://phabricator.wikimedia.org/T313431) (owner: 10Bking) [17:33:01] RECOVERY - Host an-worker1139.mgmt is UP: PING OK - Packet loss = 0%, RTA = 6.60 ms [17:33:53] RECOVERY - Host db1116.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.44 ms [17:34:33] RECOVERY - Host krb1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [17:34:33] RECOVERY - Host kubernetes1018.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.99 ms [17:34:45] RECOVERY - Host lvs1013.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [17:34:45] RECOVERY - Host lvs1017.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [17:34:55] RECOVERY - Host dbproxy1013.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.54 ms [17:34:55] RECOVERY - Host druid1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.35 ms [17:34:55] RECOVERY - Host mc1037.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.42 ms [17:34:55] RECOVERY - Host mc1038.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [17:35:03] RECOVERY - Host ganeti1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.49 ms [17:35:15] RECOVERY - Host db1115.mgmt is UP: PING OK - Packet loss = 0%, RTA = 9.32 ms [17:35:15] RECOVERY - Host db1096.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.70 ms [17:35:19] RECOVERY - Host dbprov1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.37 ms [17:35:55] (03CR) 10Ryan Kemper: [C: 03+2] "Resolving comment to get this out of "your turn" UI on gerrit" [cookbooks] - 10https://gerrit.wikimedia.org/r/823704 (https://phabricator.wikimedia.org/T315360) (owner: 10Ryan Kemper) [17:36:26] (03CR) 10Ryan Kemper: [C: 03+2] ryankemper: add tmux, vim, zsh conf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/834369 (owner: 10Ryan Kemper) [17:40:15] (MjolnirUpdateFailureRateExceedesThreshold) firing: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [17:41:20] 10SRE, 10observability, 10Patch-For-Review, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Cloud VPS - https://phabricator.wikimedia.org/T210993 (10bd808) [17:42:59] PROBLEM - SSH on mw1312.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:43:05] RECOVERY - AuthDNS-over-TLS Works on dns4003 is OK: OK: ns[012] kdig DoTLS check success https://wikitech.wikimedia.org/wiki/DNS [17:43:29] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns4003.wikimedia.org with OS buster [17:43:36] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns4003.wikimedia.org with OS buster [17:45:15] (MjolnirUpdateFailureRateExceedesThreshold) resolved: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [17:46:53] PROBLEM - Host 2620:0:863:1:198:35:26:7 is DOWN: PING CRITICAL - Packet loss = 100% [17:47:01] 10SRE, 10Infrastructure-Foundations, 10netops: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10cmooney) I will reboot this tomorrow morning, Oct 6th at 08:00 and we can take it from there. [17:47:33] 13:46:53 <+icinga-wm> PROBLEM - Host 2620:0:863:1:198:35:26:7 is DOWN: PING CRITICAL - Packet loss = 100% [17:47:44] would have expected the cookbook to downtime it anyway, this is expected [17:48:05] sukhe: that's a separate host in Icinga terms [17:48:12] the hostname is '2620:0:863:1:198:35:26:7' [17:48:16] ah! fair [17:48:23] but I don't remember seeing it last time [17:48:26] you can though run the downtime cookbook with the option [17:48:26] or maybe I didn't look close enough [17:48:54] --force (see -h/--help for the explanation) [17:49:10] volans: I am seeing double, can 100% be just me :P [17:49:45] (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:50:09] yeah mostly this is an issue with how we define these things in icinga [17:50:12] ^ expected [17:50:32] we do it on the IP for a reason, but we could also be creating some kind of dependency link so that downtiming the host affects it [17:50:42] (in some cases like this, anyways) [17:51:41] PROBLEM - Recursive DNS on 198.35.26.7 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [17:51:49] PROBLEM - Host 2620:0:863:1:198:35:26:7 is DOWN: PING CRITICAL - Packet loss = 100% [17:52:05] RECOVERY - Host cloudvirt1033.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.15 ms [17:52:05] RECOVERY - Host ps1-c8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.53 ms [17:52:05] RECOVERY - Host cloudsw2-c8-eqiad.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [17:52:13] RECOVERY - Host cloudcephosd1009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.14 ms [17:52:13] RECOVERY - Host an-tool1010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 4.62 ms [17:52:14] RECOVERY - Host cloudcephosd1022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 4.05 ms [17:52:19] RECOVERY - Host cloudgw1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.93 ms [17:52:19] RECOVERY - Host cloudvirt1032.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.37 ms [17:52:25] RECOVERY - Host cloudsw1-c8-eqiad.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [17:53:51] RECOVERY - Host db1131.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.47 ms [17:54:13] RECOVERY - Host deploy1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.16 ms [17:54:45] (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:55:35] RECOVERY - Host cloudvirt1031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [17:55:53] RECOVERY - Host cloudcephosd1007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.65 ms [17:55:58] RECOVERY - Host cloudvirt1035.mgmt is UP: PING OK - Packet loss = 0%, RTA = 10.72 ms [17:56:07] RECOVERY - Host cloudcephmon1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.35 ms [17:56:08] RECOVERY - Host cloudcephosd1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [17:56:08] RECOVERY - Host cloudbackup1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 8.12 ms [17:56:08] RECOVERY - Host cloudcephosd1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.54 ms [17:56:09] RECOVERY - Host cloudcephosd1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.79 ms [17:56:10] RECOVERY - Host elastic1059.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.70 ms [17:56:11] RECOVERY - Host cloudcephosd1008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 13.16 ms [17:56:11] RECOVERY - Host cloudcephosd1016.mgmt is UP: PING OK - Packet loss = 0%, RTA = 7.18 ms [17:56:13] RECOVERY - Host cloudcephosd1021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 4.86 ms [17:56:14] RECOVERY - Host cloudcephosd1017.mgmt is UP: PING OK - Packet loss = 0%, RTA = 4.93 ms [17:56:15] RECOVERY - Host cloudcephosd1018.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [17:56:15] RECOVERY - Host cloudvirt1025.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.86 ms [17:56:16] RECOVERY - Host cloudvirt1027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 8.58 ms [17:56:18] RECOVERY - Host cloudvirt1026.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [17:56:18] RECOVERY - Host cloudvirt1034.mgmt is UP: PING OK - Packet loss = 0%, RTA = 4.86 ms [17:56:19] RECOVERY - Host cloudnet1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [17:56:23] RECOVERY - Host ganeti1012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [17:56:25] RECOVERY - Host mw1408.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [17:56:25] RECOVERY - Host mw1409.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [17:56:25] RECOVERY - Host mw1412.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [17:56:25] RECOVERY - Host mw1410.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [17:56:25] RECOVERY - Host mw1411.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [17:56:26] RECOVERY - Host mw1413.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.92 ms [17:59:45] (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:00:04] ^demon and brennen: OwO what's this, a deployment window?? Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221005T1800). nyaa~ [18:00:04] ^demon and brennen: How many deployers does it take to do MediaWiki train - Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221005T1800). [18:01:49] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns4003.wikimedia.org with reason: host reimage [18:05:19] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns4003.wikimedia.org with reason: host reimage [18:05:57] PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:07:52] o/ [18:08:11] (03CR) 10Andrew Bogott: [C: 03+2] wmcs haproxy: prepare for IP and user-agent blocking [puppet] - 10https://gerrit.wikimedia.org/r/838265 (https://phabricator.wikimedia.org/T319313) (owner: 10Andrew Bogott) [18:14:37] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Greg Grossmeier - https://phabricator.wikimedia.org/T318873 (10greg) Approval came in via email: > Quick approval needed for analytics-private data access > > Lisa Seitz Gruwell Wed, Oct 5, 2022 at... [18:17:16] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10cmooney) @aborrero Just something I noticed, you may already be aware in which case ignore. I was testing out an updated puppet to netbox import script... [18:17:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Jclark-ctr) @Vgutierrez Would these 2 changes work for what is needed? If not we would have to order replacement cables longer lengths to r... [18:17:34] (03PS1) 10Andrew Bogott: haproxy: correct name of ip blocklist file [puppet] - 10https://gerrit.wikimedia.org/r/838867 (https://phabricator.wikimedia.org/T319313) [18:18:07] !log train 1.40.0-wmf.4 (T314193) no current blockers, rolling train to group1 [18:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:12] T314193: 1.40.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T314193 [18:18:17] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838868 (https://phabricator.wikimedia.org/T314193) [18:18:18] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838868 (https://phabricator.wikimedia.org/T314193) (owner: 10TrainBranchBot) [18:18:30] (03CR) 10Andrew Bogott: [C: 03+2] haproxy: correct name of ip blocklist file [puppet] - 10https://gerrit.wikimedia.org/r/838867 (https://phabricator.wikimedia.org/T319313) (owner: 10Andrew Bogott) [18:19:05] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838868 (https://phabricator.wikimedia.org/T314193) (owner: 10TrainBranchBot) [18:22:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:23:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [18:23:41] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.4 refs T314193 [18:23:44] T314193: 1.40.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T314193 [18:23:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:23:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:24:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:27:22] !log brennen@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.4 refs T314193 (duration: 03m 40s) [18:29:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:30:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:30:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:31:01] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns4003.wikimedia.org with OS buster [18:31:11] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns4003.wikimedia.org with OS buster completed: - dns4003 (... [18:31:27] 10SRE, 10MediaWiki-Core-HTTP-Cache, 10Performance-Team, 10Platform Engineering, 10Traffic-Icebox: Separate Cache-Control header for proxy and client - https://phabricator.wikimedia.org/T50835 (10Krinkle) p:05Medium→03Low [18:31:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:35:37] (03PS3) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) [18:43:20] RECOVERY - SSH on mw1312.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:43:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Vgutierrez) @jclark-ctr as long as both lvs1017 and lvs1020 don't get connectivity from the same switch on a single row is ok. So those look... [18:45:44] RECOVERY - Host elastic1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.96 ms [18:46:34] RECOVERY - Host an-presto1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.20 ms [18:46:38] RECOVERY - Host elastic1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [18:47:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10serviceops-collab: Q2:rack/setup/install webperf1005.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Dzahn) [18:52:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10RobH) [18:52:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10RobH) [18:56:04] (03PS4) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) [19:00:00] (03PS1) 10Andrew Bogott: haproxy add (commented-out) debug log line [puppet] - 10https://gerrit.wikimedia.org/r/838874 (https://phabricator.wikimedia.org/T319313) [19:07:06] RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:14:15] 10SRE, 10SRE-swift-storage, 10Commons, 10ConfirmEdit (CAPTCHA extension), and 5 others: Make SwiftFileBackend::doStoreInternal defer the opening of file handles to stay in the concurrency limit - https://phabricator.wikimedia.org/T230245 (10aaron) a:05aaron→03None [19:19:16] 10SRE, 10API Platform: Block non-browser requests that use generic agents - https://phabricator.wikimedia.org/T319423 (10daniel) [19:20:21] 10SRE, 10Data Engineering Planning, 10Data-Engineering-Operations, 10Mail: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10Dzahn) team membership confirmed per https://www.mediawiki.org/wiki/Platform_Engineering_Team/Data_Value_Stream --- @xc... [19:20:58] 10SRE, 10API Platform: Block non-browser requests that use generic agents - https://phabricator.wikimedia.org/T319423 (10daniel) [19:25:08] 10SRE, 10Data Engineering Planning, 10Data-Engineering-Operations, 10Mail: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10xcollazo) Thank you @Dzahn! ( Side note: I have confirmed that we can make the list public if we choose to move it to Go... [19:27:39] (03PS1) 10AOkoth: admin: add mhorsey to ldap users [puppet] - 10https://gerrit.wikimedia.org/r/838881 (https://phabricator.wikimedia.org/T318729) [19:27:43] (03PS1) 10Ssingh: wikimedia.org: update CNAME for ntp.ulsfo to dns4003 [dns] - 10https://gerrit.wikimedia.org/r/838882 (https://phabricator.wikimedia.org/T317247) [19:28:20] 10SRE, 10Data Engineering Planning, 10Data-Engineering-Operations, 10Mail: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10Dzahn) @xcollazo ITS can create the group and then give admin ship to your team so that you can self-manage it. [19:30:18] (03PS5) 10BCornwall: ats: Alert on high connection/request count [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815) [19:30:20] (03CR) 10Dzahn: [C: 03+1] "lgtm, confirmed in Namely, has manager approval, nitpick: add that it's for the wmf group and not other LDAP groups" [puppet] - 10https://gerrit.wikimedia.org/r/838881 (https://phabricator.wikimedia.org/T318729) (owner: 10AOkoth) [19:30:29] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for MHorsey - https://phabricator.wikimedia.org/T318729 (10Arnoldokoth) 05Open→03In progress p:05Triage→03Medium [19:30:44] (03CR) 10AOkoth: [C: 03+2] admin: add mhorsey to ldap users [puppet] - 10https://gerrit.wikimedia.org/r/838881 (https://phabricator.wikimedia.org/T318729) (owner: 10AOkoth) [19:32:05] (03CR) 10BCornwall: [C: 03+1] wikimedia.org: update CNAME for ntp.ulsfo to dns4003 [dns] - 10https://gerrit.wikimedia.org/r/838882 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [19:34:18] (03PS6) 10BCornwall: ats: Alert on high connection/request count [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815) [19:36:00] 10SRE, 10Data Engineering Planning, 10Data-Engineering-Operations, 10Mail: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10xcollazo) @Dzahn: we discussed moving the list today and there was concern on whether we could make the content of the li... [19:39:46] 10SRE, 10Data Engineering Planning, 10Data-Engineering-Operations, 10Mail: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10Dzahn) @xcollazo There are 2 possible routes you can go. Both result in your team being able to self-manage the list. a)... [19:43:15] (03CR) 10Andrew Bogott: [C: 03+2] haproxy add (commented-out) debug log line [puppet] - 10https://gerrit.wikimedia.org/r/838874 (https://phabricator.wikimedia.org/T319313) (owner: 10Andrew Bogott) [19:43:30] 10SRE, 10Data Engineering Planning, 10Data-Engineering-Operations, 10Mail: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10xcollazo) Ack @Dzahn, thank you for the context and options! Will discuss with team and get back to you. [19:46:53] (03PS1) 10BCornwall: prometheus: Remove ATS 8-specific metrics [puppet] - 10https://gerrit.wikimedia.org/r/838886 [19:47:15] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for MHorsey - https://phabricator.wikimedia.org/T318729 (10Arnoldokoth) ` aokoth@mwmaint1002:~$ ldapsearch -x cn=wmf | grep "mhorsey" member: uid=mhorsey,ou=people,dc=wikimedia,dc=org ` This is now resolved. Feel free to close the ticket @MHorsey-WMF [19:51:25] (03CR) 10BCornwall: ats: Alert on high connection/request count (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [19:54:26] (03PS1) 10Jdlrobson: Move horizontal padding from .mw-body to .mw-page-container, improve .mw-page-container styles [skins/Vector] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/838818 (https://phabricator.wikimedia.org/T317573) [19:54:48] (03CR) 10Jdlrobson: [C: 04-1] Move horizontal padding from .mw-body to .mw-page-container, improve .mw-page-container styles [skins/Vector] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/838818 (https://phabricator.wikimedia.org/T317573) (owner: 10Jdlrobson) [19:55:08] (03PS2) 10Jdlrobson: EXPECTED VISUAL CHANGES IN WMF.4 [skins/Vector] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/838818 (https://phabricator.wikimedia.org/T317573) [19:56:27] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1001/37463/registry2004.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/838859 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall) [19:56:55] !log reedy@deploy1002 Started deploy [integration/docroot@09eb565]: T319461 and cleanup [19:56:58] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for "Stef Dunlap" - https://phabricator.wikimedia.org/T318626 (10Arnoldokoth) 05Open→03In progress p:05Triage→03Medium [19:56:59] T319461: Add "last updated" timestamp to test coverage index pages - https://phabricator.wikimedia.org/T319461 [19:57:05] !log reedy@deploy1002 Finished deploy [integration/docroot@09eb565]: T319461 and cleanup (duration: 00m 10s) [20:00:05] RoanKattouw, Urbanecm, cjming, and TheresNoTime: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221005T2000). [20:00:05] danisztls and ebernhardson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:22] (03PS1) 10AOkoth: admin: add kindrobot to wmf group [puppet] - 10https://gerrit.wikimedia.org/r/838894 (https://phabricator.wikimedia.org/T318626) [20:00:38] I can deploy! [20:01:26] i don't see danisztls here? [20:01:28] (03CR) 10Dzahn: [V: 03+1] "[cumin2002:~] $ sudo cumin 'C:jwt_authorizer' 'date'" [puppet] - 10https://gerrit.wikimedia.org/r/838859 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall) [20:02:19] hi danisztls [20:02:19] o/ [20:02:27] urbanecm: hi [20:03:18] !log registry* (4 servers) - disabling puppet, deploying gerrit:838859 - T308501 [20:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:23] T308501: Authenticate trusted runners for registry access against GitLab using temporary JSON Web Token - https://phabricator.wikimedia.org/T308501 [20:03:32] (03CR) 10Dzahn: [V: 03+1 C: 03+2] jwt_authorizer: Start service as configured owner/group [puppet] - 10https://gerrit.wikimedia.org/r/838859 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall) [20:03:48] (03PS7) 10Urbanecm: Deploy Research Incentive survey on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834044 (https://phabricator.wikimedia.org/T318331) (owner: 10DDesouza) [20:03:51] mutante: ty :) [20:03:53] (03PS3) 10DDesouza: Remove Research Incentive survey from arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837695 (https://phabricator.wikimedia.org/T318328) [20:03:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834044 (https://phabricator.wikimedia.org/T318331) (owner: 10DDesouza) [20:03:57] yay! [20:04:42] (03Merged) 10jenkins-bot: Deploy Research Incentive survey on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834044 (https://phabricator.wikimedia.org/T318331) (owner: 10DDesouza) [20:05:05] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:834044|Deploy Research Incentive survey on eswiki (T318331)]] [20:05:10] T318331: Deploy Research Incentive Survey on Spanish Wikipedia - https://phabricator.wikimedia.org/T318331 [20:05:13] (03CR) 10BBlack: [C: 03+1] wikimedia.org: update CNAME for ntp.ulsfo to dns4003 [dns] - 10https://gerrit.wikimedia.org/r/838882 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [20:05:14] dduvall: dancy: deployed on registry1003.. now others.. in progress [20:05:31] !log urbanecm@deploy1002 urbanecm and dani: Backport for [[gerrit:834044|Deploy Research Incentive survey on eswiki (T318331)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [20:05:37] * dduvall holds breath [20:05:43] * dancy twitches [20:05:47] danisztls: your patch is at mwdebug1001 (and others), can you check? [20:05:53] urbanecm: yes [20:06:42] okay, waiting :) [20:07:04] urbanecm: eswiki good, arwiki still seeing survey [20:07:13] so, sync? :) [20:07:33] yes [20:07:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:07:54] okay, doing :) [20:08:12] dduvall: dancy: [20:08:13] (4) registry[2003-2004].codfw.wmnet,registry[1003-1004].eqiad.wmnet [20:08:16] ----- OUTPUT of 'ps aux | grep jw...| cut -f1 -d " "' ----- [20:08:19] www-data [20:08:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:08:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:08:42] mutante: nice. and the ownership of `/var/run/nginx-auth/jwt.sock`? [20:08:48] one step closer. [20:09:12] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "after deploying the process runs as www-data on 4 hosts:" [puppet] - 10https://gerrit.wikimedia.org/r/838859 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall) [20:09:23] (03CR) 10CI reject: [V: 04-1] EXPECTED VISUAL CHANGES IN WMF.4 [skins/Vector] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/838818 (https://phabricator.wikimedia.org/T317573) (owner: 10Jdlrobson) [20:09:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:10:24] dduvall: www-data www-data [20:10:46] Let the buildings begin!! [20:10:47] maybe [20:10:48] excellent! thank you [20:10:56] i see new errors :) [20:11:05] that's always good :) [20:11:10] dzahn, https://meet.google.com/sut-zxhw-jqy ? [20:11:12] i'll head over to #wikimedia-gitlab [20:11:25] (03PS4) 10Urbanecm: Remove Research Incentive survey from arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837695 (https://phabricator.wikimedia.org/T318328) (owner: 10DDesouza) [20:11:30] (03CR) 10Urbanecm: [C: 03+2] Remove Research Incentive survey from arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837695 (https://phabricator.wikimedia.org/T318328) (owner: 10DDesouza) [20:11:57] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:834044|Deploy Research Incentive survey on eswiki (T318331)]] (duration: 06m 51s) [20:12:02] T318331: Deploy Research Incentive Survey on Spanish Wikipedia - https://phabricator.wikimedia.org/T318331 [20:12:43] first patch's deployed [20:12:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837695 (https://phabricator.wikimedia.org/T318328) (owner: 10DDesouza) [20:13:42] (03Merged) 10jenkins-bot: Remove Research Incentive survey from arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837695 (https://phabricator.wikimedia.org/T318328) (owner: 10DDesouza) [20:14:05] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:837695|Remove Research Incentive survey from arwiki (T318328)]] [20:14:09] T318328: Deploy Research Incentive Survey on Arabic Wikipedia - https://phabricator.wikimedia.org/T318328 [20:14:28] !log urbanecm@deploy1002 urbanecm and dani: Backport for [[gerrit:837695|Remove Research Incentive survey from arwiki (T318328)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [20:14:40] (03CR) 10Dzahn: [C: 03+1] "looks good to me, confirmed in Namely" [puppet] - 10https://gerrit.wikimedia.org/r/838894 (https://phabricator.wikimedia.org/T318626) (owner: 10AOkoth) [20:14:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:14:48] danisztls: can you check the second patch please? [20:14:53] urbanecm: sure [20:15:03] (03CR) 10Jforrester: [C: 03+1] Add registry.gitlab.com/security-products/**/* as allowed images [puppet] - 10https://gerrit.wikimedia.org/r/838194 (https://phabricator.wikimedia.org/T312961) (owner: 10SBassett) [20:15:07] urbanecm: good [20:15:16] okay, syncing [20:15:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:15:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:16:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:19:16] !log reedy@deploy1002 Started deploy [integration/docroot@a136ce6]: More minor cleanup [20:19:19] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:837695|Remove Research Incentive survey from arwiki (T318328)]] (duration: 05m 13s) [20:19:24] T318328: Deploy Research Incentive Survey on Arabic Wikipedia - https://phabricator.wikimedia.org/T318328 [20:19:43] danisztls: and the other patch's done! [20:19:54] urbanecm: thanks! [20:19:56] ebernhardson: hi, if you want to self-service, feel free to go ahead! [20:19:59] !log reedy@deploy1002 Finished deploy [integration/docroot@a136ce6]: More minor cleanup (duration: 00m 42s) [20:20:04] i can also deploy for you if you want me to. [20:21:17] Did it though? [20:21:31] (03CR) 10Ssingh: [C: 03+2] sites.yaml: add dns4003 to anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/838239 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [20:21:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:22:06] !log reedy@deploy1002 Started deploy [integration/docroot@a136ce6]: More minor cleanup [20:22:37] !log reedy@deploy1002 Finished deploy [integration/docroot@a136ce6]: More minor cleanup (duration: 00m 31s) [20:22:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:22:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:22:50] How did it finish if it rolled back? [20:22:55] (03Merged) 10jenkins-bot: sites.yaml: add dns4003 to anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/838239 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [20:23:55] !log reedy@deploy1002 Started deploy [integration/docroot@a136ce6]: More minor cleanup [20:24:01] !log reedy@deploy1002 Finished deploy [integration/docroot@a136ce6]: More minor cleanup (duration: 00m 06s) [20:25:02] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:25:23] !log homer "cr*-ulsfo*" commit "Gerrit 838239: sites.yaml: add dns4003 to anycast_neighbors" [20:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:53] (03CR) 10AOkoth: [C: 03+2] admin: add kindrobot to wmf group [puppet] - 10https://gerrit.wikimedia.org/r/838894 (https://phabricator.wikimedia.org/T318626) (owner: 10AOkoth) [20:26:34] !log reedy@deploy1002 Started deploy [integration/docroot@a136ce6]: More minor cleanup [20:26:36] (03CR) 10Ssingh: [C: 03+2] wikimedia.org: update CNAME for ntp.ulsfo to dns4003 [dns] - 10https://gerrit.wikimedia.org/r/838882 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [20:26:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:26:45] !log reedy@deploy1002 Finished deploy [integration/docroot@a136ce6]: More minor cleanup (duration: 00m 10s) [20:27:22] !log running authdns-update for CR 838882 [20:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:20] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for "Stef Dunlap" - https://phabricator.wikimedia.org/T318626 (10Arnoldokoth) ` aokoth@mwmaint1002:~$ ldapsearch -x cn=wmf | grep "kindrobot" member: uid=kindrobot,ou=people,dc=wikimedia,dc=org ` This is resolved now. Feel free to close... [20:30:58] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:31:24] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:31:46] It's still on scap/sync/2022-08-19/0001 [20:32:09] https://sal.toolforge.org/log/FLHtsYIBa_6PSCT9m3mW [20:32:18] !log reedy@deploy1002 Started deploy [integration/docroot@a136ce6]: More minor cleanup [20:32:31] Bah, wrong channel. [20:33:23] !log reedy@deploy1002 Finished deploy [integration/docroot@a136ce6]: More minor cleanup (duration: 01m 05s) [20:34:35] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-private-users for Slavina Stefanova - https://phabricator.wikimedia.org/T318807 (10Arnoldokoth) [20:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [20:38:28] (03PS1) 10AOkoth: admin: add sstefanova to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/838900 (https://phabricator.wikimedia.org/T318807) [20:38:55] (03PS2) 10AOkoth: admin: add sstefanova to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/838900 (https://phabricator.wikimedia.org/T318807) [20:40:57] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Greg Grossmeier - https://phabricator.wikimedia.org/T318873 (10Arnoldokoth) [20:41:35] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Greg Grossmeier - https://phabricator.wikimedia.org/T318873 (10Arnoldokoth) [20:45:09] (03PS1) 10AOkoth: admin: add gjg to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/838902 (https://phabricator.wikimedia.org/T318873) [20:45:47] (03CR) 10Dzahn: [C: 03+1] "lgtm, has approval on ticket from ottomata" [puppet] - 10https://gerrit.wikimedia.org/r/838900 (https://phabricator.wikimedia.org/T318807) (owner: 10AOkoth) [20:46:02] (03CR) 10AOkoth: [C: 03+2] admin: add sstefanova to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/838900 (https://phabricator.wikimedia.org/T318807) (owner: 10AOkoth) [20:48:11] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-private-users for Slavina Stefanova - https://phabricator.wikimedia.org/T318807 (10Arnoldokoth) @Slst2020 This is resolved. Feel free to close the ticket if everything is good on your end. [20:48:21] (03PS2) 10AOkoth: admin: add gjg to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/838902 (https://phabricator.wikimedia.org/T318873) [20:49:34] (03CR) 10Dzahn: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/838902 (https://phabricator.wikimedia.org/T318873) (owner: 10AOkoth) [21:01:25] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Dzahn) [21:02:30] (03PS1) 10Andrew Bogott: Openstack Keystone: Expose the Keystone public API [puppet] - 10https://gerrit.wikimedia.org/r/838903 (https://phabricator.wikimedia.org/T319312) [21:02:32] (03PS1) 10Andrew Bogott: Openstack Nova: Expose the Nova public API [puppet] - 10https://gerrit.wikimedia.org/r/838904 (https://phabricator.wikimedia.org/T319312) [21:02:34] (03PS1) 10Andrew Bogott: Openstack Glance: Expose the Glance public API [puppet] - 10https://gerrit.wikimedia.org/r/838905 (https://phabricator.wikimedia.org/T319312) [21:02:36] (03PS1) 10Andrew Bogott: Openstack Cinder: Expose the Cinder public API [puppet] - 10https://gerrit.wikimedia.org/r/838906 (https://phabricator.wikimedia.org/T319312) [21:02:38] (03PS1) 10Andrew Bogott: Openstack Neutron: Expose the Neutron public API [puppet] - 10https://gerrit.wikimedia.org/r/838907 (https://phabricator.wikimedia.org/T319312) [21:02:42] (03PS1) 10Andrew Bogott: Openstack Designate: Expose the Designate public API [puppet] - 10https://gerrit.wikimedia.org/r/838908 (https://phabricator.wikimedia.org/T319312) [21:03:07] (03CR) 10CI reject: [V: 04-1] Openstack Keystone: Expose the Keystone public API [puppet] - 10https://gerrit.wikimedia.org/r/838903 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott) [21:03:28] (03CR) 10CI reject: [V: 04-1] Openstack Nova: Expose the Nova public API [puppet] - 10https://gerrit.wikimedia.org/r/838904 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott) [21:06:42] (03PS2) 10Andrew Bogott: Openstack Keystone: Expose the Keystone public API [puppet] - 10https://gerrit.wikimedia.org/r/838903 (https://phabricator.wikimedia.org/T319312) [21:06:44] (03PS2) 10Andrew Bogott: Openstack Nova: Expose the Nova public API [puppet] - 10https://gerrit.wikimedia.org/r/838904 (https://phabricator.wikimedia.org/T319312) [21:06:46] (03PS2) 10Andrew Bogott: Openstack Glance: Expose the Glance public API [puppet] - 10https://gerrit.wikimedia.org/r/838905 (https://phabricator.wikimedia.org/T319312) [21:06:48] (03PS2) 10Andrew Bogott: Openstack Cinder: Expose the Cinder public API [puppet] - 10https://gerrit.wikimedia.org/r/838906 (https://phabricator.wikimedia.org/T319312) [21:06:50] (03PS2) 10Andrew Bogott: Openstack Neutron: Expose the Neutron public API [puppet] - 10https://gerrit.wikimedia.org/r/838907 (https://phabricator.wikimedia.org/T319312) [21:06:52] (03PS2) 10Andrew Bogott: Openstack Designate: Expose the Designate public API [puppet] - 10https://gerrit.wikimedia.org/r/838908 (https://phabricator.wikimedia.org/T319312) [21:11:18] (03CR) 10Dzahn: "@claime Would like to chat about this one" [puppet] - 10https://gerrit.wikimedia.org/r/825753 (https://phabricator.wikimedia.org/T241207) (owner: 10Dzahn) [21:11:29] (03PS2) 10Dzahn: scap/dsh: remove parsoid service, replaced by parsoid-php [puppet] - 10https://gerrit.wikimedia.org/r/825753 (https://phabricator.wikimedia.org/T241207) [21:13:16] (03CR) 10AOkoth: [C: 03+2] admin: add gjg to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/838902 (https://phabricator.wikimedia.org/T318873) (owner: 10AOkoth) [21:15:06] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Greg Grossmeier - https://phabricator.wikimedia.org/T318873 (10Arnoldokoth) @greg This is resolved. Feel free to close the ticket if everything is good on your side. [21:16:18] (03CR) 10Dzahn: [C: 03+2] gitlab_runner: enable unprivileged_userns_clone in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/835162 (https://phabricator.wikimedia.org/T307810) (owner: 10Jelto) [21:16:56] (03CR) 10Dzahn: [C: 03+2] "deployed since it's already cherry-picked and multiple +1s" [puppet] - 10https://gerrit.wikimedia.org/r/835162 (https://phabricator.wikimedia.org/T307810) (owner: 10Jelto) [21:18:12] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Lucas Werkmeister - https://phabricator.wikimedia.org/T319014 (10Arnoldokoth) Hey @Ottomata @karapayneWMDE Kindly approve. [21:18:25] (03CR) 10Dzahn: "thanks @akosiaris! @claime This is the one I would like to deploy first." [puppet] - 10https://gerrit.wikimedia.org/r/826884 (https://phabricator.wikimedia.org/T316296) (owner: 10Dzahn) [21:19:35] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Lucas Werkmeister - https://phabricator.wikimedia.org/T319014 (10Arnoldokoth) [21:25:01] (03PS1) 10BCornwall: prometheus: Add records for ATS percent usage [puppet] - 10https://gerrit.wikimedia.org/r/838911 (https://phabricator.wikimedia.org/T292815) [21:26:06] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:27:58] (03PS1) 10Dzahn: vrts: allow installing a local mariadb server in cloud [puppet] - 10https://gerrit.wikimedia.org/r/838912 (https://phabricator.wikimedia.org/T317059) [21:28:33] (03CR) 10CI reject: [V: 04-1] vrts: allow installing a local mariadb server in cloud [puppet] - 10https://gerrit.wikimedia.org/r/838912 (https://phabricator.wikimedia.org/T317059) (owner: 10Dzahn) [21:29:50] (03PS2) 10Dzahn: vrts: allow installing a local mariadb server in cloud [puppet] - 10https://gerrit.wikimedia.org/r/838912 (https://phabricator.wikimedia.org/T317059) [21:30:58] (03CR) 10CI reject: [V: 04-1] vrts: allow installing a local mariadb server in cloud [puppet] - 10https://gerrit.wikimedia.org/r/838912 (https://phabricator.wikimedia.org/T317059) (owner: 10Dzahn) [21:33:55] (03PS3) 10Dzahn: vrts: allow installing a local mariadb server in cloud [puppet] - 10https://gerrit.wikimedia.org/r/838912 (https://phabricator.wikimedia.org/T317059) [21:39:36] (03PS1) 10Dzahn: lower TTL for gerrit,gerrit-replica from 600 to 300 [dns] - 10https://gerrit.wikimedia.org/r/838915 (https://phabricator.wikimedia.org/T315319) [21:41:00] !log dancy@deploy1002 Installing scap version "4.26.0" for 559 hosts [21:41:02] (03PS1) 10Dzahn: lower TTL for phabricator from 600 to 300 [dns] - 10https://gerrit.wikimedia.org/r/838916 (https://phabricator.wikimedia.org/T315319) [21:41:17] !log dancy@deploy1002 Installation of scap version "4.26.0" completed for 559 hosts [21:45:28] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Lucas Werkmeister - https://phabricator.wikimedia.org/T319014 (10Ottomata) Approved! [21:47:50] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1002/37464/otrs1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/838912 (https://phabricator.wikimedia.org/T317059) (owner: 10Dzahn) [22:15:14] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:17:02] !log dancy@deploy1002 Installing scap version "4.27.0" for 559 hosts [22:17:19] !log dancy@deploy1002 Installation of scap version "4.27.0" completed for 559 hosts [22:17:52] !log dancy@deploy1002 Started deploy [integration/docroot@a136ce6]: (no justification provided) [22:18:03] !log dancy@deploy1002 Finished deploy [integration/docroot@a136ce6]: (no justification provided) (duration: 00m 10s) [22:19:20] !log reedy@deploy1002 Started deploy [integration/docroot@a136ce6]: Cleanup and timestamps [22:19:43] !log reedy@deploy1002 deploy aborted: Cleanup and timestamps (duration: 00m 22s) [22:21:11] !log reedy@deploy1002 Started deploy [integration/docroot@a136ce6]: (no justification provided) [22:21:17] !log reedy@deploy1002 Finished deploy [integration/docroot@a136ce6]: (no justification provided) (duration: 00m 06s) [22:21:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:23:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [22:26:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:27:14] !log reedy@deploy1002 Started deploy [integration/docroot@a136ce6]: Cleanup and timestamps [22:27:22] !log reedy@deploy1002 Finished deploy [integration/docroot@a136ce6]: Cleanup and timestamps (duration: 00m 07s) [22:31:22] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:34:02] PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The following units failed: generate_otrs_aliases.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:41:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:41:16] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Greg Grossmeier - https://phabricator.wikimedia.org/T318873 (10greg) 05Open→03Resolved a:03Arnoldokoth Looks to be working, thanks @Arnoldokoth ! [22:46:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:29:24] RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:48:24] (03PS1) 10BryanDavis: php74: add many TTF fonts [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/838939 (https://phabricator.wikimedia.org/T310435) [23:59:46] (03CR) 10BryanDavis: [C: 03+2] "In local testing this change adds about 400MiB to the uncompressed image size. That's about a 50% increase in total size, but I think that" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/838939 (https://phabricator.wikimedia.org/T310435) (owner: 10BryanDavis)