[00:06:59] PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:05] !log disabling Lumen AS3356 BGP session over IPv4 on cr3-ulsfo to assess affect on currently broken routing to ulsfo. [00:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:04] (03CR) 10Ahmon Dancy: [C: 03+1] check_mw_versions.py: Fix problem induced by recent scap changes [puppet] - 10https://gerrit.wikimedia.org/r/767242 (https://phabricator.wikimedia.org/T302832) (owner: 10Ahmon Dancy) [00:15:31] (03CR) 10Krinkle: "It is my current understanding and expectation that if I change something in -staging on the deployment host, lock it, and sync this no-wh" [puppet] - 10https://gerrit.wikimedia.org/r/767242 (https://phabricator.wikimedia.org/T302832) (owner: 10Ahmon Dancy) [00:15:40] !log Re-enabling Lumen AS3356 BGP session over IPv4 on cr3-ulsfo to assess affect on currently broken routing to ulsfo. [00:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:37] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [00:32:53] (Traffic bill over quota) firing: Alert for device cr2-eqsin.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org [00:35:48] (03PS1) 10Ebernhardson: query_service: Include scheme and host in X-redirect-url [puppet] - 10https://gerrit.wikimedia.org/r/767259 [00:38:12] (03PS2) 10Ebernhardson: query_service: Include scheme and host in X-redirect-url [puppet] - 10https://gerrit.wikimedia.org/r/767259 [00:39:03] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/767259 (owner: 10Ebernhardson) [00:52:53] (Traffic bill over quota) resolved: Alert for device cr2-eqsin.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org [01:31:23] RECOVERY - Disk space on centrallog1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=centrallog1001&var-datasource=eqiad+prometheus/ops [01:37:32] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:37:39] PROBLEM - Confd vcl based reload on cp6009 is CRITICAL: reload-vcl failed to run since 4h, 29 minutes. https://wikitech.wikimedia.org/wiki/Varnish [01:40:41] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:52:59] PROBLEM - SSH on thumbor2003.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:37:49] PROBLEM - Confd vcl based reload on cp6009 is CRITICAL: reload-vcl failed to run since 5h, 29 minutes. https://wikitech.wikimedia.org/wiki/Varnish [02:42:10] (ProbeHttpFailed) firing: (2) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [02:43:05] PROBLEM - traffic_server tls process restarted on cp6009 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6009&var-layer=tls [02:51:29] 10SRE, 10SRE Observability (FY2021/2022-Q3): SLO dashboard refinements - https://phabricator.wikimedia.org/T302842 (10lmata) [02:51:51] 10SRE, 10SRE Observability (FY2021/2022-Q3): SLO dashboard refinements - https://phabricator.wikimedia.org/T302842 (10lmata) p:05Triage→03Medium [02:52:09] 10SRE, 10SRE Observability (FY2021/2022-Q3): SLO dashboard refinements - https://phabricator.wikimedia.org/T302842 (10lmata) a:03herron [02:54:29] 10SRE, 10SRE Observability (FY2021/2022-Q3): SLO dashboard refinements - https://phabricator.wikimedia.org/T302842 (10lmata) Hi @RLazarus, Will discuss with @herron and address the feedback with any notes. Thanks! [03:00:54] (03CR) 10TsepoThoabala: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766882 (https://phabricator.wikimedia.org/T296499) (owner: 10STran) [03:44:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [03:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:44:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [03:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:44:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [03:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:44:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [03:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:44:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T300992)', diff saved to https://phabricator.wikimedia.org/P21632 and previous config saved to /var/cache/conftool/dbconfig/20220302-034454-ladsgroup.json [03:44:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1104.eqiad.wmnet with reason: Maintenance [03:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:44:57] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [03:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:44:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1104.eqiad.wmnet with reason: Maintenance [03:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:45:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1104 (T302185)', diff saved to https://phabricator.wikimedia.org/P21633 and previous config saved to /var/cache/conftool/dbconfig/20220302-034502-ladsgroup.json [03:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:45:05] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [03:47:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T300992)', diff saved to https://phabricator.wikimedia.org/P21634 and previous config saved to /var/cache/conftool/dbconfig/20220302-034715-ladsgroup.json [03:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:48:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1104.eqiad.wmnet with OS bullseye [03:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:57:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1104.eqiad.wmnet with reason: host reimage [03:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:00:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1104.eqiad.wmnet with reason: host reimage [04:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:02:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P21635 and previous config saved to /var/cache/conftool/dbconfig/20220302-040220-ladsgroup.json [04:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:15:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1104.eqiad.wmnet with OS bullseye [04:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:17:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P21636 and previous config saved to /var/cache/conftool/dbconfig/20220302-041725-ladsgroup.json [04:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:20:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T302185)', diff saved to https://phabricator.wikimedia.org/P21637 and previous config saved to /var/cache/conftool/dbconfig/20220302-042012-ladsgroup.json [04:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:20:15] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [04:25:07] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Accidentally unsubscribed everyone from open-glam mailing list - https://phabricator.wikimedia.org/T302816 (10Ladsgroup) Unfortunately, I don't' think I can get it back from the binlog because the removal queries is not like list_id = 'open-glam.lists.wikim... [04:32:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T300992)', diff saved to https://phabricator.wikimedia.org/P21638 and previous config saved to /var/cache/conftool/dbconfig/20220302-043229-ladsgroup.json [04:32:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [04:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:32:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [04:32:33] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [04:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:32:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [04:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:32:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [04:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:33:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [04:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:33:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [04:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:33:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21639 and previous config saved to /var/cache/conftool/dbconfig/20220302-043313-ladsgroup.json [04:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:34:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21640 and previous config saved to /var/cache/conftool/dbconfig/20220302-043433-ladsgroup.json [04:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:35:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P21641 and previous config saved to /var/cache/conftool/dbconfig/20220302-043516-ladsgroup.json [04:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:42:28] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:42:48] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:45:10] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:46:32] PROBLEM - Number of messages locally queued by purged for processing on cp6009 is CRITICAL: cluster=cache_text instance=cp6009 job=purged layer=frontend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6009 [04:49:02] RECOVERY - Number of messages locally queued by purged for processing on cp6009 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6009 [04:49:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P21642 and previous config saved to /var/cache/conftool/dbconfig/20220302-044938-ladsgroup.json [04:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:50:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P21643 and previous config saved to /var/cache/conftool/dbconfig/20220302-045021-ladsgroup.json [04:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:44] (03PS1) 10Ladsgroup: db1101: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/767279 (https://phabricator.wikimedia.org/T302185) [05:04:01] (03PS2) 10Ladsgroup: db1101: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/767279 (https://phabricator.wikimedia.org/T302185) [05:04:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P21644 and previous config saved to /var/cache/conftool/dbconfig/20220302-050442-ladsgroup.json [05:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:03] (03CR) 10Ladsgroup: [C: 03+2] db1101: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/767279 (https://phabricator.wikimedia.org/T302185) (owner: 10Ladsgroup) [05:05:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T302185)', diff saved to https://phabricator.wikimedia.org/P21645 and previous config saved to /var/cache/conftool/dbconfig/20220302-050526-ladsgroup.json [05:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:29] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [05:16:03] 10SRE, 10WMF-General-or-Unknown, 10WMF-Legal, 10Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270 (10Ladsgroup) I hereby license all my existing contributions to the operations/puppet under the Apache 2.0 license [05:18:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance [05:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance [05:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T302185)', diff saved to https://phabricator.wikimedia.org/P21646 and previous config saved to /var/cache/conftool/dbconfig/20220302-051853-ladsgroup.json [05:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:56] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [05:19:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21647 and previous config saved to /var/cache/conftool/dbconfig/20220302-051947-ladsgroup.json [05:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:51] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [05:20:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3318 (T302185)', diff saved to https://phabricator.wikimedia.org/P21648 and previous config saved to /var/cache/conftool/dbconfig/20220302-052033-ladsgroup.json [05:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1101.eqiad.wmnet with OS bullseye [05:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:48] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (phab1001), Fresh: 104 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:32:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1101.eqiad.wmnet with reason: host reimage [05:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1101.eqiad.wmnet with reason: host reimage [05:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:37] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [05:46:26] PROBLEM - Number of messages locally queued by purged for processing on cp6009 is CRITICAL: cluster=cache_text instance=cp6009 job=purged layer=frontend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6009 [05:48:24] RECOVERY - Number of messages locally queued by purged for processing on cp6009 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6009 [05:48:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1101.eqiad.wmnet with OS bullseye [05:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T302185)', diff saved to https://phabricator.wikimedia.org/P21649 and previous config saved to /var/cache/conftool/dbconfig/20220302-055419-ladsgroup.json [05:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:23] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [06:09:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P21650 and previous config saved to /var/cache/conftool/dbconfig/20220302-060924-ladsgroup.json [06:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P21651 and previous config saved to /var/cache/conftool/dbconfig/20220302-062428-ladsgroup.json [06:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T302185)', diff saved to https://phabricator.wikimedia.org/P21652 and previous config saved to /var/cache/conftool/dbconfig/20220302-063933-ladsgroup.json [06:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:37] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [06:39:50] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:41:20] PROBLEM - SSH on kubernetes2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:42:10] (ProbeHttpFailed) firing: (2) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [06:50:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T302185)', diff saved to https://phabricator.wikimedia.org/P21653 and previous config saved to /var/cache/conftool/dbconfig/20220302-065056-ladsgroup.json [06:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:00] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [07:00:02] RECOVERY - SSH on thumbor2003.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:01:52] (03PS11) 10Giuseppe Lavagetto: conftool: add request-actions / request-patterns [puppet] - 10https://gerrit.wikimedia.org/r/763486 [07:01:59] (03PS1) 10Ladsgroup: Revert "db1104: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/767090 [07:02:26] (03PS1) 10Ladsgroup: Revert "db1114: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/767091 [07:02:42] (03PS1) 10Ladsgroup: Revert "db1177: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/767092 [07:02:50] (03PS2) 10Ladsgroup: Revert "db1104: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/767090 [07:02:55] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1104: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/767090 (owner: 10Ladsgroup) [07:03:07] (03PS2) 10Ladsgroup: Revert "db1114: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/767091 [07:03:10] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1114: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/767091 (owner: 10Ladsgroup) [07:03:23] (03PS2) 10Ladsgroup: Revert "db1177: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/767092 [07:03:28] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1177: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/767092 (owner: 10Ladsgroup) [07:06:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P21654 and previous config saved to /var/cache/conftool/dbconfig/20220302-070601-ladsgroup.json [07:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:36] <_joe_> !log installing scap 4.4.1 everywhere T302464 [07:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:40] T302464: Deploy Scap version 4.4.1 - https://phabricator.wikimedia.org/T302464 [07:13:14] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, and 2 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Joe) 05Open→03Stalled a:05Joe→03None De-assigning from myself as I can't do anything more for this task in its current status. Also reflecting it... [07:15:34] (03PS1) 10Ladsgroup: Revert "db1101: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/767093 [07:15:36] 10SRE, 10Patch-For-Review, 10Tracking-Neverending: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10Joe) [07:15:38] 10SRE, 10discovery-system: confctl SubjectAltNameWarning after python-urllib3 upgrade - https://phabricator.wikimedia.org/T156232 (10Joe) 05Open→03Resolved a:03Joe [07:16:23] (03PS2) 10Ladsgroup: Revert "db1101: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/767093 [07:16:26] 10SRE, 10MediaWiki-Configuration, 10discovery-system: Use EtcdConfig in production to allow automation of a datacenter switch - https://phabricator.wikimedia.org/T182597 (10Joe) 05Open→03Resolved a:03Joe [07:16:32] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1101: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/767093 (owner: 10Ladsgroup) [07:16:35] 10SRE, 10discovery-system: Replace etcd internal auth mechanism with a frontend proxy - https://phabricator.wikimedia.org/T146355 (10Joe) 05Open→03Resolved a:03Joe This has been implemented years ago. [07:16:59] 10SRE, 10discovery-system: confctl should provide tags information after writing data - https://phabricator.wikimedia.org/T124413 (10Joe) 05Open→03Resolved a:03Joe This has been solved years ago. [07:18:37] 10SRE, 10discovery-system: Create a conftool "agent" that overcomes confd deficiencies - https://phabricator.wikimedia.org/T107285 (10Joe) 05Open→03Declined 7 years later no one is working on this and I doubt it will ever be. Declining the task as a consequence. [07:18:46] (03PS1) 10Ladsgroup: db1167: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/767440 (https://phabricator.wikimedia.org/T302185) [07:19:03] 10SRE, 10Traffic-Icebox, 10discovery-system, 10services-tooling: Figure out an etcd deploy strategy that includes multi DC failure scenarios. - https://phabricator.wikimedia.org/T98165 (10Joe) 05Open→03Resolved a:03Joe This task was left open by mistake; we've had a multi-dc setup for years now. [07:19:15] (03PS2) 10Ladsgroup: db1167: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/767440 (https://phabricator.wikimedia.org/T302185) [07:19:19] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1167: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/767440 (https://phabricator.wikimedia.org/T302185) (owner: 10Ladsgroup) [07:19:47] 10SRE, 10Kubernetes, 10discovery-system: Document what #discovery-system is - https://phabricator.wikimedia.org/T282948 (10Joe) 05Open→03Resolved @Aklapper all done. I think we can retire the tag. [07:21:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P21655 and previous config saved to /var/cache/conftool/dbconfig/20220302-072105-ladsgroup.json [07:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:40] 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10Joe) I would frankly either keep third-party modules under /modules or move them to /vendor/modules. While I do love r10k as an idea and I even considered it as an option for puppet for cloud... [07:29:27] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "Merging as no new comments appeared here or on the design document in the last week or so, and we need to move forward with this." [puppet] - 10https://gerrit.wikimedia.org/r/763486 (owner: 10Giuseppe Lavagetto) [07:30:16] (03CR) 10Giuseppe Lavagetto: [C: 03+2] conftool: add request-actions / request-patterns (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/763486 (owner: 10Giuseppe Lavagetto) [07:35:35] <_joe_> !log filling request patterns in etcd [07:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T302185)', diff saved to https://phabricator.wikimedia.org/P21656 and previous config saved to /var/cache/conftool/dbconfig/20220302-073610-ladsgroup.json [07:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:13] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [07:40:46] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:41:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [07:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [07:42:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:10] RECOVERY - SSH on kubernetes2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:42:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T302185)', diff saved to https://phabricator.wikimedia.org/P21657 and previous config saved to /var/cache/conftool/dbconfig/20220302-074210-ladsgroup.json [07:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:13] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [07:45:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [07:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:14] 10SRE, 10Kubernetes, 10discovery-system: Document what #discovery-system is - https://phabricator.wikimedia.org/T282948 (10RhinosF1) Should it be archived then? [07:45:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [07:45:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [07:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [07:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [07:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [07:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21658 and previous config saved to /var/cache/conftool/dbconfig/20220302-074602-ladsgroup.json [07:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:05] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [07:48:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21659 and previous config saved to /var/cache/conftool/dbconfig/20220302-074822-ladsgroup.json [07:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:04] Amir1, awight, Urbanecm, and taavi: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220302T0800). [08:00:04] No Gerrit patches in the queue for this window AFAICS. [08:00:19] indeed, nothing to do! [08:01:34] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [08:02:20] !log killing all entity dumpers of wikidata in snapshot1008 (T300255) [08:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:24] T300255: Wikidata entity dumper keeps connecting to depooled host for really long time - https://phabricator.wikimedia.org/T300255 [08:03:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P21660 and previous config saved to /var/cache/conftool/dbconfig/20220302-080327-ladsgroup.json [08:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:35] 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Alerting, 10Patch-For-Review, and 2 others: Spicerack: add support for Alertmanager - https://phabricator.wikimedia.org/T293209 (10fgiunchedi) [08:04:56] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [08:09:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1167.eqiad.wmnet with OS bullseye [08:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:20] !log test thanos 0.24.0 on thanos-fe2001 to check if https://github.com/thanos-io/thanos/issues/4531 is fixed [08:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:43] (03CR) 10Filippo Giunchedi: "+ Cole for visibility" [puppet] - 10https://gerrit.wikimedia.org/r/766876 (https://phabricator.wikimedia.org/T276198) (owner: 10Bking) [08:18:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P21661 and previous config saved to /var/cache/conftool/dbconfig/20220302-081832-ladsgroup.json [08:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:50] (03CR) 10Muehlenhoff: Require Python 3.7/buster for logout scripts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/767064 (owner: 10Muehlenhoff) [08:20:06] 10SRE, 10observability, 10serviceops, 10Patch-For-Review: aggregate mismatched wikiversions alert - https://phabricator.wikimedia.org/T302832 (10fgiunchedi) I think a short term easy fix would be to make the check warning (i.e. icinga/alerts.w.o only) instead of critical so it doesn't spam irc, what do you... [08:20:38] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: open per-device librenms tasks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/767179 (https://phabricator.wikimedia.org/T300836) (owner: 10Filippo Giunchedi) [08:20:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1167.eqiad.wmnet with reason: host reimage [08:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1167.eqiad.wmnet with reason: host reimage [08:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:31] (03CR) 10Muehlenhoff: [C: 03+2] zuul: gracefully shutdown [puppet] - 10https://gerrit.wikimedia.org/r/732978 (https://phabricator.wikimedia.org/T257040) (owner: 10Hashar) [08:33:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21662 and previous config saved to /var/cache/conftool/dbconfig/20220302-083338-ladsgroup.json [08:33:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [08:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [08:33:42] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [08:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T300992)', diff saved to https://phabricator.wikimedia.org/P21663 and previous config saved to /var/cache/conftool/dbconfig/20220302-083345-ladsgroup.json [08:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:13] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:36:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T300992)', diff saved to https://phabricator.wikimedia.org/P21664 and previous config saved to /var/cache/conftool/dbconfig/20220302-083606-ladsgroup.json [08:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:55] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:38:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1167.eqiad.wmnet with OS bullseye [08:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:42] (03CR) 10Gehel: [C: 04-1] "See minor comment inline." [puppet] - 10https://gerrit.wikimedia.org/r/766876 (https://phabricator.wikimedia.org/T276198) (owner: 10Bking) [08:41:11] (03CR) 10Gehel: [C: 04-1] elastic: prevent rundir from deletion (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/766876 (https://phabricator.wikimedia.org/T276198) (owner: 10Bking) [08:44:30] (03CR) 10Muehlenhoff: [C: 03+2] Remove ema from router config [homer/public] - 10https://gerrit.wikimedia.org/r/767083 (owner: 10Muehlenhoff) [08:45:11] (03PS1) 10Hashar: Revert "zuul: gracefully shutdown" [puppet] - 10https://gerrit.wikimedia.org/r/767094 [08:45:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T302185)', diff saved to https://phabricator.wikimedia.org/P21665 and previous config saved to /var/cache/conftool/dbconfig/20220302-084513-ladsgroup.json [08:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:17] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [08:46:35] (03PS2) 10Hashar: Revert "zuul: gracefully shutdown" [puppet] - 10https://gerrit.wikimedia.org/r/767094 (https://phabricator.wikimedia.org/T257040) [08:50:22] (03CR) 10Muehlenhoff: [C: 03+2] Revert "zuul: gracefully shutdown" [puppet] - 10https://gerrit.wikimedia.org/r/767094 (https://phabricator.wikimedia.org/T257040) (owner: 10Hashar) [08:51:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P21666 and previous config saved to /var/cache/conftool/dbconfig/20220302-085111-ladsgroup.json [08:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:11] (03CR) 10Volans: [C: 03+1] "Replies inline, no blocker for me. The code does what it advertises :)" [puppet] - 10https://gerrit.wikimedia.org/r/767064 (owner: 10Muehlenhoff) [08:58:58] (03CR) 10Volans: [C: 03+1] "Maybe let's add a warning message to the cookbook though, so to not forget" [puppet] - 10https://gerrit.wikimedia.org/r/767064 (owner: 10Muehlenhoff) [08:59:01] (03CR) 10Muehlenhoff: elastic: prevent rundir from deletion (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/766876 (https://phabricator.wikimedia.org/T276198) (owner: 10Bking) [09:00:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P21667 and previous config saved to /var/cache/conftool/dbconfig/20220302-090018-ladsgroup.json [09:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:41] (03PS1) 10Elukey: Add kubernetes2018 to wikikube codfw [puppet] - 10https://gerrit.wikimedia.org/r/767465 (https://phabricator.wikimedia.org/T302208) [09:04:43] (03PS1) 10Ladsgroup: Revert "db1167: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/767095 [09:05:14] (03PS2) 10Ladsgroup: Revert "db1167: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/767095 [09:05:35] !log push Capirca managed labs-in firewall filter to eqiad routers [09:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:41] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1167: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/767095 (owner: 10Ladsgroup) [09:06:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P21668 and previous config saved to /var/cache/conftool/dbconfig/20220302-090615-ladsgroup.json [09:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:13] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34021/console" [puppet] - 10https://gerrit.wikimedia.org/r/767465 (https://phabricator.wikimedia.org/T302208) (owner: 10Elukey) [09:08:13] (03PS1) 10David Caro: wmcs: add runbook url to the backup_cinder_volumes alert [puppet] - 10https://gerrit.wikimedia.org/r/767467 (https://phabricator.wikimedia.org/T302855) [09:09:24] (03CR) 10Elukey: Add kubernetes2018 to wikikube codfw [puppet] - 10https://gerrit.wikimedia.org/r/767465 (https://phabricator.wikimedia.org/T302208) (owner: 10Elukey) [09:09:33] (03CR) 10Jelto: [C: 03+2] gitlab: avoid $realm check, simplify ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/762897 (owner: 10Majavah) [09:12:52] (03CR) 10Jelto: [C: 03+2] "looks good to me, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/762897 (owner: 10Majavah) [09:13:05] (03CR) 10Vgutierrez: [C: 03+2] mtail::atstls: Use native histogram type [puppet] - 10https://gerrit.wikimedia.org/r/767069 (owner: 10Vgutierrez) [09:13:32] (03PS2) 10Jelto: gitlab: avoid $realm check, simplify ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/762897 (owner: 10Majavah) [09:13:38] (03PS3) 10Vgutierrez: mtail::atstls: Provide trafficserver_tls_client_healthcheck_ttfb [puppet] - 10https://gerrit.wikimedia.org/r/767185 [09:13:40] (03CR) 10Ayounsi: [C: 03+2] Port labs-in4/6 to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/701347 (https://phabricator.wikimedia.org/T285461) (owner: 10Ayounsi) [09:14:32] (03Merged) 10jenkins-bot: Port labs-in4/6 to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/701347 (https://phabricator.wikimedia.org/T285461) (owner: 10Ayounsi) [09:15:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P21669 and previous config saved to /var/cache/conftool/dbconfig/20220302-091523-ladsgroup.json [09:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:56] (03CR) 10Vgutierrez: [C: 03+2] mtail::atstls: Provide trafficserver_tls_client_healthcheck_ttfb [puppet] - 10https://gerrit.wikimedia.org/r/767185 (owner: 10Vgutierrez) [09:16:03] !log rolling restart of varnishkafka-* on cp6* [09:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:18] (03PS1) 10Elukey: Add BGP config for kubernetes2018 [homer/public] - 10https://gerrit.wikimedia.org/r/767468 (https://phabricator.wikimedia.org/T302208) [09:16:55] (03CR) 10Elukey: "Related BGP change for Homer: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/767468" [puppet] - 10https://gerrit.wikimedia.org/r/767465 (https://phabricator.wikimedia.org/T302208) (owner: 10Elukey) [09:21:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T300992)', diff saved to https://phabricator.wikimedia.org/P21670 and previous config saved to /var/cache/conftool/dbconfig/20220302-092120-ladsgroup.json [09:21:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [09:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [09:21:24] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [09:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T300992)', diff saved to https://phabricator.wikimedia.org/P21671 and previous config saved to /var/cache/conftool/dbconfig/20220302-092128-ladsgroup.json [09:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T300992)', diff saved to https://phabricator.wikimedia.org/P21672 and previous config saved to /var/cache/conftool/dbconfig/20220302-092348-ladsgroup.json [09:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:08] (03CR) 10Jelto: [V: 03+1 C: 03+2] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34022/console" [puppet] - 10https://gerrit.wikimedia.org/r/762897 (owner: 10Majavah) [09:28:38] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for Amin Al Hazwani - https://phabricator.wikimedia.org/T302775 (10JMeybohm) [09:28:59] (03PS1) 10Ayounsi: Add labs-in4/6 to codfw cloud-hosts vlan [homer/public] - 10https://gerrit.wikimedia.org/r/767471 [09:30:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T302185)', diff saved to https://phabricator.wikimedia.org/P21673 and previous config saved to /var/cache/conftool/dbconfig/20220302-093027-ladsgroup.json [09:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:31] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [09:30:53] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Add labs-in4/6 to codfw cloud-hosts vlan [homer/public] - 10https://gerrit.wikimedia.org/r/767471 (owner: 10Ayounsi) [09:31:23] (03PS1) 10JMeybohm: admin: Add aminalhazwani to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/767472 (https://phabricator.wikimedia.org/T302775) [09:35:41] !log klausman@cumin2002 START - Cookbook sre.ganeti.makevm for new host ml-staging-ctrl2001.codfw.wmnet [09:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:43] !log klausman@cumin2002 START - Cookbook sre.dns.netbox [09:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:27] (03CR) 10JMeybohm: [C: 03+1] Add kubernetes2018 to wikikube codfw [puppet] - 10https://gerrit.wikimedia.org/r/767465 (https://phabricator.wikimedia.org/T302208) (owner: 10Elukey) [09:38:04] (03CR) 10JMeybohm: [C: 03+1] Add BGP config for kubernetes2018 [homer/public] - 10https://gerrit.wikimedia.org/r/767468 (https://phabricator.wikimedia.org/T302208) (owner: 10Elukey) [09:38:24] (03PS1) 10Jelto: gitlab: update sevice_ip and ferm_drange for wmcs [puppet] - 10https://gerrit.wikimedia.org/r/767473 (https://phabricator.wikimedia.org/T302803) [09:38:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P21674 and previous config saved to /var/cache/conftool/dbconfig/20220302-093853-ladsgroup.json [09:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:03] !log klausman@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:31] (03PS1) 10David Caro: wmcs-cinder-backup-manager: increase individual timeout to 30h [puppet] - 10https://gerrit.wikimedia.org/r/767474 (https://phabricator.wikimedia.org/T302855) [09:41:54] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34023/console" [puppet] - 10https://gerrit.wikimedia.org/r/767473 (https://phabricator.wikimedia.org/T302803) (owner: 10Jelto) [09:42:53] (03CR) 10Elukey: [C: 03+2] Add kubernetes2018 to wikikube codfw [puppet] - 10https://gerrit.wikimedia.org/r/767465 (https://phabricator.wikimedia.org/T302208) (owner: 10Elukey) [09:42:55] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [09:43:57] (03CR) 10Elukey: [V: 03+1 C: 03+2] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34025/console" [puppet] - 10https://gerrit.wikimedia.org/r/767465 (https://phabricator.wikimedia.org/T302208) (owner: 10Elukey) [09:44:21] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@fd6bc59] (eqiad): Temporarily increase poolsize for debugging [09:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:34] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@fd6bc59] (eqiad): Temporarily increase poolsize for debugging (duration: 02m 13s) [09:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:43] (03CR) 10JMeybohm: [C: 03+2] "Context: https://phabricator.wikimedia.org/T296706" [deployment-charts] - 10https://gerrit.wikimedia.org/r/751439 (owner: 10PipelineBot) [09:47:11] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@fd6bc59] (codfw): Temporarily increase poolsize for debugging [09:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:47] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-staging-ctrl2001.codfw.wmnet [09:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:03] (03CR) 10Jelto: [V: 03+1] "With correct floating IP in place (see https://phabricator.wikimedia.org/T302803) we don't need a dedicated ferm_drange in WMCS (beside mi" [puppet] - 10https://gerrit.wikimedia.org/r/767473 (https://phabricator.wikimedia.org/T302803) (owner: 10Jelto) [09:49:23] !log klausman@cumin2002 START - Cookbook sre.ganeti.makevm for new host ml-staging-ctrl2002.codfw.wmnet [09:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:24] !log klausman@cumin2002 START - Cookbook sre.dns.netbox [09:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:35] (03Merged) 10jenkins-bot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/751439 (owner: 10PipelineBot) [09:51:37] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@fd6bc59] (codfw): Temporarily increase poolsize for debugging (duration: 04m 26s) [09:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:20] (03PS1) 10Ayounsi: Rename labs and cloud filters [homer/public] - 10https://gerrit.wikimedia.org/r/767476 [09:53:13] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [09:53:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P21675 and previous config saved to /var/cache/conftool/dbconfig/20220302-095358-ladsgroup.json [09:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:06] 10SRE, 10Patch-For-Review: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 (10jcrespo) Backups worked without errors tonight, all migration work done and ready to upgrade the backup hosts next. [09:55:00] !log klausman@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:31] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [09:55:48] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/mathoid: apply [09:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:55] (03CR) 10Elukey: [C: 03+2] Add BGP config for kubernetes2018 [homer/public] - 10https://gerrit.wikimedia.org/r/767468 (https://phabricator.wikimedia.org/T302208) (owner: 10Elukey) [09:56:47] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/mathoid: apply [09:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:34] (03CR) 10Arturo Borrero Gonzalez: "LGTM, quesiton inlined." [homer/public] - 10https://gerrit.wikimedia.org/r/767476 (owner: 10Ayounsi) [09:59:09] (03CR) 10Ayounsi: Rename labs and cloud filters (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/767476 (owner: 10Ayounsi) [09:59:31] there's a weird behavior on Maps geoshape endpoint that is a possible ddos situation [10:00:18] it's starving PG connections in maps eqiad https://grafana.wikimedia.org/goto/5V7twIY7k?orgId=1 [10:00:19] (03CR) 10Ayounsi: "Checked the filter and it should works with codfw without changes." [homer/public] - 10https://gerrit.wikimedia.org/r/767471 (owner: 10Ayounsi) [10:00:31] (03CR) 10Ayounsi: [C: 03+2] Add labs-in4/6 to codfw cloud-hosts vlan [homer/public] - 10https://gerrit.wikimedia.org/r/767471 (owner: 10Ayounsi) [10:00:52] (03PS6) 10Muehlenhoff: sre.ganeti.addnode: Validate bridge config of the switches [cookbooks] - 10https://gerrit.wikimedia.org/r/765309 [10:01:32] from turnilo web requets sample it seems that geoshapes is being highly requested (and one weird chinese tile) https://w.wiki/4uCj [10:01:59] it seems that 3rd parties found a way to work around our block [10:02:23] cc/ _joe_ [10:02:51] <_joe_> mbsantos: 301 traffic :P [10:03:03] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: update sevice_ip and ferm_drange for wmcs [puppet] - 10https://gerrit.wikimedia.org/r/767473 (https://phabricator.wikimedia.org/T302803) (owner: 10Jelto) [10:03:09] <_joe_> sorry but I have too much on my plate already [10:03:29] no worries, is there a different channel for traffic? [10:03:38] <_joe_> #wikimedia-traffci [10:03:51] <_joe_> but I would advise opening a restricted task with more information [10:04:43] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-staging-ctrl2002.codfw.wmnet [10:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:50] thanks [10:08:11] (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.addnode: Validate bridge config of the switches [cookbooks] - 10https://gerrit.wikimedia.org/r/765309 (owner: 10Muehlenhoff) [10:09:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T300992)', diff saved to https://phabricator.wikimedia.org/P21676 and previous config saved to /var/cache/conftool/dbconfig/20220302-100903-ladsgroup.json [10:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:06] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [10:10:50] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, I'll merge in a bit." [puppet] - 10https://gerrit.wikimedia.org/r/766291 (owner: 10Majavah) [10:11:40] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@d049589] (codfw): Revert "Temporarily increase poolsize for debugging" [10:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:39] (03PS1) 10Ladsgroup: Add --dbgroupdefault=dump to every major dump run [dumps] - 10https://gerrit.wikimedia.org/r/767477 (https://phabricator.wikimedia.org/T138208) [10:13:17] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@d049589] (codfw): Revert "Temporarily increase poolsize for debugging" (duration: 01m 36s) [10:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:27] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@d049589] (eqiad): Revert "Temporarily increase poolsize for debugging" [10:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:23] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/mathoid: apply [10:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:12] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@d049589] (eqiad): Revert "Temporarily increase poolsize for debugging" (duration: 01m 45s) [10:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:31] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply [10:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:38] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/767226 (https://phabricator.wikimedia.org/T301679) (owner: 10JMeybohm) [10:17:34] (03PS1) 10Klausman: Add entries for ML staging control plane VMs [puppet] - 10https://gerrit.wikimedia.org/r/767478 (https://phabricator.wikimedia.org/T302504) [10:18:23] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/mathoid: apply [10:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:31] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/767227 (https://phabricator.wikimedia.org/T301659) (owner: 10JMeybohm) [10:19:43] (03PS2) 10Klausman: Add entries for ML staging control plane VMs [puppet] - 10https://gerrit.wikimedia.org/r/767478 (https://phabricator.wikimedia.org/T302504) [10:20:08] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [10:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:50] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/767472 (https://phabricator.wikimedia.org/T302775) (owner: 10JMeybohm) [10:22:13] (03CR) 10Muehlenhoff: [C: 03+2] Extract ssh fingerprint publishing to an independent class [puppet] - 10https://gerrit.wikimedia.org/r/766291 (owner: 10Majavah) [10:29:28] (03CR) 10JMeybohm: "Deployed everywhere" [deployment-charts] - 10https://gerrit.wikimedia.org/r/751439 (owner: 10PipelineBot) [10:30:00] (03CR) 10JMeybohm: [C: 03+2] admin: add tmlt-tmager to krb & analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/767226 (https://phabricator.wikimedia.org/T301679) (owner: 10JMeybohm) [10:30:04] (03CR) 10JMeybohm: [C: 03+2] admin: add damiendf to krb & analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/767227 (https://phabricator.wikimedia.org/T301659) (owner: 10JMeybohm) [10:30:09] (03CR) 10JMeybohm: [C: 03+2] admin: Add aminalhazwani to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/767472 (https://phabricator.wikimedia.org/T302775) (owner: 10JMeybohm) [10:30:56] (03PS3) 10Klausman: Add entries for ML staging control plane VMs [puppet] - 10https://gerrit.wikimedia.org/r/767478 (https://phabricator.wikimedia.org/T302504) [10:31:12] (03PS4) 10Klausman: Add entries for ML staging control plane VMs [puppet] - 10https://gerrit.wikimedia.org/r/767478 (https://phabricator.wikimedia.org/T302504) [10:31:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [10:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [10:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:08] (03CR) 10Kormat: [C: 03+1] Add Cumin alias to match core-test role [puppet] - 10https://gerrit.wikimedia.org/r/765562 (owner: 10Muehlenhoff) [10:32:43] (03CR) 10Kormat: Add Cumin alias to match core-test role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765562 (owner: 10Muehlenhoff) [10:34:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [10:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [10:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T300992)', diff saved to https://phabricator.wikimedia.org/P21677 and previous config saved to /var/cache/conftool/dbconfig/20220302-103407-ladsgroup.json [10:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:10] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [10:38:16] 10SRE, 10Kubernetes: Document what #discovery-system is - https://phabricator.wikimedia.org/T282948 (10Aklapper) Project archived - https://www.mediawiki.org/wiki/Phabricator/Project_management#Archiving_a_project [10:38:30] 10SRE, 10Project-Admins, 10Kubernetes: Document what #discovery-system is - https://phabricator.wikimedia.org/T282948 (10Aklapper) [10:38:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T300992)', diff saved to https://phabricator.wikimedia.org/P21678 and previous config saved to /var/cache/conftool/dbconfig/20220302-103832-ladsgroup.json [10:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:41] (03PS1) 10Elukey: Add kubernetes20[19-22] to wikikube codfw [puppet] - 10https://gerrit.wikimedia.org/r/767482 (https://phabricator.wikimedia.org/T302208) [10:39:19] (03CR) 10Elukey: [C: 03+1] Add entries for ML staging control plane VMs [puppet] - 10https://gerrit.wikimedia.org/r/767478 (https://phabricator.wikimedia.org/T302504) (owner: 10Klausman) [10:39:41] (03CR) 10Klausman: [C: 03+2] Add entries for ML staging control plane VMs [puppet] - 10https://gerrit.wikimedia.org/r/767478 (https://phabricator.wikimedia.org/T302504) (owner: 10Klausman) [10:42:55] (ProbeHttpFailed) firing: (2) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [10:42:58] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10observability, and 2 others: Upgrade Kafka Risk Evaluation - https://phabricator.wikimedia.org/T302610 (10JMeybohm) p:05Triage→03Medium [10:43:32] 10SRE: Domain Ownership Verification on Various Search Properties - https://phabricator.wikimedia.org/T302617 (10JMeybohm) p:05Triage→03Medium [10:44:53] (03CR) 10Elukey: "pcc diff https://puppet-compiler.wmflabs.org/pcc-worker1001/34026/" [puppet] - 10https://gerrit.wikimedia.org/r/767482 (https://phabricator.wikimedia.org/T302208) (owner: 10Elukey) [10:45:00] (03PS2) 10Elukey: Add kubernetes20[19-22] to wikikube codfw [puppet] - 10https://gerrit.wikimedia.org/r/767482 (https://phabricator.wikimedia.org/T302208) [10:45:23] (03PS1) 10Jelto: gitlab: remove realm check, move listen_addresses to hiera [puppet] - 10https://gerrit.wikimedia.org/r/767484 (https://phabricator.wikimedia.org/T297411) [10:45:25] (03PS10) 10Filippo Giunchedi: Introduce 'alertmanager' and 'alerting' modules [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (https://phabricator.wikimedia.org/T293209) [10:46:41] (03CR) 10Filippo Giunchedi: Introduce 'alertmanager' and 'alerting' modules (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (https://phabricator.wikimedia.org/T293209) (owner: 10Filippo Giunchedi) [10:48:41] (03PS1) 10Elukey: Add BGP config for kubernetes20[19-22] in wikikube codfw [homer/public] - 10https://gerrit.wikimedia.org/r/767485 (https://phabricator.wikimedia.org/T302208) [10:49:00] (03PS3) 10Muehlenhoff: Add Cumin alias to match core-test role [puppet] - 10https://gerrit.wikimedia.org/r/765562 [10:49:08] (03CR) 10Elukey: "bgp config in https://gerrit.wikimedia.org/r/c/operations/homer/public/+/767485" [puppet] - 10https://gerrit.wikimedia.org/r/767482 (https://phabricator.wikimedia.org/T302208) (owner: 10Elukey) [10:53:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P21680 and previous config saved to /var/cache/conftool/dbconfig/20220302-105336-ladsgroup.json [10:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:53] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10JMeybohm) p:05Triage→03Medium deployment-mediawiki11 has been replaced by deployment-mediawiki12 (although th... [10:56:18] !log restarting apache2 and mailman3-web on lists.wikimedia.org for expat security update [10:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:24] (03PS2) 10Jelto: gitlab: remove realm check, move listen_addresses to hiera [puppet] - 10https://gerrit.wikimedia.org/r/767484 (https://phabricator.wikimedia.org/T297411) [11:01:05] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for Amin Al Hazwani - https://phabricator.wikimedia.org/T302775 (10JMeybohm) 05Open→03Resolved a:03JMeybohm You should be good to go [11:03:16] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34028/console" [puppet] - 10https://gerrit.wikimedia.org/r/767484 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [11:04:37] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Accidentally unsubscribed everyone from open-glam mailing list - https://phabricator.wikimedia.org/T302816 (10Ladsgroup) 05Open→03Resolved I re-added everyone from a backup that was made in 2022-03-01 05:53:07 (so anyone subscribing between that time an... [11:05:50] !log installing expat security updates [11:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:28] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10Vgutierrez) hmm if that's the case horizon data for deployment-prep-cache needs to be updated as well cause right... [11:07:08] (03CR) 10Jelto: [V: 03+1] "removing one realm check by moving addresses to hiera, similar to I517f1a51b932b933e4ae42ee5a92db32d433b2fc. Should be noop to production." [puppet] - 10https://gerrit.wikimedia.org/r/767484 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [11:07:58] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Damiendf - https://phabricator.wikimedia.org/T301659 (10JMeybohm) 05In progress→03Resolved a:03JMeybohm >>! In T301659#7745083, @Damiendf wrote: > Arg sorry, this is the wrong email address. I correct... [11:08:02] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata for Tom Magerlein - https://phabricator.wikimedia.org/T301679 (10JMeybohm) 05In progress→03Resolved a:03JMeybohm Access has been granted and krb5 principal has been created. [11:08:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P21681 and previous config saved to /var/cache/conftool/dbconfig/20220302-110842-ladsgroup.json [11:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:17] (03PS2) 10Aqu: Set default Airflow concurrency limits [puppet] - 10https://gerrit.wikimedia.org/r/767220 (https://phabricator.wikimedia.org/T300870) [11:21:42] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@3dc404c] (eqiad): Merge "Update kartotherian-package to f239c6e" [11:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:48] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10JMeybohm) I was really just relaying from T300525 but it looks like something is off. deployment-mediawiki11 was... [11:22:21] !log rollback maps eqiad to a previous working state to mitigate geoshape errors [11:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:12] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@3dc404c] (eqiad): Merge "Update kartotherian-package to f239c6e" (duration: 01m 29s) [11:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T300992)', diff saved to https://phabricator.wikimedia.org/P21682 and previous config saved to /var/cache/conftool/dbconfig/20220302-112347-ladsgroup.json [11:23:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [11:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:50] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [11:23:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [11:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [11:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [11:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:37] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 105 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:28:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [11:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [11:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T300992)', diff saved to https://phabricator.wikimedia.org/P21683 and previous config saved to /var/cache/conftool/dbconfig/20220302-112824-ladsgroup.json [11:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:06] (03PS1) 10Majavah: policies/cr-labs: Include cloudbackup-dev hosts [homer/public] - 10https://gerrit.wikimedia.org/r/767487 [11:32:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T300992)', diff saved to https://phabricator.wikimedia.org/P21684 and previous config saved to /var/cache/conftool/dbconfig/20220302-113240-ladsgroup.json [11:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:44] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [11:37:02] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10Majavah) >>! In T302699#7746972, @JMeybohm wrote: > I was really just relaying from T300525 but it looks like som... [11:38:10] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for smokeping [puppet] - 10https://gerrit.wikimedia.org/r/767488 (https://phabricator.wikimedia.org/T135991) [11:43:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/767488 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:47:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P21685 and previous config saved to /var/cache/conftool/dbconfig/20220302-114745-ladsgroup.json [11:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:10] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin alias to match core-test role [puppet] - 10https://gerrit.wikimedia.org/r/765562 (owner: 10Muehlenhoff) [11:57:45] RECOVERY - Confd vcl based reload on cp6015 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:57:59] RECOVERY - Confd vcl based reload on cp6009 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:59:01] RECOVERY - traffic_server backend process restarted on cp6010 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6010&var-layer=backend [12:02:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P21686 and previous config saved to /var/cache/conftool/dbconfig/20220302-120250-ladsgroup.json [12:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:28] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for Amin Al Hazwani - https://phabricator.wikimedia.org/T302775 (10aminalhazwani) Yes, indeed! Thanks @JMeybohm 🙏🏼 [12:04:13] RECOVERY - Check systemd state on cp6010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:04:19] (03PS1) 10Giuseppe Lavagetto: utils: add script to sync abuse networks with conftool ipblocks [puppet] - 10https://gerrit.wikimedia.org/r/767489 (https://phabricator.wikimedia.org/T302471) [12:05:12] (03PS1) 10Vgutierrez: site: Reimage cp4034 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/767490 (https://phabricator.wikimedia.org/T290005) [12:07:03] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp4034 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/767490 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [12:09:13] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp4034.ulsfo.wmnet with OS buster [12:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:25] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp4034.ulsfo.wmnet with OS buster [12:10:41] PROBLEM - SSH on analytics1067.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:17:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T300992)', diff saved to https://phabricator.wikimedia.org/P21687 and previous config saved to /var/cache/conftool/dbconfig/20220302-121754-ladsgroup.json [12:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:58] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [12:18:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [12:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [12:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 12 hosts with reason: Maintenance [12:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 12 hosts with reason: Maintenance [12:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:33] (03PS1) 10Jbond: C:geoip::data::maxmind: update systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/767491 (https://phabricator.wikimedia.org/T302864) [12:19:08] (03CR) 10jerkins-bot: [V: 04-1] C:geoip::data::maxmind: update systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/767491 (https://phabricator.wikimedia.org/T302864) (owner: 10Jbond) [12:20:41] (03PS2) 10Jbond: C:geoip::data::maxmind: update systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/767491 (https://phabricator.wikimedia.org/T302864) [12:20:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [12:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [12:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T300992)', diff saved to https://phabricator.wikimedia.org/P21688 and previous config saved to /var/cache/conftool/dbconfig/20220302-122049-ladsgroup.json [12:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:15] (03CR) 10jerkins-bot: [V: 04-1] C:geoip::data::maxmind: update systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/767491 (https://phabricator.wikimedia.org/T302864) (owner: 10Jbond) [12:21:38] (03PS3) 10Jbond: C:geoip::data::maxmind: update systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/767491 (https://phabricator.wikimedia.org/T302864) [12:21:53] (03PS4) 10Jbond: C:geoip::data::maxmind: update systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/767491 (https://phabricator.wikimedia.org/T302864) [12:22:37] (03CR) 10jerkins-bot: [V: 04-1] C:geoip::data::maxmind: update systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/767491 (https://phabricator.wikimedia.org/T302864) (owner: 10Jbond) [12:24:10] 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10MoritzMuehlenhoff) Thanks for opening this task, having this discussion in seachable, open medium is very useful! > Based on the discussion so far my inclination is that we stick with our cur... [12:24:31] (03PS5) 10Jbond: C:geoip::data::maxmind: update systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/767491 (https://phabricator.wikimedia.org/T302864) [12:25:05] (03PS1) 10Zabe: Change the mwapi host back to mediawiki11 [puppet] - 10https://gerrit.wikimedia.org/r/767492 [12:25:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T300992)', diff saved to https://phabricator.wikimedia.org/P21689 and previous config saved to /var/cache/conftool/dbconfig/20220302-122510-ladsgroup.json [12:25:11] (03CR) 10jerkins-bot: [V: 04-1] C:geoip::data::maxmind: update systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/767491 (https://phabricator.wikimedia.org/T302864) (owner: 10Jbond) [12:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:14] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [12:26:36] (03PS2) 10Zabe: deployment-prep: change the mwapi host back to mediawiki11 [puppet] - 10https://gerrit.wikimedia.org/r/767492 [12:30:07] (03PS6) 10Jbond: C:geoip::data::maxmind: update systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/767491 (https://phabricator.wikimedia.org/T302864) [12:30:47] (03CR) 10jerkins-bot: [V: 04-1] C:geoip::data::maxmind: update systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/767491 (https://phabricator.wikimedia.org/T302864) (owner: 10Jbond) [12:31:52] (03CR) 10Ayounsi: [C: 03+1] Enable profile::auto_restarts::service for smokeping [puppet] - 10https://gerrit.wikimedia.org/r/767488 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:32:26] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:33:00] (03CR) 10Ayounsi: [C: 03+1] Add BGP config for kubernetes20[19-22] in wikikube codfw [homer/public] - 10https://gerrit.wikimedia.org/r/767485 (https://phabricator.wikimedia.org/T302208) (owner: 10Elukey) [12:33:08] (03PS7) 10Jbond: C:geoip::data::maxmind: update systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/767491 (https://phabricator.wikimedia.org/T302864) [12:34:31] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10netbox: Grant cn=nda some sort of read only access to Netbox - https://phabricator.wikimedia.org/T302870 (10Majavah) [12:35:19] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34037/console" [puppet] - 10https://gerrit.wikimedia.org/r/767491 (https://phabricator.wikimedia.org/T302864) (owner: 10Jbond) [12:37:38] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:24] RECOVERY - traffic_server tls process restarted on cp6009 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6009&var-layer=tls [12:38:47] (03PS1) 10Reedy: Delete incorrect en-gb.json [extensions/MassMessage] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767098 (https://phabricator.wikimedia.org/T302840) [12:39:30] jouncebot: nowandnext [12:39:31] No deployments scheduled for the next 1 hour(s) and 20 minute(s) [12:39:31] In 1 hour(s) and 20 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220302T1400) [12:39:38] (03CR) 10Reedy: [C: 03+2] Delete incorrect en-gb.json [extensions/MassMessage] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767098 (https://phabricator.wikimedia.org/T302840) (owner: 10Reedy) [12:40:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P21690 and previous config saved to /var/cache/conftool/dbconfig/20220302-124014-ladsgroup.json [12:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:06] RECOVERY - traffic_server tls process restarted on cp6014 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6014&var-layer=tls [12:43:25] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4034.ulsfo.wmnet with OS buster [12:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:38] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp4034.ulsfo.wmnet with OS buster e... [12:43:42] (03Merged) 10jenkins-bot: Delete incorrect en-gb.json [extensions/MassMessage] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767098 (https://phabricator.wikimedia.org/T302840) (owner: 10Reedy) [12:45:50] !log reedy@deploy1002 Started scap: Fix MassMessage translations T302840 [12:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:53] T302840: Wrong language in en-gb MassMessage interface - https://phabricator.wikimedia.org/T302840 [12:46:44] RECOVERY - traffic_server tls process restarted on cp6015 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6015&var-layer=tls [12:47:41] !log reedy@deploy1002 Finished scap: Fix MassMessage translations T302840 (duration: 01m 50s) [12:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:46] (03CR) 10Tchanders: Add IPInfo viewing rights for certain groups (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766882 (https://phabricator.wikimedia.org/T296499) (owner: 10STran) [12:54:20] (03PS3) 10Zabe: deployment-prep: change the mwapi host back to mediawiki11 [puppet] - 10https://gerrit.wikimedia.org/r/767492 (https://phabricator.wikimedia.org/T302699) [12:55:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P21692 and previous config saved to /var/cache/conftool/dbconfig/20220302-125519-ladsgroup.json [12:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:14] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Accidentally unsubscribed everyone from open-glam mailing list - https://phabricator.wikimedia.org/T302816 (10Scann) **THANK YOU SO MUCH**, I can't stress enough how grateful I am for all of you solving this issue in such a timely manner. Here I'm sending... [13:00:22] RECOVERY - traffic_server tls process restarted on cp6016 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6016&var-layer=tls [13:10:24] PROBLEM - Disk space on centrallog1001 is CRITICAL: DISK CRITICAL - free space: /srv 34196 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=centrallog1001&var-datasource=eqiad+prometheus/ops [13:10:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T300992)', diff saved to https://phabricator.wikimedia.org/P21693 and previous config saved to /var/cache/conftool/dbconfig/20220302-131024-ladsgroup.json [13:10:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [13:10:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [13:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:29] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [13:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T300992)', diff saved to https://phabricator.wikimedia.org/P21694 and previous config saved to /var/cache/conftool/dbconfig/20220302-131032-ladsgroup.json [13:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:14] RECOVERY - SSH on analytics1067.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:13:36] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10observability, and 2 others: Upgrade Kafka Risk Evaluation - https://phabricator.wikimedia.org/T302610 (10elukey) @EChetty hi! Could you add some details about what you expect to see in this task? [13:15:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T300992)', diff saved to https://phabricator.wikimedia.org/P21695 and previous config saved to /var/cache/conftool/dbconfig/20220302-131550-ladsgroup.json [13:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:54] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [13:17:06] (03CR) 10JMeybohm: [C: 03+1] Add BGP config for kubernetes20[19-22] in wikikube codfw [homer/public] - 10https://gerrit.wikimedia.org/r/767485 (https://phabricator.wikimedia.org/T302208) (owner: 10Elukey) [13:18:30] (03CR) 10JMeybohm: [C: 03+1] "just a nit" [puppet] - 10https://gerrit.wikimedia.org/r/767482 (https://phabricator.wikimedia.org/T302208) (owner: 10Elukey) [13:20:24] (03CR) 10Filippo Giunchedi: [C: 03+1] Enable profile::auto_restarts::service for smokeping [puppet] - 10https://gerrit.wikimedia.org/r/767488 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:24:16] (03PS3) 10Jbond: P:configmaster: parametrise server names [puppet] - 10https://gerrit.wikimedia.org/r/766585 (owner: 10Majavah) [13:24:59] (03CR) 10Joal: [C: 03+1] "LGTM except for a typo in commit message :) Thanks @Aqu" [puppet] - 10https://gerrit.wikimedia.org/r/767220 (https://phabricator.wikimedia.org/T300870) (owner: 10Aqu) [13:25:15] (03PS3) 10Elukey: Add kubernetes20[19-22] to wikikube codfw [puppet] - 10https://gerrit.wikimedia.org/r/767482 (https://phabricator.wikimedia.org/T302208) [13:25:29] (03CR) 10Elukey: Add kubernetes20[19-22] to wikikube codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/767482 (https://phabricator.wikimedia.org/T302208) (owner: 10Elukey) [13:27:39] (03CR) 10Jbond: [C: 03+1] Enable profile::auto_restarts::service for puppetdb microservice [puppet] - 10https://gerrit.wikimedia.org/r/767174 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:27:42] (03CR) 10Jbond: [C: 03+2] P:configmaster: parametrise server names [puppet] - 10https://gerrit.wikimedia.org/r/766585 (owner: 10Majavah) [13:27:47] (03CR) 10Jbond: [C: 03+2] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/766585 (owner: 10Majavah) [13:30:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P21696 and previous config saved to /var/cache/conftool/dbconfig/20220302-133055-ladsgroup.json [13:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:17] (03CR) 10Joal: [C: 04-1] "Duplicate field - Asking for a reorder but this is not mandatory - the duplicated field removal is :)" [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) (owner: 10Phuedx) [13:42:24] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:geoip::data::maxmind: update systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/767491 (https://phabricator.wikimedia.org/T302864) (owner: 10Jbond) [13:42:55] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [13:46:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P21697 and previous config saved to /var/cache/conftool/dbconfig/20220302-134600-ladsgroup.json [13:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:16] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for smokeping [puppet] - 10https://gerrit.wikimedia.org/r/767488 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:49:09] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Accidentally unsubscribed everyone from open-glam mailing list - https://phabricator.wikimedia.org/T302816 (10Ladsgroup) Glad to be of service ^^ [13:50:39] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp4034.ulsfo.wmnet with OS buster [13:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:51] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp4034.ulsfo.wmnet with OS buster [13:52:35] (03PS1) 10Jbond: geoip: add explicit syslog_identifier [puppet] - 10https://gerrit.wikimedia.org/r/767513 [13:54:05] (03PS9) 10Jbond: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 (owner: 10Giuseppe Lavagetto) [13:55:08] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for klaxon gunicorn webapp [puppet] - 10https://gerrit.wikimedia.org/r/767516 (https://phabricator.wikimedia.org/T135991) [13:57:48] (03CR) 10Jbond: [C: 03+2] geoip: add explicit syslog_identifier [puppet] - 10https://gerrit.wikimedia.org/r/767513 (owner: 10Jbond) [14:00:05] RoanKattouw, Lucas_WMDE, and Urbanecm: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220302T1400). [14:00:05] No Gerrit patches in the queue for this window AFAICS. [14:00:14] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/767516 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:00:32] ok [14:01:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T300992)', diff saved to https://phabricator.wikimedia.org/P21698 and previous config saved to /var/cache/conftool/dbconfig/20220302-140105-ladsgroup.json [14:01:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [14:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [14:01:09] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [14:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T300992)', diff saved to https://phabricator.wikimedia.org/P21699 and previous config saved to /var/cache/conftool/dbconfig/20220302-140112-ladsgroup.json [14:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:14] (03CR) 10Gmodena: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/767220 (https://phabricator.wikimedia.org/T300870) (owner: 10Aqu) [14:05:03] (03CR) 10Jbond: [C: 04-1] varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 (owner: 10Giuseppe Lavagetto) [14:05:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T300992)', diff saved to https://phabricator.wikimedia.org/P21700 and previous config saved to /var/cache/conftool/dbconfig/20220302-140532-ladsgroup.json [14:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:26] (03PS1) 10Ladsgroup: ext.flaggedRevs.review: Restore tolerance when setting "disabled" prop [extensions/FlaggedRevs] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/767099 [14:13:03] 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10jbond) moving third party modules to /vendor/modules also makes it a bit easier to exclude theses modules from CI which is a nice minor benefit [14:13:16] (03CR) 10Vgutierrez: varnish/frontend: consume etcd data for dynamic banning of requests. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763557 (owner: 10Giuseppe Lavagetto) [14:13:43] !log pool cp6013 [14:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:36] jouncebot: nowandnext [14:14:36] For the next 0 hour(s) and 45 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220302T1400) [14:14:36] In 4 hour(s) and 45 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220302T1900) [14:14:36] In 4 hour(s) and 45 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220302T1900) [14:14:47] awesome [14:14:51] (03CR) 10Ladsgroup: [C: 03+2] ext.flaggedRevs.review: Restore tolerance when setting "disabled" prop [extensions/FlaggedRevs] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/767099 (owner: 10Ladsgroup) [14:18:36] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/767073 (owner: 10Volans) [14:18:48] (03Merged) 10jenkins-bot: ext.flaggedRevs.review: Restore tolerance when setting "disabled" prop [extensions/FlaggedRevs] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/767099 (owner: 10Ladsgroup) [14:19:02] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/767074 (owner: 10Volans) [14:20:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P21701 and previous config saved to /var/cache/conftool/dbconfig/20220302-142037-ladsgroup.json [14:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:43] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for apache/pki discovery [puppet] - 10https://gerrit.wikimedia.org/r/767520 (https://phabricator.wikimedia.org/T135991) [14:21:42] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.23/extensions/FlaggedRevs/modules/ext.flaggedRevs.review/review.js: Backport: [[gerrit:767099|ext.flaggedRevs.review: Restore tolerance when setting "disabled" prop]] (duration: 00m 52s) [14:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:12] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/767071 (owner: 10Volans) [14:24:28] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4034.ulsfo.wmnet with OS buster [14:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:40] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp4034.ulsfo.wmnet with OS buster e... [14:24:52] grrr [14:25:16] RECOVERY - Check systemd state on durum6001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:08] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp4034.ulsfo.wmnet with OS buster [14:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:25] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp4034.ulsfo.wmnet with OS buster [14:27:22] !log rebalance VMs in Ganeti row A after adding new servers (and decomissioning old ones) [14:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:15] (03CR) 10Volans: [C: 03+2] sre.hosts.provision: retry once on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/767073 (owner: 10Volans) [14:33:22] (03PS3) 10Volans: sre.hosts.provision: retry once on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/767073 [14:34:55] !log vgutierrez@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4034.ulsfo.wmnet with OS buster [14:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:07] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp4034.ulsfo.wmnet with OS buster e... [14:35:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P21702 and previous config saved to /var/cache/conftool/dbconfig/20220302-143541-ladsgroup.json [14:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:13] (03CR) 10Volans: [C: 03+2] redfish: DellSCP, allow creation of new entities [software/spicerack] - 10https://gerrit.wikimedia.org/r/767071 (owner: 10Volans) [14:37:52] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp4034.ulsfo.wmnet with OS buster [14:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:01] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4034.ulsfo.wmnet with OS buster [14:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:05] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp4034.ulsfo.wmnet with OS buster [14:38:15] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp4034.ulsfo.wmnet with OS buster e... [14:41:51] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp4034.ulsfo.wmnet with OS buster [14:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:59] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4034.ulsfo.wmnet with OS buster [14:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:03] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp4034.ulsfo.wmnet with OS buster [14:42:10] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp4034.ulsfo.wmnet with OS buster e... [14:42:55] (ProbeHttpFailed) firing: (2) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [14:43:12] (03Merged) 10jenkins-bot: redfish: DellSCP, allow creation of new entities [software/spicerack] - 10https://gerrit.wikimedia.org/r/767071 (owner: 10Volans) [14:44:44] (03PS1) 10Hashar: gerrit: use raw subject for Phabricator comments [puppet] - 10https://gerrit.wikimedia.org/r/767521 (https://phabricator.wikimedia.org/T280197) [14:47:20] RECOVERY - Check unit status of prune_old_srv_syslog_directories on centrallog2002 is OK: OK: Status of the systemd unit prune_old_srv_syslog_directories https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:48:30] RECOVERY - Check systemd state on centrallog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:48:36] (03PS1) 10Vgutierrez: site: Reimage cp5014 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/767522 (https://phabricator.wikimedia.org/T290005) [14:50:01] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp5014 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/767522 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [14:50:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T300992)', diff saved to https://phabricator.wikimedia.org/P21703 and previous config saved to /var/cache/conftool/dbconfig/20220302-145046-ladsgroup.json [14:50:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [14:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [14:50:50] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [14:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T300992)', diff saved to https://phabricator.wikimedia.org/P21704 and previous config saved to /var/cache/conftool/dbconfig/20220302-145054-ladsgroup.json [14:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:25] (03PS3) 10Aqu: Set default Airflow concurrency limits [puppet] - 10https://gerrit.wikimedia.org/r/767220 (https://phabricator.wikimedia.org/T300870) [14:52:11] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp5014.eqsin.wmnet with OS buster [14:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:27] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp5014.eqsin.wmnet with OS buster [14:54:37] 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10akosiaris) >>! In T302423#7744908, @jhathaway wrote: >> On a side note, I see there is a proposal of using /vendor/modules. It seems interesting and I 've never tried it, I am wondering what t... [14:55:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T300992)', diff saved to https://phabricator.wikimedia.org/P21705 and previous config saved to /var/cache/conftool/dbconfig/20220302-145510-ladsgroup.json [14:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:38] (03CR) 10Ottomata: [C: 03+2] Set default Airflow concurrency limits [puppet] - 10https://gerrit.wikimedia.org/r/767220 (https://phabricator.wikimedia.org/T300870) (owner: 10Aqu) [14:58:24] (03PS1) 10Urbanecm: enwiki: Deploy Growth features to 100% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767525 (https://phabricator.wikimedia.org/T302846) [15:00:32] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/767520 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:01:18] (03PS1) 10Ottomata: Ah, this is the wrong file. My fault! This is for the search's airflow 1 deployment. [puppet] - 10https://gerrit.wikimedia.org/r/767100 [15:01:26] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Ah, this is the wrong file. My fault! This is for the search's airflow 1 deployment. [puppet] - 10https://gerrit.wikimedia.org/r/767100 (owner: 10Ottomata) [15:06:59] (03PS1) 10Ottomata: Set default Airflow concurrency limits for an- airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/767527 (https://phabricator.wikimedia.org/T300870) [15:10:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P21706 and previous config saved to /var/cache/conftool/dbconfig/20220302-151015-ladsgroup.json [15:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:08] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34040/console" [puppet] - 10https://gerrit.wikimedia.org/r/767527 (https://phabricator.wikimedia.org/T300870) (owner: 10Ottomata) [15:13:10] (03PS1) 10Hnowlan: maps: enable slow query log in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/767529 (https://phabricator.wikimedia.org/T302862) [15:13:20] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Set default Airflow concurrency limits for an- airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/767527 (https://phabricator.wikimedia.org/T300870) (owner: 10Ottomata) [15:13:43] (03PS1) 10Ssingh: icinga: add ssingh to cgi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/767530 [15:14:22] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34041/console" [puppet] - 10https://gerrit.wikimedia.org/r/767529 (https://phabricator.wikimedia.org/T302862) (owner: 10Hnowlan) [15:17:07] (03CR) 10Jgiannelos: [C: 03+1] maps: enable slow query log in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/767529 (https://phabricator.wikimedia.org/T302862) (owner: 10Hnowlan) [15:18:11] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] maps: enable slow query log in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/767529 (https://phabricator.wikimedia.org/T302862) (owner: 10Hnowlan) [15:18:30] o/ I'm looking to deploy a Beta-Cluster-only change. There are no deployments going on at the moment, right? [15:18:49] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5014.eqsin.wmnet with reason: host reimage [15:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:52] ^ urbanecm, Lucas_WMDE: You're marked as the deployers for the last window [15:18:58] I'm also around if needed [15:19:04] phuedx: yeah, go ahead [15:19:12] AFAIK we didn’t do anything during the window, but I saw something from Amir1 IIRC [15:19:15] (probably done by now) [15:19:24] yeah, done [15:19:29] Great. Thanks! [15:19:49] ok :) [15:23:13] (03CR) 10Vgutierrez: [C: 03+1] icinga: add ssingh to cgi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/767530 (owner: 10Ssingh) [15:23:31] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5014.eqsin.wmnet with reason: host reimage [15:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:01] (03CR) 10Phuedx: [C: 03+2] Update Event Stream for IPInfo events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [15:24:30] (03PS1) 10Jbond: pontoon: add profile::base::pontoon to list of classes [puppet] - 10https://gerrit.wikimedia.org/r/767533 [15:24:42] (03Merged) 10jenkins-bot: Update Event Stream for IPInfo events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [15:25:01] (03CR) 10jerkins-bot: [V: 04-1] pontoon: add profile::base::pontoon to list of classes [puppet] - 10https://gerrit.wikimedia.org/r/767533 (owner: 10Jbond) [15:25:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P21707 and previous config saved to /var/cache/conftool/dbconfig/20220302-152519-ladsgroup.json [15:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:47] (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767498 (https://phabricator.wikimedia.org/T280024) (owner: 10WMDE-Fisch) [15:26:54] (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767499 (https://phabricator.wikimedia.org/T280023) (owner: 10WMDE-Fisch) [15:27:02] (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767508 (https://phabricator.wikimedia.org/T286990) (owner: 10WMDE-Fisch) [15:27:10] (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767510 (https://phabricator.wikimedia.org/T286990) (owner: 10WMDE-Fisch) [15:27:19] (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767512 (https://phabricator.wikimedia.org/T286991) (owner: 10WMDE-Fisch) [15:27:46] (03PS1) 10Muehlenhoff: envoy-hot-restart: Switch shebang to /usr/bin/python3 [puppet] - 10https://gerrit.wikimedia.org/r/767536 [15:28:43] The Beta Cluster config update Jenkins job has run [15:28:50] I'll pull the change onto the deployment host [15:28:58] sounds good :) [15:32:58] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/767520 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:35:50] Done :) [15:40:15] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for apache/pki discovery [puppet] - 10https://gerrit.wikimedia.org/r/767520 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:40:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T300992)', diff saved to https://phabricator.wikimedia.org/P21708 and previous config saved to /var/cache/conftool/dbconfig/20220302-154026-ladsgroup.json [15:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [15:40:30] (03PS5) 10Bking: elastic: prevent rundir from deletion [puppet] - 10https://gerrit.wikimedia.org/r/766876 (https://phabricator.wikimedia.org/T276198) [15:40:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [15:40:30] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [15:40:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T300992)', diff saved to https://phabricator.wikimedia.org/P21709 and previous config saved to /var/cache/conftool/dbconfig/20220302-154039-ladsgroup.json [15:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:49] (03CR) 10Bking: elastic: prevent rundir from deletion (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/766876 (https://phabricator.wikimedia.org/T276198) (owner: 10Bking) [15:41:20] (03CR) 10jerkins-bot: [V: 04-1] elastic: prevent rundir from deletion [puppet] - 10https://gerrit.wikimedia.org/r/766876 (https://phabricator.wikimedia.org/T276198) (owner: 10Bking) [15:41:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [15:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:58] (03PS6) 10Bking: elastic: prevent rundir from deletion [puppet] - 10https://gerrit.wikimedia.org/r/766876 (https://phabricator.wikimedia.org/T276198) [15:45:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [15:45:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:06] !log pool cp5014 running HAProxy as TLS termination layer - T290005 T271421 [15:47:07] (03PS1) 10Jbond: O:idp: correctly escape regex dot in service urls [puppet] - 10https://gerrit.wikimedia.org/r/767540 [15:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:10] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [15:47:10] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [15:48:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T300992)', diff saved to https://phabricator.wikimedia.org/P21710 and previous config saved to /var/cache/conftool/dbconfig/20220302-154807-ladsgroup.json [15:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:10] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [15:49:25] (03CR) 10Jbond: [C: 03+2] O:idp: correctly escape regex dot in service urls [puppet] - 10https://gerrit.wikimedia.org/r/767540 (owner: 10Jbond) [15:49:32] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5014.eqsin.wmnet with OS buster [15:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:44] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp5014.eqsin.wmnet with OS buster c... [15:52:10] (03PS1) 10Vgutierrez: site: Reimage cp3061 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/767542 (https://phabricator.wikimedia.org/T290005) [15:52:49] (03CR) 10Jbond: "this fixed WMF-01-015" [puppet] - 10https://gerrit.wikimedia.org/r/767540 (owner: 10Jbond) [15:55:14] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp3061 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/767542 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:56:31] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp3061.esams.wmnet with OS buster [15:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:43] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp3061.esams.wmnet with OS buster [16:03:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P21711 and previous config saved to /var/cache/conftool/dbconfig/20220302-160312-ladsgroup.json [16:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:39] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:08:27] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:16:31] (03CR) 10Dzahn: [C: 03+2] deployment-prep: change the mwapi host back to mediawiki11 [puppet] - 10https://gerrit.wikimedia.org/r/767492 (https://phabricator.wikimedia.org/T302699) (owner: 10Zabe) [16:18:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P21713 and previous config saved to /var/cache/conftool/dbconfig/20220302-161817-ladsgroup.json [16:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:16] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3061.esams.wmnet with reason: host reimage [16:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:47] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:27:36] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3061.esams.wmnet with reason: host reimage [16:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:06] (03CR) 10RLazarus: [C: 03+1] add link to status page (031 comment) [software/klaxon] - 10https://gerrit.wikimedia.org/r/766839 (owner: 10CDanis) [16:30:49] (03CR) 10RLazarus: [C: 03+2] kubernetes: Upgrade default envoy version to 1.15.5 [puppet] - 10https://gerrit.wikimedia.org/r/766840 (https://phabricator.wikimedia.org/T300324) (owner: 10RLazarus) [16:33:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T300992)', diff saved to https://phabricator.wikimedia.org/P21714 and previous config saved to /var/cache/conftool/dbconfig/20220302-163322-ladsgroup.json [16:33:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [16:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [16:33:28] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [16:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1160 (T300992)', diff saved to https://phabricator.wikimedia.org/P21715 and previous config saved to /var/cache/conftool/dbconfig/20220302-163329-ladsgroup.json [16:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:10] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:45:03] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/766876 (https://phabricator.wikimedia.org/T276198) (owner: 10Bking) [16:45:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T300992)', diff saved to https://phabricator.wikimedia.org/P21716 and previous config saved to /var/cache/conftool/dbconfig/20220302-164550-ladsgroup.json [16:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:54] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [16:50:40] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3061.esams.wmnet with OS buster [16:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:51] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp3061.esams.wmnet with OS buster c... [16:51:35] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/766876 (https://phabricator.wikimedia.org/T276198) (owner: 10Bking) [16:51:42] !log pool cp3061 running HAProxy as TLS termination layer - T290005 T271421 [16:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:45] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [16:51:46] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [16:53:47] (03PS1) 10Ladsgroup: auto_schema: Add support for --check in running schema changes [software] - 10https://gerrit.wikimedia.org/r/767554 (https://phabricator.wikimedia.org/T301896) [16:54:30] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [17:00:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P21717 and previous config saved to /var/cache/conftool/dbconfig/20220302-170055-ladsgroup.json [17:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:51] (03PS2) 10Ladsgroup: auto_schema: Add support for --check in running schema changes [software] - 10https://gerrit.wikimedia.org/r/767554 (https://phabricator.wikimedia.org/T301896) [17:02:45] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:11:51] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:13:12] (03PS1) 10STran: Revert "Update Event Stream for IPInfo events" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767101 [17:13:48] (03CR) 10Tchanders: [C: 03+1] Revert "Update Event Stream for IPInfo events" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767101 (owner: 10STran) [17:16:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P21718 and previous config saved to /var/cache/conftool/dbconfig/20220302-171559-ladsgroup.json [17:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:20] (03CR) 10Phuedx: Update Event Stream for IPInfo events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [17:21:25] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:22:12] (03PS2) 10CDanis: add link to status page [software/klaxon] - 10https://gerrit.wikimedia.org/r/766839 [17:22:49] (03CR) 10CDanis: [C: 03+2] add link to status page [software/klaxon] - 10https://gerrit.wikimedia.org/r/766839 (owner: 10CDanis) [17:23:52] (03Merged) 10jenkins-bot: add link to status page [software/klaxon] - 10https://gerrit.wikimedia.org/r/766839 (owner: 10CDanis) [17:27:22] https://gerrit.wikimedia.org/r/756635 accidentally overrode the event streams configuration //for the Beta Cluster only//. I merged it and so accept responsibility. The revert is about to be merged and the deployment host updated [17:30:06] (03PS39) 10Jbond: reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) [17:30:09] (03CR) 10Jbond: "done thanks" [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [17:31:03] (03CR) 10Tchanders: [C: 03+2] Revert "Update Event Stream for IPInfo events" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767101 (owner: 10STran) [17:31:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T300992)', diff saved to https://phabricator.wikimedia.org/P21719 and previous config saved to /var/cache/conftool/dbconfig/20220302-173104-ladsgroup.json [17:31:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [17:31:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [17:31:08] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [17:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T300992)', diff saved to https://phabricator.wikimedia.org/P21720 and previous config saved to /var/cache/conftool/dbconfig/20220302-173112-ladsgroup.json [17:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:01] (03Merged) 10jenkins-bot: Revert "Update Event Stream for IPInfo events" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767101 (owner: 10STran) [17:34:43] Whose syncing that config patch [17:34:54] phuedx: ! [17:35:03] Oh it's labs only [17:35:13] RhinosF1: It's labs only so I presumed no sync [17:35:27] unfortunately no stickers will be awarded for breaking and fixing beta [17:35:34] phuedx: I missed the -labs [17:35:54] * bd808 can make stickers if folks will fix beta ;) [17:36:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T300992)', diff saved to https://phabricator.wikimedia.org/P21721 and previous config saved to /var/cache/conftool/dbconfig/20220302-173631-ladsgroup.json [17:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:34] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [17:37:05] bd808: I've fixed it multiple times :-P [17:38:13] (03CR) 10Jbond: [C: 03+2] sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [17:38:20] taavi: indeed! And that has been much appreciated by me. At this point I think you deserve a "real" 'I broke Wikipedia...' sticker [17:38:30] (03PS1) 10Tchanders: Define IPInfo event stream on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767558 (https://phabricator.wikimedia.org/T296415) [17:38:30] phuedx: does anything need to be done as won't /srv/mediawiki-staging be outdated [17:38:48] Oh you said doing [17:38:59] I guess I should go back to cooking [17:39:14] (03CR) 10Paladox: [C: 03+1] gerrit: use raw subject for Phabricator comments [puppet] - 10https://gerrit.wikimedia.org/r/767521 (https://phabricator.wikimedia.org/T280197) (owner: 10Hashar) [17:39:17] * taavi gets hopeful for an in-person hackathon one day [17:40:05] (03CR) 10Volans: [C: 03+1] "Ship it!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [17:40:14] RhinosF1: Just to confirm: Tran has updated the deployment host [17:40:43] bd808: About those stickers... ;) [17:40:48] phuedx: good [17:42:20] (03CR) 10Jbond: [C: 03+2] reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [17:42:43] (03CR) 10TsepoThoabala: [C: 03+1] Define IPInfo event stream on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767558 (https://phabricator.wikimedia.org/T296415) (owner: 10Tchanders) [17:42:55] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [17:49:19] (03CR) 10STran: [C: 03+2] Define IPInfo event stream on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767558 (https://phabricator.wikimedia.org/T296415) (owner: 10Tchanders) [17:49:37] We reverted a config patch and are now deploying the correct patch to beta https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/767558 [17:49:59] (03Merged) 10jenkins-bot: Define IPInfo event stream on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767558 (https://phabricator.wikimedia.org/T296415) (owner: 10Tchanders) [17:51:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P21722 and previous config saved to /var/cache/conftool/dbconfig/20220302-175136-ladsgroup.json [17:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:01] PROBLEM - SSH on kubernetes2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:04:04] (03CR) 10Dzahn: "thank you for merging" [puppet] - 10https://gerrit.wikimedia.org/r/762897 (owner: 10Majavah) [18:04:25] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:06:19] (03CR) 10Dzahn: [C: 03+2] gerrit: use raw subject for Phabricator comments [puppet] - 10https://gerrit.wikimedia.org/r/767521 (https://phabricator.wikimedia.org/T280197) (owner: 10Hashar) [18:06:37] mutante: hopefully that one will not break too many things. Thx! [18:06:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P21723 and previous config saved to /var/cache/conftool/dbconfig/20220302-180640-ladsgroup.json [18:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:56] worse thing some phab comments look slightly off [18:07:08] hashar: ACK, let's test the way you described it, by uploading a patch containing double quotes [18:07:24] it's been applied ..now. [18:07:31] I might hav eone in test/gerrit-ping [18:07:41] but I gotta focus on my current meeting, will test later this evening :] [18:08:30] (03PS1) 10Dzahn: ""double quotes"" are 'fun' "fun" ''fun'' \fun [puppet] - 10https://gerrit.wikimedia.org/r/767560 (https://phabricator.wikimedia.org/T281552) [18:09:09] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/apertium: apply [18:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:23] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/apertium: apply [18:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:24] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply [18:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:42] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply [18:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:43] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [18:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:00] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [18:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:01] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [18:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:27] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [18:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:28] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [18:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:53] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [18:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:54] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [18:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:18] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [18:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:20] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [18:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:43] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [18:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:44] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: apply [18:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:07] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [18:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:08] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [18:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:33] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [18:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:34] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [18:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:51] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [18:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:52] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/mathoid: apply [18:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:11] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/mathoid: apply [18:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:12] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply [18:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:31] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: apply [18:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:32] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [18:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:24] (03CR) 10Dzahn: "looks good here https://phabricator.wikimedia.org/T281552#7748235" [puppet] - 10https://gerrit.wikimedia.org/r/767560 (https://phabricator.wikimedia.org/T281552) (owner: 10Dzahn) [18:14:32] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [18:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:33] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [18:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:35] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T281552#7748235" [puppet] - 10https://gerrit.wikimedia.org/r/767521 (https://phabricator.wikimedia.org/T280197) (owner: 10Hashar) [18:14:48] (03Abandoned) 10Dzahn: ""double quotes"" are 'fun' "fun" ''fun'' \fun [puppet] - 10https://gerrit.wikimedia.org/r/767560 (https://phabricator.wikimedia.org/T281552) (owner: 10Dzahn) [18:14:53] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [18:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:15] mutante: ahh thanks for the test. That looks correct :] [18:15:48] hashar: :) yep, thanks for confirming [18:16:01] 10SRE, 10Discovery: Test network optimizations in RELForge - https://phabricator.wikimedia.org/T301683 (10bking) 05Open→03Declined [18:16:08] 10SRE, 10Discovery, 10Infrastructure-Foundations, 10netops: Speed up network connections for Elastic hosts - https://phabricator.wikimedia.org/T301577 (10bking) [18:16:15] I am marking the task solved again [18:16:28] +1 [18:16:31] 10SRE, 10Discovery: Test network optimizations in RELForge - https://phabricator.wikimedia.org/T301683 (10bking) Closing for now, will revisit when we have more concrete goals [18:21:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T300992)', diff saved to https://phabricator.wikimedia.org/P21724 and previous config saved to /var/cache/conftool/dbconfig/20220302-182145-ladsgroup.json [18:21:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [18:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [18:21:49] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [18:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T300992)', diff saved to https://phabricator.wikimedia.org/P21725 and previous config saved to /var/cache/conftool/dbconfig/20220302-182153-ladsgroup.json [18:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:11] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [18:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:58] (03PS1) 10Cathal Mooney: Adding includes for Netbox-generated zone files for eqiad evpn lb [dns] - 10https://gerrit.wikimedia.org/r/767562 (https://phabricator.wikimedia.org/T299758) [18:28:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T300992)', diff saved to https://phabricator.wikimedia.org/P21726 and previous config saved to /var/cache/conftool/dbconfig/20220302-182809-ladsgroup.json [18:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:14] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [18:28:34] (03PS2) 10MewOphaswongse: GLAM event: Update wgGECampaigns and wgGECampaignTopics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766869 (https://phabricator.wikimedia.org/T301029) [18:28:48] (03CR) 10jerkins-bot: [V: 04-1] Adding includes for Netbox-generated zone files for eqiad evpn lb [dns] - 10https://gerrit.wikimedia.org/r/767562 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [18:30:30] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:49] (03PS2) 10Cathal Mooney: Adding includes for Netbox-generated zone files for eqiad evpn lb [dns] - 10https://gerrit.wikimedia.org/r/767562 (https://phabricator.wikimedia.org/T299758) [18:42:55] (ProbeHttpFailed) firing: (2) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [18:43:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P21727 and previous config saved to /var/cache/conftool/dbconfig/20220302-184314-ladsgroup.json [18:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10Jclark-ctr) [18:45:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10Jclark-ctr) [18:46:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10Jclark-ctr) [18:47:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10Jclark-ctr) a:05Jclark-ctr→03RobH [18:48:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [18:49:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [18:50:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [18:56:33] RECOVERY - SSH on kubernetes2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:58:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P21728 and previous config saved to /var/cache/conftool/dbconfig/20220302-185819-ladsgroup.json [18:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] brennen and dduvall: It is that lovely time of the day again! You are hereby commanded to deploy Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220302T1900). [19:00:05] brennen and dduvall: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220302T1900). [19:01:52] o/ [19:03:54] (in a meeting, rolling forward shortly) [19:08:56] brennen: o/ howdy [19:10:20] !log 1.38.0-wmf.24 train (T300200): no current blockers; proceeding to group1 [19:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:23] T300200: 1.38.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T300200 [19:10:42] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10Jclark-ctr) [19:11:32] (03PS1) 10Brennen Bearnes: group1 wikis to 1.38.0-wmf.24 refs T300200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767569 [19:11:34] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.38.0-wmf.24 refs T300200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767569 (owner: 10Brennen Bearnes) [19:12:18] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.24 refs T300200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767569 (owner: 10Brennen Bearnes) [19:13:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T300992)', diff saved to https://phabricator.wikimedia.org/P21729 and previous config saved to /var/cache/conftool/dbconfig/20220302-191323-ladsgroup.json [19:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:27] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [19:13:45] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.24 refs T300200 [19:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:36] !log brennen@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.24 refs T300200 (duration: 00m 50s) [19:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:56] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10Jclark-ctr) [19:20:05] (03PS1) 10Cathal Mooney: Add site variable for EVPN overlay loopback subnets and CR filter [homer/public] - 10https://gerrit.wikimedia.org/r/767570 (https://phabricator.wikimedia.org/T299758) [19:20:39] (03CR) 10jerkins-bot: [V: 04-1] Add site variable for EVPN overlay loopback subnets and CR filter [homer/public] - 10https://gerrit.wikimedia.org/r/767570 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [19:22:28] brennen: i'm seeing a new error `Class 'ApiFeatureUsageQueryEngineElastica' not found` [19:22:35] i'll file a task [19:22:43] (03PS2) 10Cathal Mooney: Add site variable for EVPN overlay loopback subnets and CR filter [homer/public] - 10https://gerrit.wikimedia.org/r/767570 (https://phabricator.wikimedia.org/T299758) [19:23:18] (03CR) 10jerkins-bot: [V: 04-1] Add site variable for EVPN overlay loopback subnets and CR filter [homer/public] - 10https://gerrit.wikimedia.org/r/767570 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [19:24:00] dduvall: just filed [19:24:05] sorry, missed the ping a second ago [19:24:27] PROBLEM - Ensure local MW versions match expected deployment on deploy2002 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:24:28] ah ok [19:24:32] T302907 - worth a rollback, you think? fairly low level but it's ticking upwards. [19:24:33] T302907: Error: Class 'ApiFeatureUsageQueryEngineElastica' not found - https://phabricator.wikimedia.org/T302907 [19:25:05] PROBLEM - Ensure local MW versions match expected deployment on mw2318 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:11] PROBLEM - Ensure local MW versions match expected deployment on mw1379 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:11] PROBLEM - Ensure local MW versions match expected deployment on snapshot1008 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:12] hmm.. that stuff again! [19:25:21] PROBLEM - Ensure local MW versions match expected deployment on mw1339 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:21] PROBLEM - Ensure local MW versions match expected deployment on mwdebug1001 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:21] PROBLEM - Ensure local MW versions match expected deployment on wtp1038 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:21] PROBLEM - Ensure local MW versions match expected deployment on mw2295 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:21] PROBLEM - Ensure local MW versions match expected deployment on mw2321 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:21] PROBLEM - Ensure local MW versions match expected deployment on mw2388 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:21] PROBLEM - Ensure local MW versions match expected deployment on mw2366 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:22] PROBLEM - Ensure local MW versions match expected deployment on mw2389 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:22] PROBLEM - Ensure local MW versions match expected deployment on mw2258 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:23] PROBLEM - Ensure local MW versions match expected deployment on mw2261 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:31] PROBLEM - Ensure local MW versions match expected deployment on labweb1001 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:33] PROBLEM - Ensure local MW versions match expected deployment on mw1382 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:33] PROBLEM - Ensure local MW versions match expected deployment on mw1380 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:36] Should clear faster this time. :-/ [19:25:39] PROBLEM - Ensure local MW versions match expected deployment on snapshot1012 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:39] PROBLEM - Ensure local MW versions match expected deployment on mw1448 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:39] PROBLEM - Ensure local MW versions match expected deployment on mw1418 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:41] PROBLEM - Ensure local MW versions match expected deployment on wtp1026 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:47] PROBLEM - Ensure local MW versions match expected deployment on mw1396 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:47] PROBLEM - Ensure local MW versions match expected deployment on mw1407 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:47] PROBLEM - Ensure local MW versions match expected deployment on mw2259 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:48] PROBLEM - Ensure local MW versions match expected deployment on mw2289 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:51] PROBLEM - Ensure local MW versions match expected deployment on mw1323 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:53] PROBLEM - Ensure local MW versions match expected deployment on mw2358 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:55] PROBLEM - Ensure local MW versions match expected deployment on mw1427 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:55] PROBLEM - Ensure local MW versions match expected deployment on mw1431 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:57] PROBLEM - Ensure local MW versions match expected deployment on mw2411 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:57] PROBLEM - Ensure local MW versions match expected deployment on parse2004 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:57] PROBLEM - Ensure local MW versions match expected deployment on mw2323 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:58] PROBLEM - Ensure local MW versions match expected deployment on mw2351 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:58] PROBLEM - Ensure local MW versions match expected deployment on mw2352 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:58] PROBLEM - Ensure local MW versions match expected deployment on mw2273 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:26:03] PROBLEM - Ensure local MW versions match expected deployment on wtp1025 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:26:03] PROBLEM - Ensure local MW versions match expected deployment on mw2378 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:26:03] PROBLEM - Ensure local MW versions match expected deployment on mw2402 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:26:05] PROBLEM - Ensure local MW versions match expected deployment on mw1415 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:26:05] PROBLEM - Ensure local MW versions match expected deployment on parse2009 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:26:11] PROBLEM - Ensure local MW versions match expected deployment on mw2296 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:26:11] PROBLEM - Ensure local MW versions match expected deployment on mw2300 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:26:11] PROBLEM - Ensure local MW versions match expected deployment on mw2262 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:26:13] PROBLEM - Ensure local MW versions match expected deployment on mw2333 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:26:15] PROBLEM - Ensure local MW versions match expected deployment on mw2399 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:26:23] brennen: I'll have a fix for that in a few moments [19:26:41] taavi: cool, ty [19:27:15] PROBLEM - Ensure local MW versions match expected deployment on mw1385 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:27:15] PROBLEM - Ensure local MW versions match expected deployment on mw1371 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:27:15] PROBLEM - Ensure local MW versions match expected deployment on mw1332 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:27:21] PROBLEM - Ensure local MW versions match expected deployment on parse2002 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:27:21] PROBLEM - Ensure local MW versions match expected deployment on mw2255 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:27:22] PROBLEM - Ensure local MW versions match expected deployment on mw2264 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:27:25] PROBLEM - Ensure local MW versions match expected deployment on mw1317 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:27:35] PROBLEM - Ensure local MW versions match expected deployment on mw1438 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:27:35] PROBLEM - Ensure local MW versions match expected deployment on mw1454 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:27:35] PROBLEM - Ensure local MW versions match expected deployment on mw1375 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:27:37] PROBLEM - Ensure local MW versions match expected deployment on mw2252 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:27:43] PROBLEM - Ensure local MW versions match expected deployment on mw2326 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:27:43] PROBLEM - Ensure local MW versions match expected deployment on mw2304 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:27:49] PROBLEM - Ensure local MW versions match expected deployment on mw1307 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:27:49] PROBLEM - Ensure local MW versions match expected deployment on mw1369 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:27:49] PROBLEM - Ensure local MW versions match expected deployment on wtp1029 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:27:49] PROBLEM - Ensure local MW versions match expected deployment on mw2357 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:27:49] PROBLEM - Ensure local MW versions match expected deployment on mw2383 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:27:55] PROBLEM - Ensure local MW versions match expected deployment on mw1321 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:27:55] PROBLEM - Ensure local MW versions match expected deployment on wtp1039 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:27:55] PROBLEM - Ensure local MW versions match expected deployment on wtp1033 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:03] PROBLEM - Ensure local MW versions match expected deployment on mw1408 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:03] PROBLEM - Ensure local MW versions match expected deployment on mw1433 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:03] PROBLEM - Ensure local MW versions match expected deployment on mw1393 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:03] PROBLEM - Ensure local MW versions match expected deployment on mw1423 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:03] PROBLEM - Ensure local MW versions match expected deployment on mw1424 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:03] PROBLEM - Ensure local MW versions match expected deployment on wtp1036 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:04] PROBLEM - Ensure local MW versions match expected deployment on mw1370 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:04] PROBLEM - Ensure local MW versions match expected deployment on mw1334 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:07] PROBLEM - Ensure local MW versions match expected deployment on mw1333 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:07] PROBLEM - Ensure local MW versions match expected deployment on mw1318 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:08] PROBLEM - Ensure local MW versions match expected deployment on mw1358 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:08] PROBLEM - Ensure local MW versions match expected deployment on mw2327 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:08] PROBLEM - Ensure local MW versions match expected deployment on mw2369 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:08] PROBLEM - Ensure local MW versions match expected deployment on mw2338 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:08] PROBLEM - Ensure local MW versions match expected deployment on mw2391 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:08] PROBLEM - Ensure local MW versions match expected deployment on mw2410 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:09] PROBLEM - Ensure local MW versions match expected deployment on mw2266 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:11] PROBLEM - Ensure local MW versions match expected deployment on mw1443 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:11] PROBLEM - Ensure local MW versions match expected deployment on mw1455 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:11] PROBLEM - Ensure local MW versions match expected deployment on mw1377 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:11] PROBLEM - Ensure local MW versions match expected deployment on mw1376 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:11] PROBLEM - Ensure local MW versions match expected deployment on mw2301 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:12] PROBLEM - Ensure local MW versions match expected deployment on mw2376 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:12] PROBLEM - Ensure local MW versions match expected deployment on mw2269 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:15] PROBLEM - Ensure local MW versions match expected deployment on mw1434 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:15] PROBLEM - Ensure local MW versions match expected deployment on mw1450 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:15] PROBLEM - Ensure local MW versions match expected deployment on mw1456 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:15] PROBLEM - Ensure local MW versions match expected deployment on mw1439 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:15] PROBLEM - Ensure local MW versions match expected deployment on mw1419 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:16] PROBLEM - Ensure local MW versions match expected deployment on mw2294 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:16] PROBLEM - Ensure local MW versions match expected deployment on mw2401 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:16] PROBLEM - Ensure local MW versions match expected deployment on parse2007 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:17] PROBLEM - Ensure local MW versions match expected deployment on mw2373 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:17] PROBLEM - Ensure local MW versions match expected deployment on parse2010 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:18] PROBLEM - Ensure local MW versions match expected deployment on mw2272 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:18] PROBLEM - Ensure local MW versions match expected deployment on mw2288 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:19] PROBLEM - Ensure local MW versions match expected deployment on mwdebug2002 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:19] PROBLEM - Ensure local MW versions match expected deployment on mw1446 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:20] PROBLEM - Ensure local MW versions match expected deployment on mw1309 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:20] PROBLEM - Ensure local MW versions match expected deployment on mw2395 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:21] PROBLEM - Ensure local MW versions match expected deployment on wtp1041 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:21] PROBLEM - Ensure local MW versions match expected deployment on wtp1040 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:22] PROBLEM - Ensure local MW versions match expected deployment on mw2380 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:22] PROBLEM - Ensure local MW versions match expected deployment on mw2406 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:23] PROBLEM - Ensure local MW versions match expected deployment on mw2375 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:23] PROBLEM - Ensure local MW versions match expected deployment on mw2387 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:24] PROBLEM - Ensure local MW versions match expected deployment on parse2003 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:24] PROBLEM - Ensure local MW versions match expected deployment on wtp1047 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:25] PROBLEM - Ensure local MW versions match expected deployment on mw2291 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:25] PROBLEM - Ensure local MW versions match expected deployment on mw2356 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:26] PROBLEM - Ensure local MW versions match expected deployment on mw2283 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:31] PROBLEM - Ensure local MW versions match expected deployment on mw2270 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:38] PROBLEM - Ensure local MW versions match expected deployment on wtp1045 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:38] PROBLEM - Ensure local MW versions match expected deployment on mw1304 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:43] PROBLEM - Ensure local MW versions match expected deployment on mw1366 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:43] PROBLEM - Ensure local MW versions match expected deployment on mw1337 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:43] PROBLEM - Ensure local MW versions match expected deployment on mw2308 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:43] PROBLEM - Ensure local MW versions match expected deployment on mw2271 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:43] PROBLEM - Ensure local MW versions match expected deployment on mw2267 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:45] PROBLEM - Ensure local MW versions match expected deployment on mw2355 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:45] PROBLEM - Ensure local MW versions match expected deployment on mw2372 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:45] PROBLEM - Ensure local MW versions match expected deployment on mw2398 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:45] PROBLEM - Ensure local MW versions match expected deployment on mw2408 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:45] PROBLEM - Ensure local MW versions match expected deployment on mw2409 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:47] PROBLEM - Ensure local MW versions match expected deployment on mw1451 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:51] PROBLEM - Ensure local MW versions match expected deployment on mw1322 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:53] PROBLEM - Ensure local MW versions match expected deployment on parse2005 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:53] PROBLEM - Ensure local MW versions match expected deployment on mw1414 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:53] PROBLEM - Ensure local MW versions match expected deployment on mw1406 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:53] PROBLEM - Ensure local MW versions match expected deployment on mw1357 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:53] PROBLEM - Ensure local MW versions match expected deployment on mw2370 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:28:58] PROBLEM - Ensure local MW versions match expected deployment on mw2400 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:01] PROBLEM - Ensure local MW versions match expected deployment on mw2407 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:05] PROBLEM - Ensure local MW versions match expected deployment on wtp1034 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:15] PROBLEM - Ensure local MW versions match expected deployment on mw1417 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:15] PROBLEM - Ensure local MW versions match expected deployment on mw1435 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:15] PROBLEM - Ensure local MW versions match expected deployment on mw1312 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:17] PROBLEM - Ensure local MW versions match expected deployment on mw1345 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:19] PROBLEM - Ensure local MW versions match expected deployment on mw1308 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:19] PROBLEM - Ensure local MW versions match expected deployment on mw1356 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:19] PROBLEM - Ensure local MW versions match expected deployment on mw1342 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:23] PROBLEM - Ensure local MW versions match expected deployment on mw1359 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:23] PROBLEM - Ensure local MW versions match expected deployment on mw1413 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:23] PROBLEM - Ensure local MW versions match expected deployment on mw1421 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:23] PROBLEM - Ensure local MW versions match expected deployment on mw1437 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:27] brennen: Did the sync-apaches hang? [19:29:27] PROBLEM - Ensure local MW versions match expected deployment on mw1425 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:28] PROBLEM - Ensure local MW versions match expected deployment on mw1373 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:28] PROBLEM - Ensure local MW versions match expected deployment on mw1348 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:28] PROBLEM - Ensure local MW versions match expected deployment on mw1441 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:29] PROBLEM - Ensure local MW versions match expected deployment on mw2316 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:29] PROBLEM - Ensure local MW versions match expected deployment on mw2319 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:29] PROBLEM - Ensure local MW versions match expected deployment on mw2386 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:29] PROBLEM - Ensure local MW versions match expected deployment on mw2279 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:31] PROBLEM - Ensure local MW versions match expected deployment on mw1311 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:31] PROBLEM - Ensure local MW versions match expected deployment on mw2354 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:31] PROBLEM - Ensure local MW versions match expected deployment on mw2397 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:33] PROBLEM - Ensure local MW versions match expected deployment on mw1326 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:33] PROBLEM - Ensure local MW versions match expected deployment on mw1316 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:35] PROBLEM - Ensure local MW versions match expected deployment on mw1368 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:35] PROBLEM - Ensure local MW versions match expected deployment on mwmaint2002 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:35] PROBLEM - Ensure local MW versions match expected deployment on mw1395 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:35] PROBLEM - Ensure local MW versions match expected deployment on mw1411 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:35] PROBLEM - Ensure local MW versions match expected deployment on mw1399 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:35] PROBLEM - Ensure local MW versions match expected deployment on mw1447 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:35] PROBLEM - Ensure local MW versions match expected deployment on mw1403 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:36] PROBLEM - Ensure local MW versions match expected deployment on mw1392 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:36] PROBLEM - Ensure local MW versions match expected deployment on mw1372 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:37] PROBLEM - Ensure local MW versions match expected deployment on snapshot1009 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:37] PROBLEM - Ensure local MW versions match expected deployment on mw1453 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:38] PROBLEM - Ensure local MW versions match expected deployment on mw1330 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:38] PROBLEM - Ensure local MW versions match expected deployment on mw2381 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:39] PROBLEM - Ensure local MW versions match expected deployment on mw2396 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:39] PROBLEM - Ensure local MW versions match expected deployment on mw1388 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:40] PROBLEM - Ensure local MW versions match expected deployment on wtp1028 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:41] PROBLEM - Ensure local MW versions match expected deployment on mw1338 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:41] PROBLEM - Ensure local MW versions match expected deployment on parse2011 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:41] PROBLEM - Ensure local MW versions match expected deployment on mw1422 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:42] o_O [19:29:42] PROBLEM - Ensure local MW versions match expected deployment on mw1440 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:43] PROBLEM - Ensure local MW versions match expected deployment on mw1367 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:43] PROBLEM - Ensure local MW versions match expected deployment on wtp1031 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:43] PROBLEM - Ensure local MW versions match expected deployment on mw1306 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:44] PROBLEM - Ensure local MW versions match expected deployment on mw2313 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:44] PROBLEM - Ensure local MW versions match expected deployment on mw2309 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:45] PROBLEM - Ensure local MW versions match expected deployment on mw2336 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:51] PROBLEM - Ensure local MW versions match expected deployment on mw1432 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:51] PROBLEM - Ensure local MW versions match expected deployment on mw2293 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:51] PROBLEM - Ensure local MW versions match expected deployment on mw2292 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:51] PROBLEM - Ensure local MW versions match expected deployment on mw2325 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:51] PROBLEM - Ensure local MW versions match expected deployment on mw2297 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:51] PROBLEM - Ensure local MW versions match expected deployment on mw2353 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:52] PROBLEM - Ensure local MW versions match expected deployment on mw2310 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:52] PROBLEM - Ensure local MW versions match expected deployment on mw2286 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:53] PROBLEM - Ensure local MW versions match expected deployment on mw2284 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:59] PROBLEM - Ensure local MW versions match expected deployment on mw1347 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:29:59] PROBLEM - Ensure local MW versions match expected deployment on parse2001 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:30:01] PROBLEM - Ensure local MW versions match expected deployment on mw1428 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:30:01] PROBLEM - Ensure local MW versions match expected deployment on wtp1027 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:30:03] PROBLEM - Ensure local MW versions match expected deployment on mw2374 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:30:03] PROBLEM - Ensure local MW versions match expected deployment on mw2251 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:30:09] PROBLEM - Ensure local MW versions match expected deployment on mw2298 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:30:09] PROBLEM - Ensure local MW versions match expected deployment on mw2306 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:30:09] PROBLEM - Ensure local MW versions match expected deployment on mw2299 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:30:11] PROBLEM - Ensure local MW versions match expected deployment on mw1331 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:30:13] PROBLEM - Ensure local MW versions match expected deployment on mw2339 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:30:13] PROBLEM - Ensure local MW versions match expected deployment on mw2360 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:30:13] PROBLEM - Ensure local MW versions match expected deployment on mw2359 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:30:15] PROBLEM - Ensure local MW versions match expected deployment on mw1409 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:30:15] PROBLEM - Ensure local MW versions match expected deployment on mw1436 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:30:15] PROBLEM - Ensure local MW versions match expected deployment on mw2382 is CRITICAL: CRITICAL: 528 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:30:20] dancy: sync-apaches: 100% (in-flight: 0, ok: 347; fail: 0; left: 0) [19:30:21] !log stopped icinga-wm [19:30:22] 19:14:29 Finished sync-apaches (duration: 00m 08s) [19:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:23] (03PS3) 10Cathal Mooney: Add site variable for EVPN overlay loopback subnets and CR filter [homer/public] - 10https://gerrit.wikimedia.org/r/767570 (https://phabricator.wikimedia.org/T299758) [19:30:29] brennen: dduvall: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ApiFeatureUsage/+/767571/ [19:30:43] thx mutante [19:30:48] it's different this time. not just the 3 test wikis but ALL 528 versions [19:30:52] (03PS1) 10Majavah: Add a non-namespaced alias for ApiFeatureUsageQueryEngineElastica [extensions/ApiFeatureUsage] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767103 (https://phabricator.wikimedia.org/T302907) [19:30:56] (03CR) 10jerkins-bot: [V: 04-1] Add site variable for EVPN overlay loopback subnets and CR filter [homer/public] - 10https://gerrit.wikimedia.org/r/767570 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [19:30:58] so there is a slightly different issue this time [19:31:08] * dancy looks around [19:31:35] yeah, seems like wikiversions.json isn't being updated on targets. [19:31:44] mutante: same issue, different set promoted [19:31:57] Only 3 wikis changed version yesterday [19:32:05] RhinosF1: ACK [19:32:20] All of group 1 just did [19:32:37] I assume group1.dblist has 528 wikis in [19:33:21] !log robh@cumin1001 START - Cookbook sre.dns.netbox [19:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:06] (03PS4) 10Cathal Mooney: Add site variable for EVPN overlay loopback subnets and CR filter [homer/public] - 10https://gerrit.wikimedia.org/r/767570 (https://phabricator.wikimedia.org/T299758) [19:35:30] dancy: i think it's getting updated - i get 659 for `grep -c '[.]24' /srv/mediawiki/wikiversions.json` on m1436, for example [19:35:36] (03CR) 10jerkins-bot: [V: 04-1] Add site variable for EVPN overlay loopback subnets and CR filter [homer/public] - 10https://gerrit.wikimedia.org/r/767570 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [19:36:10] brennen: Ok.. that's good.. So it looks like deploy1002's /srv/mediawiki/ dir isn't being updated. Looking into that. [19:36:21] right on, thanks [19:36:23] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:09] (03PS5) 10Cathal Mooney: Add site variable for EVPN overlay loopback subnets and CR filter [homer/public] - 10https://gerrit.wikimedia.org/r/767570 (https://phabricator.wikimedia.org/T299758) [19:37:39] (03CR) 10jerkins-bot: [V: 04-1] Add site variable for EVPN overlay loopback subnets and CR filter [homer/public] - 10https://gerrit.wikimedia.org/r/767570 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [19:38:10] taavi: thanks for patch. i'll verify the backport on an mwdebug and then sync. [19:38:45] (03CR) 10Brennen Bearnes: [C: 03+2] Add a non-namespaced alias for ApiFeatureUsageQueryEngineElastica [extensions/ApiFeatureUsage] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767103 (https://phabricator.wikimedia.org/T302907) (owner: 10Majavah) [19:40:41] (03Merged) 10jenkins-bot: Add a non-namespaced alias for ApiFeatureUsageQueryEngineElastica [extensions/ApiFeatureUsage] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767103 (https://phabricator.wikimedia.org/T302907) (owner: 10Majavah) [19:43:34] brennen: Bug located. Working on packaging it up. [19:44:54] (03PS1) 10Eigyan: wmf-config: Undeploy the fawiki test survey from production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767574 (https://phabricator.wikimedia.org/T300291) [19:45:12] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host dumpsdata1006.mgmt.eqiad.wmnet with reboot policy FORCED [19:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:27] !log brennen@deploy1002 Synchronized php-1.38.0-wmf.24/extensions/ApiFeatureUsage: Backport: [[gerrit:767103|Add a non-namespaced alias for ApiFeatureUsageQueryEngineElastica (T302907)]] (duration: 00m 50s) [19:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:30] T302907: Error: Class 'ApiFeatureUsageQueryEngineElastica' not found - https://phabricator.wikimedia.org/T302907 [19:47:34] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dumpsdata1006.mgmt.eqiad.wmnet with reboot policy FORCED [19:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:12] dancy: should I manually copy wikiversions.json on deploy1002 maybe? [19:49:34] seeing fix now [19:49:38] No thank you. I have a fix to test in a bit. [19:50:56] I hit +2 on that one. But you would have to deploy again, right? [19:50:59] (03CR) 10Eigyan: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767574 (https://phabricator.wikimedia.org/T300291) (owner: 10Eigyan) [19:51:09] or we can scap pull manually on deploy1002 now [19:51:48] Re-running `scap sync-wikiversions` (with the updated scap code) should do the trick. I asked Brennen to run it when he has a moment. [19:52:41] dancy, mutante: running now [19:52:46] ok, thanks all! [19:53:31] hrm: 19:53:08 ['bin/scap', 'pull', '--no-update-l10n', 'deploy2002.codfw.wmnet', 'deploy1002.eqiad.wmnet', 'deploy1002.eqiad.wmnet'] (ran as mwdeploy@mw1450.eqiad.wmnet) returned [127]: bash: bin/scap: No such file or directory [19:53:41] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: (no justification provided) [19:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:43] ah.. prefix the command with SCAP=scap [19:54:14] running [19:56:09] rescheduling all the icinga alerts for MW versions [19:56:11] (03CR) 10Mepps: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767574 (https://phabricator.wikimedia.org/T300291) (owner: 10Eigyan) [19:57:58] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: (no justification provided) [19:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:13] (03CR) 10Ahmon Dancy: [C: 03+1] check_mw_versions.py: Fix problem induced by recent scap changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/767242 (https://phabricator.wikimedia.org/T302832) (owner: 10Ahmon Dancy) [19:58:22] finished cleanly that go. [19:58:23] (03CR) 10Jsn.sherman: [C: 04-1] "This looks like it removes the survey from labs (beta) but not prod?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767574 (https://phabricator.wikimedia.org/T300291) (owner: 10Eigyan) [19:58:45] Thanks for testing. I'll package up a scap release. [19:58:54] brennen: ACK, many recoveries in Icinga, I keep telling it to speed up [19:59:32] down to 329 from 400 [20:00:11] dduvall: also filed T302918, not sure if user facing impact at the moment. [20:00:11] T302918: Linter: PHP Warning: in_array() expects parameter 2 to be array, null given - https://phabricator.wikimedia.org/T302918 [20:01:05] ok, fixed most. "only" 37 CRITs (:/) that are all unrelated though. so I will turn the bot back on [20:03:26] !log robh@cumin1001 START - Cookbook sre.dns.netbox [20:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:58] https://phabricator.wikimedia.org/T302919 filed to request a new scap release. [20:04:23] thanks dancy. meanwhile i'll use the local checkout if a version change comes up again. [20:04:33] 👍🏾 [20:07:43] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:22] (03PS2) 10Eigyan: wmf-config: Undeploy the fawiki test survey from production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767574 (https://phabricator.wikimedia.org/T300291) [20:08:23] Taking a break now that the chaos has died down. [20:11:03] (03CR) 10Eigyan: wmf-config: Undeploy the fawiki test survey from production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767574 (https://phabricator.wikimedia.org/T300291) (owner: 10Eigyan) [20:11:31] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host dumpsdata1007.mgmt.eqiad.wmnet with reboot policy FORCED [20:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:42] (03Abandoned) 10Eigyan: wmf-config: Undeploy the fawiki test survey from production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767574 (https://phabricator.wikimedia.org/T300291) (owner: 10Eigyan) [20:12:55] (ProbeHttpFailed) firing: (2) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [20:18:36] (03PS1) 10Eigyan: wmf-config: Undeploy the fawiki test survey from production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767580 (https://phabricator.wikimedia.org/T300291) [20:20:41] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dumpsdata1007.mgmt.eqiad.wmnet with reboot policy FORCED [20:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:26] dancy: brennen: new scap built and uploaded, updating ticket and docs ..because new build host [20:23:14] (03PS2) 10Eigyan: wmf-config: Undeploy the fawiki test survey from production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767580 (https://phabricator.wikimedia.org/T300291) [20:23:52] not deployed yet,be back after lunch [20:23:55] (03PS6) 10Cathal Mooney: Add site variable for EVPN overlay loopback subnets and CR filter [homer/public] - 10https://gerrit.wikimedia.org/r/767570 (https://phabricator.wikimedia.org/T299758) [20:24:30] (03CR) 10jerkins-bot: [V: 04-1] Add site variable for EVPN overlay loopback subnets and CR filter [homer/public] - 10https://gerrit.wikimedia.org/r/767570 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [20:24:40] (03CR) 10Eigyan: "Fixed wrong file update 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767580 (https://phabricator.wikimedia.org/T300291) (owner: 10Eigyan) [20:26:24] (03CR) 10JHathaway: [C: 03+1] firmware fact: drop firmware_bios (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765574 (owner: 10Jbond) [20:31:13] (03PS7) 10Cathal Mooney: Add site variable for EVPN overlay loopback subnets and CR filter [homer/public] - 10https://gerrit.wikimedia.org/r/767570 (https://phabricator.wikimedia.org/T299758) [20:31:43] (03CR) 10jerkins-bot: [V: 04-1] Add site variable for EVPN overlay loopback subnets and CR filter [homer/public] - 10https://gerrit.wikimedia.org/r/767570 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [20:33:16] (03PS8) 10Cathal Mooney: Add site variable for EVPN overlay loopback subnets and CR filter [homer/public] - 10https://gerrit.wikimedia.org/r/767570 (https://phabricator.wikimedia.org/T299758) [20:33:47] (03CR) 10jerkins-bot: [V: 04-1] Add site variable for EVPN overlay loopback subnets and CR filter [homer/public] - 10https://gerrit.wikimedia.org/r/767570 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [20:35:35] (03PS9) 10Cathal Mooney: Add site variable for EVPN overlay loopback subnets and CR filter [homer/public] - 10https://gerrit.wikimedia.org/r/767570 (https://phabricator.wikimedia.org/T299758) [20:36:05] (03CR) 10jerkins-bot: [V: 04-1] Add site variable for EVPN overlay loopback subnets and CR filter [homer/public] - 10https://gerrit.wikimedia.org/r/767570 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [20:37:18] (03PS10) 10Cathal Mooney: Add site variable for EVPN overlay loopback subnets and CR filter [homer/public] - 10https://gerrit.wikimedia.org/r/767570 (https://phabricator.wikimedia.org/T299758) [20:37:48] (03CR) 10jerkins-bot: [V: 04-1] Add site variable for EVPN overlay loopback subnets and CR filter [homer/public] - 10https://gerrit.wikimedia.org/r/767570 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [20:38:04] !log rolling out scap 4.4.2 to A:mw-canary or A:parsoid-canary or A:mw-jobrunner-canary (T302919) [20:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:08] T302919: Deploy Scap version 4.4.2 - https://phabricator.wikimedia.org/T302919 [20:41:37] (03CR) 10Jsn.sherman: [C: 03+1] "Looks good to me!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767580 (https://phabricator.wikimedia.org/T300291) (owner: 10Eigyan) [20:44:43] !log testec 'scap pull' still worked on mwdebug1001; rolling out scap 4.4.2 to A:restbase-canary (T302919) [20:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:46] T302919: Deploy Scap version 4.4.2 - https://phabricator.wikimedia.org/T302919 [20:45:00] PROBLEM - Check systemd state on cp6010 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_exim4.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:47:51] !log dzahn@deploy1002 Started deploy [restbase/deploy@0848b15] (dev-cluster): (no justification provided) [20:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:32] !log dzahn@deploy1002 Finished deploy [restbase/deploy@0848b15] (dev-cluster): (no justification provided) (duration: 00m 41s) [20:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:47] !log running test-deploy to devcluster (restbase) to test new scap version, succesful and then rolled back, as the docs say T302919 [20:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:54] Thanks mutante!! [20:56:33] ^ [20:57:06] all is done except the "roll out to all" [20:57:12] you have it on canaries [20:57:29] The one and only place it is really needed is deploy1002 [20:57:56] scap clients (all other hosts) are unaffected by the code change [20:59:11] greetings [20:59:29] Hello there. [20:59:44] (03PS1) 10RobH: dumpsdata1007 info [puppet] - 10https://gerrit.wikimedia.org/r/767584 (https://phabricator.wikimedia.org/T299443) [21:00:05] RoanKattouw and Urbanecm: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220302T2100). Please do the needful. [21:00:05] eigyan: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:22] !log deploy1002 - upgraded scap to 4.4.2-1 T302919 [21:00:23] I am here [21:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:24] (03CR) 10RobH: [C: 03+2] dumpsdata1007 info [puppet] - 10https://gerrit.wikimedia.org/r/767584 (https://phabricator.wikimedia.org/T299443) (owner: 10RobH) [21:00:26] dancy: done! [21:00:26] T302919: Deploy Scap version 4.4.2 - https://phabricator.wikimedia.org/T302919 [21:00:32] woooord [21:00:33] Thanks! [21:01:24] yep, good docs were helpful [21:02:08] <3 mutante [21:03:50] PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:04:22] I'm going to test it. [21:05:09] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1007.eqiad.wmnet with OS bullseye [21:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bull... [21:09:45] eigyan: hi [21:09:53] dancy: are you able to help with B&C [21:10:01] Sure [21:10:14] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: testing scap 4.4.2 [21:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:33] What's on the list? [21:10:37] dancy: there's just 1 patch from eigyan [21:10:43] mutante: Test confirmed. [21:10:50] dancy: :) great, thanks [21:10:58] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/767580 [21:11:02] PROBLEM - Disk space on centrallog1001 is CRITICAL: DISK CRITICAL - free space: /srv 34728 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=centrallog1001&var-datasource=eqiad+prometheus/ops [21:11:02] thx [21:11:23] (03CR) 10RhinosF1: [C: 03+1] wmf-config: Undeploy the fawiki test survey from production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767580 (https://phabricator.wikimedia.org/T300291) (owner: 10Eigyan) [21:11:36] ok, letting 'er rip [21:11:48] (03CR) 10Ahmon Dancy: [C: 03+2] wmf-config: Undeploy the fawiki test survey from production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767580 (https://phabricator.wikimedia.org/T300291) (owner: 10Eigyan) [21:12:31] (03Merged) 10jenkins-bot: wmf-config: Undeploy the fawiki test survey from production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767580 (https://phabricator.wikimedia.org/T300291) (owner: 10Eigyan) [21:13:19] deployed to mwdebug. [21:13:33] eigyan: please test ^ [21:13:45] !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dumpsdata1007.eqiad.wmnet with OS bullseye [21:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye executed with errors:... [21:13:49] Thanks RhinosF1will do [21:13:51] Just need to make sure it no longer shows when forced I guess [21:14:29] eigyan: please ping dancy once you've checked [21:15:04] sure thing RhinosF1 [21:17:59] dancy mwdebug looks good on my end [21:18:10] ok, rolling out. [21:19:12] !log dancy@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:767580|wmf-config: Undeploy the fawiki test survey from production (T300291)]] (duration: 00m 50s) [21:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:16] T300291: Undeploy the fawiki test survey FROM PRODUCTION - https://phabricator.wikimedia.org/T300291 [21:20:04] dancy: thanks for helping [21:20:13] No problem. [21:20:53] eigyan: it should be live in production now, please let us know if you need anything else / have issues [21:21:00] And have a good evening! [21:21:17] thank you so much everyone have a great night [21:21:58] :) [21:35:39] 10SRE, 10Security-Team, 10Performance-Team (Radar), 10SecTeam-Processed, 10Security: Security API Storage Needs - https://phabricator.wikimedia.org/T301428 (10sbassett) Hey @Joe - just wondering if you had any thoughts or guidance regarding my previous comment. If not, I think we'll explore using MySQL... [21:36:15] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1007.eqiad.wmnet with OS bullseye [21:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye [21:38:35] jouncebot: now [21:38:35] For the next 0 hour(s) and 21 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220302T2100) [21:39:15] dancy: everything quiet? Then I roll out scap on everything now [21:42:55] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [21:43:20] mutante: All is well. [21:44:39] !log rolling out scap 4.4.2 on 'all' T302919 [21:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:42] T302919: Deploy Scap version 4.4.2 - https://phabricator.wikimedia.org/T302919 [21:46:14] dancy: looks alright, it finished and should be done globally [21:46:30] Thanks again. That was a fast turnaround. [21:46:35] ignores this: [21:46:36] The following hosts were unreachable: [21:46:36] puppet [21:46:37] :) [21:46:44] haha [21:47:39] cool, I know it took longer sometimes in the past [21:49:46] jouncebot now [21:49:46] For the next 0 hour(s) and 10 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220302T2100) [21:50:38] i'm going to backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Linter/+/767582 [21:51:11] (03PS1) 10Brennen Bearnes: Hooks.php: Check for non-array $tags [extensions/Linter] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767104 (https://phabricator.wikimedia.org/T302918) [21:51:24] (03CR) 10Brennen Bearnes: [C: 03+2] Hooks.php: Check for non-array $tags [extensions/Linter] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767104 (https://phabricator.wikimedia.org/T302918) (owner: 10Brennen Bearnes) [21:52:47] (03PS1) 10Reedy: Use namespaced ApiFeatureUsageQueryEngineElastica [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767596 (https://phabricator.wikimedia.org/T301044) [21:53:23] !log T276198 Disabled puppet across all of elastic*, cloudelastic*, and relforge* to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/766876/ on a single elastic host [21:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:27] T276198: /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 [21:53:51] (03Merged) 10jenkins-bot: Hooks.php: Check for non-array $tags [extensions/Linter] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767104 (https://phabricator.wikimedia.org/T302918) (owner: 10Brennen Bearnes) [21:53:58] (03CR) 10Reedy: [C: 04-2] "Needs to wait till `wmf/1.38.0-wmf.24` is stable and everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767596 (https://phabricator.wikimedia.org/T301044) (owner: 10Reedy) [21:55:17] (03CR) 10Bking: [C: 03+2] elastic: prevent rundir from deletion [puppet] - 10https://gerrit.wikimedia.org/r/766876 (https://phabricator.wikimedia.org/T276198) (owner: 10Bking) [21:57:01] (03CR) 10Volans: [C: 03+1] "LGTM, one optional nit inline" [dns] - 10https://gerrit.wikimedia.org/r/767562 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [21:59:44] (03PS2) 10Reedy: Use namespaced ApiFeatureUsageQueryEngineElastica [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767596 (https://phabricator.wikimedia.org/T302907) [21:59:44] !log brennen@deploy1002 Synchronized php-1.38.0-wmf.24/extensions/Linter/includes/Hooks.php: Backport: [[gerrit:767104|Hooks.php: Check for non-array $tags (T302918)]] (duration: 00m 50s) [21:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:48] T302918: Linter: PHP Warning: in_array() expects parameter 2 to be array, null given - https://phabricator.wikimedia.org/T302918 [22:05:05] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dumpsdata1007.eqiad.wmnet with OS bullseye [22:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye executed with errors:... [22:05:32] RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:12:14] PROBLEM - Check systemd state on elastic1052 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,elasticsearch_6@production-search-eqiad.service,elasticsearch_6@production-search-psi-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:12:42] inflatador: ^ [22:13:03] ryankemper: ^ [22:13:22] RhinosF1: thanks [22:13:29] also reminds me I missed a log message [22:13:34] np [22:16:30] !log T276198 Testing https://gerrit.wikimedia.org/r/c/operations/puppet/+/766876/ on `elastic1052`; elasticsearch service fails to start. It's expecting to find `/etc/tmpfiles.d/elasticsearch-production-search-psi-eqiad.conf` but the actual filename is `elasticsearch-production-search-psi-eqiad-conf.conf`. Not sure why that trailing `-conf` is there in the filename. It doesn't look like something `systemd::tmpfile` is doing. [22:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:36] T276198: /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 [22:19:42] ACKNOWLEDGEMENT - Check systemd state on elastic1052 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,elasticsearch_6@production-search-eqiad.service,elasticsearch_6@production-search-psi-eqiad.service Ryan Kemper https://phabricator.wikimedia.org/T276198 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:20:05] acked the alerts, going to downtime now (should have downtimed it earlier) [22:21:04] !log T276198 Downtimed `elastic1052` for 2 hours while troubleshooting [22:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:10] 10SRE, 10Discovery-Search (Current work): /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 (10MoritzMuehlenhoff) >>! In T276198#7749141, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/r360TH8B8Fs0LH... [22:35:03] 10SRE, 10Discovery-Search (Current work): /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 (10RKemper) >>! In T276198#7749192, @MoritzMuehlenhoff wrote: >>>! In T276198#7749141, @Stashbot wrote: >> {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=ht... [22:35:32] (03PS1) 10Dzahn: devtools: copy yaml key/values over from gitlab-runner project for test [puppet] - 10https://gerrit.wikimedia.org/r/767599 (https://phabricator.wikimedia.org/T297659) [22:36:36] (03PS1) 10Ryan Kemper: elastic: fix filename of tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/767600 (https://phabricator.wikimedia.org/T276198) [22:37:05] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/767600 (https://phabricator.wikimedia.org/T276198) (owner: 10Ryan Kemper) [22:38:08] (03PS2) 10Dzahn: devtools: copy yaml key/values over from gitlab-runner project for test [puppet] - 10https://gerrit.wikimedia.org/r/767599 (https://phabricator.wikimedia.org/T297659) [22:38:48] (03CR) 10Dzahn: [C: 03+2] devtools: copy yaml key/values over from gitlab-runner project for test [puppet] - 10https://gerrit.wikimedia.org/r/767599 (https://phabricator.wikimedia.org/T297659) (owner: 10Dzahn) [22:41:03] (03PS2) 10Ryan Kemper: elastic: fix filename of tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/767600 (https://phabricator.wikimedia.org/T276198) [22:41:19] (03PS3) 10Dzahn: devtools: copy yaml key/values over from gitlab-runner project for test [puppet] - 10https://gerrit.wikimedia.org/r/767599 (https://phabricator.wikimedia.org/T297659) [22:41:36] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/apertium: apply [22:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:18] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/apertium: apply [22:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:19] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/blubberoid: apply [22:42:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:50] (03PS3) 10Ryan Kemper: elastic: fix filename of tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/767600 (https://phabricator.wikimedia.org/T276198) [22:43:05] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/blubberoid: apply [22:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:06] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [22:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:54] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [22:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:55] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [22:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:13] (03PS4) 10Dzahn: devtools: copy yaml key/values over from gitlab-runner project for test [puppet] - 10https://gerrit.wikimedia.org/r/767599 (https://phabricator.wikimedia.org/T297659) [22:44:33] (03CR) 10Dzahn: [V: 03+2 C: 03+2] devtools: copy yaml key/values over from gitlab-runner project for test [puppet] - 10https://gerrit.wikimedia.org/r/767599 (https://phabricator.wikimedia.org/T297659) (owner: 10Dzahn) [22:45:19] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [22:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:21] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [22:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:24] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [22:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:26] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [22:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:00] (03CR) 10Ryan Kemper: [C: 03+2] elastic: fix filename of tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/767600 (https://phabricator.wikimedia.org/T276198) (owner: 10Ryan Kemper) [22:47:18] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [22:47:20] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [22:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:43] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [22:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:44] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [22:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:40] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [22:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:42] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [22:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:30] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [22:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:31] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [22:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:48] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [22:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:49] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mathoid: apply [22:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:44] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply [22:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:45] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/proton: apply [22:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:32] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/proton: apply [22:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:33] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [22:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:45] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [22:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:46] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [22:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:37] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [22:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:12] (03PS1) 10RobH: dumpsdata1007 raid testing [puppet] - 10https://gerrit.wikimedia.org/r/767602 (https://phabricator.wikimedia.org/T299443) [23:05:35] (03CR) 10RobH: [C: 03+2] dumpsdata1007 raid testing [puppet] - 10https://gerrit.wikimedia.org/r/767602 (https://phabricator.wikimedia.org/T299443) (owner: 10RobH) [23:06:34] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: wikidatardf-truthy-dumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:06:46] (03CR) 10Krinkle: [C: 03+1] check_mw_versions.py: Fix problem induced by recent scap changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/767242 (https://phabricator.wikimedia.org/T302832) (owner: 10Ahmon Dancy) [23:07:26] (03PS1) 10Ryan Kemper: elastic: disable readahead script needs new fp [puppet] - 10https://gerrit.wikimedia.org/r/767603 (https://phabricator.wikimedia.org/T276198) [23:08:41] (03CR) 10Ryan Kemper: [C: 03+2] elastic: disable readahead script needs new fp [puppet] - 10https://gerrit.wikimedia.org/r/767603 (https://phabricator.wikimedia.org/T276198) (owner: 10Ryan Kemper) [23:08:57] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1007.eqiad.wmnet with OS bullseye [23:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bull... [23:10:56] RECOVERY - Check systemd state on elastic1052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:15:20] !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dumpsdata1007.eqiad.wmnet with OS bullseye [23:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye executed with errors:... [23:17:41] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10RobH) [23:20:55] (03PS1) 10Dzahn: aptrepo: import gitlab-runner package for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/767604 (https://phabricator.wikimedia.org/T297659) [23:21:08] !log T276198 https://gerrit.wikimedia.org/r/c/operations/puppet/+/767600 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/767603/ fixed all the problems. Re-enabling puppet on elastic*, cloudelastic*, and relforge* shortly [23:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:11] T276198: /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 [23:21:56] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1007.eqiad.wmnet with OS bullseye [23:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:00] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye [23:24:00] (03PS2) 10Dzahn: aptrepo: import gitlab-runner package for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/767604 (https://phabricator.wikimedia.org/T297659) [23:25:15] !log T276198 Re-enabled puppet across fleet: `ryankemper@cumin1001:~$ sudo -E cumin 'R:Elasticsearch::instance' 'enable-puppet "deploy fix from T276198"'` [23:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:48] 10SRE, 10WMF-General-or-Unknown, 10WMF-Legal, 10Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270 (10Dzahn) I hereby license all my existing contributions to the operations/puppet under the Apache 2.0 license. --- Maybe we can get the patch from... [23:32:58] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dumpsdata1007.eqiad.wmnet with reason: host reimage [23:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:28] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on dumpsdata1007.eqiad.wmnet with reason: host reimage [23:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:04] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dumpsdata1007.eqiad.wmnet with OS bullseye [23:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:08] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye completed: - dumpsdata1007 (**WARN**) - Removed from Puppet and PuppetDB if presen... [23:49:42] (03CR) 10Dzahn: [C: 03+1] "nice! looks good, removes "labs" etc :)" [puppet] - 10https://gerrit.wikimedia.org/r/767484 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [23:50:40] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [23:52:30] robh: icinga config does not like dumpsdata1007 right now ..because of: [23:52:37] Error: 'lsw1-f1-eqiad.mgmt.eqiad.wmnet' is not a valid parent for host 'dumpsdata1007' [23:52:41] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10RobH) so this is installed now with hwraid1 single disk setup just to see if it even works within the OS. When I then launch the OS, it loads, but any megacli commands hang it. [23:53:09] interesting, perhaps a new issue due to new row? [23:53:17] seems like it, yea [23:53:32] as if the new "parent interface" needs to be added somewhere [23:53:42] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10RobH) > > 15:52 mutante: > robh: icinga config does not like dumpsdata1007 right now ..because of: Error: 'lsw1-f1-eqiad.mgmt.eqiad.wmnet' is not a valid parent for host 'dumpsdata1007' [23:53:42] appending to the test task [23:53:54] thx for the heads up, ill just maint mode it for now [23:53:59] ACK [23:54:19] so the thing is ..nothing happens as long as Icinga does not get restarted but if it does then it would go down [23:54:48] oh wait [23:54:50] i misunderstood [23:54:53] icinga CONFIG [23:55:00] yea, the config check [23:55:13] mutante: hrmmm, ok, so i guess .... i have no idea who would go about fixing that [23:55:33] i dont understand the parent thing, like i guess other servers have their switches as parents? [23:56:27] yea, so some icinga hosts or services can have "parents" in the sense that children are not supposed to alert if the parent is down [23:56:42] like "if the whole switch is down dont flood the channel with all the HOST down messages" [23:57:09] somewhere we must have the switches itself in icinga [23:57:12] taking a look [23:57:23] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) dumpsdata1007 is online with OS but doesnlt seem megacli works for it? robh@dumpsdata1007:~$ sudo megacli -LDInfo -Lall -aALL Adapter 0 -- Virtual Drive Info... [23:58:02] icinga does not know "lsw1-f1-eqiad.mgmt.eqiad.wmnet" but for some reason the host is already trying to say that is its parent [23:59:06] that is totally the switch it connects to [23:59:12] except it's right there... lsw1-f1-eqiad.mgmt.eqiad.wmnet [23:59:14] so that makes sense but i didnt realize that icinga didnt know what it was [23:59:31] now I thought I found it and we just have to add the new switch [23:59:41] in hieradata/common/monitoring.yaml