[00:09:19] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-text_eqsin and A:cp [00:38:24] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/951091 [00:38:30] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/951091 (owner: 10TrainBranchBot) [00:39:40] (Nonwrite HTTP requests with primary DB connections alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+connections+alert [00:45:18] !log removing two files for legal compliance [00:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:26] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/951091 (owner: 10TrainBranchBot) [00:59:40] (Nonwrite HTTP requests with primary DB connections alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+connections+alert [01:03:35] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T344659 (10phaultfinder) [01:16:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:55:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230822T0200) [02:00:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:06:41] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.23 [core] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/951092 (https://phabricator.wikimedia.org/T343725) [02:07:26] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.23 [core] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/951092 (https://phabricator.wikimedia.org/T343725) (owner: 10TrainBranchBot) [02:12:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance [02:13:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance [02:13:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T344589)', diff saved to https://phabricator.wikimedia.org/P50700 and previous config saved to /var/cache/conftool/dbconfig/20230822-021307-ladsgroup.json [02:13:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [02:13:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [02:16:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [02:17:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [02:17:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T344589)', diff saved to https://phabricator.wikimedia.org/P50701 and previous config saved to /var/cache/conftool/dbconfig/20230822-021715-ladsgroup.json [02:18:04] PROBLEM - cassandra-a service on restbase1030 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:18:04] PROBLEM - cassandra-a CQL 10.64.48.234:9042 on restbase1030 is CRITICAL: connect to address 10.64.48.234 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [02:19:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T344589)', diff saved to https://phabricator.wikimedia.org/P50702 and previous config saved to /var/cache/conftool/dbconfig/20230822-021942-ladsgroup.json [02:21:16] (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.23 [core] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/951092 (https://phabricator.wikimedia.org/T343725) (owner: 10TrainBranchBot) [02:23:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T344589)', diff saved to https://phabricator.wikimedia.org/P50703 and previous config saved to /var/cache/conftool/dbconfig/20230822-022328-ladsgroup.json [02:29:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance [02:29:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance [02:29:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T343718)', diff saved to https://phabricator.wikimedia.org/P50704 and previous config saved to /var/cache/conftool/dbconfig/20230822-022926-ladsgroup.json [02:29:31] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [02:30:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2101.codfw.wmnet with reason: Maintenance [02:30:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2101.codfw.wmnet with reason: Maintenance [02:31:41] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P50705 and previous config saved to /var/cache/conftool/dbconfig/20230822-023448-ladsgroup.json [02:38:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P50706 and previous config saved to /var/cache/conftool/dbconfig/20230822-023835-ladsgroup.json [02:41:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:41:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T343718)', diff saved to https://phabricator.wikimedia.org/P50707 and previous config saved to /var/cache/conftool/dbconfig/20230822-024144-ladsgroup.json [02:41:48] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [02:44:26] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:48:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: Maintenance [02:48:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: Maintenance [02:48:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T343718)', diff saved to https://phabricator.wikimedia.org/P50708 and previous config saved to /var/cache/conftool/dbconfig/20230822-024822-ladsgroup.json [02:48:26] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [02:49:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P50709 and previous config saved to /var/cache/conftool/dbconfig/20230822-024954-ladsgroup.json [02:50:08] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:53:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P50710 and previous config saved to /var/cache/conftool/dbconfig/20230822-025341-ladsgroup.json [02:56:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P50711 and previous config saved to /var/cache/conftool/dbconfig/20230822-025650-ladsgroup.json [03:00:06] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230822T0300) [03:01:24] (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951201 (https://phabricator.wikimedia.org/T343725) [03:01:26] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951201 (https://phabricator.wikimedia.org/T343725) (owner: 10TrainBranchBot) [03:02:17] (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951201 (https://phabricator.wikimedia.org/T343725) (owner: 10TrainBranchBot) [03:02:48] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.41.0-wmf.23 refs T343725 [03:02:52] T343725: 1.41.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T343725 [03:05:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T344589)', diff saved to https://phabricator.wikimedia.org/P50712 and previous config saved to /var/cache/conftool/dbconfig/20230822-030501-ladsgroup.json [03:05:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance [03:05:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance [03:05:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2127 (T344589)', diff saved to https://phabricator.wikimedia.org/P50713 and previous config saved to /var/cache/conftool/dbconfig/20230822-030526-ladsgroup.json [03:08:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T343718)', diff saved to https://phabricator.wikimedia.org/P50714 and previous config saved to /var/cache/conftool/dbconfig/20230822-030823-ladsgroup.json [03:08:27] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [03:08:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T344589)', diff saved to https://phabricator.wikimedia.org/P50715 and previous config saved to /var/cache/conftool/dbconfig/20230822-030847-ladsgroup.json [03:08:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [03:09:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [03:09:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T344589)', diff saved to https://phabricator.wikimedia.org/P50716 and previous config saved to /var/cache/conftool/dbconfig/20230822-030911-ladsgroup.json [03:11:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P50717 and previous config saved to /var/cache/conftool/dbconfig/20230822-031156-ladsgroup.json [03:13:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T344589)', diff saved to https://phabricator.wikimedia.org/P50718 and previous config saved to /var/cache/conftool/dbconfig/20230822-031312-ladsgroup.json [03:17:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T344589)', diff saved to https://phabricator.wikimedia.org/P50719 and previous config saved to /var/cache/conftool/dbconfig/20230822-031733-ladsgroup.json [03:21:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:23:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P50720 and previous config saved to /var/cache/conftool/dbconfig/20230822-032329-ladsgroup.json [03:27:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T343718)', diff saved to https://phabricator.wikimedia.org/P50721 and previous config saved to /var/cache/conftool/dbconfig/20230822-032703-ladsgroup.json [03:27:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [03:27:07] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [03:27:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [03:27:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T343718)', diff saved to https://phabricator.wikimedia.org/P50722 and previous config saved to /var/cache/conftool/dbconfig/20230822-032713-ladsgroup.json [03:28:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P50723 and previous config saved to /var/cache/conftool/dbconfig/20230822-032819-ladsgroup.json [03:32:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P50724 and previous config saved to /var/cache/conftool/dbconfig/20230822-033239-ladsgroup.json [03:38:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P50725 and previous config saved to /var/cache/conftool/dbconfig/20230822-033835-ladsgroup.json [03:42:49] (03CR) 10DLynch: [C: 03+1] Remove unneeded $wgDefaultUserOptions['visualeditor-enable'] settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933998 (https://phabricator.wikimedia.org/T340696) (owner: 10Bartosz Dziewoński) [03:43:24] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:43:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P50726 and previous config saved to /var/cache/conftool/dbconfig/20230822-034325-ladsgroup.json [03:45:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T343718)', diff saved to https://phabricator.wikimedia.org/P50727 and previous config saved to /var/cache/conftool/dbconfig/20230822-034539-ladsgroup.json [03:45:44] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [03:45:57] (03CR) 10DLynch: [C: 03+1] Move visual editor out of Beta Features (without changing prefs) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947015 (https://phabricator.wikimedia.org/T335056) (owner: 10Bartosz Dziewoński) [03:47:40] (03CR) 10DLynch: [C: 03+1] Clarify 2017 wikitext editor's Beta Feature status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949588 (https://phabricator.wikimedia.org/T344158) (owner: 10Bartosz Dziewoński) [03:47:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P50728 and previous config saved to /var/cache/conftool/dbconfig/20230822-034745-ladsgroup.json [03:50:22] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:53:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T343718)', diff saved to https://phabricator.wikimedia.org/P50729 and previous config saved to /var/cache/conftool/dbconfig/20230822-035342-ladsgroup.json [03:53:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [03:53:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [03:53:47] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [03:53:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T343718)', diff saved to https://phabricator.wikimedia.org/P50730 and previous config saved to /var/cache/conftool/dbconfig/20230822-035352-ladsgroup.json [03:57:23] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.41.0-wmf.23 refs T343725 (duration: 54m 34s) [03:57:27] T343725: 1.41.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T343725 [03:58:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T344589)', diff saved to https://phabricator.wikimedia.org/P50731 and previous config saved to /var/cache/conftool/dbconfig/20230822-035831-ladsgroup.json [03:58:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [03:58:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [03:59:31] !log mwpresync@deploy1002 Pruned MediaWiki: 1.41.0-wmf.20 (duration: 02m 06s) [04:00:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P50732 and previous config saved to /var/cache/conftool/dbconfig/20230822-040045-ladsgroup.json [04:02:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T344589)', diff saved to https://phabricator.wikimedia.org/P50733 and previous config saved to /var/cache/conftool/dbconfig/20230822-040251-ladsgroup.json [04:02:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [04:03:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [04:03:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T344589)', diff saved to https://phabricator.wikimedia.org/P50734 and previous config saved to /var/cache/conftool/dbconfig/20230822-040315-ladsgroup.json [04:04:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [04:04:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [04:04:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T344589)', diff saved to https://phabricator.wikimedia.org/P50735 and previous config saved to /var/cache/conftool/dbconfig/20230822-040451-ladsgroup.json [04:07:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T343718)', diff saved to https://phabricator.wikimedia.org/P50736 and previous config saved to /var/cache/conftool/dbconfig/20230822-040658-ladsgroup.json [04:07:05] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [04:11:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T344589)', diff saved to https://phabricator.wikimedia.org/P50737 and previous config saved to /var/cache/conftool/dbconfig/20230822-041119-ladsgroup.json [04:11:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T344589)', diff saved to https://phabricator.wikimedia.org/P50738 and previous config saved to /var/cache/conftool/dbconfig/20230822-041132-ladsgroup.json [04:15:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P50739 and previous config saved to /var/cache/conftool/dbconfig/20230822-041551-ladsgroup.json [04:22:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P50740 and previous config saved to /var/cache/conftool/dbconfig/20230822-042206-ladsgroup.json [04:26:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P50741 and previous config saved to /var/cache/conftool/dbconfig/20230822-042625-ladsgroup.json [04:26:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P50742 and previous config saved to /var/cache/conftool/dbconfig/20230822-042639-ladsgroup.json [04:30:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T343718)', diff saved to https://phabricator.wikimedia.org/P50743 and previous config saved to /var/cache/conftool/dbconfig/20230822-043058-ladsgroup.json [04:31:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [04:31:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [04:31:02] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [04:37:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P50744 and previous config saved to /var/cache/conftool/dbconfig/20230822-043712-ladsgroup.json [04:41:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P50745 and previous config saved to /var/cache/conftool/dbconfig/20230822-044131-ladsgroup.json [04:41:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P50746 and previous config saved to /var/cache/conftool/dbconfig/20230822-044145-ladsgroup.json [04:47:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [04:47:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [04:47:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [04:47:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [04:47:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T343718)', diff saved to https://phabricator.wikimedia.org/P50747 and previous config saved to /var/cache/conftool/dbconfig/20230822-044715-ladsgroup.json [04:47:22] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [04:52:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T343718)', diff saved to https://phabricator.wikimedia.org/P50748 and previous config saved to /var/cache/conftool/dbconfig/20230822-045218-ladsgroup.json [04:52:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance [04:52:23] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [04:52:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance [04:52:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [04:52:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [04:52:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T343718)', diff saved to https://phabricator.wikimedia.org/P50749 and previous config saved to /var/cache/conftool/dbconfig/20230822-045233-ladsgroup.json [04:56:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T344589)', diff saved to https://phabricator.wikimedia.org/P50750 and previous config saved to /var/cache/conftool/dbconfig/20230822-045638-ladsgroup.json [04:56:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [04:56:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T344589)', diff saved to https://phabricator.wikimedia.org/P50751 and previous config saved to /var/cache/conftool/dbconfig/20230822-045651-ladsgroup.json [04:56:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [04:56:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [04:56:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [04:57:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [04:57:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T344589)', diff saved to https://phabricator.wikimedia.org/P50752 and previous config saved to /var/cache/conftool/dbconfig/20230822-045707-ladsgroup.json [04:57:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [04:57:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T344589)', diff saved to https://phabricator.wikimedia.org/P50753 and previous config saved to /var/cache/conftool/dbconfig/20230822-045716-ladsgroup.json [05:03:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T344589)', diff saved to https://phabricator.wikimedia.org/P50754 and previous config saved to /var/cache/conftool/dbconfig/20230822-050337-ladsgroup.json [05:05:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T343718)', diff saved to https://phabricator.wikimedia.org/P50755 and previous config saved to /var/cache/conftool/dbconfig/20230822-050511-ladsgroup.json [05:05:18] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [05:05:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [05:05:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T343718)', diff saved to https://phabricator.wikimedia.org/P50756 and previous config saved to /var/cache/conftool/dbconfig/20230822-050532-ladsgroup.json [05:05:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [05:05:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T344589)', diff saved to https://phabricator.wikimedia.org/P50757 and previous config saved to /var/cache/conftool/dbconfig/20230822-050543-ladsgroup.json [05:09:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T344589)', diff saved to https://phabricator.wikimedia.org/P50758 and previous config saved to /var/cache/conftool/dbconfig/20230822-050938-ladsgroup.json [05:12:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T344589)', diff saved to https://phabricator.wikimedia.org/P50759 and previous config saved to /var/cache/conftool/dbconfig/20230822-051204-ladsgroup.json [05:13:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 35 hosts with reason: Primary switchover s1 T344621 [05:13:06] T344621: Switchover s1 master (db1163 -> db1184) - https://phabricator.wikimedia.org/T344621 [05:13:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 35 hosts with reason: Primary switchover s1 T344621 [05:13:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1184 with weight 0 T344621', diff saved to https://phabricator.wikimedia.org/P50760 and previous config saved to /var/cache/conftool/dbconfig/20230822-051347-ladsgroup.json [05:18:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P50761 and previous config saved to /var/cache/conftool/dbconfig/20230822-051843-ladsgroup.json [05:20:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P50762 and previous config saved to /var/cache/conftool/dbconfig/20230822-052018-ladsgroup.json [05:20:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P50763 and previous config saved to /var/cache/conftool/dbconfig/20230822-052038-ladsgroup.json [05:21:05] (SwiftTooManyMediaUploads) firing: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:24:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P50764 and previous config saved to /var/cache/conftool/dbconfig/20230822-052445-ladsgroup.json [05:33:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P50765 and previous config saved to /var/cache/conftool/dbconfig/20230822-053349-ladsgroup.json [05:35:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P50766 and previous config saved to /var/cache/conftool/dbconfig/20230822-053524-ladsgroup.json [05:35:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P50767 and previous config saved to /var/cache/conftool/dbconfig/20230822-053544-ladsgroup.json [05:37:24] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1113.eqiad.wmnet with OS bullseye [05:38:40] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1114.eqiad.wmnet with OS bullseye [05:39:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P50768 and previous config saved to /var/cache/conftool/dbconfig/20230822-053951-ladsgroup.json [05:42:06] (03CR) 10Ladsgroup: [C: 03+2] mariadb: Promote db1184 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/951088 (https://phabricator.wikimedia.org/T344621) (owner: 10Gerrit maintenance bot) [05:42:25] (03PS2) 10Ladsgroup: mariadb: Promote db1184 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/951088 (https://phabricator.wikimedia.org/T344621) (owner: 10Gerrit maintenance bot) [05:42:27] (03CR) 10Ladsgroup: [V: 03+2] mariadb: Promote db1184 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/951088 (https://phabricator.wikimedia.org/T344621) (owner: 10Gerrit maintenance bot) [05:48:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T344589)', diff saved to https://phabricator.wikimedia.org/P50769 and previous config saved to /var/cache/conftool/dbconfig/20230822-054855-ladsgroup.json [05:49:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance [05:49:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance [05:49:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T344589)', diff saved to https://phabricator.wikimedia.org/P50770 and previous config saved to /var/cache/conftool/dbconfig/20230822-054920-ladsgroup.json [05:50:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T343718)', diff saved to https://phabricator.wikimedia.org/P50771 and previous config saved to /var/cache/conftool/dbconfig/20230822-055030-ladsgroup.json [05:50:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance [05:50:35] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [05:50:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance [05:50:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1185 (T343718)', diff saved to https://phabricator.wikimedia.org/P50772 and previous config saved to /var/cache/conftool/dbconfig/20230822-055040-ladsgroup.json [05:50:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T343718)', diff saved to https://phabricator.wikimedia.org/P50773 and previous config saved to /var/cache/conftool/dbconfig/20230822-055050-ladsgroup.json [05:50:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [05:50:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [05:51:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T343718)', diff saved to https://phabricator.wikimedia.org/P50774 and previous config saved to /var/cache/conftool/dbconfig/20230822-055101-ladsgroup.json [05:51:06] (SwiftTooManyMediaUploads) resolved: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:51:58] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1113.eqiad.wmnet with reason: host reimage [05:53:00] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1114.eqiad.wmnet with reason: host reimage [05:54:28] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1113.eqiad.wmnet with reason: host reimage [05:54:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T344589)', diff saved to https://phabricator.wikimedia.org/P50775 and previous config saved to /var/cache/conftool/dbconfig/20230822-055457-ladsgroup.json [05:55:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance [05:55:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance [05:55:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [05:55:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [05:55:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1212 (T344589)', diff saved to https://phabricator.wikimedia.org/P50776 and previous config saved to /var/cache/conftool/dbconfig/20230822-055528-ladsgroup.json [05:55:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T344589)', diff saved to https://phabricator.wikimedia.org/P50777 and previous config saved to /var/cache/conftool/dbconfig/20230822-055548-ladsgroup.json [05:56:54] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 12897 [05:57:00] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1114.eqiad.wmnet with reason: host reimage [05:58:32] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 12897 [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230822T0600) [06:00:06] kormat, marostegui, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230822T0600). [06:00:28] o/ [06:00:44] !log Starting s1 eqiad failover from db1163 to db1184 - T344621 [06:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:48] T344621: Switchover s1 master (db1163 -> db1184) - https://phabricator.wikimedia.org/T344621 [06:01:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s1 eqiad as read-only for maintenance - T344621', diff saved to https://phabricator.wikimedia.org/P50778 and previous config saved to /var/cache/conftool/dbconfig/20230822-060104-ladsgroup.json [06:01:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1184 to s1 primary and set section read-write T344621', diff saved to https://phabricator.wikimedia.org/P50779 and previous config saved to /var/cache/conftool/dbconfig/20230822-060131-ladsgroup.json [06:01:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T343718)', diff saved to https://phabricator.wikimedia.org/P50780 and previous config saved to /var/cache/conftool/dbconfig/20230822-060147-ladsgroup.json [06:01:52] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [06:02:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T344589)', diff saved to https://phabricator.wikimedia.org/P50781 and previous config saved to /var/cache/conftool/dbconfig/20230822-060246-ladsgroup.json [06:03:13] (03CR) 10Ladsgroup: [C: 03+2] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/951089 (https://phabricator.wikimedia.org/T344621) (owner: 10Gerrit maintenance bot) [06:04:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:07:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1163 T344621', diff saved to https://phabricator.wikimedia.org/P50782 and previous config saved to /var/cache/conftool/dbconfig/20230822-060710-ladsgroup.json [06:07:15] T344621: Switchover s1 master (db1163 -> db1184) - https://phabricator.wikimedia.org/T344621 [06:09:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T343718)', diff saved to https://phabricator.wikimedia.org/P50783 and previous config saved to /var/cache/conftool/dbconfig/20230822-060911-ladsgroup.json [06:09:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:09:23] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [06:09:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maintenance [06:09:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maintenance [06:10:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P50784 and previous config saved to /var/cache/conftool/dbconfig/20230822-061054-ladsgroup.json [06:15:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maintenance [06:15:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maintenance [06:16:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P50785 and previous config saved to /var/cache/conftool/dbconfig/20230822-061653-ladsgroup.json [06:17:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P50786 and previous config saved to /var/cache/conftool/dbconfig/20230822-061752-ladsgroup.json [06:17:54] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1113.eqiad.wmnet with OS bullseye [06:18:28] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2112 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/951093 (https://phabricator.wikimedia.org/T344666) [06:19:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance [06:19:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance [06:19:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2112 (T344589)', diff saved to https://phabricator.wikimedia.org/P50787 and previous config saved to /var/cache/conftool/dbconfig/20230822-061956-ladsgroup.json [06:21:30] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1114.eqiad.wmnet with OS bullseye [06:21:58] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1115.eqiad.wmnet with OS bullseye [06:24:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P50788 and previous config saved to /var/cache/conftool/dbconfig/20230822-062417-ladsgroup.json [06:24:33] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [06:26:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P50789 and previous config saved to /var/cache/conftool/dbconfig/20230822-062600-ladsgroup.json [06:26:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 35 hosts with reason: Primary switchover s1 T344666 [06:26:58] T344666: Switchover s1 master (db2103 -> db2112) - https://phabricator.wikimedia.org/T344666 [06:27:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 35 hosts with reason: Primary switchover s1 T344666 [06:28:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db2112 with weight 0 T344666', diff saved to https://phabricator.wikimedia.org/P50790 and previous config saved to /var/cache/conftool/dbconfig/20230822-062854-ladsgroup.json [06:31:57] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:32:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P50791 and previous config saved to /var/cache/conftool/dbconfig/20230822-063200-ladsgroup.json [06:32:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P50792 and previous config saved to /var/cache/conftool/dbconfig/20230822-063258-ladsgroup.json [06:33:47] (03PS1) 10Ladsgroup: Enable URL shortener in sidebar in jawiki and zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951366 (https://phabricator.wikimedia.org/T267921) [06:36:00] (03CR) 10Ladsgroup: [C: 03+2] Enable URL shortener in sidebar in jawiki and zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951366 (https://phabricator.wikimedia.org/T267921) (owner: 10Ladsgroup) [06:36:23] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951366 (https://phabricator.wikimedia.org/T267921) (owner: 10Ladsgroup) [06:36:42] (03Merged) 10jenkins-bot: Enable URL shortener in sidebar in jawiki and zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951366 (https://phabricator.wikimedia.org/T267921) (owner: 10Ladsgroup) [06:36:57] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1115.eqiad.wmnet with reason: host reimage [06:37:25] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:951366|Enable URL shortener in sidebar in jawiki and zhwiki (T267921)]] [06:37:29] T267921: Roll out the Toolbox link for URL Shortener in Wikimedia sites - https://phabricator.wikimedia.org/T267921 [06:39:08] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:951366|Enable URL shortener in sidebar in jawiki and zhwiki (T267921)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [06:39:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P50793 and previous config saved to /var/cache/conftool/dbconfig/20230822-063923-ladsgroup.json [06:40:04] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1115.eqiad.wmnet with reason: host reimage [06:40:56] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [06:41:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T344589)', diff saved to https://phabricator.wikimedia.org/P50794 and previous config saved to /var/cache/conftool/dbconfig/20230822-064106-ladsgroup.json [06:44:51] (03PS1) 10Majavah: Set OATHAuth multiple devices WRITE_BOTH for all fishbowls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951367 (https://phabricator.wikimedia.org/T242031) [06:45:03] (03PS2) 10Majavah: Set OATHAuth multiple devices WRITE_BOTH for all fishbowls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951367 (https://phabricator.wikimedia.org/T242031) [06:46:23] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1116.eqiad.wmnet with OS bullseye [06:47:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T343718)', diff saved to https://phabricator.wikimedia.org/P50795 and previous config saved to /var/cache/conftool/dbconfig/20230822-064706-ladsgroup.json [06:47:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [06:47:10] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [06:47:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [06:47:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1200 (T343718)', diff saved to https://phabricator.wikimedia.org/P50796 and previous config saved to /var/cache/conftool/dbconfig/20230822-064716-ladsgroup.json [06:47:32] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:951366|Enable URL shortener in sidebar in jawiki and zhwiki (T267921)]] (duration: 10m 06s) [06:47:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:47:35] T267921: Roll out the Toolbox link for URL Shortener in Wikimedia sites - https://phabricator.wikimedia.org/T267921 [06:48:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T344589)', diff saved to https://phabricator.wikimedia.org/P50797 and previous config saved to /var/cache/conftool/dbconfig/20230822-064804-ladsgroup.json [06:48:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1223.eqiad.wmnet with reason: Maintenance [06:48:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1223.eqiad.wmnet with reason: Maintenance [06:48:23] 10SRE, 10Infrastructure-Foundations, 10netops: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 (10ayounsi) On the resiliency side, this protects us from a double failure: the cr1-cr2 link to fail as well as a transport link. Low risk but still a risk. I agree... [06:48:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1223 (T344589)', diff saved to https://phabricator.wikimedia.org/P50798 and previous config saved to /var/cache/conftool/dbconfig/20230822-064828-ladsgroup.json [06:51:21] (03CR) 10Ladsgroup: [C: 03+2] mariadb: Promote db2112 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/951093 (https://phabricator.wikimedia.org/T344666) (owner: 10Gerrit maintenance bot) [06:52:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:52:46] !log Starting s1 codfw failover from db2103 to db2112 - T344666 [06:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:51] T344666: Switchover s1 master (db2103 -> db2112) - https://phabricator.wikimedia.org/T344666 [06:53:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db2112 to s1 primary T344666', diff saved to https://phabricator.wikimedia.org/P50799 and previous config saved to /var/cache/conftool/dbconfig/20230822-065316-ladsgroup.json [06:53:45] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab1004.wikimedia.org [06:54:25] (03PS2) 10Sergio Gimeno: GrowthExperiments: turn off AddLink in aswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950168 (https://phabricator.wikimedia.org/T344319) [06:54:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T343718)', diff saved to https://phabricator.wikimedia.org/P50800 and previous config saved to /var/cache/conftool/dbconfig/20230822-065430-ladsgroup.json [06:54:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [06:54:34] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [06:54:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [06:54:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T343718)', diff saved to https://phabricator.wikimedia.org/P50801 and previous config saved to /var/cache/conftool/dbconfig/20230822-065440-ladsgroup.json [06:55:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db2103 T344666', diff saved to https://phabricator.wikimedia.org/P50802 and previous config saved to /var/cache/conftool/dbconfig/20230822-065518-ladsgroup.json [06:55:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T344589)', diff saved to https://phabricator.wikimedia.org/P50803 and previous config saved to /var/cache/conftool/dbconfig/20230822-065547-ladsgroup.json [06:56:26] PROBLEM - Host gitlab.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [06:56:39] ^ expected, restarting [06:57:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2103.codfw.wmnet with reason: Maintenance [06:57:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2103.codfw.wmnet with reason: Maintenance [06:57:23] !log installing intel-microcode security updates on buster hosts [06:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T343718)', diff saved to https://phabricator.wikimedia.org/P50804 and previous config saved to /var/cache/conftool/dbconfig/20230822-065819-ladsgroup.json [06:58:38] RECOVERY - Host gitlab.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [07:00:05] Amir1 and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230822T0700). [07:00:05] sergi0: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:15] hello [07:00:37] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1116.eqiad.wmnet with reason: host reimage [07:00:50] hi sergi0 can you self-serve? [07:01:20] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1004.wikimedia.org [07:01:42] Amir1: I'm waiting to graduate this Thu on the deployment training but I can try. Could you shadow me? [07:01:59] sure. Wanna hop on a call? [07:02:07] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 45356 [07:02:14] Amir1: yep [07:02:25] okay, I send you a link soon, give me a min [07:02:49] great, thank you! [07:02:56] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 45356 [07:03:46] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1116.eqiad.wmnet with reason: host reimage [07:04:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2103.codfw.wmnet with reason: Maintenance [07:04:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2103.codfw.wmnet with reason: Maintenance [07:05:39] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1115.eqiad.wmnet with OS bullseye [07:06:53] (03CR) 10Giuseppe Lavagetto: ClusterConfig: also allow to return hostname (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951047 (owner: 10Giuseppe Lavagetto) [07:08:40] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by sgimeno@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950168 (https://phabricator.wikimedia.org/T344319) (owner: 10Sergio Gimeno) [07:09:19] (03Merged) 10jenkins-bot: GrowthExperiments: turn off AddLink in aswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950168 (https://phabricator.wikimedia.org/T344319) (owner: 10Sergio Gimeno) [07:09:47] !log sgimeno@deploy1002 Started scap: Backport for [[gerrit:950168|GrowthExperiments: turn off AddLink in aswiki (T344319)]] [07:09:51] T344319: Remove models with poor evaluation metrics from the published datasets repo - https://phabricator.wikimedia.org/T344319 [07:10:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P50805 and previous config saved to /var/cache/conftool/dbconfig/20230822-071053-ladsgroup.json [07:11:22] !log sgimeno@deploy1002 sgimeno: Backport for [[gerrit:950168|GrowthExperiments: turn off AddLink in aswiki (T344319)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [07:11:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T343718)', diff saved to https://phabricator.wikimedia.org/P50806 and previous config saved to /var/cache/conftool/dbconfig/20230822-071158-ladsgroup.json [07:11:59] (03PS2) 10Filippo Giunchedi: confd: create run_dir via tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/951141 (https://phabricator.wikimedia.org/T321678) [07:12:04] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [07:13:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P50807 and previous config saved to /var/cache/conftool/dbconfig/20230822-071325-ladsgroup.json [07:14:14] !log sgimeno@deploy1002 sgimeno: Continuing with sync [07:16:32] (03PS3) 10Filippo Giunchedi: confd: create run_dir via tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/951141 (https://phabricator.wikimedia.org/T321678) [07:17:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:18:23] (03CR) 10Filippo Giunchedi: confd: create run_dir via tmpfile (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/951141 (https://phabricator.wikimedia.org/T321678) (owner: 10Filippo Giunchedi) [07:20:10] (03CR) 10Filippo Giunchedi: [C: 03+2] confd: create run_dir via tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/951141 (https://phabricator.wikimedia.org/T321678) (owner: 10Filippo Giunchedi) [07:21:28] !log sgimeno@deploy1002 Finished scap: Backport for [[gerrit:950168|GrowthExperiments: turn off AddLink in aswiki (T344319)]] (duration: 11m 41s) [07:21:33] T344319: Remove models with poor evaluation metrics from the published datasets repo - https://phabricator.wikimedia.org/T344319 [07:21:41] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42950/console" [puppet] - 10https://gerrit.wikimedia.org/r/951141 (https://phabricator.wikimedia.org/T321678) (owner: 10Filippo Giunchedi) [07:21:47] (03CR) 10Muehlenhoff: [C: 03+2] firewall::service: Use correct type for port range [puppet] - 10https://gerrit.wikimedia.org/r/951135 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [07:22:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:23:46] jouncebot: next [07:23:46] In 2 hour(s) and 36 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230822T1000) [07:23:53] (03PS4) 10Muehlenhoff: firewall: Add SSH rules in firewall-agnostic form [puppet] - 10https://gerrit.wikimedia.org/r/951118 (https://phabricator.wikimedia.org/T336497) [07:24:02] I'm going to reboot graphite1005 shortly [07:24:28] (03PS1) 10Gmodena: Expose mediawiki.page_change.v1 publicly. [deployment-charts] - 10https://gerrit.wikimedia.org/r/951426 (https://phabricator.wikimedia.org/T336817) [07:25:56] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite1005.eqiad.wmnet [07:26:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P50808 and previous config saved to /var/cache/conftool/dbconfig/20230822-072600-ladsgroup.json [07:27:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P50809 and previous config saved to /var/cache/conftool/dbconfig/20230822-072704-ladsgroup.json [07:27:18] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1116.eqiad.wmnet with OS bullseye [07:28:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P50810 and previous config saved to /var/cache/conftool/dbconfig/20230822-072831-ladsgroup.json [07:30:14] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet [07:30:26] and mwlog hosts too, rebooting shortly [07:30:49] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host mwlog2002.codfw.wmnet [07:30:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/951118 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [07:31:49] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:33:27] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite1005.eqiad.wmnet [07:33:47] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:34:03] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:34:41] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:35:36] that's centrallog ^ [07:36:05] RECOVERY - BFD status on cr2-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:36:33] RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:36:51] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 111, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:37:04] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet [07:37:38] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwlog2002.codfw.wmnet [07:38:04] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:41:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T344589)', diff saved to https://phabricator.wikimedia.org/P50811 and previous config saved to /var/cache/conftool/dbconfig/20230822-074106-ladsgroup.json [07:41:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [07:41:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [07:41:58] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host centrallog1002.eqiad.wmnet [07:42:06] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host mwlog1002.eqiad.wmnet [07:42:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P50812 and previous config saved to /var/cache/conftool/dbconfig/20230822-074210-ladsgroup.json [07:43:04] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:43:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T343718)', diff saved to https://phabricator.wikimedia.org/P50813 and previous config saved to /var/cache/conftool/dbconfig/20230822-074338-ladsgroup.json [07:43:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1210.eqiad.wmnet with reason: Maintenance [07:43:42] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [07:43:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1210.eqiad.wmnet with reason: Maintenance [07:43:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1210 (T343718)', diff saved to https://phabricator.wikimedia.org/P50814 and previous config saved to /var/cache/conftool/dbconfig/20230822-074358-ladsgroup.json [07:45:13] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:46:09] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:46:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [07:46:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [07:47:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [07:47:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [07:47:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T344589)', diff saved to https://phabricator.wikimedia.org/P50815 and previous config saved to /var/cache/conftool/dbconfig/20230822-074725-ladsgroup.json [07:47:31] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:47:59] RECOVERY - BFD status on cr2-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:48:17] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwlog1002.eqiad.wmnet [07:48:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2103.codfw.wmnet with reason: Maintenance [07:48:37] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1117.eqiad.wmnet with OS bullseye [07:48:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2103.codfw.wmnet with reason: Maintenance [07:48:45] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog1002.eqiad.wmnet [07:50:17] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] benthos::instance: remove unused parameter port [puppet] - 10https://gerrit.wikimedia.org/r/951120 (owner: 10Giuseppe Lavagetto) [07:50:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [07:50:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [07:51:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1163.eqiad.wmnet with reason: Maintenance [07:51:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1163.eqiad.wmnet with reason: Maintenance [07:53:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T344589)', diff saved to https://phabricator.wikimedia.org/P50816 and previous config saved to /var/cache/conftool/dbconfig/20230822-075329-ladsgroup.json [07:53:55] (03CR) 10Muehlenhoff: [C: 03+2] apt: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/951136 (owner: 10Muehlenhoff) [07:54:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance [07:54:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance [07:54:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T344589)', diff saved to https://phabricator.wikimedia.org/P50817 and previous config saved to /var/cache/conftool/dbconfig/20230822-075453-ladsgroup.json [07:56:45] (03PS2) 10Giuseppe Lavagetto: benthos: stop using strings as configuration [puppet] - 10https://gerrit.wikimedia.org/r/951121 [07:57:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T343718)', diff saved to https://phabricator.wikimedia.org/P50818 and previous config saved to /var/cache/conftool/dbconfig/20230822-075717-ladsgroup.json [07:57:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [07:57:21] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [07:57:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [07:57:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T343718)', diff saved to https://phabricator.wikimedia.org/P50819 and previous config saved to /var/cache/conftool/dbconfig/20230822-075728-ladsgroup.json [07:58:10] (03PS1) 10EoghanGaffney: gitlab: Add warning banner to replica instances [puppet] - 10https://gerrit.wikimedia.org/r/951429 (https://phabricator.wikimedia.org/T344620) [07:58:35] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [08:01:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T344589)', diff saved to https://phabricator.wikimedia.org/P50820 and previous config saved to /var/cache/conftool/dbconfig/20230822-080119-ladsgroup.json [08:01:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T343718)', diff saved to https://phabricator.wikimedia.org/P50821 and previous config saved to /var/cache/conftool/dbconfig/20230822-080155-ladsgroup.json [08:02:31] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1117.eqiad.wmnet with reason: host reimage [08:03:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1021.eqiad.wmnet with reason: Maintenance [08:03:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1021.eqiad.wmnet with reason: Maintenance [08:03:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1021 (T344589)', diff saved to https://phabricator.wikimedia.org/P50822 and previous config saved to /var/cache/conftool/dbconfig/20230822-080328-ladsgroup.json [08:03:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2021.codfw.wmnet with reason: Maintenance [08:04:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2021.codfw.wmnet with reason: Maintenance [08:04:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2021 (T344589)', diff saved to https://phabricator.wikimedia.org/P50823 and previous config saved to /var/cache/conftool/dbconfig/20230822-080413-ladsgroup.json [08:05:34] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1117.eqiad.wmnet with reason: host reimage [08:05:38] !log installing Linux 4.19.289-2 on Buster hosts [08:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P50824 and previous config saved to /var/cache/conftool/dbconfig/20230822-080836-ladsgroup.json [08:10:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P50825 and previous config saved to /var/cache/conftool/dbconfig/20230822-081054-ladsgroup.json [08:12:30] !log bounce ferm on aux-k8s-ctrl1001 [08:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T343718)', diff saved to https://phabricator.wikimedia.org/P50826 and previous config saved to /var/cache/conftool/dbconfig/20230822-081458-ladsgroup.json [08:15:03] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [08:15:35] RECOVERY - Check whether ferm is active by checking the default input chain on aux-k8s-ctrl1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:15:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2021 (T344589)', diff saved to https://phabricator.wikimedia.org/P50827 and previous config saved to /var/cache/conftool/dbconfig/20230822-081537-ladsgroup.json [08:16:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P50828 and previous config saved to /var/cache/conftool/dbconfig/20230822-081626-ladsgroup.json [08:17:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P50829 and previous config saved to /var/cache/conftool/dbconfig/20230822-081701-ladsgroup.json [08:21:42] (JobUnavailable) firing: (2) Reduced availability for job benthos in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:22:47] (03CR) 10Jbond: confd: create run_dir via tmpfile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/951141 (https://phabricator.wikimedia.org/T321678) (owner: 10Filippo Giunchedi) [08:23:29] (03CR) 10Jbond: [C: 03+1] confd: Explicitly require directory for systemd cleanup timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/949496 (owner: 10Muehlenhoff) [08:23:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P50830 and previous config saved to /var/cache/conftool/dbconfig/20230822-082342-ladsgroup.json [08:24:00] (03PS1) 10Dreamy Jazz: clienthints: Collect Client Hints data on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951431 (https://phabricator.wikimedia.org/T341110) [08:24:09] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/951196 (owner: 10BCornwall) [08:24:53] (03CR) 10Btullis: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/951128 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [08:26:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P50831 and previous config saved to /var/cache/conftool/dbconfig/20230822-082559-ladsgroup.json [08:26:42] (JobUnavailable) firing: (3) Reduced availability for job benthos in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:26:43] (03CR) 10Giuseppe Lavagetto: [C: 03+2] benthos: stop using strings as configuration [puppet] - 10https://gerrit.wikimedia.org/r/951121 (owner: 10Giuseppe Lavagetto) [08:29:33] RECOVERY - Check systemd state on aux-k8s-ctrl1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:29:47] RECOVERY - Check systemd state on aux-k8s-ctrl1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:30:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P50832 and previous config saved to /var/cache/conftool/dbconfig/20230822-083004-ladsgroup.json [08:30:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2021', diff saved to https://phabricator.wikimedia.org/P50833 and previous config saved to /var/cache/conftool/dbconfig/20230822-083044-ladsgroup.json [08:31:07] !log mwmaint1002: Stop frwiki instance of T315510 scripts due to a large volume of T343859 errors [08:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:13] T343859: DiscussionTools: LogicException: Database can't find our row and won't let us insert it - https://phabricator.wikimedia.org/T343859 [08:31:13] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [08:31:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P50834 and previous config saved to /var/cache/conftool/dbconfig/20230822-083132-ladsgroup.json [08:32:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P50835 and previous config saved to /var/cache/conftool/dbconfig/20230822-083207-ladsgroup.json [08:32:45] RECOVERY - Check whether ferm is active by checking the default input chain on aux-k8s-ctrl1002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:38:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T344589)', diff saved to https://phabricator.wikimedia.org/P50836 and previous config saved to /var/cache/conftool/dbconfig/20230822-083848-ladsgroup.json [08:38:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance [08:39:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance [08:39:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T344589)', diff saved to https://phabricator.wikimedia.org/P50837 and previous config saved to /var/cache/conftool/dbconfig/20230822-083912-ladsgroup.json [08:41:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P50838 and previous config saved to /var/cache/conftool/dbconfig/20230822-084104-ladsgroup.json [08:41:47] 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 3 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10akosiaris) >>! In T340036#9062519, @vadim-kovalenko wrote: > Hi there! I'm responsible for Kiwix migration to another API, but... [08:42:12] !log fabfur@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_ulsfo and A:cp [08:42:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2103.codfw.wmnet with reason: Maintenance [08:42:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2103.codfw.wmnet with reason: Maintenance [08:43:41] !log restart ATS on cp5024 to clean the ATS restart alert - T344674 [08:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:45] T344674: ATS automatically restarted due to receiving SIGUSR2 - https://phabricator.wikimedia.org/T344674 [08:43:58] (03PS1) 10Muehlenhoff: firewall::service/firewall::client: Fix function name for dump_params() [puppet] - 10https://gerrit.wikimedia.org/r/951432 (https://phabricator.wikimedia.org/T336497) [08:44:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1135.eqiad.wmnet with reason: Maintenance [08:44:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1135.eqiad.wmnet with reason: Maintenance [08:44:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T336886)', diff saved to https://phabricator.wikimedia.org/P50839 and previous config saved to /var/cache/conftool/dbconfig/20230822-084445-ladsgroup.json [08:44:49] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [08:45:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P50840 and previous config saved to /var/cache/conftool/dbconfig/20230822-084510-ladsgroup.json [08:45:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T344589)', diff saved to https://phabricator.wikimedia.org/P50841 and previous config saved to /var/cache/conftool/dbconfig/20230822-084536-ladsgroup.json [08:45:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2021', diff saved to https://phabricator.wikimedia.org/P50842 and previous config saved to /var/cache/conftool/dbconfig/20230822-084550-ladsgroup.json [08:46:06] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/951432 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [08:46:07] 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 3 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10vadim-kovalenko) @akosiaris , I've updated regexp, and now it works, thank you! [08:46:25] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:46:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T344589)', diff saved to https://phabricator.wikimedia.org/P50843 and previous config saved to /var/cache/conftool/dbconfig/20230822-084638-ladsgroup.json [08:46:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance [08:46:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance [08:47:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T344589)', diff saved to https://phabricator.wikimedia.org/P50844 and previous config saved to /var/cache/conftool/dbconfig/20230822-084703-ladsgroup.json [08:47:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P50845 and previous config saved to /var/cache/conftool/dbconfig/20230822-084712-root.json [08:47:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T343718)', diff saved to https://phabricator.wikimedia.org/P50846 and previous config saved to /var/cache/conftool/dbconfig/20230822-084713-ladsgroup.json [08:47:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [08:47:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [08:47:19] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [08:47:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1213:3315 (T343718)', diff saved to https://phabricator.wikimedia.org/P50847 and previous config saved to /var/cache/conftool/dbconfig/20230822-084724-ladsgroup.json [08:48:25] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Ricki Jay (WMDE) - https://phabricator.wikimedia.org/T343700 (10jbond) [08:48:36] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for RickiJay-WMDE - https://phabricator.wikimedia.org/T343508 (10jbond) [08:48:39] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for RickiJay-WMDE - https://phabricator.wikimedia.org/T343508 (10jbond) >>! In T343508#9085371, @Eevans wrote: > @KFrancis can you confirm we have an NDA on file? confirmed by @KFrancis ins T343508 [08:49:02] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for RickiJay-WMDE - https://phabricator.wikimedia.org/T343508 (10jbond) [08:49:15] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:50:00] (03CR) 10JMeybohm: [C: 04-1] aux: add tlsHostnames for jaeger collector and query (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/949504 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi) [08:52:03] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:53:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T344589)', diff saved to https://phabricator.wikimedia.org/P50848 and previous config saved to /var/cache/conftool/dbconfig/20230822-085332-ladsgroup.json [08:53:51] (03PS1) 10Jbond: dmin: add RickiJay-WMDE to releasers-wikibase [puppet] - 10https://gerrit.wikimedia.org/r/951433 (https://phabricator.wikimedia.org/T343508) [08:55:18] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42951/console" [puppet] - 10https://gerrit.wikimedia.org/r/951433 (https://phabricator.wikimedia.org/T343508) (owner: 10Jbond) [08:55:21] (03CR) 10Jbond: [V: 03+1 C: 03+2] dmin: add RickiJay-WMDE to releasers-wikibase [puppet] - 10https://gerrit.wikimedia.org/r/951433 (https://phabricator.wikimedia.org/T343508) (owner: 10Jbond) [08:57:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:57:25] (03CR) 10Hashar: Recognize ~/.config/docker-pkg.yaml (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/935991 (owner: 10Hashar) [08:57:36] (03PS5) 10Hashar: Recognize ~/.config/docker-pkg.yaml [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/935991 [08:58:37] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] confd: create run_dir via tmpfile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/951141 (https://phabricator.wikimedia.org/T321678) (owner: 10Filippo Giunchedi) [08:58:55] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to releasers-wikibase for RickiJay-WMDE - https://phabricator.wikimedia.org/T343508 (10jbond) 05Stalled→03Resolved a:03jbond @RickiJay-WMDE I have now added access to the releasers-wikibase unix group and wmde ldap group please reope... [09:00:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T343718)', diff saved to https://phabricator.wikimedia.org/P50849 and previous config saved to /var/cache/conftool/dbconfig/20230822-090016-ladsgroup.json [09:00:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [09:00:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [09:00:27] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [09:00:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T343718)', diff saved to https://phabricator.wikimedia.org/P50850 and previous config saved to /var/cache/conftool/dbconfig/20230822-090026-ladsgroup.json [09:00:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T336886)', diff saved to https://phabricator.wikimedia.org/P50851 and previous config saved to /var/cache/conftool/dbconfig/20230822-090036-ladsgroup.json [09:00:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P50852 and previous config saved to /var/cache/conftool/dbconfig/20230822-090042-ladsgroup.json [09:00:51] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [09:00:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2021 (T344589)', diff saved to https://phabricator.wikimedia.org/P50853 and previous config saved to /var/cache/conftool/dbconfig/20230822-090056-ladsgroup.json [09:02:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P50854 and previous config saved to /var/cache/conftool/dbconfig/20230822-090217-root.json [09:03:25] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-on-k8s: Raise traffic to 2% [puppet] - 10https://gerrit.wikimedia.org/r/951131 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert) [09:06:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315 (T343718)', diff saved to https://phabricator.wikimedia.org/P50855 and previous config saved to /var/cache/conftool/dbconfig/20230822-090646-ladsgroup.json [09:06:51] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [09:08:13] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780 (10Clement_Goubert) Pending more hardware, we will move on to 2% first. [09:08:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1021 (T344589)', diff saved to https://phabricator.wikimedia.org/P50856 and previous config saved to /var/cache/conftool/dbconfig/20230822-090832-ladsgroup.json [09:08:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P50857 and previous config saved to /var/cache/conftool/dbconfig/20230822-090838-ladsgroup.json [09:09:28] jouncebot: nowandnext [09:09:28] No deployments scheduled for the next 0 hour(s) and 50 minute(s) [09:09:28] In 0 hour(s) and 50 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230822T1000) [09:10:40] (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: Raise traffic to 2% [puppet] - 10https://gerrit.wikimedia.org/r/951131 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert) [09:11:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T343718)', diff saved to https://phabricator.wikimedia.org/P50858 and previous config saved to /var/cache/conftool/dbconfig/20230822-091113-ladsgroup.json [09:11:15] !log Redirecting 2% of global traffic to mw-on-k8s - T341780 [09:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:22] T341780: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780 [09:11:42] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [09:12:16] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Move config-master to dedicated VMs - https://phabricator.wikimedia.org/T341717 (10jbond) 05Open→03Resolved a:03jbond [09:12:27] vgutierrez: I'm gonna run puppet on A:cp-text and P{P:trafficserver::backend}, ok with you? [09:12:39] Or do you prefer I let it run on the normal schedule [09:13:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P50859 and previous config saved to /var/cache/conftool/dbconfig/20230822-091334-ladsgroup.json [09:13:42] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Add conftool::master to puppetserver - https://phabricator.wikimedia.org/T341721 (10jbond) 05Open→03Resolved a:03jbond [09:13:53] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [09:14:09] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetmaster::fetch_swift_rings: rename profile [puppet] - 10https://gerrit.wikimedia.org/r/951138 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [09:15:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P50860 and previous config saved to /var/cache/conftool/dbconfig/20230822-091542-ladsgroup.json [09:15:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P50861 and previous config saved to /var/cache/conftool/dbconfig/20230822-091549-ladsgroup.json [09:17:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P50862 and previous config saved to /var/cache/conftool/dbconfig/20230822-091722-root.json [09:17:46] (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 20): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42952/console" [puppet] - 10https://gerrit.wikimedia.org/r/951138 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [09:19:39] (03PS1) 10Ladsgroup: Stop writing to the old columns of extlinks in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951436 (https://phabricator.wikimedia.org/T342683) [09:21:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315', diff saved to https://phabricator.wikimedia.org/P50863 and previous config saved to /var/cache/conftool/dbconfig/20230822-092153-ladsgroup.json [09:21:53] (03PS11) 10Jbond: puppetserver: add volatile config [puppet] - 10https://gerrit.wikimedia.org/r/948647 (https://phabricator.wikimedia.org/T341056) [09:23:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1021', diff saved to https://phabricator.wikimedia.org/P50864 and previous config saved to /var/cache/conftool/dbconfig/20230822-092338-ladsgroup.json [09:23:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P50865 and previous config saved to /var/cache/conftool/dbconfig/20230822-092344-ladsgroup.json [09:26:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P50866 and previous config saved to /var/cache/conftool/dbconfig/20230822-092620-ladsgroup.json [09:28:05] claime: missed that, sorry [09:28:12] No worries [09:28:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:28:16] I'm letting it run naturally [09:28:24] The slow ramp up is ok with me [09:28:31] Ngh parsoid [09:28:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P50867 and previous config saved to /var/cache/conftool/dbconfig/20230822-092838-ladsgroup.json [09:28:41] It's not linked to the mw-on-k8s change, btw [09:30:09] (03PS3) 10Filippo Giunchedi: aux: add tlsHostnames for jaeger collector and query [deployment-charts] - 10https://gerrit.wikimedia.org/r/949504 (https://phabricator.wikimedia.org/T344253) [09:30:11] (03CR) 10Filippo Giunchedi: aux: add tlsHostnames for jaeger collector and query (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/949504 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi) [09:30:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P50868 and previous config saved to /var/cache/conftool/dbconfig/20230822-093049-ladsgroup.json [09:30:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T344589)', diff saved to https://phabricator.wikimedia.org/P50869 and previous config saved to /var/cache/conftool/dbconfig/20230822-093055-ladsgroup.json [09:31:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [09:31:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [09:31:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:31:32] !log pooling temporarily kartotherian codfw [09:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:31:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T344589)', diff saved to https://phabricator.wikimedia.org/P50870 and previous config saved to /var/cache/conftool/dbconfig/20230822-093147-ladsgroup.json [09:32:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P50871 and previous config saved to /var/cache/conftool/dbconfig/20230822-093227-root.json [09:32:29] (03PS1) 10Jelto: trafficserver: use eqiad cname for all miscweb services [puppet] - 10https://gerrit.wikimedia.org/r/951437 [09:33:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:35:47] (03CR) 10JMeybohm: [C: 03+1] aux: add tlsHostnames for jaeger collector and query [deployment-charts] - 10https://gerrit.wikimedia.org/r/949504 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi) [09:36:33] (03CR) 10Filippo Giunchedi: [C: 03+2] aux: add tlsHostnames for jaeger collector and query [deployment-charts] - 10https://gerrit.wikimedia.org/r/949504 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi) [09:36:36] (03PS12) 10Jbond: puppetserver: add volatile config [puppet] - 10https://gerrit.wikimedia.org/r/948647 (https://phabricator.wikimedia.org/T341056) [09:36:55] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] aux: add tlsHostnames for jaeger collector and query [deployment-charts] - 10https://gerrit.wikimedia.org/r/949504 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi) [09:36:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315', diff saved to https://phabricator.wikimedia.org/P50872 and previous config saved to /var/cache/conftool/dbconfig/20230822-093659-ladsgroup.json [09:38:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1021', diff saved to https://phabricator.wikimedia.org/P50873 and previous config saved to /var/cache/conftool/dbconfig/20230822-093844-ladsgroup.json [09:38:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T344589)', diff saved to https://phabricator.wikimedia.org/P50874 and previous config saved to /var/cache/conftool/dbconfig/20230822-093850-ladsgroup.json [09:38:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [09:39:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [09:39:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T344589)', diff saved to https://phabricator.wikimedia.org/P50875 and previous config saved to /var/cache/conftool/dbconfig/20230822-093915-ladsgroup.json [09:39:54] !log filippo@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [09:40:14] !log filippo@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:40:33] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The patch needs to be updated to the new naming of the installserver role; apart from that, I've added a comment outlining two roles that " [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [09:41:26] (03PS13) 10Jbond: puppetserver: add volatile config [puppet] - 10https://gerrit.wikimedia.org/r/948647 (https://phabricator.wikimedia.org/T341056) [09:41:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P50876 and previous config saved to /var/cache/conftool/dbconfig/20230822-094126-ladsgroup.json [09:42:38] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42956/console" [puppet] - 10https://gerrit.wikimedia.org/r/948647 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [09:43:05] 10SRE, 10SRE-OnFire, 10Incident Tooling, 10Sustainability (Incident Followup): Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377 (10Urbanecm) >>! In T343377#9107361, @RLazarus wrote: >>>! In T343377#9105446, @SLyngshede-WMF wrote: >> If it's just a matter of managing a LDA... [09:43:18] (03CR) 10Filippo Giunchedi: [C: 03+1] data-engineering: flink: alert based on active site (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/939651 (https://phabricator.wikimedia.org/T342258) (owner: 10Gmodena) [09:43:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P50877 and previous config saved to /var/cache/conftool/dbconfig/20230822-094343-ladsgroup.json [09:43:47] !log depool codfw kartotherian (maps) [09:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:56] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver: add volatile config [puppet] - 10https://gerrit.wikimedia.org/r/948647 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [09:45:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T336886)', diff saved to https://phabricator.wikimedia.org/P50878 and previous config saved to /var/cache/conftool/dbconfig/20230822-094555-ladsgroup.json [09:46:00] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [09:46:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T344589)', diff saved to https://phabricator.wikimedia.org/P50879 and previous config saved to /var/cache/conftool/dbconfig/20230822-094653-ladsgroup.json [09:47:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T344589)', diff saved to https://phabricator.wikimedia.org/P50880 and previous config saved to /var/cache/conftool/dbconfig/20230822-094712-ladsgroup.json [09:52:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315 (T343718)', diff saved to https://phabricator.wikimedia.org/P50881 and previous config saved to /var/cache/conftool/dbconfig/20230822-095205-ladsgroup.json [09:52:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [09:52:08] (03CR) 10Filippo Giunchedi: [C: 03+1] "I just realized that this parameter is picked up by Prometheus to know which port to query for metrics, we could default to the env or sth" [puppet] - 10https://gerrit.wikimedia.org/r/951120 (owner: 10Giuseppe Lavagetto) [09:52:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [09:52:10] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [09:53:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1021 (T344589)', diff saved to https://phabricator.wikimedia.org/P50882 and previous config saved to /var/cache/conftool/dbconfig/20230822-095351-ladsgroup.json [09:54:40] (03PS1) 10Giuseppe Lavagetto: Revert "benthos::instance: remove unused parameter port" [puppet] - 10https://gerrit.wikimedia.org/r/950821 [09:54:53] (03CR) 10CI reject: [V: 04-1] Revert "benthos::instance: remove unused parameter port" [puppet] - 10https://gerrit.wikimedia.org/r/950821 (owner: 10Giuseppe Lavagetto) [09:56:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T343718)', diff saved to https://phabricator.wikimedia.org/P50883 and previous config saved to /var/cache/conftool/dbconfig/20230822-095632-ladsgroup.json [09:57:41] 10SRE, 10SRE-OnFire, 10Incident Tooling, 10Sustainability (Incident Followup): Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377 (10Urbanecm) As a matter of the first step, I don't think we need to give //all// stewards access to Klaxon immediately. This is already the ca... [09:58:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P50884 and previous config saved to /var/cache/conftool/dbconfig/20230822-095848-ladsgroup.json [09:58:55] (03PS2) 10Giuseppe Lavagetto: Revert "benthos::instance: remove unused parameter port" [puppet] - 10https://gerrit.wikimedia.org/r/950821 [09:59:25] (03CR) 10Filippo Giunchedi: [C: 03+1] Revert "benthos::instance: remove unused parameter port" [puppet] - 10https://gerrit.wikimedia.org/r/950821 (owner: 10Giuseppe Lavagetto) [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230822T1000) [10:02:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P50885 and previous config saved to /var/cache/conftool/dbconfig/20230822-100200-ladsgroup.json [10:02:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P50886 and previous config saved to /var/cache/conftool/dbconfig/20230822-100219-ladsgroup.json [10:02:21] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "benthos::instance: remove unused parameter port" [puppet] - 10https://gerrit.wikimedia.org/r/950821 (owner: 10Giuseppe Lavagetto) [10:04:45] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-aux@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-aux - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:04:45] (03PS1) 10JMeybohm: jaeger: Don't skip host verification for connections to ES [deployment-charts] - 10https://gerrit.wikimedia.org/r/951438 (https://phabricator.wikimedia.org/T344253) [10:05:27] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10TheresNoTime) //+1 from me, fwiw// [10:05:39] (03PS1) 10Hnowlan: sites: add new kubernetes hosts [homer/public] - 10https://gerrit.wikimedia.org/r/951439 (https://phabricator.wikimedia.org/T343993) [10:07:37] (03CR) 10Filippo Giunchedi: [C: 03+1] jaeger: Don't skip host verification for connections to ES [deployment-charts] - 10https://gerrit.wikimedia.org/r/951438 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [10:08:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [10:08:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [10:09:46] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-aux@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-aux - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:11:50] 10SRE, 10Maps: Allow Wikimedia Maps usage on wikidata.pl - https://phabricator.wikimedia.org/T344678 (10Ada_Jakubowska_WMPL) [10:12:01] (03CR) 10JMeybohm: [C: 03+2] jaeger: Don't skip host verification for connections to ES [deployment-charts] - 10https://gerrit.wikimedia.org/r/951438 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [10:17:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P50887 and previous config saved to /var/cache/conftool/dbconfig/20230822-101706-ladsgroup.json [10:17:13] (03PS1) 10Jbond: external_clouds_vendors: add way top specify the private repo [puppet] - 10https://gerrit.wikimedia.org/r/951440 (https://phabricator.wikimedia.org/T341056) [10:17:15] (03PS1) 10Jbond: puppetserver::volatile: pass through correct private repo path [puppet] - 10https://gerrit.wikimedia.org/r/951441 (https://phabricator.wikimedia.org/T341056) [10:17:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P50888 and previous config saved to /var/cache/conftool/dbconfig/20230822-101725-ladsgroup.json [10:19:12] (03PS1) 10Hnowlan: install_server: configure disks for new kubernetes hosts [puppet] - 10https://gerrit.wikimedia.org/r/951442 (https://phabricator.wikimedia.org/T343993) [10:20:19] (03CR) 10Muehlenhoff: [C: 03+2] firewall::service/firewall::client: Fix function name for dump_params() [puppet] - 10https://gerrit.wikimedia.org/r/951432 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:20:48] (03PS1) 10Filippo Giunchedi: Revert "aux: add grpc/http ports for jaeger collector" [deployment-charts] - 10https://gerrit.wikimedia.org/r/950822 [10:24:37] (03PS5) 10Muehlenhoff: firewall: Add SSH rules in firewall-agnostic form [puppet] - 10https://gerrit.wikimedia.org/r/951118 (https://phabricator.wikimedia.org/T336497) [10:26:45] (03CR) 10Anzx: Some initial configurations for suwikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949183 (https://phabricator.wikimedia.org/T344314) (owner: 10Anzx) [10:27:58] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/951118 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:29:16] (03CR) 10Clément Goubert: [C: 03+1] install_server: configure disks for new kubernetes hosts [puppet] - 10https://gerrit.wikimedia.org/r/951442 (https://phabricator.wikimedia.org/T343993) (owner: 10Hnowlan) [10:29:53] (03CR) 10Effie Mouzeli: [C: 03+1] sites: add new kubernetes hosts [homer/public] - 10https://gerrit.wikimedia.org/r/951439 (https://phabricator.wikimedia.org/T343993) (owner: 10Hnowlan) [10:31:42] (JobUnavailable) firing: (3) Reduced availability for job benthos in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:32:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T344589)', diff saved to https://phabricator.wikimedia.org/P50889 and previous config saved to /var/cache/conftool/dbconfig/20230822-103212-ladsgroup.json [10:32:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance [10:32:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T344589)', diff saved to https://phabricator.wikimedia.org/P50890 and previous config saved to /var/cache/conftool/dbconfig/20230822-103231-ladsgroup.json [10:32:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance [10:32:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [10:32:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T344589)', diff saved to https://phabricator.wikimedia.org/P50891 and previous config saved to /var/cache/conftool/dbconfig/20230822-103237-ladsgroup.json [10:32:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [10:32:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T344589)', diff saved to https://phabricator.wikimedia.org/P50892 and previous config saved to /var/cache/conftool/dbconfig/20230822-103255-ladsgroup.json [10:34:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T344589)', diff saved to https://phabricator.wikimedia.org/P50893 and previous config saved to /var/cache/conftool/dbconfig/20230822-103417-ladsgroup.json [10:34:25] (03CR) 10JMeybohm: [C: 04-1] Revert "aux: add grpc/http ports for jaeger collector" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/950822 (owner: 10Filippo Giunchedi) [10:38:35] (03PS1) 10JMeybohm: jaeger: Don't pull images via CDN [deployment-charts] - 10https://gerrit.wikimedia.org/r/951443 (https://phabricator.wikimedia.org/T344253) [10:39:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T344589)', diff saved to https://phabricator.wikimedia.org/P50894 and previous config saved to /var/cache/conftool/dbconfig/20230822-103919-ladsgroup.json [10:41:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T344589)', diff saved to https://phabricator.wikimedia.org/P50895 and previous config saved to /var/cache/conftool/dbconfig/20230822-104106-ladsgroup.json [10:45:02] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/951439 (https://phabricator.wikimedia.org/T343993) (owner: 10Hnowlan) [10:46:43] (03CR) 10Hnowlan: [C: 03+2] sites: add new kubernetes hosts [homer/public] - 10https://gerrit.wikimedia.org/r/951439 (https://phabricator.wikimedia.org/T343993) (owner: 10Hnowlan) [10:47:16] (03Merged) 10jenkins-bot: sites: add new kubernetes hosts [homer/public] - 10https://gerrit.wikimedia.org/r/951439 (https://phabricator.wikimedia.org/T343993) (owner: 10Hnowlan) [10:47:43] (03PS1) 10Gmodena: Declare v1 of the page_content_change stream. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951444 (https://phabricator.wikimedia.org/T307959) [10:48:04] (03CR) 10Muehlenhoff: [C: 03+2] standard_packages: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/950133 (owner: 10Muehlenhoff) [10:51:12] (03CR) 10Clément Goubert: "Couple comments inline, I can't find a definitive answer." [deployment-charts] - 10https://gerrit.wikimedia.org/r/951043 (https://phabricator.wikimedia.org/T334064) (owner: 10Giuseppe Lavagetto) [10:52:05] (03CR) 10EoghanGaffney: [C: 03+1] trafficserver: use eqiad cname for all miscweb services [puppet] - 10https://gerrit.wikimedia.org/r/951437 (owner: 10Jelto) [10:54:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P50896 and previous config saved to /var/cache/conftool/dbconfig/20230822-105425-ladsgroup.json [10:56:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P50897 and previous config saved to /var/cache/conftool/dbconfig/20230822-105613-ladsgroup.json [10:59:11] (03PS1) 10Gmodena: mw-page-content-change-enrich: stream version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/951446 [10:59:15] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host vrts1001.eqiad.wmnet [11:00:03] (03PS2) 10Gmodena: mw-page-content-change-enrich: stream version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/951446 (https://phabricator.wikimedia.org/T307959) [11:02:32] (03PS1) 10Muehlenhoff: smart: Simplify check for hpsa [puppet] - 10https://gerrit.wikimedia.org/r/951448 [11:03:16] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/951448 (owner: 10Muehlenhoff) [11:03:19] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1001.eqiad.wmnet [11:04:13] (03CR) 10Hnowlan: [C: 03+2] install_server: configure disks for new kubernetes hosts [puppet] - 10https://gerrit.wikimedia.org/r/951442 (https://phabricator.wikimedia.org/T343993) (owner: 10Hnowlan) [11:04:29] !log delete old ams-ix circuits from ams-ix potal - T344579 [11:04:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:33] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:06:11] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:08:10] !log eoghan@cumin1001 START - Cookbook sre.hosts.reboot-single for host planet2002.codfw.wmnet [11:09:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P50898 and previous config saved to /var/cache/conftool/dbconfig/20230822-110932-ladsgroup.json [11:11:10] (03CR) 10Jelto: [C: 03+2] trafficserver: use eqiad cname for all miscweb services [puppet] - 10https://gerrit.wikimedia.org/r/951437 (owner: 10Jelto) [11:11:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P50899 and previous config saved to /var/cache/conftool/dbconfig/20230822-111119-ladsgroup.json [11:12:10] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host planet2002.codfw.wmnet [11:12:21] !log delete RIPE route object for 91.198.174.0/24 - T344579 [11:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:32] !log eoghan@cumin1001 START - Cookbook sre.hosts.reboot-single for host planet1002.eqiad.wmnet [11:13:28] !log delete RIPE route6 object for 2a02:ec80:500::/48 - T344579 [11:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:33] !log delete RPKI ROAs for 91.198.174.0/24 and 2a02:ec80:500::/48 - T344579 [11:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:32] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host planet1002.eqiad.wmnet [11:18:06] jouncebot: nownandnext [11:18:11] jouncebot: nowandnext [11:18:11] No deployments scheduled for the next 1 hour(s) and 41 minute(s) [11:18:11] In 1 hour(s) and 41 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230822T1300) [11:18:11] In 1 hour(s) and 41 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230822T1300) [11:18:24] Amir1: hi, any objections with me pushing out https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/951367? [11:20:11] (03PS2) 10Muehlenhoff: smart: Simplify check for hpsa [puppet] - 10https://gerrit.wikimedia.org/r/951448 [11:20:41] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42959/console" [puppet] - 10https://gerrit.wikimedia.org/r/951440 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [11:22:24] (03CR) 10Jbond: [V: 03+1 C: 03+2] external_clouds_vendors: add way top specify the private repo [puppet] - 10https://gerrit.wikimedia.org/r/951440 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [11:22:28] (03CR) 10Jbond: [C: 03+2] puppetserver::volatile: pass through correct private repo path [puppet] - 10https://gerrit.wikimedia.org/r/951441 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [11:22:50] (03PS1) 10JMeybohm: jaeger: Enable TLS for query (UI) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951450 (https://phabricator.wikimedia.org/T344253) [11:24:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T344589)', diff saved to https://phabricator.wikimedia.org/P50900 and previous config saved to /var/cache/conftool/dbconfig/20230822-112438-ladsgroup.json [11:24:46] (03CR) 10JMeybohm: "According to https://www.jaegertracing.io/docs/1.41/cli/#jaeger-query-elasticsearch something like this should work" [deployment-charts] - 10https://gerrit.wikimedia.org/r/951450 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [11:25:58] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1057.eqiad.wmnet with OS bullseye [11:26:06] 10SRE, 10ops-eqiad, 10serviceops: Move eqiad thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343993 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1001 for host kubernetes1057.eqiad.wmnet with OS bullseye [11:26:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T344589)', diff saved to https://phabricator.wikimedia.org/P50901 and previous config saved to /var/cache/conftool/dbconfig/20230822-112625-ladsgroup.json [11:26:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [11:26:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [11:26:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T344589)', diff saved to https://phabricator.wikimedia.org/P50902 and previous config saved to /var/cache/conftool/dbconfig/20230822-112650-ladsgroup.json [11:27:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T344589)', diff saved to https://phabricator.wikimedia.org/P50903 and previous config saved to /var/cache/conftool/dbconfig/20230822-112659-ladsgroup.json [11:28:47] !log eoghan@cumin1001 START - Cookbook sre.hosts.reboot-single for host moscovium.eqiad.wmnet [11:29:35] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1058.eqiad.wmnet with OS bullseye [11:29:40] 10SRE, 10ops-eqiad, 10serviceops: Move eqiad thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343993 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1001 for host kubernetes1058.eqiad.wmnet with OS bullseye [11:32:37] (03CR) 10Muehlenhoff: [C: 03+2] Add a nftables::file::service define to install a custom nftables input rule [puppet] - 10https://gerrit.wikimedia.org/r/951123 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:32:43] (03PS2) 10Muehlenhoff: Add a nftables::file::service define to install a custom nftables input rule [puppet] - 10https://gerrit.wikimedia.org/r/951123 (https://phabricator.wikimedia.org/T336497) [11:32:43] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moscovium.eqiad.wmnet [11:33:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T344589)', diff saved to https://phabricator.wikimedia.org/P50904 and previous config saved to /var/cache/conftool/dbconfig/20230822-113313-ladsgroup.json [11:36:01] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2055.codfw.wmnet with OS bullseye [11:36:08] 10SRE, 10ops-codfw, 10serviceops: Move codfw thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343996 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host kubernetes2055.codfw.wmnet with OS bullseye [11:36:38] (03PS1) 10Jbond: puppetserver::volatile: ony update conftool on puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/951451 (https://phabricator.wikimedia.org/T341056) [11:36:47] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [11:37:18] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [11:40:01] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2056.codfw.wmnet with OS bullseye [11:40:09] 10SRE, 10ops-codfw, 10serviceops: Move codfw thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343996 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host kubernetes2056.codfw.wmnet with OS bullseye [11:41:50] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [11:41:53] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [11:42:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P50905 and previous config saved to /var/cache/conftool/dbconfig/20230822-114206-ladsgroup.json [11:47:25] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host miscweb2003.codfw.wmnet [11:47:36] (03PS6) 10Muehlenhoff: firewall: Add SSH rules in firewall-agnostic form [puppet] - 10https://gerrit.wikimedia.org/r/951118 (https://phabricator.wikimedia.org/T336497) [11:47:46] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/951448 (owner: 10Muehlenhoff) [11:48:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P50906 and previous config saved to /var/cache/conftool/dbconfig/20230822-114820-ladsgroup.json [11:48:38] (03CR) 10Jbond: [C: 03+2] puppetserver::volatile: ony update conftool on puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/951451 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [11:49:36] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: sync [11:49:39] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: sync [11:50:43] (03CR) 10JMeybohm: Add cookbook to configure router's BGP sessions to k8s hosts (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [11:51:31] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host miscweb2003.codfw.wmnet [11:56:07] (03CR) 10Filippo Giunchedi: "thank you for the quick review, see inline" [deployment-charts] - 10https://gerrit.wikimedia.org/r/950822 (owner: 10Filippo Giunchedi) [11:57:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P50907 and previous config saved to /var/cache/conftool/dbconfig/20230822-115712-ladsgroup.json [11:59:37] (03PS1) 10Jelto: trafficserver: switch all miscweb services to codfw cname [puppet] - 10https://gerrit.wikimedia.org/r/951456 [12:00:21] !log eoghan@cumin1001 START - Cookbook sre.hosts.reboot-single for host aphlict1002.eqiad.wmnet [12:02:03] (03PS1) 10Jbond: puppetserver::volatile: we need to ensure if *not* empty [puppet] - 10https://gerrit.wikimedia.org/r/951457 (https://phabricator.wikimedia.org/T341056) [12:02:10] (03PS1) 10Stevemunene: switch an-worker[17-48] to reuse-analytics-hadoop recipe [puppet] - 10https://gerrit.wikimedia.org/r/951458 (https://phabricator.wikimedia.org/T332570) [12:02:45] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aphlict1002.eqiad.wmnet [12:03:00] !log eoghan@cumin1001 START - Cookbook sre.hosts.reboot-single for host aphlict2001.codfw.wmnet [12:03:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P50908 and previous config saved to /var/cache/conftool/dbconfig/20230822-120326-ladsgroup.json [12:04:24] (03PS3) 10Ayounsi: Only advertise local customers to external peers [homer/public] - 10https://gerrit.wikimedia.org/r/947993 (https://phabricator.wikimedia.org/T334530) [12:04:27] (03CR) 10Filippo Giunchedi: [C: 03+1] jaeger: Don't pull images via CDN [deployment-charts] - 10https://gerrit.wikimedia.org/r/951443 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [12:05:08] (03CR) 10Filippo Giunchedi: [C: 03+1] "Very cool" [deployment-charts] - 10https://gerrit.wikimedia.org/r/951450 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [12:05:52] (03PS1) 10Muehlenhoff: Add additional sets for monitoring/prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/951459 (https://phabricator.wikimedia.org/T336497) [12:06:16] (03CR) 10CI reject: [V: 04-1] Add additional sets for monitoring/prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/951459 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:06:42] (03CR) 10Jbond: [C: 03+2] puppetserver::volatile: we need to ensure if *not* empty [puppet] - 10https://gerrit.wikimedia.org/r/951457 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [12:07:46] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aphlict2001.codfw.wmnet [12:08:26] (03PS3) 10Filippo Giunchedi: otel-collector: export traces to jaeger [deployment-charts] - 10https://gerrit.wikimedia.org/r/946518 (https://phabricator.wikimedia.org/T343302) [12:08:28] (03PS2) 10Filippo Giunchedi: Revert "aux: add grpc/http ports for jaeger collector" [deployment-charts] - 10https://gerrit.wikimedia.org/r/950822 [12:09:13] !log eoghan@cumin1001 START - Cookbook sre.hosts.reboot-single for host doc2002.codfw.wmnet [12:10:08] jouncebot: next [12:10:08] In 0 hour(s) and 49 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230822T1300) [12:10:08] In 0 hour(s) and 49 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230822T1300) [12:10:17] ok I'll do a bunch of prometheus reboots [12:10:33] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet [12:11:57] (03CR) 10Gmodena: data-engineering: flink: alert based on active site (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/939651 (https://phabricator.wikimedia.org/T342258) (owner: 10Gmodena) [12:12:00] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet [12:12:02] (03CR) 10Gmodena: [C: 03+2] data-engineering: flink: alert based on active site [alerts] - 10https://gerrit.wikimedia.org/r/939651 (https://phabricator.wikimedia.org/T342258) (owner: 10Gmodena) [12:12:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T344589)', diff saved to https://phabricator.wikimedia.org/P50909 and previous config saved to /var/cache/conftool/dbconfig/20230822-121218-ladsgroup.json [12:12:19] taavi: I'm out sick [12:12:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [12:12:24] But looks good [12:12:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [12:12:38] (03PS2) 10Muehlenhoff: Add additional sets for monitoring/prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/951459 (https://phabricator.wikimedia.org/T336497) [12:13:09] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host doc2002.codfw.wmnet [12:13:49] (03Merged) 10jenkins-bot: data-engineering: flink: alert based on active site [alerts] - 10https://gerrit.wikimedia.org/r/939651 (https://phabricator.wikimedia.org/T342258) (owner: 10Gmodena) [12:13:50] !log eoghan@cumin1001 START - Cookbook sre.hosts.reboot-single for host doc1003.eqiad.wmnet [12:15:00] (03CR) 10CI reject: [V: 04-1] Add additional sets for monitoring/prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/951459 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:15:37] 10sre-alert-triage, 10Data-Platform-SRE, 10Patch-For-Review: Alert: Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - https://phabricator.wikimedia.org/T343318 (10gmodena) >>! In T343318#9101649, @gerritbot wrote: > Change 939651 had a related patch set uploaded (by Gmode... [12:16:29] (03CR) 10Stevemunene: [C: 03+2] datahub: fix cidr typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/951128 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [12:16:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [12:17:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [12:17:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T344589)', diff saved to https://phabricator.wikimedia.org/P50910 and previous config saved to /var/cache/conftool/dbconfig/20230822-121714-ladsgroup.json [12:17:19] (03Merged) 10jenkins-bot: datahub: fix cidr typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/951128 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [12:17:32] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet [12:17:44] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host doc1003.eqiad.wmnet [12:18:05] (03CR) 10Ayounsi: "Thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [12:18:13] !log eoghan@cumin1001 START - Cookbook sre.hosts.reboot-single for host releases2003.codfw.wmnet [12:18:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T344589)', diff saved to https://phabricator.wikimedia.org/P50911 and previous config saved to /var/cache/conftool/dbconfig/20230822-121832-ladsgroup.json [12:18:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [12:18:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [12:18:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [12:19:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [12:19:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T344589)', diff saved to https://phabricator.wikimedia.org/P50912 and previous config saved to /var/cache/conftool/dbconfig/20230822-121913-ladsgroup.json [12:19:34] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet [12:20:10] (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [12:20:13] (ThanosSidecarNoConnectionToStartedPrometheus) firing: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [12:20:15] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet [12:20:39] the thanos alerts are expected [12:21:27] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet [12:22:06] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1058.eqiad.wmnet with OS bullseye [12:22:11] 10SRE, 10ops-eqiad, 10serviceops: Move eqiad thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343993 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1001 for host kubernetes1058.eqiad.wmnet with OS bullseye executed with errors: - kubernetes1058 (**... [12:22:20] !log stevemunene@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [12:22:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:22:50] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host releases2003.codfw.wmnet [12:23:01] (03PS1) 10Jbond: swift/thanos: allow puppetservers to also pull swift rings [puppet] - 10https://gerrit.wikimedia.org/r/951462 (https://phabricator.wikimedia.org/T341056) [12:23:04] !log eoghan@cumin1001 START - Cookbook sre.hosts.reboot-single for host releases1003.eqiad.wmnet [12:23:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T344589)', diff saved to https://phabricator.wikimedia.org/P50913 and previous config saved to /var/cache/conftool/dbconfig/20230822-122338-ladsgroup.json [12:24:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42967/console" [puppet] - 10https://gerrit.wikimedia.org/r/951462 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [12:24:36] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host grafana1002.eqiad.wmnet [12:25:10] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [12:25:13] (ThanosSidecarNoConnectionToStartedPrometheus) resolved: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [12:25:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T344589)', diff saved to https://phabricator.wikimedia.org/P50914 and previous config saved to /var/cache/conftool/dbconfig/20230822-122538-ladsgroup.json [12:25:52] !log stevemunene@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [12:25:54] (03CR) 10Slyngshede: [V: 03+1] "This is done, obviously, but before I allocate time to configure a fixed date for each device, I would like input on the solution." [puppet] - 10https://gerrit.wikimedia.org/r/935408 (https://phabricator.wikimedia.org/T221083) (owner: 10Slyngshede) [12:26:58] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host releases1003.eqiad.wmnet [12:27:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:28:00] (03CR) 10Klausman: "This change is ready for review." [labs/private] - 10https://gerrit.wikimedia.org/r/951464 (owner: 10Klausman) [12:28:41] !log fabfur@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-upload_ulsfo and A:cp [12:29:30] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana1002.eqiad.wmnet [12:29:32] (03CR) 10FNegri: [C: 04-1] admin: add wmcs-roots to wmcs-admins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/923684 (owner: 10Jbond) [12:29:57] (03PS3) 10Muehlenhoff: Add additional sets for monitoring/prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/951459 (https://phabricator.wikimedia.org/T336497) [12:30:10] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [12:31:21] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "While most references to deploy-service come from scap::target and related things that don't have much to do with the group itself, which " [puppet] - 10https://gerrit.wikimedia.org/r/934411 (https://phabricator.wikimedia.org/T340165) (owner: 10Majavah) [12:32:15] (03PS2) 10Jbond: admin: add wmcs-roots to wmcs-admins [puppet] - 10https://gerrit.wikimedia.org/r/923684 [12:32:29] (03CR) 10CI reject: [V: 04-1] admin: add wmcs-roots to wmcs-admins [puppet] - 10https://gerrit.wikimedia.org/r/923684 (owner: 10Jbond) [12:33:38] (03CR) 10EoghanGaffney: [C: 03+1] trafficserver: switch all miscweb services to codfw cname [puppet] - 10https://gerrit.wikimedia.org/r/951456 (owner: 10Jelto) [12:33:56] (03CR) 10Jbond: admin: add wmcs-roots to wmcs-admins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/923684 (owner: 10Jbond) [12:35:47] (03CR) 10Klausman: [C: 03+2] deployment_server: Add fake secrets fir LW readability isvc [labs/private] - 10https://gerrit.wikimedia.org/r/951464 (owner: 10Klausman) [12:35:52] (03CR) 10Klausman: [V: 03+2 C: 03+2] deployment_server: Add fake secrets fir LW readability isvc [labs/private] - 10https://gerrit.wikimedia.org/r/951464 (owner: 10Klausman) [12:35:53] (03CR) 10FNegri: wmcs: add wmcs-roots use hiera merge to allow more fine grained control (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/923681 (owner: 10Jbond) [12:36:02] (03PS3) 10Jbond: admin: add wmcs-roots to wmcs-admins [puppet] - 10https://gerrit.wikimedia.org/r/923684 [12:37:01] (03CR) 10Jbond: "also worth noting that dr0ptp4kt was added to wmcs-admins since this was first created so i have updated to reflect that" [puppet] - 10https://gerrit.wikimedia.org/r/923684 (owner: 10Jbond) [12:37:34] (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:38:05] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host dispatch-be1001.eqiad.wmnet [12:38:06] (03CR) 10Jbond: admin: add wmcs-roots to wmcs-admins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/923684 (owner: 10Jbond) [12:38:27] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host dispatch-be2001.codfw.wmnet [12:38:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P50915 and previous config saved to /var/cache/conftool/dbconfig/20230822-123844-ladsgroup.json [12:39:23] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1057.eqiad.wmnet with OS bullseye [12:39:27] 10SRE, 10ops-eqiad, 10serviceops: Move eqiad thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343993 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1001 for host kubernetes1057.eqiad.wmnet with OS bullseye executed with errors: - kubernetes1057 (**... [12:39:51] (03PS2) 10Majavah: Drop deploy-service group [puppet] - 10https://gerrit.wikimedia.org/r/934411 (https://phabricator.wikimedia.org/T340165) [12:40:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P50916 and previous config saved to /var/cache/conftool/dbconfig/20230822-124044-ladsgroup.json [12:41:00] (03CR) 10FNegri: [C: 03+1] admin: add wmcs-roots to wmcs-admins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/923684 (owner: 10Jbond) [12:42:08] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dispatch-be1001.eqiad.wmnet [12:42:32] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dispatch-be2001.codfw.wmnet [12:43:32] (03CR) 10Majavah: Drop deploy-service group (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/934411 (https://phabricator.wikimedia.org/T340165) (owner: 10Majavah) [12:43:35] (03CR) 10Jelto: [C: 03+2] trafficserver: switch all miscweb services to codfw cname [puppet] - 10https://gerrit.wikimedia.org/r/951456 (owner: 10Jelto) [12:44:47] (03PS2) 10Giuseppe Lavagetto: mw-cli-wrapper: fix own dc reference in Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/935448 (owner: 10Krinkle) [12:46:40] !log mwmaint1002: foreachwikiindblist growthexperiments extensions/GrowthExperiments/maintenance/revalidateLinkRecommendations.php --scoreLessThan=0.6 --verbose | tee growth-T316079-revalidate-0.6.log # T316079 [12:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:45] T316079: Bump threshold for confidence score on link recommendation service suggestions - https://phabricator.wikimedia.org/T316079 [12:46:49] (03PS6) 10Jbond: wmcs: add wmcs-roots use hiera merge to allow more fine grained control [puppet] - 10https://gerrit.wikimedia.org/r/923681 [12:46:51] (03PS1) 10Jbond: admin: deprecate laptest-roots group [puppet] - 10https://gerrit.wikimedia.org/r/951469 (https://phabricator.wikimedia.org/T337848) [12:47:42] (03CR) 10Jbond: wmcs: add wmcs-roots use hiera merge to allow more fine grained control (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/923681 (owner: 10Jbond) [12:49:26] (03PS7) 10Jbond: wmcs: add wmcs-roots use hiera merge to allow more fine grained control [puppet] - 10https://gerrit.wikimedia.org/r/923681 [12:49:28] (03PS2) 10Jbond: admin: deprecate laptest-roots group [puppet] - 10https://gerrit.wikimedia.org/r/951469 (https://phabricator.wikimedia.org/T337848) [12:49:57] (03CR) 10FNegri: wmcs: add wmcs-roots use hiera merge to allow more fine grained control (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/923681 (owner: 10Jbond) [12:51:02] (03CR) 10Jbond: wmcs: add wmcs-roots use hiera merge to allow more fine grained control (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/923681 (owner: 10Jbond) [12:51:27] (03CR) 10FNegri: admin: deprecate laptest-roots group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/951469 (https://phabricator.wikimedia.org/T337848) (owner: 10Jbond) [12:51:40] (03CR) 10Giuseppe Lavagetto: [C: 03+1] python-build: set date of source files in the wheels [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940157 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar) [12:51:57] (03PS1) 10Dreamy Jazz: clienthints: Remove server-side check for browser support [extensions/CheckUser] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/950823 (https://phabricator.wikimedia.org/T344679) [12:52:15] (03PS1) 10Dreamy Jazz: clienthints: Remove server-side check for browser support [extensions/CheckUser] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/950824 (https://phabricator.wikimedia.org/T344679) [12:53:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P50917 and previous config saved to /var/cache/conftool/dbconfig/20230822-125350-ladsgroup.json [12:53:53] (03CR) 10Giuseppe Lavagetto: [C: 03+1] modules/base: Copy networkpolicy_1.0.0 to networkpolicy_1.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/950186 (https://phabricator.wikimedia.org/T344177) (owner: 10JMeybohm) [12:53:55] (03CR) 10FNegri: wmcs: add wmcs-roots use hiera merge to allow more fine grained control (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/923681 (owner: 10Jbond) [12:54:01] !log hnowlan@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2055.codfw.wmnet with OS bullseye [12:54:21] 10SRE, 10ops-codfw, 10serviceops: Move codfw thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343996 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host kubernetes2055.codfw.wmnet with OS bullseye executed with errors: - kubernetes2055 (**... [12:55:00] (03PS9) 10Ssingh: knams migration: remove references to old esams [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) [12:55:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P50918 and previous config saved to /var/cache/conftool/dbconfig/20230822-125550-ladsgroup.json [12:55:56] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T344659 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact [12:56:13] (03CR) 10JMeybohm: [C: 03+2] jaeger: Don't pull images via CDN [deployment-charts] - 10https://gerrit.wikimedia.org/r/951443 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [12:56:36] (03CR) 10JMeybohm: [C: 03+2] jaeger: Enable TLS for query (UI) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951450 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [12:56:38] 10ops-codfw: InterfaceSpeedError - https://phabricator.wikimedia.org/T344269 (10Jhancock.wm) 05Open→03Resolved [12:56:54] !log hnowlan@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2056.codfw.wmnet with OS bullseye [12:56:54] (03CR) 10Ssingh: "PS6 was reviewed while this is PS9. The difference is that ncredir-addrs is now uncommented and we will be pooling that as well for esams." [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [12:57:00] 10SRE, 10ops-codfw, 10serviceops: Move codfw thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343996 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host kubernetes2056.codfw.wmnet with OS bullseye executed with errors: - kubernetes2056 (**... [12:57:51] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Like the idea, kinda hate the implementation." [deployment-charts] - 10https://gerrit.wikimedia.org/r/950187 (https://phabricator.wikimedia.org/T344177) (owner: 10JMeybohm) [12:58:41] (03PS2) 10Anzx: knwiki add import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950827 (https://phabricator.wikimedia.org/T344573) [12:58:43] (03Merged) 10jenkins-bot: jaeger: Don't pull images via CDN [deployment-charts] - 10https://gerrit.wikimedia.org/r/951443 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [12:58:45] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, but see my comment on the preceding patch." [deployment-charts] - 10https://gerrit.wikimedia.org/r/950188 (https://phabricator.wikimedia.org/T344177) (owner: 10JMeybohm) [12:58:50] (03PS2) 10Anzx: Update tcywiki logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950826 (https://phabricator.wikimedia.org/T344557) [12:58:58] (03CR) 10Giuseppe Lavagetto: [C: 03+1] admin_ng: Disable GlobalNetworkPolicy allow rules for wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/950189 (https://phabricator.wikimedia.org/T344177) (owner: 10JMeybohm) [12:59:14] (03Merged) 10jenkins-bot: jaeger: Enable TLS for query (UI) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951450 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [12:59:21] (03CR) 10TChin: [C: 03+1] mw-page-content-change-enrich: stream version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/951446 (https://phabricator.wikimedia.org/T307959) (owner: 10Gmodena) [12:59:38] (03PS5) 10Samtar: Remove unneeded $wgDefaultUserOptions['visualeditor-enable'] settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933998 (https://phabricator.wikimedia.org/T340696) (owner: 10Bartosz Dziewoński) [12:59:47] (03CR) 10TChin: [C: 03+1] Declare v1 of the page_content_change stream. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951444 (https://phabricator.wikimedia.org/T307959) (owner: 10Gmodena) [12:59:57] (03PS3) 10Samtar: Move visual editor out of Beta Features (without changing prefs) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947015 (https://phabricator.wikimedia.org/T335056) (owner: 10Bartosz Dziewoński) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: (Dis)respected human, time to deploy UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230822T1300). Please do the needful. [13:00:05] MatmaRex, aanzx, and Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230822T1300) [13:00:10] \o [13:00:11] (03CR) 10TChin: [C: 03+1] Expose mediawiki.page_change.v1 publicly. [deployment-charts] - 10https://gerrit.wikimedia.org/r/951426 (https://phabricator.wikimedia.org/T336817) (owner: 10Gmodena) [13:00:11] * TheresNoTime can deploy [13:00:13] I can deploy today [13:00:17] well, TheresNoTime was faster! [13:00:22] urbanecm: yours if you want it :D [13:00:31] Almost like -en-revdel when an OS request comes in :) [13:00:37] (03PS3) 10Samtar: Clarify 2017 wikitext editor's Beta Feature status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949588 (https://phabricator.wikimedia.org/T344158) (owner: 10Bartosz Dziewoński) [13:00:51] o/ [13:00:51] (03CR) 10Urbanecm: [C: 03+2] clienthints: Remove server-side check for browser support [extensions/CheckUser] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/950824 (https://phabricator.wikimedia.org/T344679) (owner: 10Dreamy Jazz) [13:00:55] (03CR) 10Urbanecm: [C: 03+2] clienthints: Remove server-side check for browser support [extensions/CheckUser] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/950823 (https://phabricator.wikimedia.org/T344679) (owner: 10Dreamy Jazz) [13:01:09] MatmaRex: hi, are you around? :) [13:01:09] hi [13:01:33] !log stat1008: Remove `krcwiki` and `ganwiki` from `/srv/published/datasets/one-off/research-mwaddlink/wikis.txt` (T344686) [13:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:39] T344686: linkrecommendation-internal-load-datasets pod is failing - https://phabricator.wikimedia.org/T344686 [13:01:55] (03CR) 10Urbanecm: [C: 03+2] Remove unneeded $wgDefaultUserOptions['visualeditor-enable'] settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933998 (https://phabricator.wikimedia.org/T340696) (owner: 10Bartosz Dziewoński) [13:02:15] (03CR) 10Urbanecm: [C: 03+2] Move visual editor out of Beta Features (without changing prefs) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947015 (https://phabricator.wikimedia.org/T335056) (owner: 10Bartosz Dziewoński) [13:02:18] (03CR) 10Btullis: [C: 03+2] Retain yarn logs for 60 days and compress with gzip [puppet] - 10https://gerrit.wikimedia.org/r/950191 (https://phabricator.wikimedia.org/T342923) (owner: 10Btullis) [13:02:38] (03Merged) 10jenkins-bot: Remove unneeded $wgDefaultUserOptions['visualeditor-enable'] settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933998 (https://phabricator.wikimedia.org/T340696) (owner: 10Bartosz Dziewoński) [13:02:40] (03CR) 10Urbanecm: [C: 03+2] Clarify 2017 wikitext editor's Beta Feature status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949588 (https://phabricator.wikimedia.org/T344158) (owner: 10Bartosz Dziewoński) [13:02:55] (03Merged) 10jenkins-bot: Move visual editor out of Beta Features (without changing prefs) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947015 (https://phabricator.wikimedia.org/T335056) (owner: 10Bartosz Dziewoński) [13:03:18] * TheresNoTime will be around if needed [13:03:18] (03Merged) 10jenkins-bot: Clarify 2017 wikitext editor's Beta Feature status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949588 (https://phabricator.wikimedia.org/T344158) (owner: 10Bartosz Dziewoński) [13:03:30] ty TheresNoTime [13:03:51] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:933998|Remove unneeded $wgDefaultUserOptions['visualeditor-enable'] settings (T340696)]], [[gerrit:947015|Move visual editor out of Beta Features (without changing prefs) (T335056)]], [[gerrit:949588|Clarify 2017 wikitext editor's Beta Feature status (T344158)]] [13:03:58] T344158: Clarify 2017 wikitext editor's Beta Feature status - https://phabricator.wikimedia.org/T344158 [13:03:59] T335056: Move the visual editor out of the Beta Features section to the expected location for editing-related prefs, without changing people's prefs - https://phabricator.wikimedia.org/T335056 [13:03:59] T340696: Remove overrides for 'visualeditor-enable' and 'visualeditor-betatempdisable' from WMF config - https://phabricator.wikimedia.org/T340696 [13:04:05] (03CR) 10FNegri: [C: 03+2] toolsdb: add skipped table to the config [puppet] - 10https://gerrit.wikimedia.org/r/949854 (https://phabricator.wikimedia.org/T344411) (owner: 10David Caro) [13:04:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/951469 (https://phabricator.wikimedia.org/T337848) (owner: 10Jbond) [13:04:11] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [13:05:31] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [13:05:32] !log urbanecm@deploy1002 urbanecm and matmarex: Backport for [[gerrit:933998|Remove unneeded $wgDefaultUserOptions['visualeditor-enable'] settings (T340696)]], [[gerrit:947015|Move visual editor out of Beta Features (without changing prefs) (T335056)]], [[gerrit:949588|Clarify 2017 wikitext editor's Beta Feature status (T344158)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codf [13:05:32] w.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:05:45] MatmaRex: all three patches are on mwdebug1001 now; can you test? [13:06:09] looking [13:06:59] (03CR) 10Ssingh: [C: 03+2] knams migration: remove references to old esams [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [13:07:08] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/934411 (https://phabricator.wikimedia.org/T340165) (owner: 10Majavah) [13:07:33] !log running authdns-update to remove old references to esams [13:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T344589)', diff saved to https://phabricator.wikimedia.org/P50919 and previous config saved to /var/cache/conftool/dbconfig/20230822-130856-ladsgroup.json [13:08:58] !log [done] authdns-update for old references [13:09:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [13:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [13:09:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T344589)', diff saved to https://phabricator.wikimedia.org/P50921 and previous config saved to /var/cache/conftool/dbconfig/20230822-130920-ladsgroup.json [13:10:36] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/951100 [13:10:40] urbanecm: still testing things [13:10:46] ack, waiting. [13:10:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T344589)', diff saved to https://phabricator.wikimedia.org/P50922 and previous config saved to /var/cache/conftool/dbconfig/20230822-131057-ladsgroup.json [13:10:59] if there is something i can do to help, let me know. [13:11:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [13:11:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [13:11:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T344589)', diff saved to https://phabricator.wikimedia.org/P50923 and previous config saved to /var/cache/conftool/dbconfig/20230822-131122-ladsgroup.json [13:11:25] (03PS8) 10Jbond: wmcs: add wmcs-roots to roles where it is missing [puppet] - 10https://gerrit.wikimedia.org/r/923681 [13:11:49] (03PS9) 10Jbond: wmcs: add wmcs-roots to roles where it is missing [puppet] - 10https://gerrit.wikimedia.org/r/923681 [13:12:15] (03CR) 10Jbond: wmcs: add wmcs-roots to roles where it is missing (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/923681 (owner: 10Jbond) [13:12:24] urbanecm: i think everything is good. i was a bit confused by the behavior of checkboxes in preferences, because firefox was restoring the checks when i refreshed the page [13:12:39] chrome doesn't do that and i recently switched [13:12:43] makes sense [13:12:45] so, let's proceed? [13:12:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3318 (T344589)', diff saved to https://phabricator.wikimedia.org/P50924 and previous config saved to /var/cache/conftool/dbconfig/20230822-131250-ladsgroup.json [13:12:57] so i though the defaults were behaving incorrectly, but it was just the browser [13:13:00] yes. looks good [13:13:04] !log urbanecm@deploy1002 urbanecm and matmarex: Continuing with sync [13:13:05] doing [13:13:08] (03PS3) 10Jbond: admin: deprecate labtest-roots group [puppet] - 10https://gerrit.wikimedia.org/r/951469 (https://phabricator.wikimedia.org/T337848) [13:13:11] (03PS2) 10Ssingh: Repool esams after knams migration (merge on Monday Aug 21) [dns] - 10https://gerrit.wikimedia.org/r/950176 (https://phabricator.wikimedia.org/T329219) [13:13:15] (03CR) 10Jbond: admin: deprecate labtest-roots group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/951469 (https://phabricator.wikimedia.org/T337848) (owner: 10Jbond) [13:13:23] (03PS4) 10Jbond: admin: deprecate labtest-roots group [puppet] - 10https://gerrit.wikimedia.org/r/951469 (https://phabricator.wikimedia.org/T337848) [13:13:37] MatmaRex: now that i have you on the line, fyi, i've stopped the s6 run (again); the error's still very frequent for frwiki for some reason :-(. see T315510#9108545. [13:13:37] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [13:13:59] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/951459 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:14:06] (03PS3) 10Urbanecm: knwiki add import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950827 (https://phabricator.wikimedia.org/T344573) (owner: 10Anzx) [13:14:10] (03CR) 10Urbanecm: [C: 03+2] knwiki add import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950827 (https://phabricator.wikimedia.org/T344573) (owner: 10Anzx) [13:14:14] (03PS3) 10Urbanecm: Update tcywiki logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950826 (https://phabricator.wikimedia.org/T344557) (owner: 10Anzx) [13:14:17] (03CR) 10Urbanecm: [C: 03+2] Update tcywiki logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950826 (https://phabricator.wikimedia.org/T344557) (owner: 10Anzx) [13:14:37] (03Merged) 10jenkins-bot: clienthints: Remove server-side check for browser support [extensions/CheckUser] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/950824 (https://phabricator.wikimedia.org/T344679) (owner: 10Dreamy Jazz) [13:14:38] urbanecm: oh :( i haven't read my notifications today yet, i'll respond. thanks for trying it [13:14:39] (03PS1) 10JMeybohm: jager: Fix typo in tls.cert name [deployment-charts] - 10https://gerrit.wikimedia.org/r/951471 (https://phabricator.wikimedia.org/T344253) [13:14:49] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:14:56] (03Merged) 10jenkins-bot: clienthints: Remove server-side check for browser support [extensions/CheckUser] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/950823 (https://phabricator.wikimedia.org/T344679) (owner: 10Dreamy Jazz) [13:14:58] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/951100 (owner: 10PipelineBot) [13:15:00] (03Merged) 10jenkins-bot: knwiki add import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950827 (https://phabricator.wikimedia.org/T344573) (owner: 10Anzx) [13:15:01] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:15:02] (03Merged) 10jenkins-bot: Update tcywiki logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950826 (https://phabricator.wikimedia.org/T344557) (owner: 10Anzx) [13:15:05] no worries, thanks for having a look. [13:15:06] (03CR) 10Jbond: [C: 03+2] P:cloudceph::osd: explicitly set the interface and make route persist [puppet] - 10https://gerrit.wikimedia.org/r/927622 (owner: 10Jbond) [13:15:09] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] jager: Fix typo in tls.cert name [deployment-charts] - 10https://gerrit.wikimedia.org/r/951471 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [13:15:26] (03CR) 10Jbond: [C: 03+2] interface::route: Make interface mandatory [puppet] - 10https://gerrit.wikimedia.org/r/927204 (owner: 10Jbond) [13:15:41] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/951100 (owner: 10PipelineBot) [13:16:16] !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons. [13:16:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T344589)', diff saved to https://phabricator.wikimedia.org/P50925 and previous config saved to /var/cache/conftool/dbconfig/20230822-131651-ladsgroup.json [13:17:07] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:17:28] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [13:17:33] !log Draining ml-serve2007 for kubelet partition resize [13:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:55] (03CR) 10Filippo Giunchedi: [C: 03+1] smart: Simplify check for hpsa [puppet] - 10https://gerrit.wikimedia.org/r/951448 (owner: 10Muehlenhoff) [13:18:43] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:18:47] (03CR) 10Ssingh: [C: 03+2] Repool esams after knams migration (merge on Monday Aug 21) [dns] - 10https://gerrit.wikimedia.org/r/950176 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [13:18:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T344589)', diff saved to https://phabricator.wikimedia.org/P50926 and previous config saved to /var/cache/conftool/dbconfig/20230822-131859-ladsgroup.json [13:19:19] !log repool esams [13:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:35] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:933998|Remove unneeded $wgDefaultUserOptions['visualeditor-enable'] settings (T340696)]], [[gerrit:947015|Move visual editor out of Beta Features (without changing prefs) (T335056)]], [[gerrit:949588|Clarify 2017 wikitext editor's Beta Feature status (T344158)]] (duration: 15m 43s) [13:19:41] T344158: Clarify 2017 wikitext editor's Beta Feature status - https://phabricator.wikimedia.org/T344158 [13:19:42] T335056: Move the visual editor out of the Beta Features section to the expected location for editing-related prefs, without changing people's prefs - https://phabricator.wikimedia.org/T335056 [13:19:42] T340696: Remove overrides for 'visualeditor-enable' and 'visualeditor-betatempdisable' from WMF config - https://phabricator.wikimedia.org/T340696 [13:19:53] MatmaRex: should be live :) [13:20:01] thanks! [13:20:09] np [13:20:09] aanzx: Dreamy_Jazz: your patches are next now. [13:20:12] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:950827|knwiki add import sources (T344573)]], [[gerrit:950826|Update tcywiki logos (T344557)]], [[gerrit:950824|clienthints: Remove server-side check for browser support (T344679)]], [[gerrit:950823|clienthints: Remove server-side check for browser support (T344679)]] [13:20:14] !log [done] finished repooling esams [13:20:16] Thanks! [13:20:20] T344679: CheckUser Client Hints ResourceLoader module causing cache polution - https://phabricator.wikimedia.org/T344679 [13:20:20] T344557: Update logo/wordmark/tagline for tcywiki - https://phabricator.wikimedia.org/T344557 [13:20:21] T344573: add import sources on knwiki - https://phabricator.wikimedia.org/T344573 [13:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:25] ok [13:20:49] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host miscweb1003.eqiad.wmnet [13:21:06] (03Abandoned) 10Jbond: P:cloudceph::osd: drop the profile::cloudceph::osd::hosts [puppet] - 10https://gerrit.wikimedia.org/r/927628 (owner: 10Jbond) [13:21:39] (03CR) 10Jbond: [V: 03+1 C: 03+2] swift/thanos: allow puppetservers to also pull swift rings [puppet] - 10https://gerrit.wikimedia.org/r/951462 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [13:21:51] !log urbanecm@deploy1002 urbanecm and dreamyjazz and anzx: Backport for [[gerrit:950827|knwiki add import sources (T344573)]], [[gerrit:950826|Update tcywiki logos (T344557)]], [[gerrit:950824|clienthints: Remove server-side check for browser support (T344679)]], [[gerrit:950823|clienthints: Remove server-side check for browser support (T344679)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, [13:21:51] mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:22:01] Dreamy_Jazz: aanzx: can you test at mwdebug1001 please? [13:22:14] Yes. [13:22:14] testing [13:23:15] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host miscweb1003.eqiad.wmnet [13:24:25] !log Draining ml-serve2008 for kubelet partition resize [13:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:28] (03PS1) 10Jelto: Revert "trafficserver: switch all miscweb services to codfw cname" [puppet] - 10https://gerrit.wikimedia.org/r/950825 [13:28:40] urbanecm: tcywiki logos looks correct, couldn't test knwiki import sources due to The database is currently locked to new entries and other modifications, probably for routine database maintenance, after which it will be back to normal. [13:28:40] The system administrator who locked it offered this explanation: X-Wikimedia-Debug [13:29:00] Nearly complete with tests [13:29:07] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10Jhancock.wm) @Papaul fixed the thing. [13:29:13] aanzx: that's because you've checked "Read only DB" in X-Wikimedia-Debug. once you turn that off, it should unlock. [13:29:34] (and while on it, would you mind unchecking the other checkboxes as well? they're only needed during advanced debugging, and not during routine testing) [13:29:53] ack Dreamy_Jazz [13:29:56] urbanecm: testing complete [13:30:01] all good? [13:30:11] yes [13:30:15] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [13:30:18] okay, waiting for the other patch then. [13:30:39] (03CR) 10Muehlenhoff: [C: 03+2] smart: Simplify check for hpsa [puppet] - 10https://gerrit.wikimedia.org/r/951448 (owner: 10Muehlenhoff) [13:31:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P50927 and previous config saved to /var/cache/conftool/dbconfig/20230822-133157-ladsgroup.json [13:33:13] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [13:33:18] urbanecm: Tests complete (tested working on both wmf.22 and wmf.23) [13:33:21] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:33:27] great! proceeding [13:33:29] !log urbanecm@deploy1002 urbanecm and dreamyjazz and anzx: Continuing with sync [13:34:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P50928 and previous config saved to /var/cache/conftool/dbconfig/20230822-133405-ladsgroup.json [13:35:37] !log klausman@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2001.codfw.wmnet [13:38:30] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [13:38:30] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [13:38:34] (03PS1) 10Btullis: Remove duplicate definition of oidc client secret from datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/951474 (https://phabricator.wikimedia.org/T305874) [13:38:55] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:39:25] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.284 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:39:33] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:39:43] (03PS3) 10Samtar: wikidiff2: set maxSplitSize = 10 on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950049 (https://phabricator.wikimedia.org/T341754) (owner: 10HMonroy) [13:39:57] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:950827|knwiki add import sources (T344573)]], [[gerrit:950826|Update tcywiki logos (T344557)]], [[gerrit:950824|clienthints: Remove server-side check for browser support (T344679)]], [[gerrit:950823|clienthints: Remove server-side check for browser support (T344679)]] (duration: 19m 44s) [13:40:04] T344679: CheckUser Client Hints ResourceLoader module causing cache polution - https://phabricator.wikimedia.org/T344679 [13:40:04] Dreamy_Jazz: aanzx: should be live now! :) [13:40:04] T344557: Update logo/wordmark/tagline for tcywiki - https://phabricator.wikimedia.org/T344557 [13:40:05] T344573: add import sources on knwiki - https://phabricator.wikimedia.org/T344573 [13:40:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:41:15] (03PS1) 10Ayounsi: Send germany and UK to drmrs [dns] - 10https://gerrit.wikimedia.org/r/951475 [13:41:27] !log klausman@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2001.codfw.wmnet [13:41:27] thanks urbanecm [13:41:40] np [13:41:57] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 78, down: 20, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:42:02] (03CR) 10BBlack: [C: 03+1] Send germany and UK to drmrs [dns] - 10https://gerrit.wikimedia.org/r/951475 (owner: 10Ayounsi) [13:42:06] Thanks! [13:42:41] (03CR) 10Ayounsi: [C: 03+2] Send germany and UK to drmrs [dns] - 10https://gerrit.wikimedia.org/r/951475 (owner: 10Ayounsi) [13:42:45] (03CR) 10Ssingh: [C: 03+2] Send germany and UK to drmrs [dns] - 10https://gerrit.wikimedia.org/r/951475 (owner: 10Ayounsi) [13:43:01] !log klausman@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2002.codfw.wmnet [13:43:13] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2023'] [13:44:14] (03CR) 10Btullis: [C: 03+2] Remove duplicate definition of oidc client secret from datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/951474 (https://phabricator.wikimedia.org/T305874) (owner: 10Btullis) [13:44:52] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10Papaul) @Jhancock.wm thanks all good now ` papaul@asw-a-codfw> show interfaces xe-2/0/19 descriptions Interface Admin Link Description xe-2/0/19 up... [13:45:05] (03Merged) 10jenkins-bot: Remove duplicate definition of oidc client secret from datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/951474 (https://phabricator.wikimedia.org/T305874) (owner: 10Btullis) [13:45:31] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr1-esams.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [13:45:31] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr1-esams.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [13:45:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:46:19] np [13:46:49] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:46:57] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:47:06] (03PS1) 10Eevans: aqs: upgrade rack1 nodes to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/951476 (https://phabricator.wikimedia.org/T339299) [13:47:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P50929 and previous config saved to /var/cache/conftool/dbconfig/20230822-134703-ladsgroup.json [13:48:02] (03PS1) 10JMeybohm: jaeger: Enable TLS for the collector as well [deployment-charts] - 10https://gerrit.wikimedia.org/r/951477 (https://phabricator.wikimedia.org/T344253) [13:48:23] !log stevemunene@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [13:48:29] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gerrit2002.wikimedia.org [13:48:50] !log klausman@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2002.codfw.wmnet [13:49:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P50930 and previous config saved to /var/cache/conftool/dbconfig/20230822-134911-ladsgroup.json [13:49:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs2023'] [13:49:54] !log klausman@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1002.eqiad.wmnet [13:50:37] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:45] (03PS1) 10Eevans: aqs: upgrade rack2 nodes to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/951478 (https://phabricator.wikimedia.org/T339299) [13:50:47] (03PS1) 10Eevans: aqs: upgrade rack3 nodes to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/951479 (https://phabricator.wikimedia.org/T339299) [13:51:09] (03CR) 10JMeybohm: [C: 03+2] jaeger: Enable TLS for the collector as well [deployment-charts] - 10https://gerrit.wikimedia.org/r/951477 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [13:51:24] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/951476 (https://phabricator.wikimedia.org/T339299) (owner: 10Eevans) [13:52:12] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gerrit2002.wikimedia.org [13:53:30] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [13:53:30] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [13:53:32] (03Merged) 10jenkins-bot: jaeger: Enable TLS for the collector as well [deployment-charts] - 10https://gerrit.wikimedia.org/r/951477 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [13:55:23] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.618 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:55:29] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:55:30] (Primary inbound port utilisation over 80% #page) resolved: Device cr1-esams.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [13:55:30] (Primary inbound port utilisation over 80% #page) resolved: Device cr1-esams.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [13:55:41] !log klausman@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1002.eqiad.wmnet [13:57:02] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [13:57:19] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:00:03] (03PS1) 10Jbond: admin: update zsh file [puppet] - 10https://gerrit.wikimedia.org/r/951483 [14:00:44] (03PS1) 10Jaime Nuche: jwt_authorizer: reflect changes to accept multiple issuers [puppet] - 10https://gerrit.wikimedia.org/r/951484 (https://phabricator.wikimedia.org/T337474) [14:00:59] (03PS1) 10Stevemunene: datahub:chart version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/951485 (https://phabricator.wikimedia.org/T305874) [14:01:42] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2023.codfw.wmnet with OS bullseye [14:01:51] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2023.codfw.wmnet with OS bullseye [14:01:56] !log klausman@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1001.eqiad.wmnet [14:02:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T344589)', diff saved to https://phabricator.wikimedia.org/P50932 and previous config saved to /var/cache/conftool/dbconfig/20230822-140213-ladsgroup.json [14:02:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [14:02:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [14:02:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T344589)', diff saved to https://phabricator.wikimedia.org/P50933 and previous config saved to /var/cache/conftool/dbconfig/20230822-140236-ladsgroup.json [14:02:43] (03CR) 10Jbond: [C: 03+2] admin: update zsh file [puppet] - 10https://gerrit.wikimedia.org/r/951483 (owner: 10Jbond) [14:03:06] (03PS1) 10Ayounsi: Revert "Send germany and UK to drmrs" [dns] - 10https://gerrit.wikimedia.org/r/951486 [14:03:37] RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:04:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T344589)', diff saved to https://phabricator.wikimedia.org/P50934 and previous config saved to /var/cache/conftool/dbconfig/20230822-140417-ladsgroup.json [14:07:05] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:07:13] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:07:50] 10SRE, 10SRE-OnFire, 10IP Info, 10Traffic, 10IP-Blocking-Impacts: Blocked on English Wikipedia / Wikimedia thinks my IP is 10.80.1.11 - https://phabricator.wikimedia.org/T344704 (10AlexisJazz) [14:07:58] !log klausman@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1001.eqiad.wmnet [14:08:09] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafkamon2003.codfw.wmnet [14:08:23] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.296 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:08:31] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:08:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T344589)', diff saved to https://phabricator.wikimedia.org/P50935 and previous config saved to /var/cache/conftool/dbconfig/20230822-140835-ladsgroup.json [14:09:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T344589)', diff saved to https://phabricator.wikimedia.org/P50936 and previous config saved to /var/cache/conftool/dbconfig/20230822-140859-ladsgroup.json [14:09:43] (03PS1) 10Jbond: profile::swift:fetch_rings: ensure we create directories [puppet] - 10https://gerrit.wikimedia.org/r/951506 (https://phabricator.wikimedia.org/T341056) [14:10:22] !log gmodena@deploy1002 Started deploy [analytics/refinery@d62f281]: Regular analytics weekly train [analytics/refinery@d62f281] [14:10:54] !log hnowlan@cumin1001 START - Cookbook sre.dns.netbox [14:11:42] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:11:56] (03CR) 10Jaime Nuche: "PCC: https://puppet-compiler.wmflabs.org/output/951484/42968/" [puppet] - 10https://gerrit.wikimedia.org/r/951484 (https://phabricator.wikimedia.org/T337474) (owner: 10Jaime Nuche) [14:11:58] (03CR) 10CI reject: [V: 04-1] profile::swift:fetch_rings: ensure we create directories [puppet] - 10https://gerrit.wikimedia.org/r/951506 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [14:12:12] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:12:24] (03CR) 10Muehlenhoff: [C: 03+2] firewall: Add SSH rules in firewall-agnostic form [puppet] - 10https://gerrit.wikimedia.org/r/951118 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:12:40] 10SRE, 10Traffic: Blocked on English Wikipedia / Wikimedia thinks my IP is 10.80.1.11 - https://phabricator.wikimedia.org/T344704 (10taavi) [14:13:30] (03CR) 10Eevans: [C: 03+2] aqs: upgrade rack1 nodes to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/951476 (https://phabricator.wikimedia.org/T339299) (owner: 10Eevans) [14:13:38] (03PS2) 10Jbond: profile::swift:fetch_rings: ensure we create directories [puppet] - 10https://gerrit.wikimedia.org/r/951506 (https://phabricator.wikimedia.org/T341056) [14:14:27] (03CR) 10Btullis: [C: 03+1] datahub:chart version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/951485 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [14:14:33] jouncebot: nowandnext [14:14:33] No deployments scheduled for the next 1 hour(s) and 45 minute(s) [14:14:33] In 1 hour(s) and 45 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230822T1600) [14:14:43] !log hnowlan@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "refreshing kubernetes205[56] kubernetes105[78] status T343996 T343993 - hnowlan@cumin1001" [14:14:49] T343996: Move codfw thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343996 [14:14:49] T343993: Move eqiad thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343993 [14:15:48] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "refreshing kubernetes205[56] kubernetes105[78] status T343996 T343993 - hnowlan@cumin1001" [14:15:52] (03CR) 10CI reject: [V: 04-1] profile::swift:fetch_rings: ensure we create directories [puppet] - 10https://gerrit.wikimedia.org/r/951506 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [14:16:02] !log gmodena@deploy1002 Finished deploy [analytics/refinery@d62f281]: Regular analytics weekly train [analytics/refinery@d62f281] (duration: 05m 39s) [14:16:17] (03PS3) 10Jbond: profile::swift:fetch_rings: ensure we create directories [puppet] - 10https://gerrit.wikimedia.org/r/951506 (https://phabricator.wikimedia.org/T341056) [14:16:41] !log gmodena@deploy1002 Started deploy [analytics/refinery@d62f281] (thin): Regular analytics weekly train THIN [analytics/refinery@d62f281] [14:16:42] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:45] !log gmodena@deploy1002 Finished deploy [analytics/refinery@d62f281] (thin): Regular analytics weekly train THIN [analytics/refinery@d62f281] (duration: 00m 04s) [14:16:53] (03CR) 10Stevemunene: [C: 03+2] datahub:chart version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/951485 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [14:17:29] !log gmodena@deploy1002 Started deploy [analytics/refinery@d62f281] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d62f281] [14:17:42] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42971/console" [puppet] - 10https://gerrit.wikimedia.org/r/951506 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [14:17:45] (03Merged) 10jenkins-bot: datahub:chart version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/951485 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [14:18:09] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1057.eqiad.wmnet with OS bullseye [14:18:24] 10SRE, 10ops-eqiad, 10serviceops: Move eqiad thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343993 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1001 for host kubernetes1057.eqiad.wmnet with OS bullseye [14:18:39] 10SRE, 10Traffic: Blocked on English Wikipedia / Wikimedia thinks my IP is 10.80.1.11 - https://phabricator.wikimedia.org/T344704 (10AlexisJazz) I'm no longer blocked as on https://en.wikipedia.org/wiki/Special:Contributions/ST47ProxyBot this message can be read: 14:13, 22 August 2023 Yamla talk contribs bl... [14:18:53] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1058.eqiad.wmnet with OS bullseye [14:19:00] 10SRE, 10ops-eqiad, 10serviceops: Move eqiad thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343993 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1001 for host kubernetes1058.eqiad.wmnet with OS bullseye [14:19:07] (03PS1) 10Ssingh: wmf-config: update new esams IP ranges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951508 (https://phabricator.wikimedia.org/T329219) [14:20:15] !log stevemunene@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [14:20:23] (03CR) 10Muehlenhoff: [C: 03+2] Add additional sets for monitoring/prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/951459 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:20:44] !log gmodena@deploy1002 Finished deploy [analytics/refinery@d62f281] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d62f281] (duration: 03m 15s) [14:21:05] (03PS4) 10Jbond: profile::swift:fetch_rings: ensure we create directories [puppet] - 10https://gerrit.wikimedia.org/r/951506 (https://phabricator.wikimedia.org/T341056) [14:21:25] (03PS2) 10Ssingh: wmf-config: update new esams IP ranges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951508 (https://phabricator.wikimedia.org/T329219) [14:22:20] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2055.codfw.wmnet with OS bullseye [14:22:27] 10SRE, 10ops-codfw, 10serviceops: Move codfw thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343996 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host kubernetes2055.codfw.wmnet with OS bullseye [14:22:30] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42972/console" [puppet] - 10https://gerrit.wikimedia.org/r/951506 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [14:22:34] (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:22:49] (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:23:03] (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:23:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P50937 and previous config saved to /var/cache/conftool/dbconfig/20230822-142341-ladsgroup.json [14:23:47] (03CR) 10Ayounsi: wmf-config: update new esams IP ranges (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951508 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [14:24:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:24:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P50938 and previous config saved to /var/cache/conftool/dbconfig/20230822-142405-ladsgroup.json [14:24:30] !log stevemunene@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [14:25:17] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2056.codfw.wmnet with OS bullseye [14:25:25] 10SRE, 10ops-codfw, 10serviceops: Move codfw thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343996 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host kubernetes2056.codfw.wmnet with OS bullseye [14:25:34] (03PS3) 10Ssingh: wmf-config: update new esams IP ranges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951508 (https://phabricator.wikimedia.org/T329219) [14:26:00] (03CR) 10Cathal Mooney: "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951508 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [14:26:23] (03CR) 10CI reject: [V: 04-1] wmf-config: update new esams IP ranges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951508 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [14:26:29] wow ok [14:27:13] !log gmodena@deploy1002 Started deploy [analytics/refinery@d62f281] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d62f281] [14:27:17] !log gmodena@deploy1002 Finished deploy [analytics/refinery@d62f281] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d62f281] (duration: 00m 04s) [14:28:08] (03PS4) 10Ssingh: wmf-config: update new esams IP ranges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951508 (https://phabricator.wikimedia.org/T329219) [14:29:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:29:23] PROBLEM - Check systemd state on kafkamon2003 is CRITICAL: CRITICAL - degraded: The following units failed: burrow-logging-codfw.service,burrow-main-codfw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951508 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [14:30:05] (03Merged) 10jenkins-bot: wmf-config: update new esams IP ranges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951508 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [14:30:34] !log taavi@deploy1002 Started scap: Backport for [[gerrit:951508|wmf-config: update new esams IP ranges (T329219)]] [14:30:39] T329219: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 [14:30:54] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1057.eqiad.wmnet with reason: host reimage [14:31:05] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1058.eqiad.wmnet with reason: host reimage [14:31:37] 10SRE, 10Traffic: Blocked on English Wikipedia / Wikimedia thinks my IP is 10.80.1.11 - https://phabricator.wikimedia.org/T344704 (10AlexisJazz) [14:31:41] 10SRE, 10Traffic: Blocked on English Wikipedia / Wikimedia thinks my IP is 10.80.1.11 - https://phabricator.wikimedia.org/T344704 (10Johannnes89) Note: Multiple dewiki users were reporting a similar problem regarding the IP 10.80.1.7 which also doesn't belong to those users (same /28-range as 10.80.1.11) https... [14:32:07] !log taavi@deploy1002 taavi and sukhe: Backport for [[gerrit:951508|wmf-config: update new esams IP ranges (T329219)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [14:32:35] !log taavi@deploy1002 taavi and sukhe: Continuing with sync [14:33:19] (03PS1) 10Muehlenhoff: Adapt monitoring/metrics rules for nft and ferm providers [puppet] - 10https://gerrit.wikimedia.org/r/951512 (https://phabricator.wikimedia.org/T336497) [14:33:19] 10SRE, 10Traffic: Blocked on English Wikipedia / Wikimedia thinks my IP is 10.80.1.11 - https://phabricator.wikimedia.org/T344704 (10ssingh) https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/951508 @taavi has rolled this out so this should be resolving shortly. Thanks for filing the task. [14:33:33] 10SRE, 10Traffic: Blocked on English Wikipedia / Wikimedia thinks my IP is 10.80.1.11 - https://phabricator.wikimedia.org/T344704 (10taavi) a:03ssingh [14:33:57] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1057.eqiad.wmnet with reason: host reimage [14:34:27] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:36:06] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1058.eqiad.wmnet with reason: host reimage [14:36:21] (03CR) 10Jbond: [V: 03+1 C: 03+2] profile::swift:fetch_rings: ensure we create directories [puppet] - 10https://gerrit.wikimedia.org/r/951506 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [14:36:37] RECOVERY - Check systemd state on kafkamon2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:36:47] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafkamon2003.codfw.wmnet [14:37:01] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs101[6,9].eqiad.wmnet: Upgrade Cassandra to 4.1.1 — T339299 - eevans@cumin1001 [14:37:05] T339299: Upgrade aqs cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339299 [14:37:07] 10SRE, 10Traffic: Blocked on English Wikipedia / Wikimedia thinks my IP is 10.80.1.11 - https://phabricator.wikimedia.org/T344704 (10Yamla) Feel free to lift my block on https://en.wikipedia.org/w/index.php?title=Special:Log/block&page=User%3AST47ProxyBot once this fix is deployed and working. No need to consu... [14:37:30] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafkamon1003.eqiad.wmnet [14:38:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:38:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P50939 and previous config saved to /var/cache/conftool/dbconfig/20230822-143847-ladsgroup.json [14:39:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P50940 and previous config saved to /var/cache/conftool/dbconfig/20230822-143912-ladsgroup.json [14:40:23] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/951512 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:40:25] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:951508|wmf-config: update new esams IP ranges (T329219)]] (duration: 09m 50s) [14:40:29] T329219: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 [14:40:42] 10SRE, 10Traffic: Blocked on English Wikipedia / Wikimedia thinks my IP is 10.80.1.11 - https://phabricator.wikimedia.org/T344704 (10taavi) 05Open→03Resolved [14:41:21] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafkamon1003.eqiad.wmnet [14:43:25] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2055.codfw.wmnet with reason: host reimage [14:43:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:43:40] (03PS1) 10Ssingh: realm.pp: add comments about updating mediawiki-config [puppet] - 10https://gerrit.wikimedia.org/r/951514 (https://phabricator.wikimedia.org/T344704) [14:44:56] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs101[6,9].eqiad.wmnet: Upgrade Cassandra to 4.1.1 — T339299 - eevans@cumin1001 [14:45:00] T339299: Upgrade aqs cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339299 [14:46:06] (03CR) 10Eevans: [C: 03+2] aqs: upgrade rack2 nodes to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/951478 (https://phabricator.wikimedia.org/T339299) (owner: 10Eevans) [14:46:14] (03CR) 10JMeybohm: modules/base: networkpolicy_1.0.1 Add support for extraRules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/950187 (https://phabricator.wikimedia.org/T344177) (owner: 10JMeybohm) [14:46:26] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2056.codfw.wmnet with reason: host reimage [14:46:40] (03CR) 10Majavah: [C: 03+1] realm.pp: add comments about updating mediawiki-config [puppet] - 10https://gerrit.wikimedia.org/r/951514 (https://phabricator.wikimedia.org/T344704) (owner: 10Ssingh) [14:46:45] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2055.codfw.wmnet with reason: host reimage [14:46:53] (03CR) 10Ssingh: [C: 03+2] realm.pp: add comments about updating mediawiki-config [puppet] - 10https://gerrit.wikimedia.org/r/951514 (https://phabricator.wikimedia.org/T344704) (owner: 10Ssingh) [14:49:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:49:09] 10SRE, 10Traffic, 10Patch-For-Review: Blocked on English Wikipedia / Wikimedia thinks my IP is 10.80.1.11 - https://phabricator.wikimedia.org/T344704 (10AlexisJazz) https://en.wikipedia.org/w/api.php?action=query&meta=userinfo&callback=&format=json&formatversion=2 now returns my actual IP. Thanks for the qui... [14:49:11] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2023.codfw.wmnet with OS bullseye [14:49:23] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2023.codfw.wmnet with OS bullseye executed with errors: - wdqs2... [14:49:42] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2056.codfw.wmnet with reason: host reimage [14:49:52] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:50:16] PROBLEM - rsyslog TLS listener on port 6514 on centrallog2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [14:50:18] RECOVERY - rsyslog TLS listener on port 6514 on centrallog2002 is OK: SSL OK - Certificate centrallog2002.codfw.wmnet valid until 2026-09-27 13:35:26 +0000 (expires in 1131 days) https://wikitech.wikimedia.org/wiki/Logs [14:50:26] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1057.eqiad.wmnet with OS bullseye [14:50:33] 10SRE, 10ops-eqiad, 10serviceops: Move eqiad thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343993 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1001 for host kubernetes1057.eqiad.wmnet with OS bullseye completed: - kubernetes1057 (**WARN**) -... [14:50:47] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs10[11,14,17,20].eqiad.wmnet: Upgrade Cassandra to 4.1.1 — T339299 - eevans@cumin1001 [14:50:51] T339299: Upgrade aqs cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339299 [14:53:04] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1058.eqiad.wmnet with OS bullseye [14:53:12] 10SRE, 10ops-eqiad, 10serviceops: Move eqiad thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343993 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1001 for host kubernetes1058.eqiad.wmnet with OS bullseye completed: - kubernetes1058 (**PASS**) -... [14:53:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T344589)', diff saved to https://phabricator.wikimedia.org/P50942 and previous config saved to /var/cache/conftool/dbconfig/20230822-145353-ladsgroup.json [14:54:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [14:54:03] (ProbeDown) resolved: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:54:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [14:54:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T344589)', diff saved to https://phabricator.wikimedia.org/P50944 and previous config saved to /var/cache/conftool/dbconfig/20230822-145418-ladsgroup.json [14:54:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T344589)', diff saved to https://phabricator.wikimedia.org/P50945 and previous config saved to /var/cache/conftool/dbconfig/20230822-145419-ladsgroup.json [14:54:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [14:54:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [14:54:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T344589)', diff saved to https://phabricator.wikimedia.org/P50946 and previous config saved to /var/cache/conftool/dbconfig/20230822-145442-ladsgroup.json [14:55:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T344589)', diff saved to https://phabricator.wikimedia.org/P50947 and previous config saved to /var/cache/conftool/dbconfig/20230822-145544-ladsgroup.json [14:56:08] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host webperf2003.codfw.wmnet [14:56:28] (03PS1) 10Stevemunene: datahub: set the oidc client authentication method [deployment-charts] - 10https://gerrit.wikimedia.org/r/951518 (https://phabricator.wikimedia.org/T305874) [14:58:40] (03PS2) 10Stevemunene: datahub: set the oidc client authentication method [deployment-charts] - 10https://gerrit.wikimedia.org/r/951518 (https://phabricator.wikimedia.org/T305874) [14:59:45] !log tools.stashbot stat1008: Remove `aswiki` from `/srv/published/datasets/one-off/research-mwaddlink/wikis.txt` (T344319) [14:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:49] T344319: Remove models with poor evaluation metrics from the published datasets repo - https://phabricator.wikimedia.org/T344319 [15:00:05] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf2003.codfw.wmnet [15:00:28] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host webperf1003.eqiad.wmnet [15:00:43] kevinbazira: ty for the update! fwiw, there's no need to prefix your log with `tools.stashbot` here. [15:01:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T344589)', diff saved to https://phabricator.wikimedia.org/P50948 and previous config saved to /var/cache/conftool/dbconfig/20230822-150102-ladsgroup.json [15:01:35] urbanecm: Thank you for the clarification :) [15:01:41] np [15:01:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T344589)', diff saved to https://phabricator.wikimedia.org/P50949 and previous config saved to /var/cache/conftool/dbconfig/20230822-150153-ladsgroup.json [15:01:54] (03PS1) 10Hnowlan: kubernetes: add new kubernetes nodes to calico [puppet] - 10https://gerrit.wikimedia.org/r/951522 (https://phabricator.wikimedia.org/T343993) [15:03:35] (03PS1) 10Jbond: puppetserver::rsync: open firwall port [puppet] - 10https://gerrit.wikimedia.org/r/951523 (https://phabricator.wikimedia.org/T341056) [15:03:58] !log stat1008: Remove `aswiki` from the published datasets repo `/srv/published/datasets/one-off/research-mwaddlink` (T344319) [15:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:18] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2055.codfw.wmnet with OS bullseye [15:04:29] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: Move codfw thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343996 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host kubernetes2055.codfw.wmnet with OS bullseye completed: - kuberne... [15:04:49] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42973/console" [puppet] - 10https://gerrit.wikimedia.org/r/951523 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [15:04:55] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf1003.eqiad.wmnet [15:06:35] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host arclamp2001.codfw.wmnet [15:07:06] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42974/console" [puppet] - 10https://gerrit.wikimedia.org/r/951523 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [15:07:17] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2056.codfw.wmnet with OS bullseye [15:07:23] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs10[11,14,17,20].eqiad.wmnet: Upgrade Cassandra to 4.1.1 — T339299 - eevans@cumin1001 [15:07:27] !log installing hdf5 security updates [15:07:27] T339299: Upgrade aqs cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339299 [15:07:28] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: Move codfw thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343996 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host kubernetes2056.codfw.wmnet with OS bullseye completed: - kuberne... [15:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:54] (03PS1) 10Hnowlan: kubernetes: add new nodes [puppet] - 10https://gerrit.wikimedia.org/r/951524 (https://phabricator.wikimedia.org/T343993) [15:09:36] PROBLEM - Host ms-backup1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:09:42] (03CR) 10Eevans: [C: 03+2] aqs: upgrade rack3 nodes to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/951479 (https://phabricator.wikimedia.org/T339299) (owner: 10Eevans) [15:10:26] RECOVERY - Host ms-backup1001 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [15:10:28] ^that is me [15:10:30] fixing [15:10:34] PROBLEM - Host ms-backup1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:10:56] ms-backup is me [15:11:10] I am silencing then [15:11:18] RECOVERY - Host ms-backup1002 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [15:11:29] (03CR) 10Effie Mouzeli: [C: 03+1] kubernetes: add new kubernetes nodes to calico [puppet] - 10https://gerrit.wikimedia.org/r/951522 (https://phabricator.wikimedia.org/T343993) (owner: 10Hnowlan) [15:12:01] (03CR) 10Effie Mouzeli: [C: 03+1] kubernetes: add new nodes [puppet] - 10https://gerrit.wikimedia.org/r/951524 (https://phabricator.wikimedia.org/T343993) (owner: 10Hnowlan) [15:12:17] !nowandnext [15:12:19] jouncebot: nowandnext [15:12:19] No deployments scheduled for the next 0 hour(s) and 47 minute(s) [15:12:19] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver::rsync: open firwall port [puppet] - 10https://gerrit.wikimedia.org/r/951523 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [15:12:19] In 0 hour(s) and 47 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230822T1600) [15:12:21] :( [15:12:28] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs10[12,15,18,21].eqiad.wmnet: Upgrade Cassandra to 4.1.1 — T339299 - eevans@cumin1001 [15:12:32] T339299: Upgrade aqs cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339299 [15:12:46] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host arclamp2001.codfw.wmnet [15:15:03] (03CR) 10Hnowlan: [C: 03+2] kubernetes: add new kubernetes nodes to calico [puppet] - 10https://gerrit.wikimedia.org/r/951522 (https://phabricator.wikimedia.org/T343993) (owner: 10Hnowlan) [15:16:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P50951 and previous config saved to /var/cache/conftool/dbconfig/20230822-151608-ladsgroup.json [15:17:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P50952 and previous config saved to /var/cache/conftool/dbconfig/20230822-151700-ladsgroup.json [15:17:25] (03PS1) 10Filippo Giunchedi: sre: add bandaid alert for prometheus not reloading its k8s certs [alerts] - 10https://gerrit.wikimedia.org/r/951526 (https://phabricator.wikimedia.org/T343529) [15:18:32] (03CR) 10Hnowlan: [C: 03+2] kubernetes: add new nodes [puppet] - 10https://gerrit.wikimedia.org/r/951524 (https://phabricator.wikimedia.org/T343993) (owner: 10Hnowlan) [15:20:50] PROBLEM - Check systemd state on kubernetes2024 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:21:02] PROBLEM - Check systemd state on kubernetes1026 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:21:16] (03PS1) 10Muehlenhoff: Add library hint for hdf5 [puppet] - 10https://gerrit.wikimedia.org/r/951527 [15:22:40] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:23:44] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for hdf5 [puppet] - 10https://gerrit.wikimedia.org/r/951527 (owner: 10Muehlenhoff) [15:24:28] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1026 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:25:50] PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: send_tile_invalidations.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:26:34] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2024 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:29:17] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs10[12,15,18,21].eqiad.wmnet: Upgrade Cassandra to 4.1.1 — T339299 - eevans@cumin1001 [15:29:23] T339299: Upgrade aqs cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339299 [15:30:11] (03PS1) 10Ssingh: site: add wikidough VMs for esams [puppet] - 10https://gerrit.wikimedia.org/r/951528 (https://phabricator.wikimedia.org/T344355) [15:30:50] (03CR) 10Ssingh: [C: 03+2] site: add wikidough VMs for esams [puppet] - 10https://gerrit.wikimedia.org/r/951528 (https://phabricator.wikimedia.org/T344355) (owner: 10Ssingh) [15:31:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P50953 and previous config saved to /var/cache/conftool/dbconfig/20230822-153115-ladsgroup.json [15:32:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P50954 and previous config saved to /var/cache/conftool/dbconfig/20230822-153206-ladsgroup.json [15:32:14] 10SRE, 10ops-codfw, 10collaboration-services, 10decommission-hardware: Decommission contint2001.wikimedia.org - https://phabricator.wikimedia.org/T342017 (10Jhancock.wm) 05Open→03Resolved [15:33:25] 10SRE, 10Traffic: Blocked on English Wikipedia / Wikimedia thinks my IP is 10.80.1.11 - https://phabricator.wikimedia.org/T344704 (10AlexisJazz) >>! In T344704#9109942, @Johannnes89 wrote: > Note: Multiple dewiki users were reporting a similar problem regarding the IP 10.80.1.7 which also doesn't belong to tho... [15:35:58] (03PS1) 10Jbond: puppetserver::rsync: fix dir and ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/951529 (https://phabricator.wikimedia.org/T341056) [15:37:16] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42975/console" [puppet] - 10https://gerrit.wikimedia.org/r/951529 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [15:38:28] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42976/console" [puppet] - 10https://gerrit.wikimedia.org/r/951529 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [15:39:02] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver::rsync: fix dir and ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/951529 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [15:40:21] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_eqsin and A:cp [15:41:16] (03PS1) 10Arlolra: Remove deprecated config wgVisualEditorParsoidAutoConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951530 [15:41:19] (03PS1) 10Ssingh: site: remove older references to doh300[34] [puppet] - 10https://gerrit.wikimedia.org/r/951531 (https://phabricator.wikimedia.org/T344355) [15:42:19] (03CR) 10Ssingh: [C: 03+2] site: remove older references to doh300[34] [puppet] - 10https://gerrit.wikimedia.org/r/951531 (https://phabricator.wikimedia.org/T344355) (owner: 10Ssingh) [15:42:35] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Jhancock.wm) [15:45:40] PROBLEM - Blazegraph process -wdqs-categories- on wdqs1009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:45:46] PROBLEM - Blazegraph Port for wdqs-categories on wdqs1009 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:46:00] 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10Patch-For-Review, and 2 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Krinkle) [15:46:03] (03CR) 10FNegri: [C: 03+1] admin: deprecate labtest-roots group [puppet] - 10https://gerrit.wikimedia.org/r/951469 (https://phabricator.wikimedia.org/T337848) (owner: 10Jbond) [15:46:18] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T344589)', diff saved to https://phabricator.wikimedia.org/P50955 and previous config saved to /var/cache/conftool/dbconfig/20230822-154621-ladsgroup.json [15:46:28] (03CR) 10FNegri: [C: 03+1] wmcs: add wmcs-roots to roles where it is missing [puppet] - 10https://gerrit.wikimedia.org/r/923681 (owner: 10Jbond) [15:47:02] (03PS1) 10Ssingh: hiera: update authorized_hosts for acme_chief for WDNS [puppet] - 10https://gerrit.wikimedia.org/r/951532 (https://phabricator.wikimedia.org/T344355) [15:47:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T344589)', diff saved to https://phabricator.wikimedia.org/P50956 and previous config saved to /var/cache/conftool/dbconfig/20230822-154712-ladsgroup.json [15:47:42] (SystemdUnitFailed) firing: (2) wdqs-blazegraph.service Failed on wdqs1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:48:23] (03PS1) 10JMeybohm: deployment_server: Add jaeger user to aux-k8s [puppet] - 10https://gerrit.wikimedia.org/r/951533 (https://phabricator.wikimedia.org/T344253) [15:50:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T344589)', diff saved to https://phabricator.wikimedia.org/P50957 and previous config saved to /var/cache/conftool/dbconfig/20230822-155025-ladsgroup.json [15:50:35] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42977/console" [puppet] - 10https://gerrit.wikimedia.org/r/951533 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [15:52:44] (03PS1) 10Jbond: ferm::service: make port optional so we can use port_range [puppet] - 10https://gerrit.wikimedia.org/r/951534 [15:52:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED [15:53:43] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED [15:54:49] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:56:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:56:27] (03CR) 10Jbond: "lgtm but see comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/951512 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [15:56:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED [15:57:40] !log sukhe@cumin2002 START - Cookbook sre.ganeti.makevm for new host doh3003.wikimedia.org [15:57:41] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [15:58:29] !log sudo cookbook sre.ganeti.makevm --vcpus 2 --memory 8 --disk 15 --network public --os bullseye --cluster esams01 --group BY27 -t T344355 doh3003 [15:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:33] T344355: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 [15:58:49] (03CR) 10Bartosz Dziewoński: "I have a bigger cleanup at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/949593 . It should be good to go, but I've been " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951530 (owner: 10Arlolra) [15:59:12] (03CR) 10Bartosz Dziewoński: [C: 03+1] Remove deprecated config wgVisualEditorParsoidAutoConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951530 (owner: 10Arlolra) [15:59:44] (03CR) 10Vgutierrez: [C: 03+1] hiera: update authorized_hosts for acme_chief for WDNS [puppet] - 10https://gerrit.wikimedia.org/r/951532 (https://phabricator.wikimedia.org/T344355) (owner: 10Ssingh) [15:59:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2023.codfw.wmnet with OS bullseye [16:00:05] jbond and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230822T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:00] (03PS1) 10Eevans: aqs: upgrade codfw/a_c nodes to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/951535 (https://phabricator.wikimedia.org/T339299) [16:01:02] (03PS1) 10Eevans: aqs: upgrade codfw/b_c nodes to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/951536 (https://phabricator.wikimedia.org/T339299) [16:01:04] (03PS1) 10Eevans: aqs: upgrade codfw/c_f nodes to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/951537 (https://phabricator.wikimedia.org/T339299) [16:02:42] (SystemdUnitFailed) firing: (2) wdqs-blazegraph.service Failed on wdqs1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:03:13] (03CR) 10Ssingh: [C: 03+2] hiera: update authorized_hosts for acme_chief for WDNS [puppet] - 10https://gerrit.wikimedia.org/r/951532 (https://phabricator.wikimedia.org/T344355) (owner: 10Ssingh) [16:04:56] authdns-update is failing because of duplicate IPs [16:04:57] resolving [16:05:06] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:05:10] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host doh3003.wikimedia.org [16:05:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P50958 and previous config saved to /var/cache/conftool/dbconfig/20230822-160532-ladsgroup.json [16:05:56] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [16:06:27] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2023.codfw.wmnet with reason: host reimage [16:07:42] (SystemdUnitFailed) firing: (3) wdqs-blazegraph.service Failed on wdqs1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:07:50] authdns-update is back [16:07:55] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: clear wikidough ips - sukhe@cumin2002" [16:08:34] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: clear wikidough ips - sukhe@cumin2002" [16:08:34] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:09:04] !log sukhe@cumin2002 START - Cookbook sre.ganeti.makevm for new host doh3003.wikimedia.org [16:09:05] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [16:09:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2023.codfw.wmnet with reason: host reimage [16:11:04] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh3003.wikimedia.org - sukhe@cumin2002" [16:11:50] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh3003.wikimedia.org - sukhe@cumin2002" [16:11:50] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:11:50] !log sukhe@cumin2002 START - Cookbook sre.dns.wipe-cache doh3003.wikimedia.org on all recursors [16:11:53] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh3003.wikimedia.org on all recursors [16:11:58] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [16:12:59] (03PS5) 10Slyngshede: Facter: Python version [puppet] - 10https://gerrit.wikimedia.org/r/942641 (https://phabricator.wikimedia.org/T271196) [16:13:34] (03CR) 10CI reject: [V: 04-1] Facter: Python version [puppet] - 10https://gerrit.wikimedia.org/r/942641 (https://phabricator.wikimedia.org/T271196) (owner: 10Slyngshede) [16:13:37] (03PS6) 10Slyngshede: Facter: Python version [puppet] - 10https://gerrit.wikimedia.org/r/942641 (https://phabricator.wikimedia.org/T271196) [16:13:50] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM doh3003.wikimedia.org - sukhe@cumin2002" [16:14:37] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM doh3003.wikimedia.org - sukhe@cumin2002" [16:14:37] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:14:37] !log sukhe@cumin2002 START - Cookbook sre.dns.wipe-cache doh3003.wikimedia.org on all recursors [16:14:40] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh3003.wikimedia.org on all recursors [16:14:43] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host doh3003.wikimedia.org [16:18:59] (03Abandoned) 10Slyngshede: Allow users to be created in MediaWiki. [software/bitu] - 10https://gerrit.wikimedia.org/r/935376 (owner: 10Slyngshede) [16:20:36] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED [16:20:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P50959 and previous config saved to /var/cache/conftool/dbconfig/20230822-162038-ladsgroup.json [16:21:25] !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons. [16:23:58] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [16:25:44] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [16:25:50] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: re-add wikidough ips - sukhe@cumin2002" [16:26:48] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: re-add wikidough ips - sukhe@cumin2002" [16:26:48] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:34:20] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [16:34:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2023.codfw.wmnet with OS bullseye [16:34:28] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2023.codfw.wmnet with OS bullseye completed: - wdqs2023 (**WARN... [16:34:42] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs1009.eqiad.wmnet with reason: jnl export [16:34:55] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs1009.eqiad.wmnet with reason: jnl export [16:35:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T344589)', diff saved to https://phabricator.wikimedia.org/P50960 and previous config saved to /var/cache/conftool/dbconfig/20230822-163544-ladsgroup.json [16:35:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [16:36:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [16:36:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T344589)', diff saved to https://phabricator.wikimedia.org/P50961 and previous config saved to /var/cache/conftool/dbconfig/20230822-163609-ladsgroup.json [16:41:35] (03PS4) 10Hnowlan: WIP helmfile: add namespace and service definition for geo-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/941374 (https://phabricator.wikimedia.org/T336400) [16:41:37] (03PS1) 10Hnowlan: helmfile: add entries and namespace for media-analytics service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951544 (https://phabricator.wikimedia.org/T336400) [16:42:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T344589)', diff saved to https://phabricator.wikimedia.org/P50962 and previous config saved to /var/cache/conftool/dbconfig/20230822-164229-ladsgroup.json [16:44:00] (03PS5) 10Hnowlan: helmfile: add namespace and service definition for geo-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/941374 (https://phabricator.wikimedia.org/T336400) [16:45:57] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10Papaul) [16:48:36] 10SRE-OnFire, 10Incident Tooling: implementing an incident response workflow automation tool for SRE - https://phabricator.wikimedia.org/T308467 (10BCornwall) [16:48:53] 10SRE-OnFire, 10Incident Tooling: implementing an incident response workflow automation tool for SRE - https://phabricator.wikimedia.org/T308467 (10BCornwall) a:03BCornwall [16:51:28] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe2004.codfw.wmnet [16:53:28] (03PS2) 10Hnowlan: helmfile: add entries and namespace for media-analytics service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951544 (https://phabricator.wikimedia.org/T336380) [16:53:32] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10Papaul) 05Open→03Resolved complete [16:55:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:55:36] (03CR) 10Eevans: [C: 03+2] aqs: upgrade codfw/a_c nodes to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/951535 (https://phabricator.wikimedia.org/T339299) (owner: 10Eevans) [16:55:55] 10SRE, 10ops-codfw, 10serviceops: Move codfw thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343996 (10hnowlan) 05Open→03Resolved a:03hnowlan [16:56:03] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): Final steps for fully-Kubernetes Thumbor - https://phabricator.wikimedia.org/T334488 (10hnowlan) [16:56:15] 10SRE, 10ops-eqiad, 10serviceops: Move eqiad thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343993 (10hnowlan) [16:57:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P50963 and previous config saved to /var/cache/conftool/dbconfig/20230822-165736-ladsgroup.json [16:57:44] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2004.codfw.wmnet [16:58:33] (03PS1) 10Hnowlan: service: move thumbor from thumbor pool to kubesvc [puppet] - 10https://gerrit.wikimedia.org/r/951545 (https://phabricator.wikimedia.org/T334488) [16:58:44] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs200[2-4].codfw.wmnet: Upgrade Cassandra to 4.1.1 — T339299 - eevans@cumin1001 [16:58:48] T339299: Upgrade aqs cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339299 [17:00:04] (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230822T1700) [17:00:11] (03PS1) 10Hnowlan: conftool: clean up thumbor pools [puppet] - 10https://gerrit.wikimedia.org/r/951546 (https://phabricator.wikimedia.org/T334488) [17:01:02] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe2003.codfw.wmnet [17:01:08] RECOVERY - Host maps2009 is UP: PING OK - Packet loss = 0%, RTA = 31.82 ms [17:01:18] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service,ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:07] 10SRE, 10MediaWiki-General, 10MediaWiki-libs-Stats, 10observability, and 5 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10colewhite) [17:06:50] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts doh3003.wikimedia.org [17:07:37] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2003.codfw.wmnet [17:09:40] (03PS1) 10Abijeet Patro: ext.uls.interface.js: Inline isNamed() method [extensions/UniversalLanguageSelector] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/951487 (https://phabricator.wikimedia.org/T344635) [17:09:47] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Final steps for fully-Kubernetes Thumbor - https://phabricator.wikimedia.org/T334488 (10hnowlan) 05Open→03Resolved [17:09:59] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan) [17:10:02] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe2002.codfw.wmnet [17:10:22] PROBLEM - Host maps2009 is DOWN: PING CRITICAL - Packet loss = 100% [17:10:46] ^ possibly an expiry? [17:10:47] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [17:10:53] oh nm [17:12:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P50964 and previous config saved to /var/cache/conftool/dbconfig/20230822-171242-ladsgroup.json [17:12:44] Got the parts to fix maps2009 up and working on some last adjustments. Should be good in a few minutes [17:12:45] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh3003.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [17:13:17] JennH: nice, thanks! [17:13:38] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs200[2-4].codfw.wmnet: Upgrade Cassandra to 4.1.1 — T339299 - eevans@cumin1001 [17:13:40] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh3003.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [17:13:40] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:13:41] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh3003.wikimedia.org [17:13:42] T339299: Upgrade aqs cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339299 [17:13:55] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `doh3003.wikimedia.org` - doh3003.wikimedia.org (**WARN**) - //Hos... [17:14:52] !log sukhe@cumin2002 START - Cookbook sre.ganeti.makevm for new host doh3003.wikimedia.org [17:14:53] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [17:15:44] (03PS2) 10Hnowlan: service, conftool: add base configuration for geo-analytics [puppet] - 10https://gerrit.wikimedia.org/r/947864 (https://phabricator.wikimedia.org/T336400) [17:15:46] (03PS1) 10Hnowlan: kubernetes: add users for media_analytics service [puppet] - 10https://gerrit.wikimedia.org/r/951547 (https://phabricator.wikimedia.org/T336380) [17:16:21] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2002.codfw.wmnet [17:16:50] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh3003.wikimedia.org - sukhe@cumin2002" [17:17:34] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh3003.wikimedia.org - sukhe@cumin2002" [17:17:34] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:17:34] !log sukhe@cumin2002 START - Cookbook sre.dns.wipe-cache doh3003.wikimedia.org on all recursors [17:17:38] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh3003.wikimedia.org on all recursors [17:18:05] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh3003.wikimedia.org - sukhe@cumin2002" [17:18:51] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh3003.wikimedia.org - sukhe@cumin2002" [17:19:07] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host doh3003.wikimedia.org with OS bullseye [17:19:20] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host doh3003.wikimedia.org with OS bullseye [17:24:33] !log joal@deploy1002 Started deploy [analytics/refinery@d62f281] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d62f281] [17:26:34] !log joal@deploy1002 Finished deploy [analytics/refinery@d62f281] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d62f281] (duration: 02m 01s) [17:26:38] (03CR) 10Eevans: [C: 03+2] aqs: upgrade codfw/b_c nodes to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/951536 (https://phabricator.wikimedia.org/T339299) (owner: 10Eevans) [17:27:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T344589)', diff saved to https://phabricator.wikimedia.org/P50965 and previous config saved to /var/cache/conftool/dbconfig/20230822-172748-ladsgroup.json [17:30:00] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs200[5-8].codfw.wmnet: Upgrade Cassandra to 4.1.1 — T339299 - eevans@cumin1001 [17:30:09] T339299: Upgrade aqs cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339299 [17:34:30] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 377105848 and 205 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:39:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:39:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:39:24] RECOVERY - Host maps2009 is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms [17:39:24] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 756493264 and 1911 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:39:27] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh3003.wikimedia.org with reason: host reimage [17:39:48] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 838500048 and 1934 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:39:48] PROBLEM - Postgres Replication Lag on maps2007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 840198232 and 1934 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:40:00] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 640312 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:40:08] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 375110064 and 824 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:40:18] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:40:40] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1216568 and 857 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:41:12] RECOVERY - Postgres Replication Lag on maps2007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 233656 and 889 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:41:12] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 233656 and 889 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:41:34] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 5616 and 910 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:41:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:41:40] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T344353 (10Jhancock.wm) 05Open→03Resolved got the root problem fixed, details in other ticket [17:42:37] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh3003.wikimedia.org with reason: host reimage [17:42:57] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:44:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:44:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:46:14] 10ops-codfw, 10serviceops-radar, 10Maps (Maps-data): maps2009 is unreachable - https://phabricator.wikimedia.org/T344110 (10Jhancock.wm) The backplane has been replaced and the settings in idrac have been updated. I think it's good to go back. I will leave this ticket up for a day to observe. Please let me k... [17:46:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:47:55] (03PS1) 10Bking: rdf-streaming-updater-dse-k8s: Add Zookeeper HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/951551 (https://phabricator.wikimedia.org/T344614) [17:48:33] 10SRE, 10MediaWiki-General, 10MediaWiki-libs-Stats, 10observability, and 5 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10colewhite) [17:48:44] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs200[5-8].codfw.wmnet: Upgrade Cassandra to 4.1.1 — T339299 - eevans@cumin1001 [17:48:49] T339299: Upgrade aqs cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339299 [17:56:53] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh3003.wikimedia.org with OS bullseye [17:56:54] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh3003.wikimedia.org [17:57:12] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host doh3003.wikimedia.org with OS bullseye completed: - doh3003 (**PASS... [17:57:44] !log sukhe@cumin2002 START - Cookbook sre.ganeti.makevm for new host doh3004.wikimedia.org [17:57:45] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [17:58:37] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe2001.codfw.wmnet [17:59:42] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh3004.wikimedia.org - sukhe@cumin2002" [18:00:05] dduvall and dancy: OwO what's this, a deployment window?? MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230822T1800). nyaa~ [18:01:02] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh3004.wikimedia.org - sukhe@cumin2002" [18:01:02] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:01:03] !log sukhe@cumin2002 START - Cookbook sre.dns.wipe-cache doh3004.wikimedia.org on all recursors [18:01:06] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh3004.wikimedia.org on all recursors [18:01:30] (03Abandoned) 10Arlolra: Remove deprecated config wgVisualEditorParsoidAutoConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951530 (owner: 10Arlolra) [18:01:35] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh3004.wikimedia.org - sukhe@cumin2002" [18:02:00] (03CR) 10Arlolra: Remove deprecated config wgVisualEditorParsoidAutoConfig (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951530 (owner: 10Arlolra) [18:02:18] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh3004.wikimedia.org - sukhe@cumin2002" [18:02:57] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host doh3004.wikimedia.org with OS bullseye [18:03:09] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host doh3004.wikimedia.org with OS bullseye [18:04:33] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2001.codfw.wmnet [18:06:33] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe1004.eqiad.wmnet [18:10:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:12:55] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe1004.eqiad.wmnet [18:15:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:16:42] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:17:32] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe1003.eqiad.wmnet [18:17:42] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh3004.wikimedia.org with reason: host reimage [18:18:24] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951555 (https://phabricator.wikimedia.org/T343725) [18:18:26] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951555 (https://phabricator.wikimedia.org/T343725) (owner: 10TrainBranchBot) [18:19:09] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951555 (https://phabricator.wikimedia.org/T343725) (owner: 10TrainBranchBot) [18:21:14] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh3004.wikimedia.org with reason: host reimage [18:23:38] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe1003.eqiad.wmnet [18:26:07] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe1002.eqiad.wmnet [18:28:06] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.23 refs T343725 [18:28:11] T343725: 1.41.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T343725 [18:28:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:29:18] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:32:25] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe1002.eqiad.wmnet [18:33:33] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:33:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:33:35] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe1001.eqiad.wmnet [18:36:42] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:37:25] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh3004.wikimedia.org with OS bullseye [18:37:25] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh3004.wikimedia.org [18:37:38] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host doh3004.wikimedia.org with OS bullseye completed: - doh3004 (**PASS... [18:41:29] !log decommissioning doh3004 as it was added in the same ganeti cluster as 3003 [18:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:42] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:42:02] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts doh3004.wikimedia.org [18:43:03] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:44:59] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe1001.eqiad.wmnet [18:46:13] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [18:47:03] (03CR) 10Btullis: rdf-streaming-updater-dse-k8s: Add Zookeeper HA (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951551 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [18:48:03] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:48:07] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh3004.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [18:48:54] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh3004.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [18:48:54] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:48:57] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh3004.wikimedia.org [18:49:08] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `doh3004.wikimedia.org` - doh3004.wikimedia.org (**PASS**) - Downt... [18:49:53] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus6002.drmrs.wmnet [18:51:38] (03PS5) 10Bking: query_service: let puppet manage whitelist [puppet] - 10https://gerrit.wikimedia.org/r/949101 (https://phabricator.wikimedia.org/T343856) [18:51:42] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:51:46] !log sukhe@cumin2002 START - Cookbook sre.ganeti.makevm for new host doh3004.wikimedia.org [18:51:48] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [18:52:06] (03PS6) 10Bking: query_service: let puppet manage whitelist [puppet] - 10https://gerrit.wikimedia.org/r/949101 (https://phabricator.wikimedia.org/T343856) [18:52:20] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/949101 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [18:53:48] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh3004.wikimedia.org - sukhe@cumin2002" [18:54:32] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh3004.wikimedia.org - sukhe@cumin2002" [18:54:32] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:54:32] !log sukhe@cumin2002 START - Cookbook sre.dns.wipe-cache doh3004.wikimedia.org on all recursors [18:54:36] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh3004.wikimedia.org on all recursors [18:55:04] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh3004.wikimedia.org - sukhe@cumin2002" [18:55:17] (03CR) 10Gehel: [C: 03+1] query_service: let puppet manage whitelist [puppet] - 10https://gerrit.wikimedia.org/r/949101 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [18:55:19] (03CR) 10Ryan Kemper: [C: 03+1] query_service: let puppet manage whitelist [puppet] - 10https://gerrit.wikimedia.org/r/949101 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [18:55:41] (03CR) 10Bking: [C: 03+2] query_service: let puppet manage whitelist [puppet] - 10https://gerrit.wikimedia.org/r/949101 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [18:55:50] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh3004.wikimedia.org - sukhe@cumin2002" [18:55:58] (03CR) 10Bking: [C: 03+2] query_service: let puppet manage whitelist (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/949101 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [18:56:06] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus6002.drmrs.wmnet [18:56:12] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host doh3004.wikimedia.org with OS bullseye [18:56:26] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus5002.eqsin.wmnet [18:56:28] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host doh3004.wikimedia.org with OS bullseye [19:00:10] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:01:00] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.259 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:01:42] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:02:52] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus5002.eqsin.wmnet [19:03:20] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus4002.ulsfo.wmnet [19:09:10] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus4002.ulsfo.wmnet [19:09:34] (03PS1) 10Bking: query_service: Set correct path for allowlist [puppet] - 10https://gerrit.wikimedia.org/r/951562 (https://phabricator.wikimedia.org/T343856) [19:10:06] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus3003.esams.wmnet [19:10:17] (03CR) 10Ryan Kemper: [C: 03+1] query_service: Set correct path for allowlist [puppet] - 10https://gerrit.wikimedia.org/r/951562 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [19:10:32] (03CR) 10Bking: [C: 03+2] query_service: Set correct path for allowlist [puppet] - 10https://gerrit.wikimedia.org/r/951562 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [19:10:40] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10Jclark-ctr) @wiki_willy @RobH we do not have any replacement ssd for this server and is out of warranty. we would need to order replacement [19:12:38] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh3004.wikimedia.org with reason: host reimage [19:14:32] 10SRE, 10AQS2.0, 10Cassandra, 10serviceops, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10Htriedman) Hi all! It's been a few weeks without activity, so I'm following up on this request. It seems to me there are two remai... [19:16:04] (03PS1) 10Bking: query_service: Set correct path for allowlist in blazegraph.pp [puppet] - 10https://gerrit.wikimedia.org/r/951563 (https://phabricator.wikimedia.org/T343856) [19:16:12] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:16:17] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus3003.esams.wmnet [19:16:18] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh3004.wikimedia.org with reason: host reimage [19:16:27] (03CR) 10CI reject: [V: 04-1] query_service: Set correct path for allowlist in blazegraph.pp [puppet] - 10https://gerrit.wikimedia.org/r/951563 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [19:16:28] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:16:59] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10RobH) [19:17:33] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus2006.codfw.wmnet [19:17:50] (03PS2) 10Bking: query_service: Set correct path for allowlist in blazegraph.pp [puppet] - 10https://gerrit.wikimedia.org/r/951563 (https://phabricator.wikimedia.org/T343856) [19:18:06] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.289 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:18:15] (03CR) 10Ryan Kemper: [C: 03+1] query_service: Set correct path for allowlist in blazegraph.pp [puppet] - 10https://gerrit.wikimedia.org/r/951563 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [19:18:28] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 4.812 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:18:31] (03CR) 10Bking: [C: 03+2] query_service: Set correct path for allowlist in blazegraph.pp [puppet] - 10https://gerrit.wikimedia.org/r/951563 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [19:21:42] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:25:57] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2006.codfw.wmnet [19:26:38] (03PS1) 10Bking: query_service: move allowlist file resource [puppet] - 10https://gerrit.wikimedia.org/r/951565 (https://phabricator.wikimedia.org/T343856) [19:27:00] (03CR) 10CI reject: [V: 04-1] query_service: move allowlist file resource [puppet] - 10https://gerrit.wikimedia.org/r/951565 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [19:27:20] 10SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10thcipriani) 05Open→03Resolved a:03thcipriani This task got too big to be useful. I've broken down each individual row in the table fro... [19:27:40] 10SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10thcipriani) [19:28:10] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [19:28:14] 10SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10thcipriani) [19:28:26] (03PS2) 10Bking: query_service: move allowlist file resource [puppet] - 10https://gerrit.wikimedia.org/r/951565 (https://phabricator.wikimedia.org/T343856) [19:28:46] 10SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10thcipriani) [19:29:12] (03CR) 10Ryan Kemper: [C: 03+1] query_service: move allowlist file resource [puppet] - 10https://gerrit.wikimedia.org/r/951565 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [19:29:16] (03CR) 10Bking: [C: 03+2] query_service: move allowlist file resource [puppet] - 10https://gerrit.wikimedia.org/r/951565 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [19:29:36] 10SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10thcipriani) [19:29:56] 10SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10thcipriani) [19:30:16] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:30:26] 10SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10thcipriani) [19:32:57] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh3004.wikimedia.org with OS bullseye [19:32:57] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh3004.wikimedia.org [19:33:10] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [19:33:15] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus1006.eqiad.wmnet [19:33:26] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host doh3004.wikimedia.org with OS bullseye completed: - doh3004 (**PASS... [19:33:35] (03PS1) 10Bking: query_service: move allowlist file resource [puppet] - 10https://gerrit.wikimedia.org/r/951566 (https://phabricator.wikimedia.org/T343856) [19:35:43] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/951566 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [19:36:39] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42978/console" [puppet] - 10https://gerrit.wikimedia.org/r/951566 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [19:36:42] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:42:17] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1006.eqiad.wmnet [19:43:10] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [19:48:10] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [19:48:53] (03PS1) 10Bking: query service: rollback allowlist changes [puppet] - 10https://gerrit.wikimedia.org/r/951576 (https://phabricator.wikimedia.org/T343856) [19:49:34] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:56:51] (03PS2) 10Bking: query_service: rollback allowlist changes [puppet] - 10https://gerrit.wikimedia.org/r/951576 (https://phabricator.wikimedia.org/T343856) [19:57:04] (03CR) 10CI reject: [V: 04-1] query_service: rollback allowlist changes [puppet] - 10https://gerrit.wikimedia.org/r/951576 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to snap out of that daydream and deploy UTC late backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230822T2000). [20:00:04] Dreamy_Jazz and gmodena: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:36] i can deploy [20:00:37] \o [20:00:57] (03PS2) 10Urbanecm: clienthints: Collect Client Hints data on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951431 (https://phabricator.wikimedia.org/T341110) (owner: 10Dreamy Jazz) [20:01:00] (03CR) 10Urbanecm: [C: 03+2] clienthints: Collect Client Hints data on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951431 (https://phabricator.wikimedia.org/T341110) (owner: 10Dreamy Jazz) [20:01:04] finger's crossed :)) [20:01:11] :D [20:01:15] \o [20:01:45] (03Merged) 10jenkins-bot: clienthints: Collect Client Hints data on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951431 (https://phabricator.wikimedia.org/T341110) (owner: 10Dreamy Jazz) [20:01:46] hey gmodena! [20:02:02] (03PS3) 10Bking: query_service: rollback allowlist changes [puppet] - 10https://gerrit.wikimedia.org/r/951576 (https://phabricator.wikimedia.org/T343856) [20:02:04] urbanecm hey there! [20:02:39] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:951431|clienthints: Collect Client Hints data on all wikis (T341110)]] [20:02:44] T341110: Deploy client hints functionality - https://phabricator.wikimedia.org/T341110 [20:04:17] !log urbanecm@deploy1002 dreamyjazz and urbanecm: Backport for [[gerrit:951431|clienthints: Collect Client Hints data on all wikis (T341110)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:04:27] Dreamy_Jazz: mind testing? :) [20:04:30] Sure. [20:04:39] (03PS1) 10Bking: Revert "query_service: let puppet manage whitelist" [puppet] - 10https://gerrit.wikimedia.org/r/951577 [20:05:23] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Valerie Riley - https://phabricator.wikimedia.org/T344770 (10VRiley-WMF) [20:06:06] (03PS2) 10Bking: Revert "query_service: let puppet manage whitelist" [puppet] - 10https://gerrit.wikimedia.org/r/951577 [20:06:47] (03CR) 10Ryan Kemper: [C: 03+1] Revert "query_service: let puppet manage whitelist" [puppet] - 10https://gerrit.wikimedia.org/r/951577 (owner: 10Bking) [20:07:50] (03CR) 10Bking: [C: 03+2] Revert "query_service: let puppet manage whitelist" [puppet] - 10https://gerrit.wikimedia.org/r/951577 (owner: 10Bking) [20:08:30] urbanecm: Test complete. Could you check that rows exist in cu_useragent_clienthints and cu_useragent_clienthints_map on enwiki? [20:08:35] sure [20:09:07] Dreamy_Jazz: 11 and 22 rows respectively. do you want to see the rows? [20:09:12] or should i proceed? [20:09:20] I don't think I need to see the rows necessarily. [20:09:27] Proceeding should be okay. [20:09:29] okay, so proceeding :) [20:09:34] !log urbanecm@deploy1002 dreamyjazz and urbanecm: Continuing with sync [20:10:44] (03PS2) 10Urbanecm: Declare v1 of the page_content_change stream. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951444 (https://phabricator.wikimedia.org/T307959) (owner: 10Gmodena) [20:10:48] (03CR) 10Urbanecm: [C: 03+2] Declare v1 of the page_content_change stream. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951444 (https://phabricator.wikimedia.org/T307959) (owner: 10Gmodena) [20:11:00] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:12:13] (03Merged) 10jenkins-bot: Declare v1 of the page_content_change stream. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951444 (https://phabricator.wikimedia.org/T307959) (owner: 10Gmodena) [20:15:09] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:951431|clienthints: Collect Client Hints data on all wikis (T341110)]] (duration: 12m 29s) [20:15:13] T341110: Deploy client hints functionality - https://phabricator.wikimedia.org/T341110 [20:15:19] Dreamy_Jazz: live :) [20:15:24] Thanks! [20:15:29] np [20:15:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PUT deployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:16:03] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:951444|Declare v1 of the page_content_change stream. (T307959)]] [20:16:07] T307959: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 [20:17:34] !log urbanecm@deploy1002 urbanecm and gmodena: Backport for [[gerrit:951444|Declare v1 of the page_content_change stream. (T307959)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:17:45] gmodena: do you mind testing at mwdebug1001, if possible? [20:19:25] urbanecm sure [20:19:34] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:20:01] (03Abandoned) 10Bking: query_service: rollback allowlist changes [puppet] - 10https://gerrit.wikimedia.org/r/951576 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [20:20:35] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PUT deployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:20:42] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:21:42] urbanecm tested on mwdebug1001, works as expected. Querying https://www.mediawiki.org/w/api.php?action=streamconfigs&streams=mediawiki.page_content_change.v1 returns the expected payload. [20:21:50] awesome, syncing [20:21:52] !log urbanecm@deploy1002 urbanecm and gmodena: Continuing with sync [20:24:22] PROBLEM - puppet last run on wcqs2001 is CRITICAL: CRITICAL: Puppet last ran 5 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:24:56] PROBLEM - puppet last run on wcqs2003 is CRITICAL: CRITICAL: Puppet last ran 5 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:25:18] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:25:44] PROBLEM - puppet last run on wcqs2002 is CRITICAL: CRITICAL: Puppet last ran 5 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:26:06] PROBLEM - puppet last run on wcqs1003 is CRITICAL: CRITICAL: Puppet last ran 5 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:26:56] PROBLEM - puppet last run on wcqs1002 is CRITICAL: CRITICAL: Puppet last ran 5 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:26:56] PROBLEM - puppet last run on wcqs1001 is CRITICAL: CRITICAL: Puppet last ran 5 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:27:22] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:951444|Declare v1 of the page_content_change stream. (T307959)]] (duration: 11m 19s) [20:27:27] T307959: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 [20:27:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:27:54] ^^ w[cd]qs puppet alerts should clear soon [20:28:05] gmodena: should be live [20:28:07] anything else? [20:28:30] urbanecm it is. All looks good, and nothing else on my end [20:28:35] 👍 [20:28:43] urbanecm many thanks! [20:29:02] np [20:29:32] !log bking@cumin1001 enable/run puppet on hosts after rollback T343856 [20:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:37] T343856: Move whitelist.txt from WDQS deploy repo into puppet and rename it to "allow list" - https://phabricator.wikimedia.org/T343856 [20:29:48] RECOVERY - puppet last run on wcqs2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:30:22] RECOVERY - puppet last run on wcqs2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:31:08] RECOVERY - puppet last run on wcqs2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:31:32] RECOVERY - puppet last run on wcqs1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:32:22] RECOVERY - puppet last run on wcqs1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:32:22] RECOVERY - puppet last run on wcqs1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:32:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:36:34] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:40:38] (03CR) 10Eevans: [C: 03+2] aqs: upgrade codfw/c_f nodes to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/951537 (https://phabricator.wikimedia.org/T339299) (owner: 10Eevans) [20:44:17] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs20[09-12].codfw.wmnet: Upgrade Cassandra to 4.1.1 — T339299 - eevans@cumin1001 [20:44:22] T339299: Upgrade aqs cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339299 [20:48:51] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-upload_eqsin and A:cp [20:49:42] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:49:48] (03CR) 10Bking: rdf-streaming-updater-dse-k8s: Add Zookeeper HA (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951551 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [20:55:31] (03PS1) 10Dduvall: gitlab: Support loading of local gems [puppet] - 10https://gerrit.wikimedia.org/r/951580 (https://phabricator.wikimedia.org/T337570) [20:56:00] (03CR) 10CI reject: [V: 04-1] gitlab: Support loading of local gems [puppet] - 10https://gerrit.wikimedia.org/r/951580 (https://phabricator.wikimedia.org/T337570) (owner: 10Dduvall) [20:58:17] (03PS2) 10Dduvall: gitlab: Support loading of local gems [puppet] - 10https://gerrit.wikimedia.org/r/951580 (https://phabricator.wikimedia.org/T337570) [21:02:27] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs20[09-12].codfw.wmnet: Upgrade Cassandra to 4.1.1 — T339299 - eevans@cumin1001 [21:02:32] T339299: Upgrade aqs cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339299 [21:04:04] (03PS1) 10Ssingh: devices: add doh300[34] to asw1-b*27-esams [homer/public] - 10https://gerrit.wikimedia.org/r/951581 [21:05:29] (03CR) 10Ssingh: "Please feel free to merge and run homer if there are other changes." [homer/public] - 10https://gerrit.wikimedia.org/r/951581 (owner: 10Ssingh) [21:06:58] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for lwatson - https://phabricator.wikimedia.org/T344772 (10lwatson) [21:07:14] (03PS2) 10Ssingh: devices: add doh300[34] to asw1-b*27-esams [homer/public] - 10https://gerrit.wikimedia.org/r/951581 (https://phabricator.wikimedia.org/T344355) [21:07:53] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ssingh) [21:13:38] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for lwatson - https://phabricator.wikimedia.org/T344772 (10NHillard-WMF) I'm Lauralyn's manager and I approve this request! [21:30:26] 10SRE, 10LDAP-Access-Requests: Access to Netbox - https://phabricator.wikimedia.org/T344764 (10Peachey88) [21:30:28] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Valerie Riley - https://phabricator.wikimedia.org/T344770 (10Peachey88) [21:36:05] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Valerie Riley - https://phabricator.wikimedia.org/T344770 (10RhinosF1) Hi, You've put would like shell access. Netbox is nothing to do with shell access. If only read access is required, 'wmf' group will be enough. [21:42:36] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability: Ingestion errors for production logs on ELK7 - https://phabricator.wikimedia.org/T240667 (10colewhite) At this point, ingestion errors make up less than 0.006% of daily indexed log volume. There is still room to improve, but this task... [21:47:11] (03PS1) 10Eevans: aqs: move per-host settings back to role [puppet] - 10https://gerrit.wikimedia.org/r/951585 (https://phabricator.wikimedia.org/T339299) [21:48:34] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/951585 (https://phabricator.wikimedia.org/T339299) (owner: 10Eevans) [21:49:32] (03PS2) 10Bartosz Dziewoński: Remove unused RESTBase-related VisualEditor config settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949593 (https://phabricator.wikimedia.org/T341618) [21:50:55] (03CR) 10Eevans: [C: 03+2] aqs: move per-host settings back to role [puppet] - 10https://gerrit.wikimedia.org/r/951585 (https://phabricator.wikimedia.org/T339299) (owner: 10Eevans) [22:15:29] PROBLEM - Host logstash2001 is DOWN: PING CRITICAL - Packet loss = 100% [22:15:29] PROBLEM - Host logstash2026 is DOWN: PING CRITICAL - Packet loss = 100% [22:16:59] RECOVERY - Host logstash2026 is UP: PING OK - Packet loss = 0%, RTA = 32.61 ms [22:17:43] RECOVERY - Host logstash2001 is UP: PING OK - Packet loss = 0%, RTA = 31.66 ms [22:26:42] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10cmooney) [22:31:33] PROBLEM - Host logstash2002 is DOWN: PING CRITICAL - Packet loss = 100% [22:33:19] RECOVERY - Host logstash2002 is UP: PING OK - Packet loss = 0%, RTA = 35.10 ms [22:38:31] PROBLEM - Host logstash2027 is DOWN: PING CRITICAL - Packet loss = 100% [22:39:45] RECOVERY - Host logstash2027 is UP: PING OK - Packet loss = 0%, RTA = 32.28 ms [22:44:07] PROBLEM - Host logstash2003 is DOWN: PING CRITICAL - Packet loss = 100% [22:44:59] ^^ is me [22:45:03] RECOVERY - Host logstash2003 is UP: PING OK - Packet loss = 0%, RTA = 34.62 ms [22:52:19] (03PS1) 10Eevans: aqs: set legacy ssl port & optional encryption to false [puppet] - 10https://gerrit.wikimedia.org/r/951589 (https://phabricator.wikimedia.org/T339299) [22:53:09] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/951589 (https://phabricator.wikimedia.org/T339299) (owner: 10Eevans) [22:55:05] (03CR) 10Eevans: [C: 03+2] aqs: set legacy ssl port & optional encryption to false [puppet] - 10https://gerrit.wikimedia.org/r/951589 (https://phabricator.wikimedia.org/T339299) (owner: 10Eevans) [22:59:43] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-codfw: Disable legacy SSL port — T339299 - eevans@cumin1001 [22:59:48] T339299: Upgrade aqs cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339299 [23:30:07] (ProbeDown) firing: (2) Service kibana7:443 has failed probes (http_kibana7_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:30:07] (ProbeDown) firing: (2) Service kibana7:443 has failed probes (http_kibana7_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:30:17] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - logs-api_443: Servers logstash2025.codfw.wmnet are marked down but pooled: kibana7_443: Servers logstash2025.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:30:25] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - logs-api_443: Servers logstash2025.codfw.wmnet, logstash2024.codfw.wmnet are marked down but pooled: kibana7_443: Servers logstash2025.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:30:33] 👋 [23:30:36] cwhite: any work still going? [23:31:16] ah shoot, yeah [23:31:46] sorry about the noise :( [23:31:49] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:32:03] no worries! around if you need a pair of hands [23:33:05] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:33:11] * sukhe here if required [23:33:25] but recoveries are coming :) [23:33:40] O.o [23:33:54] just doing reboots on the standby cluster [23:34:43] cool, stepping away for dinner in that case [23:35:07] (ProbeDown) resolved: (2) Service kibana7:443 has failed probes (http_kibana7_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:35:07] (ProbeDown) resolved: (2) Service kibana7:443 has failed probes (http_kibana7_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:36:43] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:45:04] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-codfw: Disable legacy SSL port — T339299 - eevans@cumin1001 [23:45:09] T339299: Upgrade aqs cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339299 [23:49:52] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-eqiad: Disable legacy SSL port — T339299 - eevans@cumin1001