[00:10:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P46368 and previous config saved to /var/cache/conftool/dbconfig/20230412-001051-ladsgroup.json [00:16:32] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:23:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T333332)', diff saved to https://phabricator.wikimedia.org/P46369 and previous config saved to /var/cache/conftool/dbconfig/20230412-002312-ladsgroup.json [00:23:19] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [00:25:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T333332)', diff saved to https://phabricator.wikimedia.org/P46370 and previous config saved to /var/cache/conftool/dbconfig/20230412-002557-ladsgroup.json [00:25:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1138.eqiad.wmnet with reason: Maintenance [00:26:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1138.eqiad.wmnet with reason: Maintenance [00:26:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T333332)', diff saved to https://phabricator.wikimedia.org/P46371 and previous config saved to /var/cache/conftool/dbconfig/20230412-002620-ladsgroup.json [00:26:32] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:28:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T333332)', diff saved to https://phabricator.wikimedia.org/P46372 and previous config saved to /var/cache/conftool/dbconfig/20230412-002829-ladsgroup.json [00:28:34] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [00:38:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P46373 and previous config saved to /var/cache/conftool/dbconfig/20230412-003819-ladsgroup.json [00:39:24] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/907831 [00:39:26] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/907831 (owner: 10TrainBranchBot) [00:40:16] (03PS4) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed [puppet] - 10https://gerrit.wikimedia.org/r/905243 [00:42:49] (03CR) 10CI reject: [V: 04-1] maintain-dbusers: ensure get_global_wiki_user is only called when needed [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe) [00:43:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P46374 and previous config saved to /var/cache/conftool/dbconfig/20230412-004335-ladsgroup.json [00:46:59] RECOVERY - PHP opcache health on mw2351 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:49:11] (03CR) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe) [00:53:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P46375 and previous config saved to /var/cache/conftool/dbconfig/20230412-005325-ladsgroup.json [00:57:13] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/907831 (owner: 10TrainBranchBot) [00:57:19] (03CR) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe) [00:58:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P46376 and previous config saved to /var/cache/conftool/dbconfig/20230412-005841-ladsgroup.json [01:00:21] (03PS5) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed [puppet] - 10https://gerrit.wikimedia.org/r/905243 [01:02:32] (03CR) 10CI reject: [V: 04-1] maintain-dbusers: ensure get_global_wiki_user is only called when needed [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe) [01:07:10] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Performance-Team (Radar): GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10Krinkle) [01:08:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T333332)', diff saved to https://phabricator.wikimedia.org/P46377 and previous config saved to /var/cache/conftool/dbconfig/20230412-010832-ladsgroup.json [01:08:37] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [01:13:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T333332)', diff saved to https://phabricator.wikimedia.org/P46378 and previous config saved to /var/cache/conftool/dbconfig/20230412-011348-ladsgroup.json [01:13:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1141.eqiad.wmnet with reason: Maintenance [01:13:53] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [01:14:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1141.eqiad.wmnet with reason: Maintenance [01:14:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T333332)', diff saved to https://phabricator.wikimedia.org/P46379 and previous config saved to /var/cache/conftool/dbconfig/20230412-011411-ladsgroup.json [01:16:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T333332)', diff saved to https://phabricator.wikimedia.org/P46380 and previous config saved to /var/cache/conftool/dbconfig/20230412-011619-ladsgroup.json [01:16:29] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P46381 and previous config saved to /var/cache/conftool/dbconfig/20230412-013126-ladsgroup.json [01:37:19] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:42:00] 10SRE, 10MediaWiki-extensions-OAuth, 10The-Wikipedia-Library, 10Datacenter-Switchover, and 2 others: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10jsn.sherman) I wanted to followup on the library side;... [01:45:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:46:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P46382 and previous config saved to /var/cache/conftool/dbconfig/20230412-014632-ladsgroup.json [01:46:33] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:51:32] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:58:32] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:01:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T333332)', diff saved to https://phabricator.wikimedia.org/P46383 and previous config saved to /var/cache/conftool/dbconfig/20230412-020138-ladsgroup.json [02:01:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1142.eqiad.wmnet with reason: Maintenance [02:01:44] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [02:01:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1142.eqiad.wmnet with reason: Maintenance [02:02:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T333332)', diff saved to https://phabricator.wikimedia.org/P46384 and previous config saved to /var/cache/conftool/dbconfig/20230412-020201-ladsgroup.json [02:03:32] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:04:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T333332)', diff saved to https://phabricator.wikimedia.org/P46385 and previous config saved to /var/cache/conftool/dbconfig/20230412-020410-ladsgroup.json [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:19:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P46386 and previous config saved to /var/cache/conftool/dbconfig/20230412-021916-ladsgroup.json [02:26:32] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P46387 and previous config saved to /var/cache/conftool/dbconfig/20230412-023422-ladsgroup.json [02:49:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T333332)', diff saved to https://phabricator.wikimedia.org/P46388 and previous config saved to /var/cache/conftool/dbconfig/20230412-024929-ladsgroup.json [02:49:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1143.eqiad.wmnet with reason: Maintenance [02:49:34] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [02:49:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1143.eqiad.wmnet with reason: Maintenance [02:49:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T333332)', diff saved to https://phabricator.wikimedia.org/P46389 and previous config saved to /var/cache/conftool/dbconfig/20230412-024952-ladsgroup.json [02:52:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T333332)', diff saved to https://phabricator.wikimedia.org/P46390 and previous config saved to /var/cache/conftool/dbconfig/20230412-025200-ladsgroup.json [03:06:05] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:07:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P46391 and previous config saved to /var/cache/conftool/dbconfig/20230412-030707-ladsgroup.json [03:10:45] RECOVERY - PHP opcache health on mw2353 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:15:39] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:15:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:18:51] PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:22:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P46392 and previous config saved to /var/cache/conftool/dbconfig/20230412-032213-ladsgroup.json [03:37:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T333332)', diff saved to https://phabricator.wikimedia.org/P46393 and previous config saved to /var/cache/conftool/dbconfig/20230412-033719-ladsgroup.json [03:37:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance [03:37:25] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [03:37:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance [03:37:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T333332)', diff saved to https://phabricator.wikimedia.org/P46394 and previous config saved to /var/cache/conftool/dbconfig/20230412-033742-ladsgroup.json [03:39:00] (03CR) 10Kevin Bazira: [C: 03+1] httpbb: remove tests from liftwing production [puppet] - 10https://gerrit.wikimedia.org/r/907809 (owner: 10Elukey) [03:39:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T333332)', diff saved to https://phabricator.wikimedia.org/P46395 and previous config saved to /var/cache/conftool/dbconfig/20230412-033951-ladsgroup.json [03:54:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P46396 and previous config saved to /var/cache/conftool/dbconfig/20230412-035457-ladsgroup.json [04:05:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:10:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P46397 and previous config saved to /var/cache/conftool/dbconfig/20230412-041003-ladsgroup.json [04:13:43] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:15:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:22:31] RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:25:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T333332)', diff saved to https://phabricator.wikimedia.org/P46398 and previous config saved to /var/cache/conftool/dbconfig/20230412-042510-ladsgroup.json [04:25:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance [04:25:17] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [04:25:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance [04:25:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [04:25:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [04:25:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T333332)', diff saved to https://phabricator.wikimedia.org/P46399 and previous config saved to /var/cache/conftool/dbconfig/20230412-042552-ladsgroup.json [04:28:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T333332)', diff saved to https://phabricator.wikimedia.org/P46400 and previous config saved to /var/cache/conftool/dbconfig/20230412-042800-ladsgroup.json [04:36:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:43:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P46401 and previous config saved to /var/cache/conftool/dbconfig/20230412-044306-ladsgroup.json [04:46:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:58:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P46402 and previous config saved to /var/cache/conftool/dbconfig/20230412-045813-ladsgroup.json [05:11:29] (03PS1) 10Krinkle: objectcache: Disable cool-off bounce feature [core] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/908027 [05:11:35] (03PS3) 10Krinkle: objectcache: Disable cool-off bounce feature [core] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/902376 [05:11:39] (03Abandoned) 10Krinkle: objectcache: Disable cool-off bounce feature [core] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/902376 (owner: 10Krinkle) [05:12:15] (03CR) 10Krinkle: [C: 03+2] objectcache: Disable cool-off bounce feature [core] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/908027 (owner: 10Krinkle) [05:13:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T333332)', diff saved to https://phabricator.wikimedia.org/P46403 and previous config saved to /var/cache/conftool/dbconfig/20230412-051319-ladsgroup.json [05:13:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1147.eqiad.wmnet with reason: Maintenance [05:13:24] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [05:13:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1147.eqiad.wmnet with reason: Maintenance [05:13:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T333332)', diff saved to https://phabricator.wikimedia.org/P46404 and previous config saved to /var/cache/conftool/dbconfig/20230412-051342-ladsgroup.json [05:15:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T333332)', diff saved to https://phabricator.wikimedia.org/P46405 and previous config saved to /var/cache/conftool/dbconfig/20230412-051550-ladsgroup.json [05:21:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:23:05] (03PS1) 10Marostegui: instances.yaml: Add db1222 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/908016 (https://phabricator.wikimedia.org/T326669) [05:23:43] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1222 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/908016 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [05:25:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1222 to dbctl T326669', diff saved to https://phabricator.wikimedia.org/P46406 and previous config saved to /var/cache/conftool/dbconfig/20230412-052504-marostegui.json [05:25:10] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [05:26:27] (03PS1) 10Marostegui: db1222: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/908017 (https://phabricator.wikimedia.org/T326669) [05:27:02] (03CR) 10Marostegui: [C: 03+2] db1222: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/908017 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [05:27:06] (03Merged) 10jenkins-bot: objectcache: Disable cool-off bounce feature [core] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/908027 (owner: 10Krinkle) [05:27:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1222 (re)pooling @ 1%: Pooling', diff saved to https://phabricator.wikimedia.org/P46407 and previous config saved to /var/cache/conftool/dbconfig/20230412-052733-root.json [05:29:05] * Krinkle testing on mwdebug2001 [05:30:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:30:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P46408 and previous config saved to /var/cache/conftool/dbconfig/20230412-053057-ladsgroup.json [05:34:29] (03PS1) 10Marostegui: db1218: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/908114 (https://phabricator.wikimedia.org/T326669) [05:35:17] (03CR) 10Marostegui: [C: 03+2] db1218: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/908114 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [05:37:32] (03PS1) 10Marostegui: instances.yaml: Add db1218 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/908148 (https://phabricator.wikimedia.org/T326669) [05:38:57] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1218 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/908148 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [05:40:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1218 to dbctl T326669', diff saved to https://phabricator.wikimedia.org/P46409 and previous config saved to /var/cache/conftool/dbconfig/20230412-054024-marostegui.json [05:40:30] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [05:41:17] !log krinkle@deploy2002 Synchronized php-1.41.0-wmf.4/includes/libs/objectcache/: Ie3a2215d33: disable WANCache cool-off feature (duration: 06m 00s) [05:41:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1218 (re)pooling @ 1%: Pooling db1218 T326669', diff saved to https://phabricator.wikimedia.org/P46410 and previous config saved to /var/cache/conftool/dbconfig/20230412-054120-root.json [05:42:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1222 (re)pooling @ 2%: Pooling', diff saved to https://phabricator.wikimedia.org/P46411 and previous config saved to /var/cache/conftool/dbconfig/20230412-054238-root.json [05:42:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1110 to clone db1210 T326669', diff saved to https://phabricator.wikimedia.org/P46412 and previous config saved to /var/cache/conftool/dbconfig/20230412-054258-marostegui.json [05:46:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P46414 and previous config saved to /var/cache/conftool/dbconfig/20230412-054603-ladsgroup.json [05:56:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1218 (re)pooling @ 2%: Pooling db1218 T326669', diff saved to https://phabricator.wikimedia.org/P46415 and previous config saved to /var/cache/conftool/dbconfig/20230412-055624-root.json [05:56:30] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [05:57:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1222 (re)pooling @ 3%: Pooling', diff saved to https://phabricator.wikimedia.org/P46416 and previous config saved to /var/cache/conftool/dbconfig/20230412-055743-root.json [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230412T0600) [06:01:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T333332)', diff saved to https://phabricator.wikimedia.org/P46417 and previous config saved to /var/cache/conftool/dbconfig/20230412-060109-ladsgroup.json [06:01:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1148.eqiad.wmnet with reason: Maintenance [06:01:15] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [06:01:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1148.eqiad.wmnet with reason: Maintenance [06:01:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T333332)', diff saved to https://phabricator.wikimedia.org/P46418 and previous config saved to /var/cache/conftool/dbconfig/20230412-060133-ladsgroup.json [06:01:54] (03PS1) 10Marostegui: kormat/bashrc.wmf: Change alias location [puppet] - 10https://gerrit.wikimedia.org/r/908157 (https://phabricator.wikimedia.org/T334455) [06:02:41] (03CR) 10Marostegui: [C: 03+2] kormat/bashrc.wmf: Change alias location [puppet] - 10https://gerrit.wikimedia.org/r/908157 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [06:02:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T333332)', diff saved to https://phabricator.wikimedia.org/P46419 and previous config saved to /var/cache/conftool/dbconfig/20230412-060241-ladsgroup.json [06:11:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1218 (re)pooling @ 3%: Pooling db1218 T326669', diff saved to https://phabricator.wikimedia.org/P46420 and previous config saved to /var/cache/conftool/dbconfig/20230412-061129-root.json [06:11:35] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [06:12:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1222 (re)pooling @ 4%: Pooling', diff saved to https://phabricator.wikimedia.org/P46421 and previous config saved to /var/cache/conftool/dbconfig/20230412-061248-root.json [06:14:13] (03PS1) 10Marostegui: mariadb: Productionize db1210 [puppet] - 10https://gerrit.wikimedia.org/r/908158 (https://phabricator.wikimedia.org/T326669) [06:15:09] (03PS13) 10Ilias Sarantopoulos: ml-services: FastAPI chart using sextant for ores-legacy service [deployment-charts] - 10https://gerrit.wikimedia.org/r/904777 (https://phabricator.wikimedia.org/T330414) [06:16:50] (03CR) 10Ilias Sarantopoulos: ml-services: FastAPI chart using sextant for ores-legacy service (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/904777 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [06:17:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P46422 and previous config saved to /var/cache/conftool/dbconfig/20230412-061747-ladsgroup.json [06:21:31] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1210 [puppet] - 10https://gerrit.wikimedia.org/r/908158 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [06:22:43] 10SRE, 10Deployments, 10Traffic-Icebox, 10Regression, and 2 others: [Regression] PHP files in /static (and /w/static) on text domains should not execute - https://phabricator.wikimedia.org/T106732 (10Krinkle) Thanks @BCornwall. This is indeed resolved. The paths did change a bit so in this case the (expect... [06:22:55] (03CR) 10Krinkle: [C: 03+2] static.php: Restore short cache for temporary 'mismatch' response (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777901 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [06:23:11] 10SRE, 10Deployments, 10Traffic-Icebox, 10Regression, and 2 others: [Regression] PHP files in /static (and /w/static) on text domains should not execute - https://phabricator.wikimedia.org/T106732 (10Krinkle) [06:23:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P46423 and previous config saved to /var/cache/conftool/dbconfig/20230412-062353-root.json [06:26:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1218 (re)pooling @ 4%: Pooling db1218 T326669', diff saved to https://phabricator.wikimedia.org/P46424 and previous config saved to /var/cache/conftool/dbconfig/20230412-062634-root.json [06:26:39] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [06:27:12] (03PS14) 10Ilias Sarantopoulos: ml-services: FastAPI chart using sextant for ores-legacy service [deployment-charts] - 10https://gerrit.wikimedia.org/r/904777 (https://phabricator.wikimedia.org/T330414) [06:27:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1222 (re)pooling @ 5%: Pooling', diff saved to https://phabricator.wikimedia.org/P46425 and previous config saved to /var/cache/conftool/dbconfig/20230412-062752-root.json [06:28:16] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 59 probes of 774 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:32:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1121 to clone db1221 T326669', diff saved to https://phabricator.wikimedia.org/P46426 and previous config saved to /var/cache/conftool/dbconfig/20230412-063224-marostegui.json [06:32:30] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [06:32:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P46427 and previous config saved to /var/cache/conftool/dbconfig/20230412-063253-ladsgroup.json [06:33:55] !log Stop mariadb on db1121 to clone db1221 this will generate lag on clouddb replicas for s4 T326669 [06:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:31] (03PS1) 10Marostegui: db1221: Place it in s4 [puppet] - 10https://gerrit.wikimedia.org/r/908160 (https://phabricator.wikimedia.org/T326669) [06:35:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:36:13] (03CR) 10Marostegui: [C: 03+2] db1221: Place it in s4 [puppet] - 10https://gerrit.wikimedia.org/r/908160 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [06:38:04] !log restart haproxy on cp2035 - T334448 [06:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:08] T334448: HAProxy 2.6.12 segfaults on cp2033 - https://phabricator.wikimedia.org/T334448 [06:38:42] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 7 probes of 774 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:38:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P46429 and previous config saved to /var/cache/conftool/dbconfig/20230412-063858-root.json [06:41:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1218 (re)pooling @ 5%: Pooling db1218 T326669', diff saved to https://phabricator.wikimedia.org/P46430 and previous config saved to /var/cache/conftool/dbconfig/20230412-064139-root.json [06:41:44] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [06:42:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1222 (re)pooling @ 10%: Pooling', diff saved to https://phabricator.wikimedia.org/P46431 and previous config saved to /var/cache/conftool/dbconfig/20230412-064257-root.json [06:45:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:48:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T333332)', diff saved to https://phabricator.wikimedia.org/P46432 and previous config saved to /var/cache/conftool/dbconfig/20230412-064800-ladsgroup.json [06:48:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1149.eqiad.wmnet with reason: Maintenance [06:48:05] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [06:48:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1149.eqiad.wmnet with reason: Maintenance [06:48:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T333332)', diff saved to https://phabricator.wikimedia.org/P46433 and previous config saved to /var/cache/conftool/dbconfig/20230412-064823-ladsgroup.json [06:50:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T333332)', diff saved to https://phabricator.wikimedia.org/P46434 and previous config saved to /var/cache/conftool/dbconfig/20230412-065032-ladsgroup.json [06:51:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:54:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P46435 and previous config saved to /var/cache/conftool/dbconfig/20230412-065402-root.json [06:54:29] (03CR) 10Slyngshede: [C: 03+2] P:url_downloader send squid logs to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/904783 (https://phabricator.wikimedia.org/T333676) (owner: 10Slyngshede) [06:54:46] (03PS1) 10Ilias Sarantopoulos: ml-services: deployment of ores-legacy app in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/908191 (https://phabricator.wikimedia.org/T330414) [06:56:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1218 (re)pooling @ 10%: Pooling db1218 T326669', diff saved to https://phabricator.wikimedia.org/P46436 and previous config saved to /var/cache/conftool/dbconfig/20230412-065644-root.json [06:56:49] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [06:58:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1222 (re)pooling @ 25%: Pooling', diff saved to https://phabricator.wikimedia.org/P46437 and previous config saved to /var/cache/conftool/dbconfig/20230412-065802-root.json [06:59:15] (03CR) 10CI reject: [V: 04-1] ml-services: deployment of ores-legacy app in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/908191 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [07:00:05] Amir1, Urbanecm, and taavi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230412T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:23] o/ nothing to do it seems [07:00:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:05:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P46438 and previous config saved to /var/cache/conftool/dbconfig/20230412-070538-ladsgroup.json [07:06:32] (03CR) 10Muehlenhoff: C:httpd move htcacheclean to httpd class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [07:09:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P46439 and previous config saved to /var/cache/conftool/dbconfig/20230412-070907-root.json [07:11:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1218 (re)pooling @ 25%: Pooling db1218 T326669', diff saved to https://phabricator.wikimedia.org/P46440 and previous config saved to /var/cache/conftool/dbconfig/20230412-071149-root.json [07:11:54] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [07:12:49] (03CR) 10Jelto: [C: 03+1] "lgtm, lets test the new cookbook on the replicas 🎉" [cookbooks] - 10https://gerrit.wikimedia.org/r/894634 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [07:13:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1222 (re)pooling @ 50%: Pooling', diff saved to https://phabricator.wikimedia.org/P46441 and previous config saved to /var/cache/conftool/dbconfig/20230412-071307-root.json [07:16:34] !log Drop flaggerevs tables from ptwikisource T332594 [07:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:39] T332594: Drop FlaggedRevs tables in database for ptwikisource - https://phabricator.wikimedia.org/T332594 [07:19:52] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Adjust routing policy to increase SSH session speed from East Asia to toolforge - https://phabricator.wikimedia.org/T334530 (10ayounsi) Thanks for the report. This is because we advertise our "customer" prefixes from all our POPs to improve the use... [07:20:36] (03CR) 10Hashar: devtools: change gerrit hostname to use wmcloud, not wmflabs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888808 (https://phabricator.wikimedia.org/T329444) (owner: 10Dzahn) [07:20:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P46442 and previous config saved to /var/cache/conftool/dbconfig/20230412-072044-ladsgroup.json [07:20:56] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:21:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:21:16] !log installing xen security updates [07:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:31] (03CR) 10Muehlenhoff: [C: 03+2] aqs: Remove use_nodejs10 [puppet] - 10https://gerrit.wikimedia.org/r/907718 (owner: 10Muehlenhoff) [07:24:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P46443 and previous config saved to /var/cache/conftool/dbconfig/20230412-072412-root.json [07:24:15] (03PS1) 10Marostegui: change_ptrp_tags_update_T334536.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/908195 (https://phabricator.wikimedia.org/T334536) [07:25:50] (03CR) 10Marostegui: "This still awaits for clarification on why it is only needed on s1: https://phabricator.wikimedia.org/T334536#8774369" [software/schema-changes] - 10https://gerrit.wikimedia.org/r/908195 (https://phabricator.wikimedia.org/T334536) (owner: 10Marostegui) [07:26:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:26:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1218 (re)pooling @ 50%: Pooling db1218 T326669', diff saved to https://phabricator.wikimedia.org/P46444 and previous config saved to /var/cache/conftool/dbconfig/20230412-072654-root.json [07:26:59] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [07:27:09] (03CR) 10Jelto: [C: 03+1] "lgtm. I agree stopping puppet and rolling this change out one by one makes sense so we don't wipe the ssh keys worst case." [puppet] - 10https://gerrit.wikimedia.org/r/907878 (https://phabricator.wikimedia.org/T333840) (owner: 10EoghanGaffney) [07:28:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1222 (re)pooling @ 75%: Pooling', diff saved to https://phabricator.wikimedia.org/P46445 and previous config saved to /var/cache/conftool/dbconfig/20230412-072812-root.json [07:28:19] (03PS1) 10Muehlenhoff: kartotherian: Stop passing use_nodejs10 [puppet] - 10https://gerrit.wikimedia.org/r/908196 [07:30:31] !log jelto@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bullseye [07:30:36] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:31:59] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/908196 (owner: 10Muehlenhoff) [07:32:26] PROBLEM - WDQS SPARQL on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [07:33:49] 10SRE-swift-storage, 10Patch-For-Review: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) Thanks. Interestingly, `codfw` and `eqiad` have different creation dates and sizes: ` root@ms-fe2009:/home/mvernon# swift list -l wikipedia-en-local-pub... [07:34:54] (03PS1) 10Marostegui: instances.yaml: Remove db1107 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/908197 (https://phabricator.wikimedia.org/T334447) [07:35:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T333332)', diff saved to https://phabricator.wikimedia.org/P46446 and previous config saved to /var/cache/conftool/dbconfig/20230412-073550-ladsgroup.json [07:35:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance [07:35:56] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [07:36:07] !log installing python-cryptography security updates [07:36:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance [07:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1190.eqiad.wmnet with reason: Maintenance [07:36:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1190.eqiad.wmnet with reason: Maintenance [07:36:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1190 (T333332)', diff saved to https://phabricator.wikimedia.org/P46447 and previous config saved to /var/cache/conftool/dbconfig/20230412-073633-ladsgroup.json [07:38:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T333332)', diff saved to https://phabricator.wikimedia.org/P46448 and previous config saved to /var/cache/conftool/dbconfig/20230412-073841-ladsgroup.json [07:38:52] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1107 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/908197 (https://phabricator.wikimedia.org/T334447) (owner: 10Marostegui) [07:39:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P46449 and previous config saved to /var/cache/conftool/dbconfig/20230412-073917-root.json [07:39:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1107 from dbctl T334447', diff saved to https://phabricator.wikimedia.org/P46450 and previous config saved to /var/cache/conftool/dbconfig/20230412-073921-marostegui.json [07:39:25] T334447: decommission db1107.eqiad.wmnet - https://phabricator.wikimedia.org/T334447 [07:41:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1218 (re)pooling @ 75%: Pooling db1218 T326669', diff saved to https://phabricator.wikimedia.org/P46451 and previous config saved to /var/cache/conftool/dbconfig/20230412-074158-root.json [07:42:04] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [07:43:21] !log jelto@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage [07:43:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1222 (re)pooling @ 100%: Pooling', diff saved to https://phabricator.wikimedia.org/P46452 and previous config saved to /var/cache/conftool/dbconfig/20230412-074317-root.json [07:45:58] !log jelto@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage [07:50:39] (03PS16) 10Slyngshede: C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 [07:51:03] (03CR) 10CI reject: [V: 04-1] C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [07:53:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P46453 and previous config saved to /var/cache/conftool/dbconfig/20230412-075348-ladsgroup.json [07:54:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P46454 and previous config saved to /var/cache/conftool/dbconfig/20230412-075422-root.json [07:54:23] (03PS17) 10Slyngshede: C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 [07:55:59] (03CR) 10Marostegui: "Amir, even if it can be run with replication directly, please take a look so I can merge and add this to the repo" [software/schema-changes] - 10https://gerrit.wikimedia.org/r/908195 (https://phabricator.wikimedia.org/T334536) (owner: 10Marostegui) [07:56:17] (03CR) 10CI reject: [V: 04-1] C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [07:57:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1218 (re)pooling @ 100%: Pooling db1218 T326669', diff saved to https://phabricator.wikimedia.org/P46455 and previous config saved to /var/cache/conftool/dbconfig/20230412-075703-root.json [07:57:08] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [07:57:25] (03CR) 10Ladsgroup: change_ptrp_tags_update_T334536.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/908195 (https://phabricator.wikimedia.org/T334536) (owner: 10Marostegui) [07:57:50] (03CR) 10Elukey: [C: 03+2] httpbb: remove tests from liftwing production [puppet] - 10https://gerrit.wikimedia.org/r/907809 (owner: 10Elukey) [07:57:54] (03CR) 10Marostegui: change_ptrp_tags_update_T334536.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/908195 (https://phabricator.wikimedia.org/T334536) (owner: 10Marostegui) [07:59:03] (03CR) 10Ladsgroup: [C: 03+1] "small nitpick." [software/schema-changes] - 10https://gerrit.wikimedia.org/r/908195 (https://phabricator.wikimedia.org/T334536) (owner: 10Marostegui) [07:59:39] (03PS2) 10Marostegui: change_ptrp_tags_update_T334536.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/908195 (https://phabricator.wikimedia.org/T334536) [07:59:41] (03CR) 10Ladsgroup: change_ptrp_tags_update_T334536.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/908195 (https://phabricator.wikimedia.org/T334536) (owner: 10Marostegui) [07:59:43] (03CR) 10Marostegui: change_ptrp_tags_update_T334536.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/908195 (https://phabricator.wikimedia.org/T334536) (owner: 10Marostegui) [08:00:00] (03CR) 10Ladsgroup: [C: 03+1] change_ptrp_tags_update_T334536.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/908195 (https://phabricator.wikimedia.org/T334536) (owner: 10Marostegui) [08:00:05] ^demon and hashar: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230412T0800). [08:00:12] (03CR) 10Marostegui: [C: 03+2] change_ptrp_tags_update_T334536.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/908195 (https://phabricator.wikimedia.org/T334536) (owner: 10Marostegui) [08:00:38] (03Merged) 10jenkins-bot: change_ptrp_tags_update_T334536.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/908195 (https://phabricator.wikimedia.org/T334536) (owner: 10Marostegui) [08:01:26] !log jelto@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab2003.wikimedia.org with OS bullseye [08:01:48] 10SRE, 10Machine-Learning-Team, 10serviceops: Import and deploy istio 1.15.7 - https://phabricator.wikimedia.org/T334068 (10elukey) [08:02:47] 10SRE, 10Machine-Learning-Team, 10serviceops: Import and deploy istio 1.15.7 - https://phabricator.wikimedia.org/T334068 (10elukey) Rollout to ml-serve/aux/dse completed. To keep archives happy, I used ` istioctl-1.15.7 upgrade -f config.yaml` Last step: rollout to wikikube clusters [08:03:04] 10SRE, 10Machine-Learning-Team, 10serviceops: Import and deploy istio 1.15.7 - https://phabricator.wikimedia.org/T334068 (10elukey) a:05elukey→03JMeybohm [08:03:31] !log dbmaint Deploy schema change on s3 codfw with replication enabled (only for testwiki and test2wiki) T334536 [08:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:36] T334536: Schema changes: Make ptrp_tags_updated NULLABLE - https://phabricator.wikimedia.org/T334536 [08:03:51] 10SRE-swift-storage, 10Patch-For-Review: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10jcrespo) eqiad backups: ` This is the list of 2 files found with the given criteria: 0) wiki | commonswiki title | The_Collected_Works... [08:06:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:08:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P46456 and previous config saved to /var/cache/conftool/dbconfig/20230412-080854-ladsgroup.json [08:09:36] 10SRE-swift-storage, 10Patch-For-Review: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10jcrespo) These look to me as leftovers- check the paths of production_container + production_path, that is where mw thinks they should be (only). They must have failed... [08:10:05] (03CR) 10Hashar: [C: 03+2] [Beta Cluster] Replicate WebResponseSetCookie wgHooks migration here too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907933 (https://phabricator.wikimedia.org/T333926) (owner: 10Jforrester) [08:10:57] (03Merged) 10jenkins-bot: [Beta Cluster] Replicate WebResponseSetCookie wgHooks migration here too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907933 (https://phabricator.wikimedia.org/T333926) (owner: 10Jforrester) [08:14:12] 10SRE-swift-storage, 10Patch-For-Review: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) Yes, that fits with the "this file has been deleted" page, so I think that object is good to clear up in both clusters. Thank you! I'll be interested t... [08:14:58] (03CR) 10Hashar: [C: 03+2] "Thanks! I have synced it in production." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907933 (https://phabricator.wikimedia.org/T333926) (owner: 10Jforrester) [08:15:48] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:35] 10SRE-swift-storage, 10Patch-For-Review: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10jcrespo) >>! In T327253#8774731, @MatthewVernon wrote: > I'll be interested to hear about the other objects when you've some time :) The 22 at wikipedia-ja-local-publ... [08:17:37] !log hashar@deploy2002 Synchronized wmf-config/CommonSettings-labs.php: [Beta Cluster] Replicate WebResponseSetCookie wgHooks migration here too - T333926 (duration: 05m 51s) [08:17:40] PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:17:41] 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10elukey) @Jclark-ctr progress! I was able to reimage, but the two disks in the flex bay seem in `Firmware state: Unconfigured(good), Spun Up`, so the OS got installed on... [08:17:41] T333926: PHP Deprecated: Accessing $wgHooks directly is deprecated, use HookContainer::getHandlers() or HookContainer::register() instead. [Called from {closure}] - https://phabricator.wikimedia.org/T333926 [08:17:50] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Apply prometheus::pop role to prometheus3002 [puppet] - 10https://gerrit.wikimedia.org/r/905705 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [08:17:54] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Apply prometheus::pop role to prometheus4002 [puppet] - 10https://gerrit.wikimedia.org/r/907984 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [08:17:58] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Apply prometheus::pop role to prometheus6002 [puppet] - 10https://gerrit.wikimedia.org/r/907987 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [08:18:40] (03CR) 10Clément Goubert: [C: 03+2] P:httpbb: Add monitoring for kubernetes services [puppet] - 10https://gerrit.wikimedia.org/r/907814 (https://phabricator.wikimedia.org/T334456) (owner: 10Clément Goubert) [08:20:06] 10SRE-swift-storage, 10Patch-For-Review: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) >>! In T327253#8774732, @jcrespo wrote: >>>! In T327253#8774731, @MatthewVernon wrote: >> I'll be interested to hear about the other objects when you've... [08:20:55] (03CR) 10Elukey: [C: 03+2] "Great work Ilias!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/904777 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [08:22:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:22:30] RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:24:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T333332)', diff saved to https://phabricator.wikimedia.org/P46457 and previous config saved to /var/cache/conftool/dbconfig/20230412-082400-ladsgroup.json [08:24:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1199.eqiad.wmnet with reason: Maintenance [08:24:06] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [08:24:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1199.eqiad.wmnet with reason: Maintenance [08:24:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1199 (T333332)', diff saved to https://phabricator.wikimedia.org/P46458 and previous config saved to /var/cache/conftool/dbconfig/20230412-082424-ladsgroup.json [08:24:44] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10ItamarWMDE) Hello @MoritzMuehlenhoff and @BCornwall, apologies for the delay in the response. I am just back from holidays. I am not so well ve... [08:25:22] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service,httpbb_kubernetes_mw-api-int_hourly.service,httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:18] (03CR) 10Clément Goubert: [C: 03+2] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/908201 (owner: 10Clément Goubert) [08:26:26] (03Merged) 10jenkins-bot: ml-services: FastAPI chart using sextant for ores-legacy service [deployment-charts] - 10https://gerrit.wikimedia.org/r/904777 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [08:26:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T333332)', diff saved to https://phabricator.wikimedia.org/P46459 and previous config saved to /var/cache/conftool/dbconfig/20230412-082632-ladsgroup.json [08:26:48] The httpbb CRITs are my bad, greedy replace messed up the path, pushing a fix [08:27:23] (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] P:httpbb: Fix wrong test directory [puppet] - 10https://gerrit.wikimedia.org/r/908201 (owner: 10Clément Goubert) [08:29:07] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service,httpbb_kubernetes_mw-api-int_hourly.service,httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:30:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:30:43] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:32:02] (03PS5) 10Clément Goubert: P:httpbb: Remove absented httpbb_kubernetes_hourly [puppet] - 10https://gerrit.wikimedia.org/r/907848 (https://phabricator.wikimedia.org/T334456) [08:32:03] PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:32:11] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:33:09] RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:33:20] !log About to deploy analytics/refinery in production [08:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:15] !log aqu@deploy2002 Started deploy [analytics/refinery@f3389dc]: Deploy analytics_refinery in production [analytics/refinery@f3389dc] [08:34:56] !log aqu@deploy2002 Finished deploy [analytics/refinery@f3389dc]: Deploy analytics_refinery in production [analytics/refinery@f3389dc] (duration: 00m 41s) [08:35:34] !log imported puppet 5.5.22-2+deb12u1 for bookworm-wikimedia component/puppet5 T330495 [08:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:38] T330495: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 [08:35:38] !log aqu@deploy2002 Started deploy [analytics/refinery@f3389dc] (thin): Deploy analytics_refinery in production thin [analytics/refinery@f3389dc] [08:35:46] !log aqu@deploy2002 Finished deploy [analytics/refinery@f3389dc] (thin): Deploy analytics_refinery in production thin [analytics/refinery@f3389dc] (duration: 00m 07s) [08:37:26] !log dbmaint Deploy schema change on s1 codfw with replication T334536 [08:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:30] T334536: Schema changes: Make ptrp_tags_updated NULLABLE - https://phabricator.wikimedia.org/T334536 [08:38:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:40:48] (03CR) 10Clément Goubert: [C: 03+2] P:httpbb: Remove absented httpbb_kubernetes_hourly [puppet] - 10https://gerrit.wikimedia.org/r/907848 (https://phabricator.wikimedia.org/T334456) (owner: 10Clément Goubert) [08:41:36] (03CR) 10Hashar: [C: 04-1] "I have proposed a series of change to rely on a PuppetDB query instead of a manually maintained list starting at https://gerrit.wikimedia." [puppet] - 10https://gerrit.wikimedia.org/r/867670 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [08:41:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P46460 and previous config saved to /var/cache/conftool/dbconfig/20230412-084138-ladsgroup.json [08:42:49] PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:42:51] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [08:43:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:44:07] RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:46:23] (03PS1) 10Muehlenhoff: Install Puppet 5.5 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/908202 (https://phabricator.wikimedia.org/T330495) [08:46:34] (03PS2) 10Muehlenhoff: Install Puppet 5.5 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/908202 (https://phabricator.wikimedia.org/T330495) [08:48:10] (03PS4) 10Hashar: doc: upgrade php from 7.3 to 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) [08:48:22] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/908202 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [08:48:29] (03CR) 10Hashar: "Rebased to clear a conflict with Id989c18b783d1bd58e3935a3d6418fa02b4f5652" [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [08:49:43] (03PS1) 10Marostegui: mariadb: Productionize db1221 [puppet] - 10https://gerrit.wikimedia.org/r/908204 (https://phabricator.wikimedia.org/T326669) [08:50:08] (03PS8) 10EoghanGaffney: Add keys for sshd-gitlab from the secrets repo [puppet] - 10https://gerrit.wikimedia.org/r/907878 [08:50:54] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1221 [puppet] - 10https://gerrit.wikimedia.org/r/908204 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [08:50:55] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@18ae3be]: Deploy airflow-dags including webrequest load job - Analytics [airflow-dags@18ae3be] [08:51:07] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@18ae3be]: Deploy airflow-dags including webrequest load job - Analytics [airflow-dags@18ae3be] (duration: 00m 12s) [08:51:29] !log jelto@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bullseye [08:54:15] PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:54:26] (03PS1) 10Vgutierrez: hiera: Use a single socket for haproxy/varnish on drmrs [puppet] - 10https://gerrit.wikimedia.org/r/908205 (https://phabricator.wikimedia.org/T333965) [08:56:18] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40609/console" [puppet] - 10https://gerrit.wikimedia.org/r/908205 (https://phabricator.wikimedia.org/T333965) (owner: 10Vgutierrez) [08:56:29] RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:56:40] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40610/console" [puppet] - 10https://gerrit.wikimedia.org/r/907878 (owner: 10EoghanGaffney) [08:56:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P46462 and previous config saved to /var/cache/conftool/dbconfig/20230412-085644-ladsgroup.json [08:56:48] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Use a single socket for haproxy/varnish on drmrs [puppet] - 10https://gerrit.wikimedia.org/r/908205 (https://phabricator.wikimedia.org/T333965) (owner: 10Vgutierrez) [08:57:15] (03PS1) 10Filippo Giunchedi: sre: report alert lint problems [alerts] - 10https://gerrit.wikimedia.org/r/908206 (https://phabricator.wikimedia.org/T309182) [08:57:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1127.eqiad.wmnet with reason: Maintenance [08:58:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1127.eqiad.wmnet with reason: Maintenance [08:58:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T333332)', diff saved to https://phabricator.wikimedia.org/P46463 and previous config saved to /var/cache/conftool/dbconfig/20230412-085816-ladsgroup.json [08:58:21] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [08:59:26] 10SRE, 10Machine-Learning-Team, 10serviceops: Import and deploy istio 1.15.7 - https://phabricator.wikimedia.org/T334068 (10JMeybohm) 05Open→03Resolved Thanks! Wikikube is done as well [08:59:36] 10SRE, 10Machine-Learning-Team, 10serviceops: Import and deploy istio 1.15.7 - https://phabricator.wikimedia.org/T334068 (10JMeybohm) [08:59:37] PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:59:46] (03PS4) 10Arturo Borrero Gonzalez: cloud_private_subnet: add route to public IPv4 range [puppet] - 10https://gerrit.wikimedia.org/r/903622 (https://phabricator.wikimedia.org/T324992) [09:00:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T333332)', diff saved to https://phabricator.wikimedia.org/P46464 and previous config saved to /var/cache/conftool/dbconfig/20230412-090032-ladsgroup.json [09:04:16] !log jelto@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage [09:05:03] (03CR) 10JMeybohm: [C: 03+1] linkrecommendation: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905941 (https://phabricator.wikimedia.org/T334060) (owner: 10Clément Goubert) [09:05:17] RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:05:20] 10SRE-swift-storage, 10Patch-For-Review: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10jcrespo) >>! In T327253#8774738, @MatthewVernon wrote: > Yes, and if you have any further thoughts on 8/80/Anotheryear.jpg So backups records are not (and do not inte... [09:05:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:06:14] !log Migrating cxserver to mw-api-int on kubernetes - T334204 [09:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:19] (03CR) 10Clément Goubert: [C: 03+2] cxserver: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/903646 (https://phabricator.wikimedia.org/T334204) (owner: 10Clément Goubert) [09:06:20] T334204: Migrate cxserver to mw-api-int - https://phabricator.wikimedia.org/T334204 [09:06:44] (03PS2) 10Ilias Sarantopoulos: ml-services: deployment of ores-legacy app in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/908191 (https://phabricator.wikimedia.org/T330414) [09:07:39] !log jelto@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage [09:08:44] PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:09:52] (03CR) 10CI reject: [V: 04-1] ml-services: deployment of ores-legacy app in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/908191 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [09:11:08] (03Merged) 10jenkins-bot: cxserver: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/903646 (https://phabricator.wikimedia.org/T334204) (owner: 10Clément Goubert) [09:11:31] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [09:11:46] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [09:11:50] RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:11:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T333332)', diff saved to https://phabricator.wikimedia.org/P46466 and previous config saved to /var/cache/conftool/dbconfig/20230412-091151-ladsgroup.json [09:11:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [09:11:55] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [09:12:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [09:12:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2099.codfw.wmnet with reason: Maintenance [09:12:26] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [09:12:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2099.codfw.wmnet with reason: Maintenance [09:12:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2106.codfw.wmnet with reason: Maintenance [09:12:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2106.codfw.wmnet with reason: Maintenance [09:12:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2106 (T333332)', diff saved to https://phabricator.wikimedia.org/P46467 and previous config saved to /var/cache/conftool/dbconfig/20230412-091255-ladsgroup.json [09:13:53] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [09:15:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T333332)', diff saved to https://phabricator.wikimedia.org/P46468 and previous config saved to /var/cache/conftool/dbconfig/20230412-091507-ladsgroup.json [09:15:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P46469 and previous config saved to /var/cache/conftool/dbconfig/20230412-091539-ladsgroup.json [09:15:48] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:19:28] (03CR) 10Hashar: [C: 04-1] ci: split contint hosts to different roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/907886 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [09:19:30] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40611/console" [puppet] - 10https://gerrit.wikimedia.org/r/907878 (owner: 10EoghanGaffney) [09:20:58] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [09:21:14] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [09:21:18] !log jelto@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab2003.wikimedia.org with OS bullseye [09:21:56] (03PS3) 10Ilias Sarantopoulos: ml-services: deployment of ores-legacy app in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/908191 (https://phabricator.wikimedia.org/T330414) [09:22:28] (03PS1) 10Clément Goubert: Revert "cxserver: Switch to mw-api-int-async on k8s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/908032 [09:23:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P46470 and previous config saved to /var/cache/conftool/dbconfig/20230412-092327-root.json [09:27:11] (03PS1) 10Elukey: aptrepo: import AMD ROCm 5.4 to bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/908208 (https://phabricator.wikimedia.org/T295661) [09:27:48] (03PS2) 10Sergio Gimeno: GrowthExperiments: enable add link frontend in 7,8th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907899 (https://phabricator.wikimedia.org/T304551) [09:28:24] (03CR) 10Clément Goubert: [C: 03+2] Revert "cxserver: Switch to mw-api-int-async on k8s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/908032 (owner: 10Clément Goubert) [09:28:56] PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:30:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P46472 and previous config saved to /var/cache/conftool/dbconfig/20230412-093013-ladsgroup.json [09:30:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P46473 and previous config saved to /var/cache/conftool/dbconfig/20230412-093045-ladsgroup.json [09:31:50] (03CR) 10Jbond: [C: 04-1] "see comments inline i tl;dr i think we should switch back to the previous implementation. Any systems that need cache_disk should declare" [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [09:32:02] RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:33:37] (03Merged) 10jenkins-bot: Revert "cxserver: Switch to mw-api-int-async on k8s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/908032 (owner: 10Clément Goubert) [09:34:12] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/903622 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [09:34:29] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [09:34:39] (03PS1) 10Elukey: role::dse_k8s::worker: add AMD GPU support [puppet] - 10https://gerrit.wikimedia.org/r/908210 (https://phabricator.wikimedia.org/T333009) [09:34:50] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [09:34:57] !log Reverted migrating cxserver to mw-api-int on kubernetes - T334204 [09:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:01] T334204: Migrate cxserver to mw-api-int - https://phabricator.wikimedia.org/T334204 [09:35:45] (03PS2) 10Elukey: role::dse_k8s::worker: add AMD GPU support [puppet] - 10https://gerrit.wikimedia.org/r/908210 (https://phabricator.wikimedia.org/T333009) [09:37:42] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40612/console" [puppet] - 10https://gerrit.wikimedia.org/r/908210 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [09:38:22] PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:38:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P46474 and previous config saved to /var/cache/conftool/dbconfig/20230412-093831-root.json [09:39:24] (03CR) 10Hnowlan: [C: 03+1] kartotherian: Stop passing use_nodejs10 [puppet] - 10https://gerrit.wikimedia.org/r/908196 (owner: 10Muehlenhoff) [09:41:23] (03PS1) 10Filippo Giunchedi: alerting-host: toggle auto-restart for ircecho/icinga-am on failover [puppet] - 10https://gerrit.wikimedia.org/r/908211 [09:42:19] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: report alert lint problems [alerts] - 10https://gerrit.wikimedia.org/r/908206 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [09:42:54] (03PS18) 10Slyngshede: C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 [09:43:12] RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:45:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P46475 and previous config saved to /var/cache/conftool/dbconfig/20230412-094520-ladsgroup.json [09:45:22] (03PS1) 10Jaime Nuche: scap: block Scap deployments on inactive deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/908212 (https://phabricator.wikimedia.org/T330756) [09:45:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T333332)', diff saved to https://phabricator.wikimedia.org/P46476 and previous config saved to /var/cache/conftool/dbconfig/20230412-094551-ladsgroup.json [09:45:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1136.eqiad.wmnet with reason: Maintenance [09:45:56] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [09:46:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1136.eqiad.wmnet with reason: Maintenance [09:46:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T333332)', diff saved to https://phabricator.wikimedia.org/P46477 and previous config saved to /var/cache/conftool/dbconfig/20230412-094615-ladsgroup.json [09:46:54] (03PS2) 10Jaime Nuche: scap: block Scap deployments on inactive deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/908212 (https://phabricator.wikimedia.org/T330756) [09:47:40] (03CR) 10Jbond: "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/908202 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [09:48:02] PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:48:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T333332)', diff saved to https://phabricator.wikimedia.org/P46478 and previous config saved to /var/cache/conftool/dbconfig/20230412-094829-ladsgroup.json [09:48:36] (03CR) 10Slyngshede: C:httpd move htcacheclean to httpd class (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [09:50:25] (03CR) 10David Caro: maintain-dbusers: ensure get_global_wiki_user is only called when needed (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe) [09:50:27] (03CR) 10Jbond: [C: 03+1] aptrepo: import AMD ROCm 5.4 to bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/908208 (https://phabricator.wikimedia.org/T295661) (owner: 10Elukey) [09:51:14] RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:52:39] (03PS1) 10Hashar: utils: rm hiera_lookup (replaced by puppet lookup) [puppet] - 10https://gerrit.wikimedia.org/r/908214 [09:53:36] (03CR) 10Hashar: "There is no other reference to `hiera_lookup` in the Puppet repo :]" [puppet] - 10https://gerrit.wikimedia.org/r/908214 (owner: 10Hashar) [09:53:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P46479 and previous config saved to /var/cache/conftool/dbconfig/20230412-095336-root.json [09:54:15] (03PS1) 10Filippo Giunchedi: Rename cadvisor_exporter to cadvisor [puppet] - 10https://gerrit.wikimedia.org/r/908215 (https://phabricator.wikimedia.org/T108027) [09:56:04] PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:56:11] (03CR) 10Jaime Nuche: "Once this patch has been merged, we should apply it to both deployments servers and run some simply scap sync command from the active one " [puppet] - 10https://gerrit.wikimedia.org/r/908212 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche) [09:57:40] RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:58:03] 10SRE-swift-storage, 10Patch-For-Review: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10jcrespo) Regarding jawiki, there is no latest or or archived (public) files with those names: ` root@db1140:~$ cat images.txt | while read image; do echo "SELECT * FR... [09:58:04] (03CR) 10Jaime Nuche: "PCC: https://puppet-compiler.wmflabs.org/output/908212/40613/" [puppet] - 10https://gerrit.wikimedia.org/r/908212 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche) [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230412T1000) [10:00:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T333332)', diff saved to https://phabricator.wikimedia.org/P46480 and previous config saved to /var/cache/conftool/dbconfig/20230412-100026-ladsgroup.json [10:00:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2110.codfw.wmnet with reason: Maintenance [10:00:31] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [10:00:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2110.codfw.wmnet with reason: Maintenance [10:00:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2110 (T333332)', diff saved to https://phabricator.wikimedia.org/P46481 and previous config saved to /var/cache/conftool/dbconfig/20230412-100049-ladsgroup.json [10:01:00] (03CR) 10Jbond: C:httpd move htcacheclean to httpd class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [10:01:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1123 to clone db1223 T326669', diff saved to https://phabricator.wikimedia.org/P46482 and previous config saved to /var/cache/conftool/dbconfig/20230412-100111-marostegui.json [10:01:16] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [10:02:19] (03CR) 10Clément Goubert: [C: 03+2] scap: block Scap deployments on inactive deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/908212 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche) [10:02:30] PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:03:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T333332)', diff saved to https://phabricator.wikimedia.org/P46484 and previous config saved to /var/cache/conftool/dbconfig/20230412-100301-ladsgroup.json [10:03:09] (03CR) 10Jbond: [C: 03+1] "lgtm, would have been nice to update this but i wasn't able. adding a few other puppet people in case they see an easy win or object" [puppet] - 10https://gerrit.wikimedia.org/r/908214 (owner: 10Hashar) [10:03:17] (03PS1) 10Marostegui: db1223: Place it in s3 [puppet] - 10https://gerrit.wikimedia.org/r/908216 (https://phabricator.wikimedia.org/T326669) [10:03:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P46485 and previous config saved to /var/cache/conftool/dbconfig/20230412-100335-ladsgroup.json [10:04:18] (03CR) 10Marostegui: [C: 03+2] db1223: Place it in s3 [puppet] - 10https://gerrit.wikimedia.org/r/908216 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [10:06:22] !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [10:06:42] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:08:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P46486 and previous config saved to /var/cache/conftool/dbconfig/20230412-100841-root.json [10:08:56] (03CR) 10David Caro: maintain-dbusers: ensure get_global_wiki_user is only called when needed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe) [10:08:58] (03CR) 10Muehlenhoff: Install Puppet 5.5 on Bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908202 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [10:10:03] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/908211 (owner: 10Filippo Giunchedi) [10:10:47] !log cgoubert@deploy2002 Synchronized README: (no justification provided) (duration: 05m 44s) [10:11:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/908208 (https://phabricator.wikimedia.org/T295661) (owner: 10Elukey) [10:12:10] RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:12:48] (03CR) 10Filippo Giunchedi: [C: 03+2] alerting-host: toggle auto-restart for ircecho/icinga-am on failover [puppet] - 10https://gerrit.wikimedia.org/r/908211 (owner: 10Filippo Giunchedi) [10:13:00] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/908210 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [10:15:38] 10SRE-swift-storage, 10Patch-For-Review: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) Thanks, that is super helpful! I agree that some tooling that's able to look up objects in backups //and// production would be really useful (if nothin... [10:15:49] (03CR) 10Filippo Giunchedi: "Note that due to the change in exported resources name, the Prometheus configuration will converge after puppet has ran on all affected ho" [puppet] - 10https://gerrit.wikimedia.org/r/908215 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [10:16:22] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:58] PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:18:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P46487 and previous config saved to /var/cache/conftool/dbconfig/20230412-101808-ladsgroup.json [10:18:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P46488 and previous config saved to /var/cache/conftool/dbconfig/20230412-101841-ladsgroup.json [10:18:46] !log clearing out 24 ghost objects from Swift T327253 [10:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:50] T327253: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 [10:21:14] 10SRE, 10LDAP-Access-Requests, 10User-MarcoAurelio: add MarcoAurelio to LDAP nda group - https://phabricator.wikimedia.org/T333884 (10MarcoAurelio) [10:22:00] (03PS2) 10Clément Goubert: contint: manage dsh target from Puppet DB [puppet] - 10https://gerrit.wikimedia.org/r/893483 (owner: 10Hashar) [10:23:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud_private_subnet: add route to public IPv4 range [puppet] - 10https://gerrit.wikimedia.org/r/903622 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [10:23:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P46489 and previous config saved to /var/cache/conftool/dbconfig/20230412-102346-root.json [10:24:04] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40616/console" [puppet] - 10https://gerrit.wikimedia.org/r/893483 (owner: 10Hashar) [10:26:05] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] contint: manage dsh target from Puppet DB [puppet] - 10https://gerrit.wikimedia.org/r/893483 (owner: 10Hashar) [10:26:40] RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:27:01] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1220 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/907833 (https://phabricator.wikimedia.org/T334564) [10:27:55] (03Abandoned) 10Marostegui: mariadb: Promote db1220 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/907833 (https://phabricator.wikimedia.org/T334564) (owner: 10Gerrit maintenance bot) [10:28:41] !log hashar@deploy2002 Started deploy [zuul/deploy@4c6859c]: Dummy deploy with dsh file managed by Puppet [10:28:44] !log hashar@deploy2002 Finished deploy [zuul/deploy@4c6859c]: Dummy deploy with dsh file managed by Puppet (duration: 00m 02s) [10:29:02] (03PS3) 10Muehlenhoff: Install Puppet 5.5 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/908202 (https://phabricator.wikimedia.org/T330495) [10:29:12] (03PS4) 10Muehlenhoff: Install Puppet 5.5 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/908202 (https://phabricator.wikimedia.org/T330495) [10:29:13] !log hashar@deploy2002 Started deploy [integration/docroot@ab848e3]: Dummy deploy with dsh file managed by Puppet [10:29:16] !log hashar@deploy2002 Finished deploy [integration/docroot@ab848e3]: Dummy deploy with dsh file managed by Puppet (duration: 00m 02s) [10:29:20] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/908202 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [10:29:29] !log hashar@deploy2002 Started deploy [integration/docroot@ab848e3]: Dummy deploy with dsh file managed by Puppet [10:29:35] !log hashar@deploy2002 Finished deploy [integration/docroot@ab848e3]: Dummy deploy with dsh file managed by Puppet (duration: 00m 06s) [10:29:46] !log hashar@deploy2002 Started deploy [integration/docroot@ab848e3]: Dummy deploy with dsh file managed by Puppet [10:29:51] !log hashar@deploy2002 Finished deploy [integration/docroot@ab848e3]: Dummy deploy with dsh file managed by Puppet (duration: 00m 04s) [10:31:28] PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:32:00] (03PS2) 10Hashar: contint: manage jenkins-ci dsh group from Puppet DB [puppet] - 10https://gerrit.wikimedia.org/r/893484 (https://phabricator.wikimedia.org/T328920) [10:32:13] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/893484 (https://phabricator.wikimedia.org/T328920) (owner: 10Hashar) [10:32:37] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Clement_Goubert) Starting 10 min cpu stress test: ` cgoubert@mw2448:~$ stress -c 48 --timeout 600s... [10:33:06] RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:33:14] (03PS2) 10Hashar: releases: manage jenkins-rel dsh group from Puppet DB [puppet] - 10https://gerrit.wikimedia.org/r/893485 (https://phabricator.wikimedia.org/T323909) [10:33:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P46490 and previous config saved to /var/cache/conftool/dbconfig/20230412-103314-ladsgroup.json [10:33:30] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/893485 (https://phabricator.wikimedia.org/T323909) (owner: 10Hashar) [10:33:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T333332)', diff saved to https://phabricator.wikimedia.org/P46491 and previous config saved to /var/cache/conftool/dbconfig/20230412-103348-ladsgroup.json [10:33:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance [10:33:53] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [10:34:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance [10:34:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:34:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:34:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T333332)', diff saved to https://phabricator.wikimedia.org/P46492 and previous config saved to /var/cache/conftool/dbconfig/20230412-103421-ladsgroup.json [10:35:02] (03CR) 10Hashar: [C: 04-1] "https://puppet-compiler.wmflabs.org/output/893484/1724/deploy1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/893484 (https://phabricator.wikimedia.org/T328920) (owner: 10Hashar) [10:35:55] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui) @ayounsi we are placing new DB hosts in production, can you run the same query you ran to gather the affected DBs just in case we have new on... [10:36:11] (03PS5) 10Arturo Borrero Gonzalez: cloud_private_subnet: codfw: relocate some hiera [puppet] - 10https://gerrit.wikimedia.org/r/903623 (https://phabricator.wikimedia.org/T324992) [10:36:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T333332)', diff saved to https://phabricator.wikimedia.org/P46493 and previous config saved to /var/cache/conftool/dbconfig/20230412-103635-ladsgroup.json [10:38:12] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/908215 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [10:38:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P46494 and previous config saved to /var/cache/conftool/dbconfig/20230412-103851-root.json [10:41:03] (03PS3) 10Hashar: releases: manage jenkins-rel dsh group from Puppet DB [puppet] - 10https://gerrit.wikimedia.org/r/893485 (https://phabricator.wikimedia.org/T323909) [10:41:16] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/893485 (https://phabricator.wikimedia.org/T323909) (owner: 10Hashar) [10:41:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud_private_subnet: codfw: relocate some hiera [puppet] - 10https://gerrit.wikimedia.org/r/903623 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [10:42:34] PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:42:34] (03PS3) 10Elukey: role::dse_k8s::worker: add AMD GPU support [puppet] - 10https://gerrit.wikimedia.org/r/908210 (https://phabricator.wikimedia.org/T333009) [10:42:45] (03CR) 10Elukey: role::dse_k8s::worker: add AMD GPU support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908210 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [10:42:48] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/907834 (https://phabricator.wikimedia.org/T334567) [10:43:34] (03Abandoned) 10Ladsgroup: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/907834 (https://phabricator.wikimedia.org/T334567) (owner: 10Gerrit maintenance bot) [10:43:54] (03PS4) 10Hashar: releases: manage jenkins-rel dsh group from Puppet DB [puppet] - 10https://gerrit.wikimedia.org/r/893485 (https://phabricator.wikimedia.org/T323909) [10:44:00] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/907835 (https://phabricator.wikimedia.org/T334568) [10:44:23] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/893485 (https://phabricator.wikimedia.org/T323909) (owner: 10Hashar) [10:47:41] (03PS3) 10David Caro: maintain_dbusers: move all the files under service [puppet] - 10https://gerrit.wikimedia.org/r/906637 [10:47:50] (03CR) 10David Caro: maintain_dbusers: move all the files under service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/906637 (owner: 10David Caro) [10:48:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T333332)', diff saved to https://phabricator.wikimedia.org/P46495 and previous config saved to /var/cache/conftool/dbconfig/20230412-104820-ladsgroup.json [10:48:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2119.codfw.wmnet with reason: Maintenance [10:48:26] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [10:48:26] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/907836 (https://phabricator.wikimedia.org/T334569) [10:48:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2119.codfw.wmnet with reason: Maintenance [10:48:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2119 (T333332)', diff saved to https://phabricator.wikimedia.org/P46496 and previous config saved to /var/cache/conftool/dbconfig/20230412-104843-ladsgroup.json [10:49:41] (03Abandoned) 10Ladsgroup: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/907836 (https://phabricator.wikimedia.org/T334569) (owner: 10Gerrit maintenance bot) [10:49:43] (03CR) 10Hashar: [C: 03+1] "PCC https://puppet-compiler.wmflabs.org/output/893485/1727/deploy1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/893485 (https://phabricator.wikimedia.org/T323909) (owner: 10Hashar) [10:49:51] (03Abandoned) 10Ladsgroup: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/907835 (https://phabricator.wikimedia.org/T334568) (owner: 10Gerrit maintenance bot) [10:50:27] (03CR) 10Clément Goubert: [C: 03+2] releases: manage jenkins-rel dsh group from Puppet DB [puppet] - 10https://gerrit.wikimedia.org/r/893485 (https://phabricator.wikimedia.org/T323909) (owner: 10Hashar) [10:50:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T333332)', diff saved to https://phabricator.wikimedia.org/P46497 and previous config saved to /var/cache/conftool/dbconfig/20230412-105056-ladsgroup.json [10:51:01] (03CR) 10Hashar: "The CI Jenkins do not have the scap::target['releng/jenkins-deploy'] yet for some reason. I have to investigate a bit more with Jaime." [puppet] - 10https://gerrit.wikimedia.org/r/893484 (https://phabricator.wikimedia.org/T328920) (owner: 10Hashar) [10:51:09] (03CR) 10Hashar: [C: 04-1] contint: manage jenkins-ci dsh group from Puppet DB [puppet] - 10https://gerrit.wikimedia.org/r/893484 (https://phabricator.wikimedia.org/T328920) (owner: 10Hashar) [10:51:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P46498 and previous config saved to /var/cache/conftool/dbconfig/20230412-105141-ladsgroup.json [10:53:18] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST apiservices) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:53:49] (03CR) 10Muehlenhoff: C:httpd move htcacheclean to httpd class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [10:53:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P46499 and previous config saved to /var/cache/conftool/dbconfig/20230412-105356-root.json [10:55:04] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [10:55:28] RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:55:34] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/908210 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [10:55:55] (03Abandoned) 10Hashar: scap: add contint2002 to ci-docroot, jenkins, zuul deploy [puppet] - 10https://gerrit.wikimedia.org/r/867670 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [10:56:16] !log installing apache2 security updates on Bullseye [10:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:44] !log installing apache2 security updates on Buster [10:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:10] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Clement_Goubert) Stress test went without issue, removing downtime and repooling host. [10:58:18] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST apiservices) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:59:19] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw2448.codfw.wmnet [10:59:20] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2448.codfw.wmnet [10:59:45] !log repooling mw2448.codfw.wmnet - T334429 [10:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:49] T334429: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 [11:00:07] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: name=mw2448.*.codfw.wmnet [11:00:10] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale-full only: 1 (install1004), Fresh: 123 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:00:18] PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:02:18] (03CR) 10Hashar: "So the "issue" is that the CI Jenkins are not yet using scap for deployment of the Jenkins configuration. It is an opt-in via:" [puppet] - 10https://gerrit.wikimedia.org/r/893484 (https://phabricator.wikimedia.org/T328920) (owner: 10Hashar) [11:06:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P46500 and previous config saved to /var/cache/conftool/dbconfig/20230412-110602-ladsgroup.json [11:06:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P46501 and previous config saved to /var/cache/conftool/dbconfig/20230412-110647-ladsgroup.json [11:08:17] 10SRE, 10Infrastructure-Foundations: IDM milestone 3 "Build-out for self service" - https://phabricator.wikimedia.org/T320801 (10SLyngshede-WMF) [11:08:23] 10SRE, 10Infrastructure-Foundations: Figure out a captcha option for IDM - https://phabricator.wikimedia.org/T320809 (10SLyngshede-WMF) 05In progress→03Resolved a:03SLyngshede-WMF [11:08:24] RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:10:32] RECOVERY - mediawiki-installation DSH group on mw2448 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [11:12:34] !log installing gnutls28 security updates on buster [11:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:14] PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:16:29] (03PS1) 10Marostegui: core_test.pp: Add mariadb 11.1 package [puppet] - 10https://gerrit.wikimedia.org/r/908221 (https://phabricator.wikimedia.org/T333289) [11:18:33] (03CR) 10Marostegui: [C: 03+2] core_test.pp: Add mariadb 11.1 package [puppet] - 10https://gerrit.wikimedia.org/r/908221 (https://phabricator.wikimedia.org/T333289) (owner: 10Marostegui) [11:20:28] (03PS1) 10Marostegui: db1106: Migrate to MariaDB 11.1 [puppet] - 10https://gerrit.wikimedia.org/r/908222 (https://phabricator.wikimedia.org/T333289) [11:20:57] (03CR) 10Marostegui: [C: 03+2] db1106: Migrate to MariaDB 11.1 [puppet] - 10https://gerrit.wikimedia.org/r/908222 (https://phabricator.wikimedia.org/T333289) (owner: 10Marostegui) [11:21:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P46502 and previous config saved to /var/cache/conftool/dbconfig/20230412-112108-ladsgroup.json [11:21:42] RECOVERY - dump of es4 in eqiad on backupmon1001 is OK: Last dump for es4 at eqiad (es1022) taken on 2023-04-11 00:00:06 (4300 GiB, +1.3 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [11:21:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T333332)', diff saved to https://phabricator.wikimedia.org/P46503 and previous config saved to /var/cache/conftool/dbconfig/20230412-112154-ladsgroup.json [11:21:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [11:21:59] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [11:22:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [11:22:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46504 and previous config saved to /var/cache/conftool/dbconfig/20230412-112217-ladsgroup.json [11:23:31] !log dbmaint Upgrade db1106 to mariadb 11.1 (eqiad) T333289 [11:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:35] T333289: Compile and package MariaDB 11.1.0 - https://phabricator.wikimedia.org/T333289 [11:23:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46505 and previous config saved to /var/cache/conftool/dbconfig/20230412-112334-ladsgroup.json [11:36:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T333332)', diff saved to https://phabricator.wikimedia.org/P46506 and previous config saved to /var/cache/conftool/dbconfig/20230412-113615-ladsgroup.json [11:36:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2136.codfw.wmnet with reason: Maintenance [11:36:20] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [11:36:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2136.codfw.wmnet with reason: Maintenance [11:36:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2136 (T333332)', diff saved to https://phabricator.wikimedia.org/P46507 and previous config saved to /var/cache/conftool/dbconfig/20230412-113638-ladsgroup.json [11:38:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P46508 and previous config saved to /var/cache/conftool/dbconfig/20230412-113840-ladsgroup.json [11:38:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T333332)', diff saved to https://phabricator.wikimedia.org/P46509 and previous config saved to /var/cache/conftool/dbconfig/20230412-113850-ladsgroup.json [11:45:12] RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:50:02] PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:51:20] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:53:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P46512 and previous config saved to /var/cache/conftool/dbconfig/20230412-115347-ladsgroup.json [11:53:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P46513 and previous config saved to /var/cache/conftool/dbconfig/20230412-115357-ladsgroup.json [11:59:14] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:59:20] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:59:49] 10Puppet, 10Infrastructure-Foundations: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10jbond) p:05Triage→03Medium [12:00:58] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:01:12] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49851 bytes in 0.252 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:01:18] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.379 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:01:43] (03PS19) 10Jbond: C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 (https://phabricator.wikimedia.org/T334577) (owner: 10Slyngshede) [12:01:56] (03CR) 10Jbond: [C: 03+1] C:httpd move htcacheclean to httpd class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904102 (https://phabricator.wikimedia.org/T334577) (owner: 10Slyngshede) [12:02:56] jouncebot: now [12:02:56] No deployments scheduled for the next 0 hour(s) and 57 minute(s) [12:03:02] jouncebot: next [12:03:02] In 0 hour(s) and 56 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230412T1300) [12:06:59] (03PS1) 10DCausse: rdf-streaming-updater: increase cpu limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/908225 [12:07:19] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Marostegui) Is this task still needed? [12:08:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46514 and previous config saved to /var/cache/conftool/dbconfig/20230412-120853-ladsgroup.json [12:08:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance [12:08:58] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [12:09:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P46515 and previous config saved to /var/cache/conftool/dbconfig/20230412-120903-ladsgroup.json [12:09:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance [12:09:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance [12:09:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance [12:09:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T333332)', diff saved to https://phabricator.wikimedia.org/P46516 and previous config saved to /var/cache/conftool/dbconfig/20230412-120943-ladsgroup.json [12:11:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T333332)', diff saved to https://phabricator.wikimedia.org/P46517 and previous config saved to /var/cache/conftool/dbconfig/20230412-121157-ladsgroup.json [12:12:09] (03CR) 10Jelto: "one question in-line regarding manage_host_keys. I guess we want to either run ssh-keygen or import keys from private puppet. But from loo" [puppet] - 10https://gerrit.wikimedia.org/r/907878 (owner: 10EoghanGaffney) [12:14:01] (03CR) 10Muehlenhoff: [C: 03+2] kartotherian: Stop passing use_nodejs10 [puppet] - 10https://gerrit.wikimedia.org/r/908196 (owner: 10Muehlenhoff) [12:14:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1120 T334580', diff saved to https://phabricator.wikimedia.org/P46518 and previous config saved to /var/cache/conftool/dbconfig/20230412-121420-marostegui.json [12:14:25] T334580: decommission db1120.eqiad.wmnet - https://phabricator.wikimedia.org/T334580 [12:16:28] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10MoritzMuehlenhoff) [12:19:25] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10MoritzMuehlenhoff) [12:23:03] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10ayounsi) Yep, the list of servers on the task description is up to date. [12:23:15] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10MoritzMuehlenhoff) [12:23:37] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10ayounsi) [12:24:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T333332)', diff saved to https://phabricator.wikimedia.org/P46519 and previous config saved to /var/cache/conftool/dbconfig/20230412-122409-ladsgroup.json [12:24:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2137.codfw.wmnet with reason: Maintenance [12:24:15] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [12:24:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2137.codfw.wmnet with reason: Maintenance [12:24:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3314 (T333332)', diff saved to https://phabricator.wikimedia.org/P46520 and previous config saved to /var/cache/conftool/dbconfig/20230412-122433-ladsgroup.json [12:25:35] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10ayounsi) > deploy1002 will need to be scheduled well in advance and/or failed over to deploy2002 as it is the canonical deployment host. @akosiaris As we're in the DC switchover and 2002 is... [12:26:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T333332)', diff saved to https://phabricator.wikimedia.org/P46521 and previous config saved to /var/cache/conftool/dbconfig/20230412-122645-ladsgroup.json [12:27:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P46522 and previous config saved to /var/cache/conftool/dbconfig/20230412-122703-ladsgroup.json [12:29:16] (03PS1) 10Muehlenhoff: service::node: Remove use_nodejs10 [puppet] - 10https://gerrit.wikimedia.org/r/908226 [12:31:10] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10ayounsi) [12:31:24] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ayounsi) [12:31:34] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [12:31:42] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10ayounsi) [12:31:54] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ayounsi) [12:32:08] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10ayounsi) [12:34:49] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ayounsi) [12:35:06] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:35:40] !log installing intel-microcode security updates [12:35:42] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ayounsi) >>! In T333377#8775126, @Marostegui wrote: > @ayounsi we are placing new DB hosts in production, can you run the same query you ran to gather th... [12:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:11] (03PS1) 10Marostegui: mariadb: Place db1223 into s3 [puppet] - 10https://gerrit.wikimedia.org/r/908227 (https://phabricator.wikimedia.org/T326669) [12:38:19] (03CR) 10Marostegui: [C: 03+2] mariadb: Place db1223 into s3 [puppet] - 10https://gerrit.wikimedia.org/r/908227 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [12:38:28] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui) Thank you, nothing changes from our DB side! [12:39:16] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10MoritzMuehlenhoff) [12:40:35] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10SLyngshede-WMF) [12:41:13] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10SLyngshede-WMF) idm servers have the module installed, but not enabled. [12:41:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P46523 and previous config saved to /var/cache/conftool/dbconfig/20230412-124151-ladsgroup.json [12:42:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P46524 and previous config saved to /var/cache/conftool/dbconfig/20230412-124210-ladsgroup.json [12:45:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:50:15] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10MoritzMuehlenhoff) [12:50:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P46525 and previous config saved to /var/cache/conftool/dbconfig/20230412-125016-root.json [12:50:23] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/908202 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [12:52:47] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10jbond) >>! In T334577#8775710, @SLyngshede-WMF wrote: > idm servers have the module installed, but not enabled. the apache2 package installs the file so t... [12:54:06] (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 230k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [12:55:29] (03CR) 10Elukey: [C: 03+2] aptrepo: import AMD ROCm 5.4 to bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/908208 (https://phabricator.wikimedia.org/T295661) (owner: 10Elukey) [12:56:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P46526 and previous config saved to /var/cache/conftool/dbconfig/20230412-125658-ladsgroup.json [12:57:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T333332)', diff saved to https://phabricator.wikimedia.org/P46527 and previous config saved to /var/cache/conftool/dbconfig/20230412-125716-ladsgroup.json [12:57:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1191.eqiad.wmnet with reason: Maintenance [12:57:21] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [12:57:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idm1001.wikimedia.org [12:57:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1191.eqiad.wmnet with reason: Maintenance [12:57:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T333332)', diff saved to https://phabricator.wikimedia.org/P46528 and previous config saved to /var/cache/conftool/dbconfig/20230412-125739-ladsgroup.json [12:59:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T333332)', diff saved to https://phabricator.wikimedia.org/P46529 and previous config saved to /var/cache/conftool/dbconfig/20230412-125953-ladsgroup.json [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: Time to snap out of that daydream and deploy UTC afternoon backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230412T1300). [13:00:05] subbu, Sergi0, and herzog: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:07] (unable to deploy today FYI) [13:00:16] hello [13:00:26] o/ [13:00:30] o/ [13:00:35] o/ [13:00:56] no patches from me today though, just a maintenance script run request [13:00:56] * Lucas_WMDE looks at the calendar [13:00:59] Lucas_WMDE: can you deploy or should I? [13:01:06] I can do it [13:01:11] thanks! [13:01:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idm1001.wikimedia.org [13:01:43] hi subbu! ready to start with your change? [13:02:15] o/ give me a couple mins to wake up fully. :) [13:02:19] ok sure [13:02:24] I can do another change first :) [13:02:40] sounds good. :) [13:03:10] !log installing nodejs security updates on buster [13:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:26] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10MoritzMuehlenhoff) [13:04:33] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "list of wikis matches the two tasks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907899 (https://phabricator.wikimedia.org/T304551) (owner: 10Sergio Gimeno) [13:04:55] (03PS9) 10EoghanGaffney: Add keys for sshd-gitlab from the secrets repo [puppet] - 10https://gerrit.wikimedia.org/r/907878 [13:05:16] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907899 (https://phabricator.wikimedia.org/T304551) (owner: 10Sergio Gimeno) [13:05:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P46530 and previous config saved to /var/cache/conftool/dbconfig/20230412-130521-root.json [13:06:12] (03Merged) 10jenkins-bot: GrowthExperiments: enable add link frontend in 7,8th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907899 (https://phabricator.wikimedia.org/T304551) (owner: 10Sergio Gimeno) [13:06:52] * Lucas_WMDE watches scap/git fetch REL1_38 and branch_cut_pretest in a bunch of submodules [13:07:11] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:907899|GrowthExperiments: enable add link frontend in 7,8th round wikis (T304551 T308133)]] [13:07:17] T308133: Deploy "add a link" to 8th round of wikis - https://phabricator.wikimedia.org/T308133 [13:07:17] T304551: Deploy "add a link" to 7th round of wikis - https://phabricator.wikimedia.org/T304551 [13:07:27] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [13:07:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [13:08:33] !log lucaswerkmeister-wmde@deploy2002 sgimeno and lucaswerkmeister-wmde: Backport for [[gerrit:907899|GrowthExperiments: enable add link frontend in 7,8th round wikis (T304551 T308133)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [13:08:48] sergi0: can you test the change? [13:08:51] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10MoritzMuehlenhoff) So it turns out none of our Apache installs which had it running actually needs it; these 11 cases must all have been caused by random d... [13:09:14] Lucas_WMDE: sure I'll need ~3-5 min, gonna test 8-10 of them [13:09:19] ok sure [13:11:46] (03CR) 10Stevemunene: [C: 03+1] airflow: Make Data Engineering primary contact [puppet] - 10https://gerrit.wikimedia.org/r/907992 (https://phabricator.wikimedia.org/T334522) (owner: 10Bking) [13:12:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T333332)', diff saved to https://phabricator.wikimedia.org/P46531 and previous config saved to /var/cache/conftool/dbconfig/20230412-131204-ladsgroup.json [13:12:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2138.codfw.wmnet with reason: Maintenance [13:12:09] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [13:12:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2138.codfw.wmnet with reason: Maintenance [13:12:24] (03PS4) 10Elukey: role::dse_k8s::worker: add AMD GPU support [puppet] - 10https://gerrit.wikimedia.org/r/908210 (https://phabricator.wikimedia.org/T333009) [13:12:26] (03PS1) 10Elukey: aptrepo: fix rocm update rule for bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/908229 [13:12:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3314 (T333332)', diff saved to https://phabricator.wikimedia.org/P46532 and previous config saved to /var/cache/conftool/dbconfig/20230412-131227-ladsgroup.json [13:13:03] 2 more and done [13:13:15] herzog: do you know if there’s a standard way to phaste the maintenance script output? [13:13:25] otherwise I would probably go for | tee >(phaste) [13:13:33] Lucas_WMDE: I think [script] | phaste [13:13:36] e.g. [13:13:57] mwscript namespaceDupes.php kswiki | phaste [13:14:06] (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 202.4k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [13:14:07] but then I don’t get to see the output myself, I assume [13:14:11] but I don't know for sure, is anyone here familiar with that? [13:14:28] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40617/console" [puppet] - 10https://gerrit.wikimedia.org/r/907878 (owner: 10EoghanGaffney) [13:14:28] otherwise we can resort to the usual copy & paste terminal method :) [13:14:32] ^^ [13:14:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T333332)', diff saved to https://phabricator.wikimedia.org/P46533 and previous config saved to /var/cache/conftool/dbconfig/20230412-131440-ladsgroup.json [13:14:42] | tee -a | phaste ? [13:14:58] no docs on Wikitech re phaste that I can see [13:14:59] Ah, no that won't work [13:15:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P46535 and previous config saved to /var/cache/conftool/dbconfig/20230412-131459-ladsgroup.json [13:15:06] Lucas_WMDE: looking good from my side [13:15:09] I think | tee >(phaste) should work [13:15:11] sergi0: ok thanks! [13:15:26] I’ll try it ^^ [13:15:26] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/908229 (owner: 10Elukey) [13:15:37] Lucas_WMDE: Yeah, that should work better [13:15:39] cgoubert@deploy2002:/srv/mediawiki$ echo plop | tee -a >(phaste) [13:15:41] plop [13:15:41] https://wikitech.wikimedia.org/wiki/Phabricator/Conduit_API_Tokens#ProdPasteBot [13:15:43] cgoubert@deploy2002:/srv/mediawiki$ https://phabricator.wikimedia.org/P46536 [13:15:53] heh ^^ [13:15:54] (03CR) 10Elukey: [C: 03+2] aptrepo: fix rocm update rule for bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/908229 (owner: 10Elukey) [13:16:11] (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/907878 (owner: 10EoghanGaffney) [13:16:52] claime: and just | phaste won't work? [13:17:10] or you wouldn't see the terminal output [13:17:16] (03CR) 10EoghanGaffney: Add keys for sshd-gitlab from the secrets repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/907878 (owner: 10EoghanGaffney) [13:17:21] herzog: It'll work, you won't see the output [13:17:31] thanks :) [13:20:21] Lucas_WMDE: how's the script going? [13:20:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P46537 and previous config saved to /var/cache/conftool/dbconfig/20230412-132026-root.json [13:20:34] haven’t started it yet [13:20:39] ah, ok [13:20:39] scaap is still running for the other change :) [13:20:42] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:907899|GrowthExperiments: enable add link frontend in 7,8th round wikis (T304551 T308133)]] (duration: 13m 30s) [13:20:47] T308133: Deploy "add a link" to 8th round of wikis - https://phabricator.wikimedia.org/T308133 [13:20:47] T304551: Deploy "add a link" to 7th round of wikis - https://phabricator.wikimedia.org/T304551 [13:21:02] ugh, the same error again [13:22:04] !log lucaswerkmeister-wmde@mwmaint2002:~$ mwscript namespaceDupes kswiki --fix | tee >(phaste -t T334277) # P46538; errors on stderr, cf. T328634 [13:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:10] T328634: Lost pages after deployed addtional namespaces on shn.wikibooks - https://phabricator.wikimedia.org/T328634 [13:22:10] T334277: Run namespaceDupes.php for kswiki - https://phabricator.wikimedia.org/T334277 [13:22:46] (03PS2) 10Ssingh: hiera: lvs/balancer: unify hiera post bullseye upgrade (esams) [puppet] - 10https://gerrit.wikimedia.org/r/907931 (https://phabricator.wikimedia.org/T321309) [13:25:36] so one link to fix manually it seems [13:25:48] I pasted the dry run output [13:25:50] nine pages in total [13:26:11] !log upload AMD ROCm 5.4 debian packages to wikimedia-bullseye:thirdparty/amd-rocm54 - T295661 [13:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:15] T295661: Upgrade ROCm to 4.5 - https://phabricator.wikimedia.org/T295661 [13:26:34] (03CR) 10Elukey: [C: 03+2] role::dse_k8s::worker: add AMD GPU support [puppet] - 10https://gerrit.wikimedia.org/r/908210 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [13:26:46] Lucas_WMDE: yep, I'll purge these via API [13:26:50] ok thanks [13:26:58] with both options [13:27:04] and see if that solves the issue [13:27:13] I don’t think recursive should be needed, but probably won’t hurt either [13:27:22] I can run another dry run later to check if it’s done [13:27:31] subbu: how are you feeling now? :) [13:27:41] Ready. :) [13:27:45] ok \o/ [13:27:58] (03PS6) 10Lucas Werkmeister (WMDE): Make VE on officewiki use Parsoid directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896104 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [13:28:03] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896104 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [13:28:12] !log Stopping puppet on gitlab hosts to slow-rollout puppet ssh key management - T333840 [13:28:13] (03PS2) 10Slyngshede: LDAP attribute editor [software/bitu] - 10https://gerrit.wikimedia.org/r/900621 (https://phabricator.wikimedia.org/T179463) [13:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:17] T333840: Move gitlab ssh host keys to private puppet - https://phabricator.wikimedia.org/T333840 [13:28:24] Lucas_WMDE: just a q, which is the bad link, the one on the left or the right? The namespace name being in a script I can't read doesn't help :) [13:28:36] let me see [13:29:07] (03Merged) 10jenkins-bot: Make VE on officewiki use Parsoid directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896104 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [13:29:27] I think this is the title mentioned in the first line https://ks.wikipedia.org/wiki/%D9%85%D8%A7%DA%88%DB%8C%D9%88%D9%97%D9%84:Citation/CS1/Configuration/%D8%AF%D9%8E%D8%B3%D8%AA%D8%A7%D9%88%DB%8C%D9%96%D8%B2 [13:29:31] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:896104|Make VE on officewiki use Parsoid directly (T320529 T333402)]] [13:29:37] T333402: Switching from source editing to visual editing mode is broken with the REST API - https://phabricator.wikimedia.org/T333402 [13:29:38] T320529: Configure VE backend to use Parsoid directly, instead of calling RESTbase - https://phabricator.wikimedia.org/T320529 [13:29:42] (03PS1) 10Elukey: Fix dse_gpu_hosts format in regex.yaml [puppet] - 10https://gerrit.wikimedia.org/r/908231 [13:29:45] (03PS3) 10Slyngshede: LDAP attribute editor [software/bitu] - 10https://gerrit.wikimedia.org/r/900621 (https://phabricator.wikimedia.org/T179463) [13:29:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P46539 and previous config saved to /var/cache/conftool/dbconfig/20230412-132946-ladsgroup.json [13:30:00] hang on, no, that doesn’t make sense to specify as a URL [13:30:04] because that probably normalizes it anyway [13:30:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P46540 and previous config saved to /var/cache/conftool/dbconfig/20230412-133006-ladsgroup.json [13:30:29] (03CR) 10Slyngshede: "I'll create another patch that sets up Sphinx, so documentation can get started." [software/bitu] - 10https://gerrit.wikimedia.org/r/900621 (https://phabricator.wikimedia.org/T179463) (owner: 10Slyngshede) [13:30:54] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and daniel: Backport for [[gerrit:896104|Make VE on officewiki use Parsoid directly (T320529 T333402)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [13:31:04] purge sent [13:31:07] for that one [13:31:12] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40619/console" [puppet] - 10https://gerrit.wikimedia.org/r/908231 (owner: 10Elukey) [13:31:16] subbu: should be ready to test now :) [13:31:22] Lucas_WMDE: TIL about the `phaste` command o_O [13:31:34] Lucas_WMDE, ty. is it mwdebug on codfw? [13:31:35] same tbh ^^ [13:31:38] 10SRE, 10Infrastructure-Foundations, 10netops: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10jbond) the logic we use in puppet is mostly the same as [[ https://phabricator.wikimedia.org/P46511 | this script ]] which would be a good template to use for a cookbook [13:31:40] should be yeah [13:31:43] ok. [13:31:52] * TheresNoTime has been manually copy/pasting (: [13:32:31] herzog: I tried it in the API sandbox myself but got nine `missing`s :/ must’ve done something wrong [13:32:31] (03CR) 10Elukey: [V: 03+1 C: 03+2] Fix dse_gpu_hosts format in regex.yaml [puppet] - 10https://gerrit.wikimedia.org/r/908231 (owner: 10Elukey) [13:32:45] Lucas_WMDE: yep, "missing": true [13:33:01] or maybe it's the other title [13:33:09] (03CR) 10EoghanGaffney: [C: 03+2] Add keys for sshd-gitlab from the secrets repo [puppet] - 10https://gerrit.wikimedia.org/r/907878 (owner: 10EoghanGaffney) [13:33:12] action=info for that page returns page_id = 0 [13:33:15] which is not right [13:33:25] https://ks.wikipedia.org/w/index.php?title=%D9%85%D8%A7%DA%88%DB%8C%D9%88%D9%97%D9%84:Citation/CS1/Configuration/%D8%AF%D9%8E%D8%B3%D8%AA%D8%A7%D9%88%DB%8C%D9%96%D8%B2&action=info [13:33:47] well, it does not exist so indeed pageid = 0 [13:33:49] Lucas_WMDE, lgtm ... good to go. [13:33:54] ok thanks! [13:33:58] syncing [13:34:16] !log [puppetmaster] sudo /usr/local/sbin/puppet-facts-upload --proxy http://webproxy.eqiad.wmnet:8080 to update PCC [13:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:20] (03PS1) 10Jbond: ci: indicate which server is the control server via a hiera param [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) [13:35:06] TheresNoTime: using the canonical NS name in English seems to work [13:35:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P46541 and previous config saved to /var/cache/conftool/dbconfig/20230412-133531-root.json [13:35:36] Not been following the channel, what's up? [13:35:45] I was thinking of a more brutal route and using action=purge with generator=allpages and gapnamespace=828 [13:35:57] would take a little bit though, since purges are rate limited [13:36:01] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:07] (but there are only 366 pages in the namespace so it wouldn’t be totally terrible either) [13:36:42] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on puppetdb2002.codfw.wmnet with reason: puppetdb maintenance [13:36:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on puppetdb2002.codfw.wmnet with reason: puppetdb maintenance [13:37:36] (03PS1) 10Phedenskog: perf: PaintTiming metrics is now sent in the navtiming event. [alerts] - 10https://gerrit.wikimedia.org/r/908234 (https://phabricator.wikimedia.org/T328256) [13:38:23] PROBLEM - Check systemd state on dse-k8s-worker1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_amd_rocm_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:39:20] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:896104|Make VE on officewiki use Parsoid directly (T320529 T333402)]] (duration: 09m 48s) [13:39:26] T333402: Switching from source editing to visual editing mode is broken with the REST API - https://phabricator.wikimedia.org/T333402 [13:39:26] T320529: Configure VE backend to use Parsoid directly, instead of calling RESTbase - https://phabricator.wikimedia.org/T320529 [13:39:27] subbu: should be done [13:39:33] ty! :) [13:39:40] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10jbond) awesome thanks @MoritzMuehlenhoff [13:40:44] (03PS2) 10Jbond: ci: indicate which server is the control server via a hiera param [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) [13:41:38] herzog: 7 links left now [13:41:55] (03CR) 10Cathal Mooney: "Thanks for the review, will reformat and submit a new patchset." [homer/public] - 10https://gerrit.wikimedia.org/r/906726 (https://phabricator.wikimedia.org/T334281) (owner: 10Cathal Mooney) [13:43:02] !log UTC afternoon backport+config window done [13:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:09] Lucas_WMDE: great - I sent purge requests for the page ids listed there as well [13:43:13] just in case [13:43:19] I'll continue with the titles [13:43:20] !log stop Puppet in codfw/edges for puppetdb maintenance [13:43:22] oh right, I missed that the output had page IDs [13:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:24] that looks great [13:43:44] but if 7 links still remain means it's not fixing all? [13:43:59] maybe the job queue will need some time to run [13:44:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P46542 and previous config saved to /var/cache/conftool/dbconfig/20230412-134453-ladsgroup.json [13:45:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T333332)', diff saved to https://phabricator.wikimedia.org/P46543 and previous config saved to /var/cache/conftool/dbconfig/20230412-134512-ladsgroup.json [13:45:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1194.eqiad.wmnet with reason: Maintenance [13:45:17] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [13:45:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1194.eqiad.wmnet with reason: Maintenance [13:45:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T333332)', diff saved to https://phabricator.wikimedia.org/P46544 and previous config saved to /var/cache/conftool/dbconfig/20230412-134535-ladsgroup.json [13:45:41] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:46:55] (03PS3) 10Jbond: ci: indicate which server is the control server via a hiera param [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) [13:47:12] (03PS1) 10Jelto: Revert "install_server: change device names in gitlab-raid1" [puppet] - 10https://gerrit.wikimedia.org/r/908045 [13:47:47] (03CR) 10CI reject: [V: 04-1] Revert "install_server: change device names in gitlab-raid1" [puppet] - 10https://gerrit.wikimedia.org/r/908045 (owner: 10Jelto) [13:47:47] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for KMorgan - https://phabricator.wikimedia.org/T334432 (10Ottomata) Approved [13:47:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T333332)', diff saved to https://phabricator.wikimedia.org/P46545 and previous config saved to /var/cache/conftool/dbconfig/20230412-134749-ladsgroup.json [13:48:32] (03PS2) 10Jelto: Revert "install_server: change device names in gitlab-raid1" [puppet] - 10https://gerrit.wikimedia.org/r/908045 [13:49:42] (03PS6) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed [puppet] - 10https://gerrit.wikimedia.org/r/905243 [13:49:52] herzog: I cobbled together a small python script to purge all pages in namespace 828 after all [13:49:58] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40624/console" [puppet] - 10https://gerrit.wikimedia.org/r/907931 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [13:50:30] (03CR) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe) [13:50:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P46546 and previous config saved to /var/cache/conftool/dbconfig/20230412-135035-root.json [13:50:37] PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 BAD GATEWAY - 275 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [13:50:55] (03CR) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe) [13:51:04] (03CR) 10Jelto: "the different device names did not help growing the raid. Let's try the old naming and a different maximum size. See also T333674#8775979" [puppet] - 10https://gerrit.wikimedia.org/r/908045 (owner: 10Jelto) [13:53:05] (03PS4) 10Jbond: ci: indicate which server is the control server via a hiera param [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) [13:53:28] (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: lvs/balancer: unify hiera post bullseye upgrade (esams) [puppet] - 10https://gerrit.wikimedia.org/r/907931 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [13:53:32] (JobUnavailable) firing: Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:54:43] PROBLEM - Check systemd state on dse-k8s-worker1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_amd_rocm_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:55:35] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [13:55:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed... [13:57:10] (03PS5) 10Jbond: ci: indicate which server is the control server via a hiera param [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) [13:57:41] (03CR) 10CI reject: [V: 04-1] ci: indicate which server is the control server via a hiera param [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) (owner: 10Jbond) [13:59:26] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Increase varnish max_connections to ats-be on eqsin|ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/907912 (https://phabricator.wikimedia.org/T288106) (owner: 10Vgutierrez) [13:59:53] Lucas_WMDE: awesome, thanks [14:00:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T333332)', diff saved to https://phabricator.wikimedia.org/P46547 and previous config saved to /var/cache/conftool/dbconfig/20230412-135959-ladsgroup.json [14:00:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2139.codfw.wmnet with reason: Maintenance [14:00:08] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [14:00:10] not sure if it's the script that's broken but namespaceDupes should be fixed :) [14:00:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2139.codfw.wmnet with reason: Maintenance [14:00:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2147.codfw.wmnet with reason: Maintenance [14:00:27] only 4 links to fix now [14:00:29] * herzog returns to tax filings [14:00:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2147.codfw.wmnet with reason: Maintenance [14:00:39] leaving, thanks for the assistance Lucas_WMDE! [14:00:43] well, namespaceDupes shouldn’t have a problem anymore once the links table migration is done I assume [14:00:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2147 (T333332)', diff saved to https://phabricator.wikimedia.org/P46548 and previous config saved to /var/cache/conftool/dbconfig/20230412-140045-ladsgroup.json [14:00:54] not sure it’s worth fixing it until then, when I looked at it last time it wasn’t trivial [14:01:57] (03PS6) 10Jbond: ci: indicate which server is the control server via a hiera param [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) [14:01:59] RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard2002 is OK: HTTP OK: HTTP/1.1 200 OK - 58711 bytes in 4.350 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [14:02:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P46549 and previous config saved to /var/cache/conftool/dbconfig/20230412-140255-ladsgroup.json [14:03:28] (03CR) 10Kamila Součková: [C: 03+2] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/907928 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan) [14:03:32] (JobUnavailable) resolved: Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:04:06] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: increase cpu limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/908225 (owner: 10DCausse) [14:04:24] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Ottomata) > FWIW I cannot access those either Makes sense, user `brett` would have to be in analytics-privatedata-users Unless...what do you mean by 'access'? You sho... [14:05:21] (03PS7) 10Jbond: ci: indicate which server is the control server via a hiera param [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) [14:05:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P46550 and previous config saved to /var/cache/conftool/dbconfig/20230412-140540-root.json [14:05:55] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:06:03] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:07:06] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [14:07:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [14:07:41] !log re-enabled Puppet in codfw/edges after puppetdb maintenance [14:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:34] (03Merged) 10jenkins-bot: rest-gateway: fix lua handler [deployment-charts] - 10https://gerrit.wikimedia.org/r/907928 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan) [14:08:49] 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 16 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10lbowmaker) [14:10:27] !log lucaswerkmeister-wmde@mwmaint2002:~$ mwscript namespaceDupes kswiki --fix # T334277, fixed the one remaining link [14:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:31] T334277: Run namespaceDupes.php for kswiki - https://phabricator.wikimedia.org/T334277 [14:13:03] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [14:16:41] (03PS8) 10Jbond: ci: indicate which server is the control server via a hiera param [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) [14:18:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P46552 and previous config saved to /var/cache/conftool/dbconfig/20230412-141801-ladsgroup.json [14:18:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P46553 and previous config saved to /var/cache/conftool/dbconfig/20230412-141802-ladsgroup.json [14:19:29] (03PS3) 10Muehlenhoff: Create a separate Hiera variable of KDCs specifically for use in client config [puppet] - 10https://gerrit.wikimedia.org/r/906563 (https://phabricator.wikimedia.org/T331695) [14:20:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:20:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P46554 and previous config saved to /var/cache/conftool/dbconfig/20230412-142045-root.json [14:22:32] (03PS4) 10Cathal Mooney: Add generic way to create static routes on switches [homer/public] - 10https://gerrit.wikimedia.org/r/906726 (https://phabricator.wikimedia.org/T334281) [14:22:58] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/908202 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [14:23:09] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [14:25:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:25:13] (03PS1) 10Hnowlan: rest-gateway: fix direct_response [deployment-charts] - 10https://gerrit.wikimedia.org/r/908242 (https://phabricator.wikimedia.org/T326321) [14:25:55] (03CR) 10Raymond Ndibe: [C: 03+1] maintain_dbusers: move all the files under service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/906637 (owner: 10David Caro) [14:26:27] (03CR) 10Kamila Součková: [C: 03+2] "LGTM again!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/908242 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan) [14:27:57] (03PS1) 10Ottomata: Update all eventgate clusters to same image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/908244 (https://phabricator.wikimedia.org/T334510) [14:29:07] (03PS9) 10Jbond: ci: indicate which server is the control server via a hiera param [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) [14:30:14] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/906563 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff) [14:30:20] (03CR) 10Ayounsi: [C: 03+1] "lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/906726 (https://phabricator.wikimedia.org/T334281) (owner: 10Cathal Mooney) [14:30:34] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add generic mechanism to add static routes on switches - https://phabricator.wikimedia.org/T334281 (10cmooney) FWIW I've submitted a new patchset with a different format for defining the routes in YAML (at Arzhel's suggestion). ` static... [14:31:17] (03Merged) 10jenkins-bot: rest-gateway: fix direct_response [deployment-charts] - 10https://gerrit.wikimedia.org/r/908242 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan) [14:31:40] (03PS1) 10Vgutierrez: hiera: merge: hash for profile::cache::varnish::frontend::cache_be_opts [puppet] - 10https://gerrit.wikimedia.org/r/908245 (https://phabricator.wikimedia.org/T288106) [14:32:15] (03PS10) 10Jbond: ci: indicate which server is the control server via a hiera param [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) [14:32:32] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [14:33:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T333332)', diff saved to https://phabricator.wikimedia.org/P46556 and previous config saved to /var/cache/conftool/dbconfig/20230412-143308-ladsgroup.json [14:33:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P46557 and previous config saved to /var/cache/conftool/dbconfig/20230412-143309-ladsgroup.json [14:33:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1202.eqiad.wmnet with reason: Maintenance [14:33:13] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [14:33:17] (03CR) 10Ottomata: [C: 03+2] Update all eventgate clusters to same image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/908244 (https://phabricator.wikimedia.org/T334510) (owner: 10Ottomata) [14:33:24] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40635/console" [puppet] - 10https://gerrit.wikimedia.org/r/908245 (https://phabricator.wikimedia.org/T288106) (owner: 10Vgutierrez) [14:33:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1202.eqiad.wmnet with reason: Maintenance [14:33:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T333332)', diff saved to https://phabricator.wikimedia.org/P46558 and previous config saved to /var/cache/conftool/dbconfig/20230412-143331-ladsgroup.json [14:35:11] (03PS11) 10Jbond: ci: indicate which server is the control server via a hiera param [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) [14:35:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T333332)', diff saved to https://phabricator.wikimedia.org/P46559 and previous config saved to /var/cache/conftool/dbconfig/20230412-143545-ladsgroup.json [14:36:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40637/console" [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) (owner: 10Jbond) [14:36:36] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [14:36:39] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Update all eventgate clusters to same image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/908244 (https://phabricator.wikimedia.org/T334510) (owner: 10Ottomata) [14:36:42] (03CR) 10Jbond: [V: 03+1] "10 patches later and we finnaly have a pcc 😊" [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) (owner: 10Jbond) [14:36:43] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [14:37:47] jouncebot: next [14:37:48] In 2 hour(s) and 22 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230412T1700) [14:38:04] (03CR) 10Filippo Giunchedi: [C: 03+2] Rename cadvisor_exporter to cadvisor [puppet] - 10https://gerrit.wikimedia.org/r/908215 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [14:38:18] jbond: merging your change too [14:38:27] for private.git that is [14:38:32] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:36] !log installing apache security updates on gerrit1001 [14:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:48] jbond: labs/private.git [14:40:25] !log installing apache security updates on phab1004 (phabricator.wikimedia.org) [14:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:46] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [14:42:15] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [14:42:20] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [14:42:34] jouncebot: nowandnext [14:42:34] No deployments scheduled for the next 2 hour(s) and 17 minute(s) [14:42:34] In 2 hour(s) and 17 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230412T1700) [14:43:04] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [14:43:23] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [14:43:32] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [14:43:32] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:44:11] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [14:44:56] (03PS1) 10Urbanecm: [Growth] beta: Enable Personalized praise everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908249 [14:45:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908249 (owner: 10Urbanecm) [14:46:15] (03CR) 10Jbond: [C: 03+1] Create a separate Hiera variable of KDCs specifically for use in client config [puppet] - 10https://gerrit.wikimedia.org/r/906563 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff) [14:46:30] (03Merged) 10jenkins-bot: [Growth] beta: Enable Personalized praise everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908249 (owner: 10Urbanecm) [14:47:27] (03PS1) 10Ottomata: eventgate - remove deprecated all_settings stream config param [deployment-charts] - 10https://gerrit.wikimedia.org/r/908251 (https://phabricator.wikimedia.org/T286344) [14:47:44] (03PS5) 10David Caro: openstack: encapi: new id-based api for Terraform [puppet] - 10https://gerrit.wikimedia.org/r/874812 (https://phabricator.wikimedia.org/T317478) (owner: 10Majavah) [14:47:52] (03CR) 10David Caro: openstack: encapi: new id-based api for Terraform (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/874812 (https://phabricator.wikimedia.org/T317478) (owner: 10Majavah) [14:48:09] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate - remove deprecated all_settings stream config param [deployment-charts] - 10https://gerrit.wikimedia.org/r/908251 (https://phabricator.wikimedia.org/T286344) (owner: 10Ottomata) [14:48:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T333332)', diff saved to https://phabricator.wikimedia.org/P46560 and previous config saved to /var/cache/conftool/dbconfig/20230412-144815-ladsgroup.json [14:48:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2155.codfw.wmnet with reason: Maintenance [14:48:20] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [14:48:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2155.codfw.wmnet with reason: Maintenance [14:48:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance [14:48:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance [14:48:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2155 (T333332)', diff saved to https://phabricator.wikimedia.org/P46561 and previous config saved to /var/cache/conftool/dbconfig/20230412-144856-ladsgroup.json [14:48:57] (03PS1) 10Jbond: puppet-diffs: add project members to the access list [puppet] - 10https://gerrit.wikimedia.org/r/908252 [14:49:13] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppet-diffs: add project members to the access list [puppet] - 10https://gerrit.wikimedia.org/r/908252 (owner: 10Jbond) [14:49:58] (03PS1) 10Ayounsi: Allow cumin host to reach gNMI on sonic switches [homer/public] - 10https://gerrit.wikimedia.org/r/908253 [14:50:02] (03CR) 10Muehlenhoff: [C: 03+1] "Let's give it a shot" [puppet] - 10https://gerrit.wikimedia.org/r/908045 (owner: 10Jelto) [14:50:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P46562 and previous config saved to /var/cache/conftool/dbconfig/20230412-145051-ladsgroup.json [14:51:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:51:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T333332)', diff saved to https://phabricator.wikimedia.org/P46563 and previous config saved to /var/cache/conftool/dbconfig/20230412-145108-ladsgroup.json [14:51:47] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:51:55] (03CR) 10David Caro: [C: 03+2] openstack: encapi: new id-based api for Terraform [puppet] - 10https://gerrit.wikimedia.org/r/874812 (https://phabricator.wikimedia.org/T317478) (owner: 10Majavah) [14:51:58] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (FY2022/2023-Q4): WMCS Cookbook Automation FY2022-23 Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri) p:05Medium→03High [14:52:05] (03CR) 10Ayounsi: [C: 03+2] Allow cumin host to reach gNMI on sonic switches [homer/public] - 10https://gerrit.wikimedia.org/r/908253 (owner: 10Ayounsi) [14:52:07] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [14:52:12] (03CR) 10Majavah: [C: 04-1] openstack: encapi: new id-based api for Terraform [puppet] - 10https://gerrit.wikimedia.org/r/874812 (https://phabricator.wikimedia.org/T317478) (owner: 10Majavah) [14:52:15] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (FY2022/2023-Q4): WMCS Cookbook Automation FY2022-23 Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri) [14:52:20] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [14:52:35] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [14:52:38] (03Merged) 10jenkins-bot: Allow cumin host to reach gNMI on sonic switches [homer/public] - 10https://gerrit.wikimedia.org/r/908253 (owner: 10Ayounsi) [14:52:48] (03PS6) 10David Caro: openstack: encapi: open up write access [puppet] - 10https://gerrit.wikimedia.org/r/874813 (https://phabricator.wikimedia.org/T317478) (owner: 10Majavah) [14:53:16] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [14:53:19] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [14:53:35] 10SRE, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) [14:53:44] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [14:54:28] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [14:55:10] (03PS1) 10Ottomata: eventgate - remove irrelevant comment about all_settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/908255 [14:55:11] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [14:55:28] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate - remove irrelevant comment about all_settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/908255 (owner: 10Ottomata) [14:55:49] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [14:56:06] (03PS2) 10David Caro: cloudlib: support https for fetching data [puppet] - 10https://gerrit.wikimedia.org/r/875896 (owner: 10Majavah) [14:56:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:56:07] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [14:56:13] (03PS3) 10David Caro: hieradata: use port 443 for enc access [puppet] - 10https://gerrit.wikimedia.org/r/874894 (owner: 10Majavah) [14:56:19] (03PS6) 10David Caro: openstack: encapi: drop legacy ports [puppet] - 10https://gerrit.wikimedia.org/r/874814 (owner: 10Majavah) [14:56:19] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [14:57:15] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [14:57:42] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [14:57:54] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [14:57:59] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:58:22] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [14:58:40] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [14:58:42] (03CR) 10Muehlenhoff: [C: 03+2] Create a separate Hiera variable of KDCs specifically for use in client config [puppet] - 10https://gerrit.wikimedia.org/r/906563 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff) [14:58:49] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:59:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:59:08] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [14:59:20] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [14:59:36] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [14:59:53] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [15:00:15] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [15:00:25] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [15:00:35] (03PS13) 10David Caro: maintain-dbusers: use click for cli definition [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) [15:01:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:02:13] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [15:02:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye execute... [15:03:03] 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10Jclark-ctr) Created raid 1 for 2 ssd @elukey [15:04:29] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [15:04:38] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [15:05:11] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49852 bytes in 0.981 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:05:35] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [15:05:44] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [15:05:57] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.399 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:05:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P46564 and previous config saved to /var/cache/conftool/dbconfig/20230412-150557-ladsgroup.json [15:06:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P46565 and previous config saved to /var/cache/conftool/dbconfig/20230412-150614-ladsgroup.json [15:09:06] (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 211.8k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [15:09:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:09:51] (03PS1) 10Ottomata: flink - Allow for conditionally disabling jemalloc [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/908256 [15:10:42] 10SRE, 10MediaWiki-General: The script file run.php cannot be executed using MaintenanceRunner - https://phabricator.wikimedia.org/T334484 (10TheresNoTime) 05Open→03Resolved [[ https://wikitech.wikimedia.org/w/index.php?title=Maintenance_server&diff=prev&oldid=2068462&diffmode=source | Updated the docs ]],... [15:13:34] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10BCornwall) @Ottomata yes, I just meant those dashboards, which I do get privilege errors. I'll wait for @FNavas-foundation to clarify their issues. Thanks! [15:13:36] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40638/console" [puppet] - 10https://gerrit.wikimedia.org/r/906017 (https://phabricator.wikimedia.org/T334092) (owner: 10Stevemunene) [15:14:06] (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 204.1k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [15:14:19] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [15:14:23] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [15:15:41] (03CR) 10Elukey: [V: 03+1 C: 03+1] "Given the state of the host, I am in favor of remove it from hadoop so we can reimage it completely (HDDs as well), and then we can re-add" [puppet] - 10https://gerrit.wikimedia.org/r/906017 (https://phabricator.wikimedia.org/T334092) (owner: 10Stevemunene) [15:16:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:18:16] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for KMorgan - https://phabricator.wikimedia.org/T334432 (10nettrom_WMF) >>! In T334432#8773117, @BCornwall wrote: > @nettrom_WMF Are you the approving party (manager) of @KMorgan-WMF? No, I think that would be eit... [15:18:45] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40639/console" [puppet] - 10https://gerrit.wikimedia.org/r/906637 (owner: 10David Caro) [15:19:04] elukey: o/ am trying to deploy eventgate-main, and it looks like maybe i'm getting a kafka ssl error in staging [15:19:18] (03PS2) 10Jdlrobson: Set Vector 2022 as default skin on Welsh Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907539 (https://phabricator.wikimedia.org/T334279) [15:19:23] i already deployed all other eventgates, (that don't talk to kafka main) those were fine. [15:19:45] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: merge: hash for profile::cache::varnish::frontend::cache_be_opts [puppet] - 10https://gerrit.wikimedia.org/r/908245 (https://phabricator.wikimedia.org/T288106) (owner: 10Vgutierrez) [15:19:45] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for KMorgan - https://phabricator.wikimedia.org/T334432 (10BCornwall) Hi, @DMburugu and @KStoller-WMF! Could either/both of you review the description of this ticket and approve/deny the request, please? A simple c... [15:19:47] BGP alerts in codfw expected [15:19:57] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:20:01] ottomata: ouch, checking [15:20:13] (03CR) 10Jelto: [C: 03+2] Revert "install_server: change device names in gitlab-raid1" [puppet] - 10https://gerrit.wikimedia.org/r/908045 (owner: 10Jelto) [15:20:41] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for KMorgan - https://phabricator.wikimedia.org/T334432 (10BCornwall) [15:20:59] {"name":"eventgate-wikimedia","hostname":"eventgate-production-67474fd66b-wbxwm","pid":141,"producer_type":"GuaranteedProducer","level":"ERROR","error":{"origin":"local","message":"ssl error","code":-1,"errno":-1,"stack":"Error: Local: SSL error"},"msg":"Encountered rdkafka error event: ssl error","time":"2023-04-12T15:20:52.719Z","v":0} [15:21:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T333332)', diff saved to https://phabricator.wikimedia.org/P46566 and previous config saved to /var/cache/conftool/dbconfig/20230412-152104-ladsgroup.json [15:21:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [15:21:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:21:09] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [15:21:16] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for KMorgan - https://phabricator.wikimedia.org/T334432 (10BCornwall) [15:21:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [15:21:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P46567 and previous config saved to /var/cache/conftool/dbconfig/20230412-152120-ladsgroup.json [15:21:24] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for trizek - https://phabricator.wikimedia.org/T333863 (10Trizek-WMF) Thank you everyone! [15:21:25] (03PS2) 10Jdlrobson: Drop unused VectorPageTools feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907511 (https://phabricator.wikimedia.org/T332090) [15:21:31] hm [15:21:32] ssl.ca.location":"/etc/eventgate/puppetca.crt.pem [15:21:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance [15:21:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance [15:22:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2100.codfw.wmnet with reason: Maintenance [15:22:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2100.codfw.wmnet with reason: Maintenance [15:22:36] ottomata: let's go to #sre [15:22:37] .19 [15:22:39] k [15:22:44] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2010.codfw.wmnet with OS bullseye [15:23:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2108.codfw.wmnet with reason: Maintenance [15:23:01] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2010.codfw.wmnet with OS bullseye [15:23:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2108.codfw.wmnet with reason: Maintenance [15:23:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T333332)', diff saved to https://phabricator.wikimedia.org/P46568 and previous config saved to /var/cache/conftool/dbconfig/20230412-152320-ladsgroup.json [15:23:50] 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 16 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10fgiunchedi) [15:24:25] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:24:58] ^ expected [15:25:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T333332)', diff saved to https://phabricator.wikimedia.org/P46569 and previous config saved to /var/cache/conftool/dbconfig/20230412-152553-ladsgroup.json [15:26:36] (03PS3) 10Jelto: buildkitd: Isolate build container user/process/network namespaces [puppet] - 10https://gerrit.wikimedia.org/r/902132 (https://phabricator.wikimedia.org/T332804) (owner: 10Dduvall) [15:26:43] (03PS1) 10Ottomata: eventgate-main - set common Kafka ssl settings in values.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/908258 [15:27:25] (03CR) 10Jelto: [C: 03+2] buildkitd: Isolate build container user/process/network namespaces [puppet] - 10https://gerrit.wikimedia.org/r/902132 (https://phabricator.wikimedia.org/T332804) (owner: 10Dduvall) [15:28:09] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40640/console" [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [15:30:02] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [15:30:10] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [15:32:34] (HelmReleaseBadStatus) firing: Helm release eventgate-main/production on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-main - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:32:51] (03CR) 10Elukey: [C: 03+1] eventgate-main - set common Kafka ssl settings in values.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/908258 (owner: 10Ottomata) [15:34:35] (03PS1) 10Ssingh: hiera: lvs2010: update iface names for bullseye (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/908270 (https://phabricator.wikimedia.org/T321309) [15:35:15] (03PS1) 10Dzahn: add gerrit-new service IP [dns] - 10https://gerrit.wikimedia.org/r/908271 (https://phabricator.wikimedia.org/T334524) [15:35:49] (03CR) 10Gmodena: [C: 03+1] "LGTM." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/908256 (owner: 10Ottomata) [15:36:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:36:11] (03CR) 10CI reject: [V: 04-1] add gerrit-new service IP [dns] - 10https://gerrit.wikimedia.org/r/908271 (https://phabricator.wikimedia.org/T334524) (owner: 10Dzahn) [15:36:25] (03PS2) 10Ssingh: hiera: lvs2010: update iface names for bullseye (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/908270 (https://phabricator.wikimedia.org/T321309) [15:36:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T333332)', diff saved to https://phabricator.wikimedia.org/P46570 and previous config saved to /var/cache/conftool/dbconfig/20230412-153627-ladsgroup.json [15:36:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2172.codfw.wmnet with reason: Maintenance [15:36:33] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [15:36:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2172.codfw.wmnet with reason: Maintenance [15:36:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2172 (T333332)', diff saved to https://phabricator.wikimedia.org/P46571 and previous config saved to /var/cache/conftool/dbconfig/20230412-153651-ladsgroup.json [15:38:46] (03PS1) 10Snwachukwu: Add referer_name field to druid pageviews hourly and daily tables [puppet] - 10https://gerrit.wikimedia.org/r/908272 (https://phabricator.wikimedia.org/T334224) [15:39:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T333332)', diff saved to https://phabricator.wikimedia.org/P46572 and previous config saved to /var/cache/conftool/dbconfig/20230412-153903-ladsgroup.json [15:39:12] (03PS2) 10Dzahn: add gerrit-new service IP [dns] - 10https://gerrit.wikimedia.org/r/908271 (https://phabricator.wikimedia.org/T334524) [15:41:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P46573 and previous config saved to /var/cache/conftool/dbconfig/20230412-154100-ladsgroup.json [15:41:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:42:32] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/908271 (https://phabricator.wikimedia.org/T334524) (owner: 10Dzahn) [15:44:01] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2010.codfw.wmnet with reason: host reimage [15:44:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:44:27] (03CR) 10Dzahn: [C: 03+2] add gerrit-new service IP [dns] - 10https://gerrit.wikimedia.org/r/908271 (https://phabricator.wikimedia.org/T334524) (owner: 10Dzahn) [15:44:40] (03CR) 10Dzahn: [C: 03+2] "thanks for the review" [dns] - 10https://gerrit.wikimedia.org/r/908271 (https://phabricator.wikimedia.org/T334524) (owner: 10Dzahn) [15:44:58] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [15:45:04] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [15:45:17] (03PS1) 10Majavah: aptrepo: Drop kubernetes 1.21 components [puppet] - 10https://gerrit.wikimedia.org/r/908275 (https://phabricator.wikimedia.org/T286856) [15:45:19] (03PS1) 10Majavah: aptrepo: Add kubeadm 1.23 component [puppet] - 10https://gerrit.wikimedia.org/r/908276 (https://phabricator.wikimedia.org/T298005) [15:46:08] (03CR) 10Ssingh: [C: 03+2] hiera: lvs2010: update iface names for bullseye (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/908270 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:46:19] 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 16 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10Michaelcochez) Do we have a timeline for this move already? Is it better to not update the gerrit repo at the moment? Our main development... [15:46:59] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2010.codfw.wmnet with reason: host reimage [15:47:02] 10SRE, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) [15:47:05] (03PS2) 10Ottomata: eventgate - set default kafka ssl.ca.location in chart values.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/908258 [15:47:08] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [15:47:16] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [15:47:48] 10SRE, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) 208.80.154.151 / 2620:0:861:2:208:80:154:151 has been selected as the service IP for the new host in the subtask above [15:48:09] 10SRE, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) p:05Medium→03High [15:49:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:49:25] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [15:49:33] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [15:52:16] 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 16 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10calbon) [15:52:50] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [15:52:57] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [15:53:00] 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 16 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10leila) [15:54:07] (03PS1) 10Dzahn: site: add role(gerrit::migration) to host gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/908278 (https://phabricator.wikimedia.org/T326368) [15:54:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P46575 and previous config saved to /var/cache/conftool/dbconfig/20230412-155410-ladsgroup.json [15:56:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P46576 and previous config saved to /var/cache/conftool/dbconfig/20230412-155606-ladsgroup.json [15:57:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:57:38] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [15:57:46] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [15:58:35] (03PS5) 10Cathal Mooney: Add generic way to create static routes on switches [homer/public] - 10https://gerrit.wikimedia.org/r/906726 (https://phabricator.wikimedia.org/T334281) [15:58:52] !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes201[0123].codfw.wmnet [15:59:11] (03CR) 10Ottomata: [C: 03+2] eventgate - set default kafka ssl.ca.location in chart values.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/908258 (owner: 10Ottomata) [15:59:14] (03CR) 10Cathal Mooney: [C: 03+2] Add generic way to create static routes on switches [homer/public] - 10https://gerrit.wikimedia.org/r/906726 (https://phabricator.wikimedia.org/T334281) (owner: 10Cathal Mooney) [15:59:48] (03Merged) 10jenkins-bot: Add generic way to create static routes on switches [homer/public] - 10https://gerrit.wikimedia.org/r/906726 (https://phabricator.wikimedia.org/T334281) (owner: 10Cathal Mooney) [16:02:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:02:45] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [16:03:00] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [16:03:29] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:03:40] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [16:04:10] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs2010.codfw.wmnet with OS bullseye [16:04:13] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [16:04:20] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [16:04:20] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs2010.codfw.wmnet with OS bullseye completed: - lvs2010 (**PASS**) - Downtimed on Icinga/Aler... [16:04:49] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [16:05:30] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [16:05:39] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [16:07:34] (HelmReleaseBadStatus) resolved: Helm release eventgate-main/production on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-main - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:07:51] hm [16:08:22] oh resolved, right cool. [16:09:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P46577 and previous config saved to /var/cache/conftool/dbconfig/20230412-160916-ladsgroup.json [16:09:49] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [16:09:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [16:11:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T333332)', diff saved to https://phabricator.wikimedia.org/P46578 and previous config saved to /var/cache/conftool/dbconfig/20230412-161112-ladsgroup.json [16:11:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2120.codfw.wmnet with reason: Maintenance [16:11:17] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [16:11:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2120.codfw.wmnet with reason: Maintenance [16:11:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T333332)', diff saved to https://phabricator.wikimedia.org/P46579 and previous config saved to /var/cache/conftool/dbconfig/20230412-161135-ladsgroup.json [16:14:09] 10SRE, 10serviceops-collab, 10Patch-For-Review: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) [16:14:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T333332)', diff saved to https://phabricator.wikimedia.org/P46580 and previous config saved to /var/cache/conftool/dbconfig/20230412-161409-ladsgroup.json [16:15:33] (03CR) 10DCausse: [C: 03+1] flink - Allow for conditionally disabling jemalloc [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/908256 (owner: 10Ottomata) [16:15:56] (03CR) 10Dzahn: [V: 04-1] "https://puppet-compiler.wmflabs.org/output/908278/40641/gerrit1003.wikimedia.org/change.gerrit1003.wikimedia.org.err" [puppet] - 10https://gerrit.wikimedia.org/r/908278 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [16:16:53] (03CR) 10Ottomata: [C: 03+2] flink - Allow for conditionally disabling jemalloc [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/908256 (owner: 10Ottomata) [16:17:02] (03CR) 10Ottomata: [V: 03+2 C: 03+2] flink - Allow for conditionally disabling jemalloc [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/908256 (owner: 10Ottomata) [16:18:58] 10SRE, 10Infrastructure-Foundations, 10netops: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10cmooney) I'd consider client auth a "stretch goal" for now, nice to have but not sure we want to have all that extra complexity. In terms of an intermediate CA just for network... [16:22:29] 10SRE, 10Infrastructure-Foundations, 10Traffic: Set NEL `success_fraction: 1.0` on HTTP responses for measurement domains - https://phabricator.wikimedia.org/T334608 (10CDanis) [16:24:18] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (FY2022/2023-Q4): Update Spicerack documentation - https://phabricator.wikimedia.org/T325754 (10fnegri) [16:24:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T333332)', diff saved to https://phabricator.wikimedia.org/P46581 and previous config saved to /var/cache/conftool/dbconfig/20230412-162422-ladsgroup.json [16:24:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2179.codfw.wmnet with reason: Maintenance [16:24:28] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [16:24:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2179.codfw.wmnet with reason: Maintenance [16:24:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2179 (T333332)', diff saved to https://phabricator.wikimedia.org/P46582 and previous config saved to /var/cache/conftool/dbconfig/20230412-162448-ladsgroup.json [16:25:42] (03PS2) 10Dzahn: site: add role(gerrit::migration) to gerrit1003 and fix code [puppet] - 10https://gerrit.wikimedia.org/r/908278 (https://phabricator.wikimedia.org/T326368) [16:27:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T333332)', diff saved to https://phabricator.wikimedia.org/P46583 and previous config saved to /var/cache/conftool/dbconfig/20230412-162700-ladsgroup.json [16:27:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:28:07] (03PS3) 10Dzahn: site: add role(gerrit::migration) to gerrit1003 and fix code [puppet] - 10https://gerrit.wikimedia.org/r/908278 (https://phabricator.wikimedia.org/T326368) [16:29:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P46584 and previous config saved to /var/cache/conftool/dbconfig/20230412-162915-ladsgroup.json [16:30:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:30:40] (03PS1) 10Andrew Bogott: Update partman for cloudvirtlocal100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/908283 (https://phabricator.wikimedia.org/T329863) [16:33:11] (03CR) 10Andrew Bogott: [C: 03+2] Update partman for cloudvirtlocal100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/908283 (https://phabricator.wikimedia.org/T329863) (owner: 10Andrew Bogott) [16:36:07] 10SRE, 10Infrastructure-Foundations, 10Traffic: Set NEL `success_fraction: 1.0` on HTTP responses for measurement domains - https://phabricator.wikimedia.org/T334608 (10CDanis) [16:40:31] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 11): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata) > Flink doc does suggest that their k8s HA implementation could wor... [16:42:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P46585 and previous config saved to /var/cache/conftool/dbconfig/20230412-164206-ladsgroup.json [16:43:11] (03PS4) 10Dzahn: site: add role(gerrit::migration) to gerrit1003 and fix code [puppet] - 10https://gerrit.wikimedia.org/r/908278 (https://phabricator.wikimedia.org/T326368) [16:44:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P46586 and previous config saved to /var/cache/conftool/dbconfig/20230412-164422-ladsgroup.json [16:44:55] 10SRE, 10SRE-Access-Requests: Update SSH key for Mikhail Popov - https://phabricator.wikimedia.org/T334423 (10mpopov) Thank you @BCornwall! I can confirm that I can SSH from the new laptop with the new key. [16:47:01] (03CR) 10Andrew Bogott: [C: 03+2] Add database config for disable_tool process [puppet] - 10https://gerrit.wikimedia.org/r/907983 (https://phabricator.wikimedia.org/T332514) (owner: 10Andrew Bogott) [16:48:38] (03PS1) 10Cathal Mooney: Fix incorrect next-hop for IPv6 default route on cloudsw2-c8-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/908285 (https://phabricator.wikimedia.org/T334281) [16:49:58] (03CR) 10Cathal Mooney: [C: 03+2] Fix incorrect next-hop for IPv6 default route on cloudsw2-c8-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/908285 (https://phabricator.wikimedia.org/T334281) (owner: 10Cathal Mooney) [16:50:15] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [16:50:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet wit... [16:51:30] !log Updating routing-options on Eqiad lsw1 switches to add empty rib inet6 stanza T334281 [16:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:34] T334281: Add generic mechanism to add static routes on switches - https://phabricator.wikimedia.org/T334281 [16:51:52] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [16:52:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet... [16:53:54] (03Merged) 10jenkins-bot: Fix incorrect next-hop for IPv6 default route on cloudsw2-c8-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/908285 (https://phabricator.wikimedia.org/T334281) (owner: 10Cathal Mooney) [16:54:10] !log Updating routing-options on drmrs asw switches to add empty rib inet6 stanza T334281 [16:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P46587 and previous config saved to /var/cache/conftool/dbconfig/20230412-165712-ladsgroup.json [16:58:55] (03PS5) 10Dzahn: site: add role(gerrit::migration) to gerrit1003 and fix code [puppet] - 10https://gerrit.wikimedia.org/r/908278 (https://phabricator.wikimedia.org/T326368) [16:59:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T333332)', diff saved to https://phabricator.wikimedia.org/P46588 and previous config saved to /var/cache/conftool/dbconfig/20230412-165928-ladsgroup.json [16:59:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2121.codfw.wmnet with reason: Maintenance [16:59:33] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [16:59:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2121.codfw.wmnet with reason: Maintenance [16:59:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T333332)', diff saved to https://phabricator.wikimedia.org/P46589 and previous config saved to /var/cache/conftool/dbconfig/20230412-165951-ladsgroup.json [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230412T1700) [17:02:00] (03CR) 10Dzahn: "cool, thanks for this. let's link these to https://phabricator.wikimedia.org/T324659" [puppet] - 10https://gerrit.wikimedia.org/r/893483 (owner: 10Hashar) [17:02:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T333332)', diff saved to https://phabricator.wikimedia.org/P46590 and previous config saved to /var/cache/conftool/dbconfig/20230412-170224-ladsgroup.json [17:02:45] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Dzahn) Hashar did "contint: manage dsh target from Puppet DB" -> https://gerrit.wikimedia.org/r/c/operations/puppet/+/893483 [17:05:06] (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 213.6k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [17:10:06] (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 202.5k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [17:12:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T333332)', diff saved to https://phabricator.wikimedia.org/P46591 and previous config saved to /var/cache/conftool/dbconfig/20230412-171219-ladsgroup.json [17:12:24] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [17:17:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P46592 and previous config saved to /var/cache/conftool/dbconfig/20230412-171730-ladsgroup.json [17:24:49] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [17:25:07] (03PS1) 10Jbond: hieradata: move overrides to role/site part of hiera [puppet] - 10https://gerrit.wikimedia.org/r/908308 [17:25:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:26:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40644/console" [puppet] - 10https://gerrit.wikimedia.org/r/908308 (owner: 10Jbond) [17:28:18] (03CR) 10Hashar: "That is amazing John thank you! I will jump on it tomorrow morning :)" [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) (owner: 10Jbond) [17:28:58] 10SRE, 10Commons, 10Traffic: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10Umar) For more than a month I have not seen new versions of files. https://commons.wikimedia.org/wiki/File:Vake_District.svg [17:30:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:32:00] (03CR) 10Joal: [C: 03+1] "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/908272 (https://phabricator.wikimedia.org/T334224) (owner: 10Snwachukwu) [17:32:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P46593 and previous config saved to /var/cache/conftool/dbconfig/20230412-173237-ladsgroup.json [17:37:06] (03CR) 10Krinkle: [C: 03+2] perf: PaintTiming metrics is now sent in the navtiming event. [alerts] - 10https://gerrit.wikimedia.org/r/908234 (https://phabricator.wikimedia.org/T328256) (owner: 10Phedenskog) [17:38:10] (03PS1) 10Ottomata: flink-operator - set default resource limits and requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/908310 (https://phabricator.wikimedia.org/T333464) [17:39:23] (03Merged) 10jenkins-bot: perf: PaintTiming metrics is now sent in the navtiming event. [alerts] - 10https://gerrit.wikimedia.org/r/908234 (https://phabricator.wikimedia.org/T328256) (owner: 10Phedenskog) [17:44:37] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1002.eqiad.wmnet with OS bullseye [17:44:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1002.eqiad.wmnet with OS bullseye [17:45:26] (03CR) 10Ottomata: [C: 03+2] flink-operator - set default resource limits and requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/908310 (https://phabricator.wikimedia.org/T333464) (owner: 10Ottomata) [17:46:59] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [17:47:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed... [17:47:28] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [17:47:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [17:47:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T333332)', diff saved to https://phabricator.wikimedia.org/P46594 and previous config saved to /var/cache/conftool/dbconfig/20230412-174743-ladsgroup.json [17:47:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2122.codfw.wmnet with reason: Maintenance [17:47:48] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [17:48:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2122.codfw.wmnet with reason: Maintenance [17:48:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T333332)', diff saved to https://phabricator.wikimedia.org/P46595 and previous config saved to /var/cache/conftool/dbconfig/20230412-174806-ladsgroup.json [17:48:22] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:52:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T333332)', diff saved to https://phabricator.wikimedia.org/P46596 and previous config saved to /var/cache/conftool/dbconfig/20230412-175240-ladsgroup.json [17:54:33] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:00:06] ^demon and hashar: #bothumor My software never has bugs. It just develops random features. Rise for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230412T1800). [18:00:06] ^demon and hashar: OwO what's this, a deployment window?? MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230412T1800). nyaa~ [18:00:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:01:16] I'm running the train today. [18:02:15] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908314 (https://phabricator.wikimedia.org/T330210) [18:02:17] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908314 (https://phabricator.wikimedia.org/T330210) (owner: 10TrainBranchBot) [18:03:58] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908314 (https://phabricator.wikimedia.org/T330210) (owner: 10TrainBranchBot) [18:07:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P46597 and previous config saved to /var/cache/conftool/dbconfig/20230412-180746-ladsgroup.json [18:10:27] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.4 refs T330210 [18:10:31] T330210: 1.41.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T330210 [18:14:53] (03PS1) 10Andrew Bogott: Fix partman for cloudvirtlocal100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/908316 (https://phabricator.wikimedia.org/T329863) [18:15:34] (03CR) 10Andrew Bogott: [C: 03+2] Fix partman for cloudvirtlocal100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/908316 (https://phabricator.wikimedia.org/T329863) (owner: 10Andrew Bogott) [18:16:29] !log dancy@deploy2002 Synchronized php: group1 wikis to 1.41.0-wmf.4 refs T330210 (duration: 06m 02s) [18:16:36] T330210: 1.41.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T330210 [18:18:30] (03PS1) 10Andrew Bogott: profile::toolforge::disable_tool: fix a couple of param names [puppet] - 10https://gerrit.wikimedia.org/r/908317 (https://phabricator.wikimedia.org/T332514) [18:20:31] (03CR) 10Andrew Bogott: [C: 03+2] profile::toolforge::disable_tool: fix a couple of param names [puppet] - 10https://gerrit.wikimedia.org/r/908317 (https://phabricator.wikimedia.org/T332514) (owner: 10Andrew Bogott) [18:22:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P46598 and previous config saved to /var/cache/conftool/dbconfig/20230412-182252-ladsgroup.json [18:23:27] 10SRE, 10Infrastructure-Foundations, 10netops: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10jbond) >I would worry about how we deal with the security / key management aspects of it. Just to expand on this a bit the reason why there may be a need for an additional inte... [18:25:45] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/output/908278/40645/" [puppet] - 10https://gerrit.wikimedia.org/r/908278 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [18:37:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T333332)', diff saved to https://phabricator.wikimedia.org/P46599 and previous config saved to /var/cache/conftool/dbconfig/20230412-183758-ladsgroup.json [18:38:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2150.codfw.wmnet with reason: Maintenance [18:38:05] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [18:38:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2150.codfw.wmnet with reason: Maintenance [18:38:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T333332)', diff saved to https://phabricator.wikimedia.org/P46600 and previous config saved to /var/cache/conftool/dbconfig/20230412-183822-ladsgroup.json [18:39:07] (03PS1) 10Andrew Bogott: profile::toolforge::disable_tool: include python3-pymysql [puppet] - 10https://gerrit.wikimedia.org/r/908319 (https://phabricator.wikimedia.org/T332514) [18:39:44] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirtlocal1002.eqiad.wmnet with OS bullseye [18:41:02] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1002.eqiad.wmnet with OS bullseye [18:41:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1002.eqiad.wmnet with OS bullseye [18:41:21] (03CR) 10Andrew Bogott: [C: 03+2] profile::toolforge::disable_tool: include python3-pymysql [puppet] - 10https://gerrit.wikimedia.org/r/908319 (https://phabricator.wikimedia.org/T332514) (owner: 10Andrew Bogott) [18:42:34] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [18:42:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed... [18:42:46] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [18:42:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [18:42:57] (03PS1) 10Jforrester: Ensure ApiHelp correctly types values in TOCData objects [core] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/908289 (https://phabricator.wikimedia.org/T334551) [18:43:06] (03PS1) 10Jforrester: Ensure ApiHelp correctly types values in TOCData objects [core] (wmf/1.41.0-wmf.3) - 10https://gerrit.wikimedia.org/r/908290 (https://phabricator.wikimedia.org/T334551) [18:57:51] (03CR) 10CI reject: [V: 04-1] Ensure ApiHelp correctly types values in TOCData objects [core] (wmf/1.41.0-wmf.3) - 10https://gerrit.wikimedia.org/r/908290 (https://phabricator.wikimedia.org/T334551) (owner: 10Jforrester) [19:00:30] jouncebot: nowandnext [19:00:30] For the next 0 hour(s) and 59 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230412T1800) [19:00:31] In 0 hour(s) and 59 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230412T2000) [19:00:45] Train has already been advanced so you're welcome to do stuff. [19:00:58] thanks :) [19:01:18] (03PS1) 10BCornwall: hiera: lvs2007: update iface names for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/908322 (https://phabricator.wikimedia.org/T321309) [19:04:46] (03PS1) 10Cathal Mooney: Expose interface VRF association to templates if present in Netbox [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/908325 (https://phabricator.wikimedia.org/T312635) [19:05:29] (03PS1) 10Zabe: composer.json: Explicitly pin psr/http-message to 1.0.1 [core] (wmf/1.41.0-wmf.3) - 10https://gerrit.wikimedia.org/r/908291 (https://phabricator.wikimedia.org/T333993) [19:06:03] (03PS2) 10Zabe: Ensure ApiHelp correctly types values in TOCData objects [core] (wmf/1.41.0-wmf.3) - 10https://gerrit.wikimedia.org/r/908290 (https://phabricator.wikimedia.org/T334551) (owner: 10Jforrester) [19:07:45] (03PS2) 10Jbond: environment: add environment.conf file and remove environments dir [puppet] - 10https://gerrit.wikimedia.org/r/907991 [19:07:47] (03PS4) 10Jbond: wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) [19:07:49] (03PS29) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [19:07:51] (03PS1) 10Jbond: core_modules: add core modules [puppet] - 10https://gerrit.wikimedia.org/r/908326 [19:08:41] (03CR) 10CI reject: [V: 04-1] environment: add environment.conf file and remove environments dir [puppet] - 10https://gerrit.wikimedia.org/r/907991 (owner: 10Jbond) [19:08:59] (03CR) 10CI reject: [V: 04-1] core_modules: add core modules [puppet] - 10https://gerrit.wikimedia.org/r/908326 (owner: 10Jbond) [19:09:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T333332)', diff saved to https://phabricator.wikimedia.org/P46601 and previous config saved to /var/cache/conftool/dbconfig/20230412-190904-ladsgroup.json [19:09:10] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [19:10:24] (03CR) 10Zabe: [C: 03+2] composer.json: Explicitly pin psr/http-message to 1.0.1 [core] (wmf/1.41.0-wmf.3) - 10https://gerrit.wikimedia.org/r/908291 (https://phabricator.wikimedia.org/T333993) (owner: 10Zabe) [19:10:26] (03CR) 10Zabe: [C: 03+2] Ensure ApiHelp correctly types values in TOCData objects [core] (wmf/1.41.0-wmf.3) - 10https://gerrit.wikimedia.org/r/908290 (https://phabricator.wikimedia.org/T334551) (owner: 10Jforrester) [19:10:32] (03CR) 10Zabe: [C: 03+2] Ensure ApiHelp correctly types values in TOCData objects [core] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/908289 (https://phabricator.wikimedia.org/T334551) (owner: 10Jforrester) [19:12:51] (03CR) 10CI reject: [V: 04-1] wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) (owner: 10Jbond) [19:13:14] (03PS1) 10Andrew Bogott: Add profile::toolforge::nfs_disable_tool [puppet] - 10https://gerrit.wikimedia.org/r/908327 [19:13:33] (03PS4) 10Krinkle: Set "s3" as the default section name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893834 (owner: 10Aaron Schulz) [19:15:36] (03PS1) 10Jameel Kaisar: Set NEL 'success_fraction: 1.0' on HTTP responses for measurement domains [puppet] - 10https://gerrit.wikimedia.org/r/908328 (https://phabricator.wikimedia.org/T334608) [19:16:02] (03CR) 10CI reject: [V: 04-1] Set NEL 'success_fraction: 1.0' on HTTP responses for measurement domains [puppet] - 10https://gerrit.wikimedia.org/r/908328 (https://phabricator.wikimedia.org/T334608) (owner: 10Jameel Kaisar) [19:16:37] (03CR) 10Andrew Bogott: [C: 03+2] Add profile::toolforge::nfs_disable_tool [puppet] - 10https://gerrit.wikimedia.org/r/908327 (owner: 10Andrew Bogott) [19:16:53] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on sessionstore1001.eqiad.wmnet with reason: Reproducing dissonant cluster state [19:17:09] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on sessionstore1001.eqiad.wmnet with reason: Reproducing dissonant cluster state [19:17:19] (03PS1) 10Cwhite: opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) [19:17:50] (03PS2) 10BCornwall: hiera: lvs2007: update iface names for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/908322 (https://phabricator.wikimedia.org/T321309) [19:17:52] (03CR) 10CI reject: [V: 04-1] opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [19:18:58] (03CR) 10Krinkle: Set "s3" as the default section name (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893834 (owner: 10Aaron Schulz) [19:19:35] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [19:20:20] (03PS2) 10Jameel Kaisar: Set NEL 'success_fraction: 1.0' on HTTP responses for measurement domains [puppet] - 10https://gerrit.wikimedia.org/r/908328 (https://phabricator.wikimedia.org/T334608) [19:20:39] 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 16 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10thcipriani) [19:21:23] (03PS1) 10Andrew Bogott: disable_tool: update nfs patchs for tool archiving [puppet] - 10https://gerrit.wikimedia.org/r/908329 [19:23:28] (03PS1) 10Eevans: sessionstore: disable sessionstore1001 native transport [puppet] - 10https://gerrit.wikimedia.org/r/908330 (https://phabricator.wikimedia.org/T327954) [19:23:54] 10SRE, 10ops-eqiad, 10serviceops-collab, 10GitLab (Infrastructure): Install additional SSDs on gitlab1004.wikimedia.org (B1) - https://phabricator.wikimedia.org/T333997 (10Jclark-ctr) 05Open→03Resolved T330172 drives where installed and commented on this ticket. procurement ticket listed servers gitl... [19:24:05] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/908330 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans) [19:24:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P46602 and previous config saved to /var/cache/conftool/dbconfig/20230412-192411-ladsgroup.json [19:24:19] 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 16 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10thcipriani) [19:25:46] (03CR) 10Eevans: [C: 03+2] sessionstore: disable sessionstore1001 native transport [puppet] - 10https://gerrit.wikimedia.org/r/908330 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans) [19:26:37] 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 16 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10thcipriani) [19:28:01] (03PS1) 10Dzahn: vrts: do not use /srv/sqldata as mariadb datadir (cloud, devtools) [puppet] - 10https://gerrit.wikimedia.org/r/908331 [19:28:24] !log restart Cassandra —sessionstore1001— to disable native transport for testing — T327954 [19:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:28] T327954: session storage: dissonant cluster status after reboot (was: 'cannot achieve consistency level' errors) - https://phabricator.wikimedia.org/T327954 [19:28:31] (03Merged) 10jenkins-bot: composer.json: Explicitly pin psr/http-message to 1.0.1 [core] (wmf/1.41.0-wmf.3) - 10https://gerrit.wikimedia.org/r/908291 (https://phabricator.wikimedia.org/T333993) (owner: 10Zabe) [19:28:37] (03Merged) 10jenkins-bot: Ensure ApiHelp correctly types values in TOCData objects [core] (wmf/1.41.0-wmf.3) - 10https://gerrit.wikimedia.org/r/908290 (https://phabricator.wikimedia.org/T334551) (owner: 10Jforrester) [19:28:42] (03Merged) 10jenkins-bot: Ensure ApiHelp correctly types values in TOCData objects [core] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/908289 (https://phabricator.wikimedia.org/T334551) (owner: 10Jforrester) [19:29:06] (03PS2) 10Cwhite: opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) [19:29:19] (03PS2) 10Snwachukwu: Add referer_name field to druid pageviews hourly and daily tables turnilo [puppet] - 10https://gerrit.wikimedia.org/r/908272 (https://phabricator.wikimedia.org/T334224) [19:29:59] (03CR) 10CI reject: [V: 04-1] opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [19:30:24] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10ARM support: Adoption of aarch64 (aka arm64) in WMF production? (SRE Summit 2022 Session) - https://phabricator.wikimedia.org/T320811 (10Ladsgroup) This might be interesting, specially in choosing a manufacturer: https://www.hetzner.com/press-release/arm... [19:30:51] !log zabe@deploy2002 Started scap: Backport for [[gerrit:908291|composer.json: Explicitly pin psr/http-message to 1.0.1 (T333993)]], [[gerrit:908290|Ensure ApiHelp correctly types values in TOCData objects (T334551)]], [[gerrit:908289|Ensure ApiHelp correctly types values in TOCData objects (T334551)]] [19:30:57] T334551: action=help&toc=1: Caught exception of type TypeError - https://phabricator.wikimedia.org/T334551 [19:30:57] T333993: Explicitly pin psr/http-message to 1.0.1 in composer.json - https://phabricator.wikimedia.org/T333993 [19:31:43] (03PS2) 10Dzahn: vrts: do not use /srv/sqldata as mariadb datadir (cloud, devtools) [puppet] - 10https://gerrit.wikimedia.org/r/908331 (https://phabricator.wikimedia.org/T329571) [19:32:12] !log zabe@deploy2002 jforrester and zabe: Backport for [[gerrit:908291|composer.json: Explicitly pin psr/http-message to 1.0.1 (T333993)]], [[gerrit:908290|Ensure ApiHelp correctly types values in TOCData objects (T334551)]], [[gerrit:908289|Ensure ApiHelp correctly types values in TOCData objects (T334551)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002. [19:32:12] eqiad.wmnet [19:35:46] !log zabe@deploy2002 Sync cancelled. [19:36:14] (03PS1) 10Zabe: Revert "Ensure ApiHelp correctly types values in TOCData objects" [core] (wmf/1.41.0-wmf.3) - 10https://gerrit.wikimedia.org/r/908292 [19:36:19] (03PS1) 10Zabe: Revert "Ensure ApiHelp correctly types values in TOCData objects" [core] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/908293 [19:36:20] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirtlocal1002.eqiad.wmnet with OS bullseye [19:36:21] (03CR) 10Zabe: [V: 03+2 C: 03+2] Revert "Ensure ApiHelp correctly types values in TOCData objects" [core] (wmf/1.41.0-wmf.3) - 10https://gerrit.wikimedia.org/r/908292 (owner: 10Zabe) [19:36:26] (03CR) 10Zabe: [V: 03+2 C: 03+2] Revert "Ensure ApiHelp correctly types values in TOCData objects" [core] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/908293 (owner: 10Zabe) [19:37:02] !log sessionstore1001: systemctl stop cassandra-a.service && systemctl start cassandra-a.service — T327954 [19:37:04] !log zabe@deploy2002 Started scap: Backport for [[gerrit:908292|Revert "Ensure ApiHelp correctly types values in TOCData objects"]], [[gerrit:908293|Revert "Ensure ApiHelp correctly types values in TOCData objects"]] [19:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:06] T327954: session storage: dissonant cluster status after reboot (was: 'cannot achieve consistency level' errors) - https://phabricator.wikimedia.org/T327954 [19:37:49] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [19:37:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed... [19:38:26] !log zabe@deploy2002 zabe: Backport for [[gerrit:908292|Revert "Ensure ApiHelp correctly types values in TOCData objects"]], [[gerrit:908293|Revert "Ensure ApiHelp correctly types values in TOCData objects"]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [19:39:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P46603 and previous config saved to /var/cache/conftool/dbconfig/20230412-193917-ladsgroup.json [19:39:33] (03PS1) 10Ottomata: flink-operator - set default resource limits and requests in operatorPod [deployment-charts] - 10https://gerrit.wikimedia.org/r/908334 (https://phabricator.wikimedia.org/T333464) [19:39:43] (03PS2) 10Ottomata: flink-operator - set default resource limits and requests in operatorPod [deployment-charts] - 10https://gerrit.wikimedia.org/r/908334 (https://phabricator.wikimedia.org/T333464) [19:39:55] (03CR) 10Ottomata: [V: 03+2 C: 03+2] flink-operator - set default resource limits and requests in operatorPod [deployment-charts] - 10https://gerrit.wikimedia.org/r/908334 (https://phabricator.wikimedia.org/T333464) (owner: 10Ottomata) [19:40:15] (03PS2) 10Andrew Bogott: disable_tool: update nfs patchs for tool archiving [puppet] - 10https://gerrit.wikimedia.org/r/908329 [19:40:29] !log otto@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [19:40:41] (03PS3) 10Andrew Bogott: disable_tool: update nfs paths for tool archiving [puppet] - 10https://gerrit.wikimedia.org/r/908329 [19:41:09] !log otto@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [19:41:54] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [19:42:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [19:42:49] (03CR) 10Andrew Bogott: [C: 03+2] disable_tool: update nfs paths for tool archiving [puppet] - 10https://gerrit.wikimedia.org/r/908329 (owner: 10Andrew Bogott) [19:43:44] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:908292|Revert "Ensure ApiHelp correctly types values in TOCData objects"]], [[gerrit:908293|Revert "Ensure ApiHelp correctly types values in TOCData objects"]] (duration: 06m 40s) [19:46:35] (03PS1) 10Arlolra: Remove unused parsoidSettings, nativeGalleryEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908337 [19:48:20] (03PS3) 10Cwhite: opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) [19:48:29] (03CR) 10Ssingh: "Looks good! Let's wait on merging this as I think we should also set a higher BGP med for lvs2007 so that it has a lower priority than lvs" [puppet] - 10https://gerrit.wikimedia.org/r/908322 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [19:48:52] (03CR) 10CI reject: [V: 04-1] opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [19:49:15] (03CR) 10Dzahn: "Arnold, this should fix the issue you described to me about the DB on vrts-1001 in devtools. But right now puppet is disabled. Please conf" [puppet] - 10https://gerrit.wikimedia.org/r/908331 (https://phabricator.wikimedia.org/T329571) (owner: 10Dzahn) [19:50:38] 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 16 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10bd808) [19:51:48] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1002.eqiad.wmnet with OS bullseye [19:51:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1002.eqiad.wmnet with OS bullseye [19:54:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T333332)', diff saved to https://phabricator.wikimedia.org/P46604 and previous config saved to /var/cache/conftool/dbconfig/20230412-195423-ladsgroup.json [19:54:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2159.codfw.wmnet with reason: Maintenance [19:54:28] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [19:54:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2159.codfw.wmnet with reason: Maintenance [19:54:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance [19:54:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance [19:54:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T333332)', diff saved to https://phabricator.wikimedia.org/P46605 and previous config saved to /var/cache/conftool/dbconfig/20230412-195453-ladsgroup.json [19:57:07] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:58:20] (03PS4) 10Cwhite: opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) [19:58:52] (03CR) 10CI reject: [V: 04-1] opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [19:59:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T333332)', diff saved to https://phabricator.wikimedia.org/P46606 and previous config saved to /var/cache/conftool/dbconfig/20230412-195926-ladsgroup.json [19:59:31] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230412T2000). [20:00:05] Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:15] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:01:59] here [20:02:29] I can deploy [20:03:00] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for KMorgan - https://phabricator.wikimedia.org/T334432 (10KStoller-WMF) @KMorgan-WMF is an engineer on the Growth team, and this has my approval as the Product Manager of Growth. But if this need the approval of... [20:03:17] Jdlrobson: is it okay to merge, test and sync your two patches together? [20:03:35] (03CR) 10Zabe: [C: 03+2] Drop unused VectorPageTools feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907511 (https://phabricator.wikimedia.org/T332090) (owner: 10Jdlrobson) [20:03:37] yep [20:03:46] (03PS3) 10Zabe: Set Vector 2022 as default skin on Welsh Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907539 (https://phabricator.wikimedia.org/T334279) (owner: 10Jdlrobson) [20:03:50] (03CR) 10Zabe: [C: 03+2] Set Vector 2022 as default skin on Welsh Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907539 (https://phabricator.wikimedia.org/T334279) (owner: 10Jdlrobson) [20:04:36] (03Merged) 10jenkins-bot: Drop unused VectorPageTools feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907511 (https://phabricator.wikimedia.org/T332090) (owner: 10Jdlrobson) [20:04:40] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907539 (https://phabricator.wikimedia.org/T334279) (owner: 10Jdlrobson) [20:04:42] (03PS5) 10Cwhite: opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) [20:04:44] (03Merged) 10jenkins-bot: Set Vector 2022 as default skin on Welsh Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907539 (https://phabricator.wikimedia.org/T334279) (owner: 10Jdlrobson) [20:05:07] !log zabe@deploy2002 Started scap: Backport for [[gerrit:907511|Drop unused VectorPageTools feature flag (T332090)]], [[gerrit:907539|Set Vector 2022 as default skin on Welsh Wikipedia (T334279)]] [20:05:13] T332090: Post page tools cleanup: Remove page tools disabled code - https://phabricator.wikimedia.org/T332090 [20:05:13] T334279: Deploy Vector 2022 on Welsh Wikipedia - https://phabricator.wikimedia.org/T334279 [20:05:15] (03CR) 10CI reject: [V: 04-1] opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [20:06:26] !log zabe@deploy2002 zabe and jdlrobson: Backport for [[gerrit:907511|Drop unused VectorPageTools feature flag (T332090)]], [[gerrit:907539|Set Vector 2022 as default skin on Welsh Wikipedia (T334279)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [20:06:40] (03PS3) 10BCornwall: hiera: lvs2007: update iface names for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/908322 (https://phabricator.wikimedia.org/T321309) [20:06:45] (03PS1) 10Andrew Bogott: profile::toolforge::grid::exec_environ: use ensure_packages on pymysql [puppet] - 10https://gerrit.wikimedia.org/r/908345 [20:07:15] (03PS1) 10Cathal Mooney: Automate DHCP forwarding on Juniper L3 Swithces [homer/public] - 10https://gerrit.wikimedia.org/r/908346 (https://phabricator.wikimedia.org/T312635) [20:08:31] Jdlrobson: please test :) [20:09:09] (03CR) 10Andrew Bogott: [C: 03+2] profile::toolforge::grid::exec_environ: use ensure_packages on pymysql [puppet] - 10https://gerrit.wikimedia.org/r/908345 (owner: 10Andrew Bogott) [20:09:13] zabe: on it.. [20:09:42] zabe: LGTM [20:09:45] please sync [20:10:49] (03CR) 10Atieno: "b" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/906575 (https://phabricator.wikimedia.org/T334205) (owner: 10Atieno) [20:11:03] (03PS3) 10Jameel Kaisar: Set NEL 'success_fraction: 1.0' on HTTP responses for measurement domains [puppet] - 10https://gerrit.wikimedia.org/r/908328 (https://phabricator.wikimedia.org/T334608) [20:12:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:13:06] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40646/console" [puppet] - 10https://gerrit.wikimedia.org/r/908322 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [20:14:23] (03PS6) 10Cwhite: opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) [20:14:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P46608 and previous config saved to /var/cache/conftool/dbconfig/20230412-201432-ladsgroup.json [20:14:55] (03CR) 10CI reject: [V: 04-1] opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [20:15:09] (03CR) 10Ssingh: [C: 03+1] hiera: lvs2007: update iface names for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/908322 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [20:15:27] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:907511|Drop unused VectorPageTools feature flag (T332090)]], [[gerrit:907539|Set Vector 2022 as default skin on Welsh Wikipedia (T334279)]] (duration: 10m 19s) [20:15:28] Jdlrobson: should be live [20:15:32] T332090: Post page tools cleanup: Remove page tools disabled code - https://phabricator.wikimedia.org/T332090 [20:15:32] T334279: Deploy Vector 2022 on Welsh Wikipedia - https://phabricator.wikimedia.org/T334279 [20:15:37] thanks Zabe! [20:15:45] yw [20:15:59] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:16:38] (03CR) 10Dzahn: [C: 04-1] "Arnold said things work when datadir is set to /srv/sqldata and after service was restarted." [puppet] - 10https://gerrit.wikimedia.org/r/908331 (https://phabricator.wikimedia.org/T329571) (owner: 10Dzahn) [20:20:11] (03PS7) 10Cwhite: opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) [20:20:43] (03CR) 10CI reject: [V: 04-1] opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [20:27:41] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/908278/40645/" [puppet] - 10https://gerrit.wikimedia.org/r/908278 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [20:29:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P46609 and previous config saved to /var/cache/conftool/dbconfig/20230412-202939-ladsgroup.json [20:35:17] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop confirmed on gerrit2002, gerrit1002 prod servers" [puppet] - 10https://gerrit.wikimedia.org/r/908278 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [20:36:47] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:38:12] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [20:38:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed... [20:44:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T333332)', diff saved to https://phabricator.wikimedia.org/P46610 and previous config saved to /var/cache/conftool/dbconfig/20230412-204445-ladsgroup.json [20:44:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2168.codfw.wmnet with reason: Maintenance [20:44:51] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [20:45:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2168.codfw.wmnet with reason: Maintenance [20:45:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46611 and previous config saved to /var/cache/conftool/dbconfig/20230412-204508-ladsgroup.json [20:45:23] 10SRE, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) with the merge above there is now a "gerrit2" user and group on gerrit1003, rsyncd is running and ready to be pushed to from gerrit1001.. and releng users got shell access [20:46:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:46:43] (03PS8) 10Cwhite: opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) [20:47:04] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirtlocal1002.eqiad.wmnet with OS bullseye [20:47:17] (03CR) 10CI reject: [V: 04-1] opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [20:47:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46612 and previous config saved to /var/cache/conftool/dbconfig/20230412-204742-ladsgroup.json [20:49:23] (03CR) 10Krinkle: arclamp: serve SVGs, compressed logs from Swift (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke) [20:50:17] (03PS9) 10Cwhite: opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) [20:50:51] (03CR) 10CI reject: [V: 04-1] opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [20:57:08] 10SRE, 10Commons, 10Traffic: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10Lionel_Scheepmans) It seems that here in Phabricator, no new is bad new. [20:58:16] (03CR) 10BCornwall: [V: 03+1 C: 03+2] hiera: lvs2007: update iface names for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/908322 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [20:58:53] !log Disable Puppet/PyBal on lvs2007 in preparation for reimaging - T321309 [20:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:58] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [20:59:37] (03CR) 10EoghanGaffney: [C: 03+2] Add keys for sshd-gitlab from the secrets repo (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/907878 (owner: 10EoghanGaffney) [21:01:28] PROBLEM - pybal on lvs2007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [21:01:38] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:01:42] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2007.codfw.wmnet with OS bullseye [21:01:53] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs2007.codfw.wmnet with OS bullseye [21:01:55] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host lvs2007.codfw.wmnet with OS bullseye [21:02:05] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs2007.codfw.wmnet with OS bullseye executed with errors: - lvs2007 (**FAIL**) - **The reimage... [21:02:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P46613 and previous config saved to /var/cache/conftool/dbconfig/20230412-210249-ladsgroup.json [21:04:17] !log gerrit1001 - pushing data over to gerrit1003 via rsync, with bwlimit option: rsync -avp --bwlimit=1m /srv/gerrit/ rsync://gerrit1003.wikimedia.org/gerrit-data/ (T326368) [21:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:22] T326368: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 [21:05:20] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:16:10] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2007.codfw.wmnet with OS bullseye [21:16:17] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs2007.codfw.wmnet with OS bullseye [21:17:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P46614 and previous config saved to /var/cache/conftool/dbconfig/20230412-211755-ladsgroup.json [21:24:40] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:28:36] (03PS1) 10Eevans: Revert "sessionstore: disable sessionstore1001 native transport" [puppet] - 10https://gerrit.wikimedia.org/r/908294 [21:29:17] (03CR) 10Eevans: [C: 03+2] Revert "sessionstore: disable sessionstore1001 native transport" [puppet] - 10https://gerrit.wikimedia.org/r/908294 (owner: 10Eevans) [21:33:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46615 and previous config saved to /var/cache/conftool/dbconfig/20230412-213301-ladsgroup.json [21:33:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance [21:33:07] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [21:33:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance [21:33:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46616 and previous config saved to /var/cache/conftool/dbconfig/20230412-213325-ladsgroup.json [21:35:36] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2007.codfw.wmnet with reason: host reimage [21:35:50] !log restarting Cassandra —sessionstore1001— to reenable native transport — T327954 [21:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:54] T327954: session storage: dissonant cluster status after reboot (was: 'cannot achieve consistency level' errors) - https://phabricator.wikimedia.org/T327954 [21:35:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46617 and previous config saved to /var/cache/conftool/dbconfig/20230412-213558-ladsgroup.json [21:38:54] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2007.codfw.wmnet with reason: host reimage [21:45:25] (03CR) 10Dzahn: ci: split contint hosts to different roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/907886 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [21:47:34] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:47:38] (03CR) 10Dzahn: "Why would this be a special case that warrants finding new patterns? Including one profile in 2 roles is standard." [puppet] - 10https://gerrit.wikimedia.org/r/907886 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [21:48:37] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:51:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P46618 and previous config saved to /var/cache/conftool/dbconfig/20230412-215104-ladsgroup.json [21:52:49] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for sessionstore1001.eqiad.wmnet [21:52:49] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for sessionstore1001.eqiad.wmnet [21:54:34] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:56:05] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs2007.codfw.wmnet with OS bullseye [21:56:10] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs2007.codfw.wmnet with OS bullseye completed: - lvs2007 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled... [22:06:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P46619 and previous config saved to /var/cache/conftool/dbconfig/20230412-220611-ladsgroup.json [22:09:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:15:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:21:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46620 and previous config saved to /var/cache/conftool/dbconfig/20230412-222117-ladsgroup.json [22:21:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2182.codfw.wmnet with reason: Maintenance [22:21:23] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [22:21:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2182.codfw.wmnet with reason: Maintenance [22:21:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T333332)', diff saved to https://phabricator.wikimedia.org/P46621 and previous config saved to /var/cache/conftool/dbconfig/20230412-222141-ladsgroup.json [22:24:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T333332)', diff saved to https://phabricator.wikimedia.org/P46622 and previous config saved to /var/cache/conftool/dbconfig/20230412-222414-ladsgroup.json [22:29:39] (03PS1) 10Urbanecm: [Growth] Prepare for a Personalized praise config variable change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908365 (https://phabricator.wikimedia.org/T334630) [22:31:47] (03CR) 10AOkoth: [C: 03+2] exim: fix hard-coded vrts hostname [puppet] - 10https://gerrit.wikimedia.org/r/905722 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [22:38:56] (03PS1) 10Urbanecm: [Growth] Finish Personalized praise variable rename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908367 (https://phabricator.wikimedia.org/T334630) [22:39:19] (03PS2) 10Urbanecm: [Growth] Finish Personalized praise variable rename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908367 (https://phabricator.wikimedia.org/T334630) [22:39:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P46623 and previous config saved to /var/cache/conftool/dbconfig/20230412-223921-ladsgroup.json [22:39:49] (03CR) 10Urbanecm: [C: 04-2] "Not yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908367 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm) [22:54:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P46624 and previous config saved to /var/cache/conftool/dbconfig/20230412-225427-ladsgroup.json [22:56:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:59:32] (03PS1) 10Raymond Ndibe: tools-webservice: set default for buildservice-image [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/908369 (https://phabricator.wikimedia.org/T334586) [23:01:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:09:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T333332)', diff saved to https://phabricator.wikimedia.org/P46625 and previous config saved to /var/cache/conftool/dbconfig/20230412-230933-ladsgroup.json [23:09:39] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [23:30:24] (03CR) 10Andrea Denisse: [V: 03+1 C: 03+2] prometheus: Apply prometheus::pop role to prometheus3002 [puppet] - 10https://gerrit.wikimedia.org/r/905705 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [23:55:35] (JobUnavailable) firing: (4) Reduced availability for job pint in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:56:22] (JobUnavailable) firing: Reduced availability for job blackbox/pingthing in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable