[00:10:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P46368 and previous config saved to /var/cache/conftool/dbconfig/20230412-001051-ladsgroup.json
[00:16:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:23:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T333332)', diff saved to https://phabricator.wikimedia.org/P46369 and previous config saved to /var/cache/conftool/dbconfig/20230412-002312-ladsgroup.json
[00:23:19] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[00:25:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T333332)', diff saved to https://phabricator.wikimedia.org/P46370 and previous config saved to /var/cache/conftool/dbconfig/20230412-002557-ladsgroup.json
[00:25:59] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1138.eqiad.wmnet with reason: Maintenance
[00:26:15] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1138.eqiad.wmnet with reason: Maintenance
[00:26:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T333332)', diff saved to https://phabricator.wikimedia.org/P46371 and previous config saved to /var/cache/conftool/dbconfig/20230412-002620-ladsgroup.json
[00:26:32] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:28:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T333332)', diff saved to https://phabricator.wikimedia.org/P46372 and previous config saved to /var/cache/conftool/dbconfig/20230412-002829-ladsgroup.json
[00:28:34] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[00:38:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P46373 and previous config saved to /var/cache/conftool/dbconfig/20230412-003819-ladsgroup.json
[00:39:24] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/907831
[00:39:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/907831 (owner: 10TrainBranchBot)
[00:40:16] <wikibugs>	 (03PS4) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed [puppet] - 10https://gerrit.wikimedia.org/r/905243
[00:42:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] maintain-dbusers: ensure get_global_wiki_user is only called when needed [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe)
[00:43:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P46374 and previous config saved to /var/cache/conftool/dbconfig/20230412-004335-ladsgroup.json
[00:46:59] <icinga-wm_>	 RECOVERY - PHP opcache health on mw2351 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[00:49:11] <wikibugs>	 (03CR) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe)
[00:53:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P46375 and previous config saved to /var/cache/conftool/dbconfig/20230412-005325-ladsgroup.json
[00:57:13] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/907831 (owner: 10TrainBranchBot)
[00:57:19] <wikibugs>	 (03CR) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe)
[00:58:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P46376 and previous config saved to /var/cache/conftool/dbconfig/20230412-005841-ladsgroup.json
[01:00:21] <wikibugs>	 (03PS5) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed [puppet] - 10https://gerrit.wikimedia.org/r/905243
[01:02:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] maintain-dbusers: ensure get_global_wiki_user is only called when needed [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe)
[01:07:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Performance-Team (Radar): GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10Krinkle)
[01:08:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T333332)', diff saved to https://phabricator.wikimedia.org/P46377 and previous config saved to /var/cache/conftool/dbconfig/20230412-010832-ladsgroup.json
[01:08:37] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[01:13:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T333332)', diff saved to https://phabricator.wikimedia.org/P46378 and previous config saved to /var/cache/conftool/dbconfig/20230412-011348-ladsgroup.json
[01:13:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1141.eqiad.wmnet with reason: Maintenance
[01:13:53] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[01:14:05] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1141.eqiad.wmnet with reason: Maintenance
[01:14:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T333332)', diff saved to https://phabricator.wikimedia.org/P46379 and previous config saved to /var/cache/conftool/dbconfig/20230412-011411-ladsgroup.json
[01:16:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T333332)', diff saved to https://phabricator.wikimedia.org/P46380 and previous config saved to /var/cache/conftool/dbconfig/20230412-011619-ladsgroup.json
[01:16:29] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:31:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P46381 and previous config saved to /var/cache/conftool/dbconfig/20230412-013126-ladsgroup.json
[01:37:19] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:42:00] <wikibugs>	 10SRE, 10MediaWiki-extensions-OAuth, 10The-Wikipedia-Library, 10Datacenter-Switchover, and 2 others: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10jsn.sherman) I wanted to followup on the library side;...
[01:45:25] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:46:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P46382 and previous config saved to /var/cache/conftool/dbconfig/20230412-014632-ladsgroup.json
[01:46:33] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:51:32] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:58:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:01:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T333332)', diff saved to https://phabricator.wikimedia.org/P46383 and previous config saved to /var/cache/conftool/dbconfig/20230412-020138-ladsgroup.json
[02:01:41] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1142.eqiad.wmnet with reason: Maintenance
[02:01:44] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[02:01:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1142.eqiad.wmnet with reason: Maintenance
[02:02:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T333332)', diff saved to https://phabricator.wikimedia.org/P46384 and previous config saved to /var/cache/conftool/dbconfig/20230412-020201-ladsgroup.json
[02:03:32] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:04:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T333332)', diff saved to https://phabricator.wikimedia.org/P46385 and previous config saved to /var/cache/conftool/dbconfig/20230412-020410-ladsgroup.json
[02:06:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:19:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P46386 and previous config saved to /var/cache/conftool/dbconfig/20230412-021916-ladsgroup.json
[02:26:32] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:34:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P46387 and previous config saved to /var/cache/conftool/dbconfig/20230412-023422-ladsgroup.json
[02:49:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T333332)', diff saved to https://phabricator.wikimedia.org/P46388 and previous config saved to /var/cache/conftool/dbconfig/20230412-024929-ladsgroup.json
[02:49:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1143.eqiad.wmnet with reason: Maintenance
[02:49:34] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[02:49:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1143.eqiad.wmnet with reason: Maintenance
[02:49:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T333332)', diff saved to https://phabricator.wikimedia.org/P46389 and previous config saved to /var/cache/conftool/dbconfig/20230412-024952-ladsgroup.json
[02:52:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T333332)', diff saved to https://phabricator.wikimedia.org/P46390 and previous config saved to /var/cache/conftool/dbconfig/20230412-025200-ladsgroup.json
[03:06:05] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:07:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P46391 and previous config saved to /var/cache/conftool/dbconfig/20230412-030707-ladsgroup.json
[03:10:45] <icinga-wm_>	 RECOVERY - PHP opcache health on mw2353 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:15:39] <icinga-wm_>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:15:47] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:18:51] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[03:22:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P46392 and previous config saved to /var/cache/conftool/dbconfig/20230412-032213-ladsgroup.json
[03:37:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T333332)', diff saved to https://phabricator.wikimedia.org/P46393 and previous config saved to /var/cache/conftool/dbconfig/20230412-033719-ladsgroup.json
[03:37:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[03:37:25] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[03:37:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[03:37:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T333332)', diff saved to https://phabricator.wikimedia.org/P46394 and previous config saved to /var/cache/conftool/dbconfig/20230412-033742-ladsgroup.json
[03:39:00] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] httpbb: remove tests from liftwing production [puppet] - 10https://gerrit.wikimedia.org/r/907809 (owner: 10Elukey)
[03:39:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T333332)', diff saved to https://phabricator.wikimedia.org/P46395 and previous config saved to /var/cache/conftool/dbconfig/20230412-033951-ladsgroup.json
[03:54:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P46396 and previous config saved to /var/cache/conftool/dbconfig/20230412-035457-ladsgroup.json
[04:05:53] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:10:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P46397 and previous config saved to /var/cache/conftool/dbconfig/20230412-041003-ladsgroup.json
[04:13:43] <icinga-wm_>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:15:35] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:22:31] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[04:25:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T333332)', diff saved to https://phabricator.wikimedia.org/P46398 and previous config saved to /var/cache/conftool/dbconfig/20230412-042510-ladsgroup.json
[04:25:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[04:25:17] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[04:25:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[04:25:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[04:25:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[04:25:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T333332)', diff saved to https://phabricator.wikimedia.org/P46399 and previous config saved to /var/cache/conftool/dbconfig/20230412-042552-ladsgroup.json
[04:28:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T333332)', diff saved to https://phabricator.wikimedia.org/P46400 and previous config saved to /var/cache/conftool/dbconfig/20230412-042800-ladsgroup.json
[04:36:41] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:43:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P46401 and previous config saved to /var/cache/conftool/dbconfig/20230412-044306-ladsgroup.json
[04:46:23] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:58:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P46402 and previous config saved to /var/cache/conftool/dbconfig/20230412-045813-ladsgroup.json
[05:11:29] <wikibugs>	 (03PS1) 10Krinkle: objectcache: Disable cool-off bounce feature [core] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/908027
[05:11:35] <wikibugs>	 (03PS3) 10Krinkle: objectcache: Disable cool-off bounce feature [core] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/902376
[05:11:39] <wikibugs>	 (03Abandoned) 10Krinkle: objectcache: Disable cool-off bounce feature [core] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/902376 (owner: 10Krinkle)
[05:12:15] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] objectcache: Disable cool-off bounce feature [core] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/908027 (owner: 10Krinkle)
[05:13:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T333332)', diff saved to https://phabricator.wikimedia.org/P46403 and previous config saved to /var/cache/conftool/dbconfig/20230412-051319-ladsgroup.json
[05:13:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1147.eqiad.wmnet with reason: Maintenance
[05:13:24] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[05:13:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1147.eqiad.wmnet with reason: Maintenance
[05:13:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T333332)', diff saved to https://phabricator.wikimedia.org/P46404 and previous config saved to /var/cache/conftool/dbconfig/20230412-051342-ladsgroup.json
[05:15:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T333332)', diff saved to https://phabricator.wikimedia.org/P46405 and previous config saved to /var/cache/conftool/dbconfig/20230412-051550-ladsgroup.json
[05:21:15] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:23:05] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db1222 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/908016 (https://phabricator.wikimedia.org/T326669)
[05:23:43] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1222 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/908016 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui)
[05:25:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1222 to dbctl T326669', diff saved to https://phabricator.wikimedia.org/P46406 and previous config saved to /var/cache/conftool/dbconfig/20230412-052504-marostegui.json
[05:25:10] <stashbot>	 T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669
[05:26:27] <wikibugs>	 (03PS1) 10Marostegui: db1222: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/908017 (https://phabricator.wikimedia.org/T326669)
[05:27:02] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1222: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/908017 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui)
[05:27:06] <wikibugs>	 (03Merged) 10jenkins-bot: objectcache: Disable cool-off bounce feature [core] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/908027 (owner: 10Krinkle)
[05:27:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1222 (re)pooling @ 1%: Pooling', diff saved to https://phabricator.wikimedia.org/P46407 and previous config saved to /var/cache/conftool/dbconfig/20230412-052733-root.json
[05:29:05] * Krinkle testing on mwdebug2001
[05:30:45] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:30:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P46408 and previous config saved to /var/cache/conftool/dbconfig/20230412-053057-ladsgroup.json
[05:34:29] <wikibugs>	 (03PS1) 10Marostegui: db1218: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/908114 (https://phabricator.wikimedia.org/T326669)
[05:35:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1218: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/908114 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui)
[05:37:32] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db1218 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/908148 (https://phabricator.wikimedia.org/T326669)
[05:38:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1218 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/908148 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui)
[05:40:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1218 to dbctl T326669', diff saved to https://phabricator.wikimedia.org/P46409 and previous config saved to /var/cache/conftool/dbconfig/20230412-054024-marostegui.json
[05:40:30] <stashbot>	 T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669
[05:41:17] <logmsgbot>	 !log krinkle@deploy2002 Synchronized php-1.41.0-wmf.4/includes/libs/objectcache/: Ie3a2215d33: disable WANCache cool-off feature (duration: 06m 00s)
[05:41:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1218 (re)pooling @ 1%: Pooling db1218 T326669', diff saved to https://phabricator.wikimedia.org/P46410 and previous config saved to /var/cache/conftool/dbconfig/20230412-054120-root.json
[05:42:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1222 (re)pooling @ 2%: Pooling', diff saved to https://phabricator.wikimedia.org/P46411 and previous config saved to /var/cache/conftool/dbconfig/20230412-054238-root.json
[05:42:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1110 to clone db1210 T326669', diff saved to https://phabricator.wikimedia.org/P46412 and previous config saved to /var/cache/conftool/dbconfig/20230412-054258-marostegui.json
[05:46:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P46414 and previous config saved to /var/cache/conftool/dbconfig/20230412-054603-ladsgroup.json
[05:56:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1218 (re)pooling @ 2%: Pooling db1218 T326669', diff saved to https://phabricator.wikimedia.org/P46415 and previous config saved to /var/cache/conftool/dbconfig/20230412-055624-root.json
[05:56:30] <stashbot>	 T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669
[05:57:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1222 (re)pooling @ 3%: Pooling', diff saved to https://phabricator.wikimedia.org/P46416 and previous config saved to /var/cache/conftool/dbconfig/20230412-055743-root.json
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230412T0600)
[06:01:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T333332)', diff saved to https://phabricator.wikimedia.org/P46417 and previous config saved to /var/cache/conftool/dbconfig/20230412-060109-ladsgroup.json
[06:01:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1148.eqiad.wmnet with reason: Maintenance
[06:01:15] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[06:01:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1148.eqiad.wmnet with reason: Maintenance
[06:01:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T333332)', diff saved to https://phabricator.wikimedia.org/P46418 and previous config saved to /var/cache/conftool/dbconfig/20230412-060133-ladsgroup.json
[06:01:54] <wikibugs>	 (03PS1) 10Marostegui: kormat/bashrc.wmf: Change alias location [puppet] - 10https://gerrit.wikimedia.org/r/908157 (https://phabricator.wikimedia.org/T334455)
[06:02:41] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] kormat/bashrc.wmf: Change alias location [puppet] - 10https://gerrit.wikimedia.org/r/908157 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui)
[06:02:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T333332)', diff saved to https://phabricator.wikimedia.org/P46419 and previous config saved to /var/cache/conftool/dbconfig/20230412-060241-ladsgroup.json
[06:11:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1218 (re)pooling @ 3%: Pooling db1218 T326669', diff saved to https://phabricator.wikimedia.org/P46420 and previous config saved to /var/cache/conftool/dbconfig/20230412-061129-root.json
[06:11:35] <stashbot>	 T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669
[06:12:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1222 (re)pooling @ 4%: Pooling', diff saved to https://phabricator.wikimedia.org/P46421 and previous config saved to /var/cache/conftool/dbconfig/20230412-061248-root.json
[06:14:13] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db1210 [puppet] - 10https://gerrit.wikimedia.org/r/908158 (https://phabricator.wikimedia.org/T326669)
[06:15:09] <wikibugs>	 (03PS13) 10Ilias Sarantopoulos: ml-services: FastAPI chart using sextant for ores-legacy service [deployment-charts] - 10https://gerrit.wikimedia.org/r/904777 (https://phabricator.wikimedia.org/T330414)
[06:16:50] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: ml-services: FastAPI chart using sextant for ores-legacy service (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/904777 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos)
[06:17:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P46422 and previous config saved to /var/cache/conftool/dbconfig/20230412-061747-ladsgroup.json
[06:21:31] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1210 [puppet] - 10https://gerrit.wikimedia.org/r/908158 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui)
[06:22:43] <wikibugs>	 10SRE, 10Deployments, 10Traffic-Icebox, 10Regression, and 2 others: [Regression] PHP files in /static (and /w/static) on text domains should not execute - https://phabricator.wikimedia.org/T106732 (10Krinkle) Thanks @BCornwall. This is indeed resolved. The paths did change a bit so in this case the (expect...
[06:22:55] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] static.php: Restore short cache for temporary 'mismatch' response (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777901 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle)
[06:23:11] <wikibugs>	 10SRE, 10Deployments, 10Traffic-Icebox, 10Regression, and 2 others: [Regression] PHP files in /static (and /w/static) on text domains should not execute - https://phabricator.wikimedia.org/T106732 (10Krinkle)
[06:23:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P46423 and previous config saved to /var/cache/conftool/dbconfig/20230412-062353-root.json
[06:26:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1218 (re)pooling @ 4%: Pooling db1218 T326669', diff saved to https://phabricator.wikimedia.org/P46424 and previous config saved to /var/cache/conftool/dbconfig/20230412-062634-root.json
[06:26:39] <stashbot>	 T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669
[06:27:12] <wikibugs>	 (03PS14) 10Ilias Sarantopoulos: ml-services: FastAPI chart using sextant for ores-legacy service [deployment-charts] - 10https://gerrit.wikimedia.org/r/904777 (https://phabricator.wikimedia.org/T330414)
[06:27:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1222 (re)pooling @ 5%: Pooling', diff saved to https://phabricator.wikimedia.org/P46425 and previous config saved to /var/cache/conftool/dbconfig/20230412-062752-root.json
[06:28:16] <icinga-wm_>	 PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 59 probes of 774 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[06:32:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1121 to clone db1221 T326669', diff saved to https://phabricator.wikimedia.org/P46426 and previous config saved to /var/cache/conftool/dbconfig/20230412-063224-marostegui.json
[06:32:30] <stashbot>	 T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669
[06:32:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P46427 and previous config saved to /var/cache/conftool/dbconfig/20230412-063253-ladsgroup.json
[06:33:55] <marostegui>	 !log Stop mariadb on db1121 to clone db1221 this will generate lag on clouddb replicas for s4 T326669
[06:33:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:35:31] <wikibugs>	 (03PS1) 10Marostegui: db1221: Place it in s4 [puppet] - 10https://gerrit.wikimedia.org/r/908160 (https://phabricator.wikimedia.org/T326669)
[06:35:54] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:36:13] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1221: Place it in s4 [puppet] - 10https://gerrit.wikimedia.org/r/908160 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui)
[06:38:04] <vgutierrez>	 !log restart haproxy on cp2035 - T334448
[06:38:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:38:08] <stashbot>	 T334448: HAProxy 2.6.12 segfaults on cp2033 - https://phabricator.wikimedia.org/T334448
[06:38:42] <icinga-wm_>	 RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 7 probes of 774 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[06:38:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P46429 and previous config saved to /var/cache/conftool/dbconfig/20230412-063858-root.json
[06:41:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1218 (re)pooling @ 5%: Pooling db1218 T326669', diff saved to https://phabricator.wikimedia.org/P46430 and previous config saved to /var/cache/conftool/dbconfig/20230412-064139-root.json
[06:41:44] <stashbot>	 T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669
[06:42:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1222 (re)pooling @ 10%: Pooling', diff saved to https://phabricator.wikimedia.org/P46431 and previous config saved to /var/cache/conftool/dbconfig/20230412-064257-root.json
[06:45:14] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:48:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T333332)', diff saved to https://phabricator.wikimedia.org/P46432 and previous config saved to /var/cache/conftool/dbconfig/20230412-064800-ladsgroup.json
[06:48:02] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1149.eqiad.wmnet with reason: Maintenance
[06:48:05] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[06:48:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1149.eqiad.wmnet with reason: Maintenance
[06:48:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T333332)', diff saved to https://phabricator.wikimedia.org/P46433 and previous config saved to /var/cache/conftool/dbconfig/20230412-064823-ladsgroup.json
[06:50:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T333332)', diff saved to https://phabricator.wikimedia.org/P46434 and previous config saved to /var/cache/conftool/dbconfig/20230412-065032-ladsgroup.json
[06:51:10] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:54:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P46435 and previous config saved to /var/cache/conftool/dbconfig/20230412-065402-root.json
[06:54:29] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] P:url_downloader send squid logs to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/904783 (https://phabricator.wikimedia.org/T333676) (owner: 10Slyngshede)
[06:54:46] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: deployment of ores-legacy app in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/908191 (https://phabricator.wikimedia.org/T330414)
[06:56:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1218 (re)pooling @ 10%: Pooling db1218 T326669', diff saved to https://phabricator.wikimedia.org/P46436 and previous config saved to /var/cache/conftool/dbconfig/20230412-065644-root.json
[06:56:49] <stashbot>	 T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669
[06:58:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1222 (re)pooling @ 25%: Pooling', diff saved to https://phabricator.wikimedia.org/P46437 and previous config saved to /var/cache/conftool/dbconfig/20230412-065802-root.json
[06:59:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ml-services: deployment of ores-legacy app in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/908191 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos)
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and taavi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230412T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:00:23] <taavi>	 o/ nothing to do it seems
[07:00:30] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:05:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P46438 and previous config saved to /var/cache/conftool/dbconfig/20230412-070538-ladsgroup.json
[07:06:32] <wikibugs>	 (03CR) 10Muehlenhoff: C:httpd move htcacheclean to httpd class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede)
[07:09:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P46439 and previous config saved to /var/cache/conftool/dbconfig/20230412-070907-root.json
[07:11:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1218 (re)pooling @ 25%: Pooling db1218 T326669', diff saved to https://phabricator.wikimedia.org/P46440 and previous config saved to /var/cache/conftool/dbconfig/20230412-071149-root.json
[07:11:54] <stashbot>	 T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669
[07:12:49] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm, lets test the new cookbook on the replicas 🎉" [cookbooks] - 10https://gerrit.wikimedia.org/r/894634 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney)
[07:13:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1222 (re)pooling @ 50%: Pooling', diff saved to https://phabricator.wikimedia.org/P46441 and previous config saved to /var/cache/conftool/dbconfig/20230412-071307-root.json
[07:16:34] <marostegui>	 !log Drop flaggerevs tables from ptwikisource T332594
[07:16:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:16:39] <stashbot>	 T332594: Drop FlaggedRevs tables in database for ptwikisource - https://phabricator.wikimedia.org/T332594
[07:19:52] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Adjust routing policy to increase SSH session speed from East Asia to toolforge - https://phabricator.wikimedia.org/T334530 (10ayounsi) Thanks for the report.  This is because we advertise our "customer" prefixes from all our POPs to improve the use...
[07:20:36] <wikibugs>	 (03CR) 10Hashar: devtools: change gerrit hostname to use wmcloud, not wmflabs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888808 (https://phabricator.wikimedia.org/T329444) (owner: 10Dzahn)
[07:20:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P46442 and previous config saved to /var/cache/conftool/dbconfig/20230412-072044-ladsgroup.json
[07:20:56] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:21:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:21:16] <moritzm>	 !log installing xen security updates
[07:21:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:23:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] aqs: Remove use_nodejs10 [puppet] - 10https://gerrit.wikimedia.org/r/907718 (owner: 10Muehlenhoff)
[07:24:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P46443 and previous config saved to /var/cache/conftool/dbconfig/20230412-072412-root.json
[07:24:15] <wikibugs>	 (03PS1) 10Marostegui: change_ptrp_tags_update_T334536.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/908195 (https://phabricator.wikimedia.org/T334536)
[07:25:50] <wikibugs>	 (03CR) 10Marostegui: "This still awaits for clarification on why it is only needed on s1: https://phabricator.wikimedia.org/T334536#8774369" [software/schema-changes] - 10https://gerrit.wikimedia.org/r/908195 (https://phabricator.wikimedia.org/T334536) (owner: 10Marostegui)
[07:26:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:26:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1218 (re)pooling @ 50%: Pooling db1218 T326669', diff saved to https://phabricator.wikimedia.org/P46444 and previous config saved to /var/cache/conftool/dbconfig/20230412-072654-root.json
[07:26:59] <stashbot>	 T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669
[07:27:09] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm. I agree stopping puppet and rolling this change out one by one makes sense so we don't wipe the ssh keys worst case." [puppet] - 10https://gerrit.wikimedia.org/r/907878 (https://phabricator.wikimedia.org/T333840) (owner: 10EoghanGaffney)
[07:28:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1222 (re)pooling @ 75%: Pooling', diff saved to https://phabricator.wikimedia.org/P46445 and previous config saved to /var/cache/conftool/dbconfig/20230412-072812-root.json
[07:28:19] <wikibugs>	 (03PS1) 10Muehlenhoff: kartotherian: Stop passing use_nodejs10 [puppet] - 10https://gerrit.wikimedia.org/r/908196
[07:30:31] <logmsgbot>	 !log jelto@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bullseye
[07:30:36] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:31:59] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/908196 (owner: 10Muehlenhoff)
[07:32:26] <icinga-wm_>	 PROBLEM - WDQS SPARQL on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[07:33:49] <wikibugs>	 10SRE-swift-storage, 10Patch-For-Review: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) Thanks. Interestingly, `codfw` and `eqiad` have different creation dates and sizes: ` root@ms-fe2009:/home/mvernon# swift list -l wikipedia-en-local-pub...
[07:34:54] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove db1107 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/908197 (https://phabricator.wikimedia.org/T334447)
[07:35:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T333332)', diff saved to https://phabricator.wikimedia.org/P46446 and previous config saved to /var/cache/conftool/dbconfig/20230412-073550-ladsgroup.json
[07:35:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[07:35:56] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[07:36:07] <moritzm>	 !log installing python-cryptography security updates
[07:36:08] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[07:36:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1190.eqiad.wmnet with reason: Maintenance
[07:36:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1190.eqiad.wmnet with reason: Maintenance
[07:36:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1190 (T333332)', diff saved to https://phabricator.wikimedia.org/P46447 and previous config saved to /var/cache/conftool/dbconfig/20230412-073633-ladsgroup.json
[07:38:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T333332)', diff saved to https://phabricator.wikimedia.org/P46448 and previous config saved to /var/cache/conftool/dbconfig/20230412-073841-ladsgroup.json
[07:38:52] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1107 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/908197 (https://phabricator.wikimedia.org/T334447) (owner: 10Marostegui)
[07:39:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P46449 and previous config saved to /var/cache/conftool/dbconfig/20230412-073917-root.json
[07:39:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1107 from dbctl T334447', diff saved to https://phabricator.wikimedia.org/P46450 and previous config saved to /var/cache/conftool/dbconfig/20230412-073921-marostegui.json
[07:39:25] <stashbot>	 T334447: decommission db1107.eqiad.wmnet - https://phabricator.wikimedia.org/T334447
[07:41:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1218 (re)pooling @ 75%: Pooling db1218 T326669', diff saved to https://phabricator.wikimedia.org/P46451 and previous config saved to /var/cache/conftool/dbconfig/20230412-074158-root.json
[07:42:04] <stashbot>	 T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669
[07:43:21] <logmsgbot>	 !log jelto@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage
[07:43:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1222 (re)pooling @ 100%: Pooling', diff saved to https://phabricator.wikimedia.org/P46452 and previous config saved to /var/cache/conftool/dbconfig/20230412-074317-root.json
[07:45:58] <logmsgbot>	 !log jelto@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage
[07:50:39] <wikibugs>	 (03PS16) 10Slyngshede: C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102
[07:51:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede)
[07:53:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P46453 and previous config saved to /var/cache/conftool/dbconfig/20230412-075348-ladsgroup.json
[07:54:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P46454 and previous config saved to /var/cache/conftool/dbconfig/20230412-075422-root.json
[07:54:23] <wikibugs>	 (03PS17) 10Slyngshede: C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102
[07:55:59] <wikibugs>	 (03CR) 10Marostegui: "Amir, even if it can be run with replication directly, please take a look so I can merge and add this to the repo" [software/schema-changes] - 10https://gerrit.wikimedia.org/r/908195 (https://phabricator.wikimedia.org/T334536) (owner: 10Marostegui)
[07:56:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede)
[07:57:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1218 (re)pooling @ 100%: Pooling db1218 T326669', diff saved to https://phabricator.wikimedia.org/P46455 and previous config saved to /var/cache/conftool/dbconfig/20230412-075703-root.json
[07:57:08] <stashbot>	 T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669
[07:57:25] <wikibugs>	 (03CR) 10Ladsgroup: change_ptrp_tags_update_T334536.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/908195 (https://phabricator.wikimedia.org/T334536) (owner: 10Marostegui)
[07:57:50] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] httpbb: remove tests from liftwing production [puppet] - 10https://gerrit.wikimedia.org/r/907809 (owner: 10Elukey)
[07:57:54] <wikibugs>	 (03CR) 10Marostegui: change_ptrp_tags_update_T334536.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/908195 (https://phabricator.wikimedia.org/T334536) (owner: 10Marostegui)
[07:59:03] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "small nitpick." [software/schema-changes] - 10https://gerrit.wikimedia.org/r/908195 (https://phabricator.wikimedia.org/T334536) (owner: 10Marostegui)
[07:59:39] <wikibugs>	 (03PS2) 10Marostegui: change_ptrp_tags_update_T334536.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/908195 (https://phabricator.wikimedia.org/T334536)
[07:59:41] <wikibugs>	 (03CR) 10Ladsgroup: change_ptrp_tags_update_T334536.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/908195 (https://phabricator.wikimedia.org/T334536) (owner: 10Marostegui)
[07:59:43] <wikibugs>	 (03CR) 10Marostegui: change_ptrp_tags_update_T334536.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/908195 (https://phabricator.wikimedia.org/T334536) (owner: 10Marostegui)
[08:00:00] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] change_ptrp_tags_update_T334536.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/908195 (https://phabricator.wikimedia.org/T334536) (owner: 10Marostegui)
[08:00:05] <jouncebot>	 ^demon and hashar: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230412T0800).
[08:00:12] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] change_ptrp_tags_update_T334536.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/908195 (https://phabricator.wikimedia.org/T334536) (owner: 10Marostegui)
[08:00:38] <wikibugs>	 (03Merged) 10jenkins-bot: change_ptrp_tags_update_T334536.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/908195 (https://phabricator.wikimedia.org/T334536) (owner: 10Marostegui)
[08:01:26] <logmsgbot>	 !log jelto@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab2003.wikimedia.org with OS bullseye
[08:01:48] <wikibugs>	 10SRE, 10Machine-Learning-Team, 10serviceops: Import and deploy istio 1.15.7 - https://phabricator.wikimedia.org/T334068 (10elukey)
[08:02:47] <wikibugs>	 10SRE, 10Machine-Learning-Team, 10serviceops: Import and deploy istio 1.15.7 - https://phabricator.wikimedia.org/T334068 (10elukey) Rollout to ml-serve/aux/dse completed. To keep archives happy, I used ` istioctl-1.15.7 upgrade -f config.yaml`  Last step: rollout to wikikube clusters
[08:03:04] <wikibugs>	 10SRE, 10Machine-Learning-Team, 10serviceops: Import and deploy istio 1.15.7 - https://phabricator.wikimedia.org/T334068 (10elukey) a:05elukey→03JMeybohm
[08:03:31] <marostegui>	 !log dbmaint Deploy schema change on s3 codfw with replication enabled (only for testwiki and test2wiki) T334536
[08:03:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:36] <stashbot>	 T334536: Schema changes: Make ptrp_tags_updated NULLABLE  - https://phabricator.wikimedia.org/T334536
[08:03:51] <wikibugs>	 10SRE-swift-storage, 10Patch-For-Review: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10jcrespo) eqiad backups: ` This is the list of 2 files found with the given criteria:   0) wiki                 | commonswiki title                | The_Collected_Works...
[08:06:10] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:08:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P46456 and previous config saved to /var/cache/conftool/dbconfig/20230412-080854-ladsgroup.json
[08:09:36] <wikibugs>	 10SRE-swift-storage, 10Patch-For-Review: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10jcrespo) These look to me as leftovers- check the paths of production_container + production_path, that is where mw thinks they should be (only). They must have failed...
[08:10:05] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] [Beta Cluster] Replicate WebResponseSetCookie wgHooks migration here too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907933 (https://phabricator.wikimedia.org/T333926) (owner: 10Jforrester)
[08:10:57] <wikibugs>	 (03Merged) 10jenkins-bot: [Beta Cluster] Replicate WebResponseSetCookie wgHooks migration here too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907933 (https://phabricator.wikimedia.org/T333926) (owner: 10Jforrester)
[08:14:12] <wikibugs>	 10SRE-swift-storage, 10Patch-For-Review: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) Yes, that fits with the "this file has been deleted" page, so I think that object is good to clear up in both clusters. Thank you!  I'll be interested t...
[08:14:58] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] "Thanks! I have synced it in production." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907933 (https://phabricator.wikimedia.org/T333926) (owner: 10Jforrester)
[08:15:48] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:17:35] <wikibugs>	 10SRE-swift-storage, 10Patch-For-Review: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10jcrespo) >>! In T327253#8774731, @MatthewVernon wrote: > I'll be interested to hear about the other objects when you've some time :)  The 22 at wikipedia-ja-local-publ...
[08:17:37] <logmsgbot>	 !log hashar@deploy2002 Synchronized wmf-config/CommonSettings-labs.php: [Beta Cluster] Replicate WebResponseSetCookie wgHooks migration here too - T333926 (duration: 05m 51s)
[08:17:40] <icinga-wm_>	 PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:17:41] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10elukey) @Jclark-ctr progress! I was able to reimage, but the two disks in the flex bay seem in `Firmware state: Unconfigured(good), Spun Up`, so the OS got installed on...
[08:17:41] <stashbot>	 T333926: PHP Deprecated: Accessing $wgHooks directly is deprecated, use HookContainer::getHandlers() or HookContainer::register() instead. [Called from {closure}] - https://phabricator.wikimedia.org/T333926
[08:17:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Apply prometheus::pop role to prometheus3002 [puppet] - 10https://gerrit.wikimedia.org/r/905705 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse)
[08:17:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Apply prometheus::pop role to prometheus4002 [puppet] - 10https://gerrit.wikimedia.org/r/907984 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse)
[08:17:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Apply prometheus::pop role to prometheus6002 [puppet] - 10https://gerrit.wikimedia.org/r/907987 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse)
[08:18:40] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] P:httpbb: Add monitoring for kubernetes services [puppet] - 10https://gerrit.wikimedia.org/r/907814 (https://phabricator.wikimedia.org/T334456) (owner: 10Clément Goubert)
[08:20:06] <wikibugs>	 10SRE-swift-storage, 10Patch-For-Review: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) >>! In T327253#8774732, @jcrespo wrote: >>>! In T327253#8774731, @MatthewVernon wrote: >> I'll be interested to hear about the other objects when you've...
[08:20:55] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "Great work Ilias!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/904777 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos)
[08:22:16] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:22:30] <icinga-wm_>	 RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:24:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T333332)', diff saved to https://phabricator.wikimedia.org/P46457 and previous config saved to /var/cache/conftool/dbconfig/20230412-082400-ladsgroup.json
[08:24:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1199.eqiad.wmnet with reason: Maintenance
[08:24:06] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[08:24:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1199.eqiad.wmnet with reason: Maintenance
[08:24:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1199 (T333332)', diff saved to https://phabricator.wikimedia.org/P46458 and previous config saved to /var/cache/conftool/dbconfig/20230412-082424-ladsgroup.json
[08:24:44] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10ItamarWMDE) Hello @MoritzMuehlenhoff and @BCornwall, apologies for the delay in the response. I am just back from holidays.  I am not so well ve...
[08:25:22] <icinga-wm_>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service,httpbb_kubernetes_mw-api-int_hourly.service,httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:26:18] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/908201 (owner: 10Clément Goubert)
[08:26:26] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: FastAPI chart using sextant for ores-legacy service [deployment-charts] - 10https://gerrit.wikimedia.org/r/904777 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos)
[08:26:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T333332)', diff saved to https://phabricator.wikimedia.org/P46459 and previous config saved to /var/cache/conftool/dbconfig/20230412-082632-ladsgroup.json
[08:26:48] <claime>	 The httpbb CRITs are my bad, greedy replace messed up the path, pushing a fix
[08:27:23] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] P:httpbb: Fix wrong test directory [puppet] - 10https://gerrit.wikimedia.org/r/908201 (owner: 10Clément Goubert)
[08:29:07] <icinga-wm_>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service,httpbb_kubernetes_mw-api-int_hourly.service,httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:30:35] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:30:43] <icinga-wm_>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:32:02] <wikibugs>	 (03PS5) 10Clément Goubert: P:httpbb: Remove absented httpbb_kubernetes_hourly [puppet] - 10https://gerrit.wikimedia.org/r/907848 (https://phabricator.wikimedia.org/T334456)
[08:32:03] <icinga-wm_>	 PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:32:11] <icinga-wm_>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:33:09] <icinga-wm_>	 RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:33:20] <aqu>	 !log About to deploy analytics/refinery in production
[08:33:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:34:15] <logmsgbot>	 !log aqu@deploy2002 Started deploy [analytics/refinery@f3389dc]: Deploy analytics_refinery in production [analytics/refinery@f3389dc]
[08:34:56] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [analytics/refinery@f3389dc]: Deploy analytics_refinery in production [analytics/refinery@f3389dc] (duration: 00m 41s)
[08:35:34] <moritzm>	 !log imported puppet 5.5.22-2+deb12u1 for bookworm-wikimedia component/puppet5 T330495
[08:35:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:38] <stashbot>	 T330495: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495
[08:35:38] <logmsgbot>	 !log aqu@deploy2002 Started deploy [analytics/refinery@f3389dc] (thin): Deploy analytics_refinery in production thin [analytics/refinery@f3389dc]
[08:35:46] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [analytics/refinery@f3389dc] (thin): Deploy analytics_refinery in production thin [analytics/refinery@f3389dc] (duration: 00m 07s)
[08:37:26] <marostegui>	 !log dbmaint Deploy schema change on s1 codfw with replication T334536
[08:37:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:30] <stashbot>	 T334536: Schema changes: Make ptrp_tags_updated NULLABLE  - https://phabricator.wikimedia.org/T334536
[08:38:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:40:48] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] P:httpbb: Remove absented httpbb_kubernetes_hourly [puppet] - 10https://gerrit.wikimedia.org/r/907848 (https://phabricator.wikimedia.org/T334456) (owner: 10Clément Goubert)
[08:41:36] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "I have proposed a series of change to rely on a PuppetDB query instead of a manually maintained list starting at https://gerrit.wikimedia." [puppet] - 10https://gerrit.wikimedia.org/r/867670 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn)
[08:41:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P46460 and previous config saved to /var/cache/conftool/dbconfig/20230412-084138-ladsgroup.json
[08:42:49] <icinga-wm_>	 PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:42:51] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert)
[08:43:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:44:07] <icinga-wm_>	 RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:46:23] <wikibugs>	 (03PS1) 10Muehlenhoff: Install Puppet 5.5 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/908202 (https://phabricator.wikimedia.org/T330495)
[08:46:34] <wikibugs>	 (03PS2) 10Muehlenhoff: Install Puppet 5.5 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/908202 (https://phabricator.wikimedia.org/T330495)
[08:48:10] <wikibugs>	 (03PS4) 10Hashar: doc: upgrade php from 7.3 to 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357)
[08:48:22] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/908202 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[08:48:29] <wikibugs>	 (03CR) 10Hashar: "Rebased to clear a conflict with Id989c18b783d1bd58e3935a3d6418fa02b4f5652" [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar)
[08:49:43] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db1221 [puppet] - 10https://gerrit.wikimedia.org/r/908204 (https://phabricator.wikimedia.org/T326669)
[08:50:08] <wikibugs>	 (03PS8) 10EoghanGaffney: Add keys for sshd-gitlab from the secrets repo [puppet] - 10https://gerrit.wikimedia.org/r/907878
[08:50:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1221 [puppet] - 10https://gerrit.wikimedia.org/r/908204 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui)
[08:50:55] <logmsgbot>	 !log aqu@deploy2002 Started deploy [airflow-dags/analytics@18ae3be]: Deploy airflow-dags including webrequest load job - Analytics [airflow-dags@18ae3be]
[08:51:07] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@18ae3be]: Deploy airflow-dags including webrequest load job - Analytics [airflow-dags@18ae3be] (duration: 00m 12s)
[08:51:29] <logmsgbot>	 !log jelto@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bullseye
[08:54:15] <icinga-wm_>	 PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:54:26] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Use a single socket for haproxy/varnish on drmrs [puppet] - 10https://gerrit.wikimedia.org/r/908205 (https://phabricator.wikimedia.org/T333965)
[08:56:18] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40609/console" [puppet] - 10https://gerrit.wikimedia.org/r/908205 (https://phabricator.wikimedia.org/T333965) (owner: 10Vgutierrez)
[08:56:29] <icinga-wm_>	 RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:56:40] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40610/console" [puppet] - 10https://gerrit.wikimedia.org/r/907878 (owner: 10EoghanGaffney)
[08:56:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P46462 and previous config saved to /var/cache/conftool/dbconfig/20230412-085644-ladsgroup.json
[08:56:48] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Use a single socket for haproxy/varnish on drmrs [puppet] - 10https://gerrit.wikimedia.org/r/908205 (https://phabricator.wikimedia.org/T333965) (owner: 10Vgutierrez)
[08:57:15] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: report alert lint problems [alerts] - 10https://gerrit.wikimedia.org/r/908206 (https://phabricator.wikimedia.org/T309182)
[08:57:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[08:58:10] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[08:58:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T333332)', diff saved to https://phabricator.wikimedia.org/P46463 and previous config saved to /var/cache/conftool/dbconfig/20230412-085816-ladsgroup.json
[08:58:21] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[08:59:26] <wikibugs>	 10SRE, 10Machine-Learning-Team, 10serviceops: Import and deploy istio 1.15.7 - https://phabricator.wikimedia.org/T334068 (10JMeybohm) 05Open→03Resolved Thanks! Wikikube is done as well
[08:59:36] <wikibugs>	 10SRE, 10Machine-Learning-Team, 10serviceops: Import and deploy istio 1.15.7 - https://phabricator.wikimedia.org/T334068 (10JMeybohm)
[08:59:37] <icinga-wm_>	 PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:59:46] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: cloud_private_subnet: add route to public IPv4 range [puppet] - 10https://gerrit.wikimedia.org/r/903622 (https://phabricator.wikimedia.org/T324992)
[09:00:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T333332)', diff saved to https://phabricator.wikimedia.org/P46464 and previous config saved to /var/cache/conftool/dbconfig/20230412-090032-ladsgroup.json
[09:04:16] <logmsgbot>	 !log jelto@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage
[09:05:03] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] linkrecommendation: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905941 (https://phabricator.wikimedia.org/T334060) (owner: 10Clément Goubert)
[09:05:17] <icinga-wm_>	 RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:05:20] <wikibugs>	 10SRE-swift-storage, 10Patch-For-Review: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10jcrespo) >>! In T327253#8774738, @MatthewVernon wrote: > Yes, and if you have any further thoughts on 8/80/Anotheryear.jpg  So backups records are not (and do not inte...
[09:05:41] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:06:14] <claime>	 !log Migrating cxserver to mw-api-int on kubernetes - T334204
[09:06:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:19] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] cxserver: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/903646 (https://phabricator.wikimedia.org/T334204) (owner: 10Clément Goubert)
[09:06:20] <stashbot>	 T334204: Migrate cxserver to mw-api-int - https://phabricator.wikimedia.org/T334204
[09:06:44] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: ml-services: deployment of ores-legacy app in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/908191 (https://phabricator.wikimedia.org/T330414)
[09:07:39] <logmsgbot>	 !log jelto@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage
[09:08:44] <icinga-wm_>	 PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:09:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ml-services: deployment of ores-legacy app in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/908191 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos)
[09:11:08] <wikibugs>	 (03Merged) 10jenkins-bot: cxserver: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/903646 (https://phabricator.wikimedia.org/T334204) (owner: 10Clément Goubert)
[09:11:31] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
[09:11:46] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[09:11:50] <icinga-wm_>	 RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:11:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T333332)', diff saved to https://phabricator.wikimedia.org/P46466 and previous config saved to /var/cache/conftool/dbconfig/20230412-091151-ladsgroup.json
[09:11:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[09:11:55] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[09:12:08] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[09:12:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2099.codfw.wmnet with reason: Maintenance
[09:12:26] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
[09:12:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2099.codfw.wmnet with reason: Maintenance
[09:12:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2106.codfw.wmnet with reason: Maintenance
[09:12:49] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2106.codfw.wmnet with reason: Maintenance
[09:12:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2106 (T333332)', diff saved to https://phabricator.wikimedia.org/P46467 and previous config saved to /var/cache/conftool/dbconfig/20230412-091255-ladsgroup.json
[09:13:53] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[09:15:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T333332)', diff saved to https://phabricator.wikimedia.org/P46468 and previous config saved to /var/cache/conftool/dbconfig/20230412-091507-ladsgroup.json
[09:15:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P46469 and previous config saved to /var/cache/conftool/dbconfig/20230412-091539-ladsgroup.json
[09:15:48] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:19:28] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] ci: split contint hosts to different roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/907886 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar)
[09:19:30] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40611/console" [puppet] - 10https://gerrit.wikimedia.org/r/907878 (owner: 10EoghanGaffney)
[09:20:58] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
[09:21:14] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[09:21:18] <logmsgbot>	 !log jelto@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab2003.wikimedia.org with OS bullseye
[09:21:56] <wikibugs>	 (03PS3) 10Ilias Sarantopoulos: ml-services: deployment of ores-legacy app in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/908191 (https://phabricator.wikimedia.org/T330414)
[09:22:28] <wikibugs>	 (03PS1) 10Clément Goubert: Revert "cxserver: Switch to mw-api-int-async on k8s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/908032
[09:23:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P46470 and previous config saved to /var/cache/conftool/dbconfig/20230412-092327-root.json
[09:27:11] <wikibugs>	 (03PS1) 10Elukey: aptrepo: import AMD ROCm 5.4 to bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/908208 (https://phabricator.wikimedia.org/T295661)
[09:27:48] <wikibugs>	 (03PS2) 10Sergio Gimeno: GrowthExperiments: enable add link frontend in 7,8th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907899 (https://phabricator.wikimedia.org/T304551)
[09:28:24] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] Revert "cxserver: Switch to mw-api-int-async on k8s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/908032 (owner: 10Clément Goubert)
[09:28:56] <icinga-wm_>	 PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:30:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P46472 and previous config saved to /var/cache/conftool/dbconfig/20230412-093013-ladsgroup.json
[09:30:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P46473 and previous config saved to /var/cache/conftool/dbconfig/20230412-093045-ladsgroup.json
[09:31:50] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "see comments inline i tl;dr i think we should switch back to the previous implementation.  Any systems that need cache_disk should declare" [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede)
[09:32:02] <icinga-wm_>	 RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:33:37] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "cxserver: Switch to mw-api-int-async on k8s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/908032 (owner: 10Clément Goubert)
[09:34:12] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/903622 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez)
[09:34:29] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
[09:34:39] <wikibugs>	 (03PS1) 10Elukey: role::dse_k8s::worker: add AMD GPU support [puppet] - 10https://gerrit.wikimedia.org/r/908210 (https://phabricator.wikimedia.org/T333009)
[09:34:50] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[09:34:57] <claime>	 !log Reverted migrating cxserver to mw-api-int on kubernetes - T334204
[09:35:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:01] <stashbot>	 T334204: Migrate cxserver to mw-api-int - https://phabricator.wikimedia.org/T334204
[09:35:45] <wikibugs>	 (03PS2) 10Elukey: role::dse_k8s::worker: add AMD GPU support [puppet] - 10https://gerrit.wikimedia.org/r/908210 (https://phabricator.wikimedia.org/T333009)
[09:37:42] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40612/console" [puppet] - 10https://gerrit.wikimedia.org/r/908210 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey)
[09:38:22] <icinga-wm_>	 PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:38:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P46474 and previous config saved to /var/cache/conftool/dbconfig/20230412-093831-root.json
[09:39:24] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] kartotherian: Stop passing use_nodejs10 [puppet] - 10https://gerrit.wikimedia.org/r/908196 (owner: 10Muehlenhoff)
[09:41:23] <wikibugs>	 (03PS1) 10Filippo Giunchedi: alerting-host: toggle auto-restart for ircecho/icinga-am on failover [puppet] - 10https://gerrit.wikimedia.org/r/908211
[09:42:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] sre: report alert lint problems [alerts] - 10https://gerrit.wikimedia.org/r/908206 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[09:42:54] <wikibugs>	 (03PS18) 10Slyngshede: C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102
[09:43:12] <icinga-wm_>	 RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:45:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P46475 and previous config saved to /var/cache/conftool/dbconfig/20230412-094520-ladsgroup.json
[09:45:22] <wikibugs>	 (03PS1) 10Jaime Nuche: scap: block Scap deployments on inactive deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/908212 (https://phabricator.wikimedia.org/T330756)
[09:45:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T333332)', diff saved to https://phabricator.wikimedia.org/P46476 and previous config saved to /var/cache/conftool/dbconfig/20230412-094551-ladsgroup.json
[09:45:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[09:45:56] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[09:46:09] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[09:46:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T333332)', diff saved to https://phabricator.wikimedia.org/P46477 and previous config saved to /var/cache/conftool/dbconfig/20230412-094615-ladsgroup.json
[09:46:54] <wikibugs>	 (03PS2) 10Jaime Nuche: scap: block Scap deployments on inactive deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/908212 (https://phabricator.wikimedia.org/T330756)
[09:47:40] <wikibugs>	 (03CR) 10Jbond: "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/908202 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[09:48:02] <icinga-wm_>	 PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:48:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T333332)', diff saved to https://phabricator.wikimedia.org/P46478 and previous config saved to /var/cache/conftool/dbconfig/20230412-094829-ladsgroup.json
[09:48:36] <wikibugs>	 (03CR) 10Slyngshede: C:httpd move htcacheclean to httpd class (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede)
[09:50:25] <wikibugs>	 (03CR) 10David Caro: maintain-dbusers: ensure get_global_wiki_user is only called when needed (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe)
[09:50:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] aptrepo: import AMD ROCm 5.4 to bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/908208 (https://phabricator.wikimedia.org/T295661) (owner: 10Elukey)
[09:51:14] <icinga-wm_>	 RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:52:39] <wikibugs>	 (03PS1) 10Hashar: utils: rm hiera_lookup (replaced by puppet lookup) [puppet] - 10https://gerrit.wikimedia.org/r/908214
[09:53:36] <wikibugs>	 (03CR) 10Hashar: "There is no other reference to `hiera_lookup` in the Puppet repo :]" [puppet] - 10https://gerrit.wikimedia.org/r/908214 (owner: 10Hashar)
[09:53:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P46479 and previous config saved to /var/cache/conftool/dbconfig/20230412-095336-root.json
[09:54:15] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Rename cadvisor_exporter to cadvisor [puppet] - 10https://gerrit.wikimedia.org/r/908215 (https://phabricator.wikimedia.org/T108027)
[09:56:04] <icinga-wm_>	 PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:56:11] <wikibugs>	 (03CR) 10Jaime Nuche: "Once this patch has been merged, we should apply it to both deployments servers and run some simply scap sync command from the active one " [puppet] - 10https://gerrit.wikimedia.org/r/908212 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche)
[09:57:40] <icinga-wm_>	 RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:58:03] <wikibugs>	 10SRE-swift-storage, 10Patch-For-Review: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10jcrespo) Regarding jawiki, there is no latest or or archived  (public) files with those names: ` root@db1140:~$ cat images.txt | while read image; do echo "SELECT * FR...
[09:58:04] <wikibugs>	 (03CR) 10Jaime Nuche: "PCC: https://puppet-compiler.wmflabs.org/output/908212/40613/" [puppet] - 10https://gerrit.wikimedia.org/r/908212 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche)
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230412T1000)
[10:00:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T333332)', diff saved to https://phabricator.wikimedia.org/P46480 and previous config saved to /var/cache/conftool/dbconfig/20230412-100026-ladsgroup.json
[10:00:29] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2110.codfw.wmnet with reason: Maintenance
[10:00:31] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[10:00:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2110.codfw.wmnet with reason: Maintenance
[10:00:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2110 (T333332)', diff saved to https://phabricator.wikimedia.org/P46481 and previous config saved to /var/cache/conftool/dbconfig/20230412-100049-ladsgroup.json
[10:01:00] <wikibugs>	 (03CR) 10Jbond: C:httpd move htcacheclean to httpd class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede)
[10:01:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1123 to clone db1223 T326669', diff saved to https://phabricator.wikimedia.org/P46482 and previous config saved to /var/cache/conftool/dbconfig/20230412-100111-marostegui.json
[10:01:16] <stashbot>	 T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669
[10:02:19] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] scap: block Scap deployments on inactive deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/908212 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche)
[10:02:30] <icinga-wm_>	 PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:03:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T333332)', diff saved to https://phabricator.wikimedia.org/P46484 and previous config saved to /var/cache/conftool/dbconfig/20230412-100301-ladsgroup.json
[10:03:09] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm, would have been nice to update this but i wasn't able.  adding a few other puppet people in case they see an easy win or object" [puppet] - 10https://gerrit.wikimedia.org/r/908214 (owner: 10Hashar)
[10:03:17] <wikibugs>	 (03PS1) 10Marostegui: db1223: Place it in s3 [puppet] - 10https://gerrit.wikimedia.org/r/908216 (https://phabricator.wikimedia.org/T326669)
[10:03:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P46485 and previous config saved to /var/cache/conftool/dbconfig/20230412-100335-ladsgroup.json
[10:04:18] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1223: Place it in s3 [puppet] - 10https://gerrit.wikimedia.org/r/908216 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui)
[10:06:22] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[10:06:42] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:08:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P46486 and previous config saved to /var/cache/conftool/dbconfig/20230412-100841-root.json
[10:08:56] <wikibugs>	 (03CR) 10David Caro: maintain-dbusers: ensure get_global_wiki_user is only called when needed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe)
[10:08:58] <wikibugs>	 (03CR) 10Muehlenhoff: Install Puppet 5.5 on Bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908202 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[10:10:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/908211 (owner: 10Filippo Giunchedi)
[10:10:47] <logmsgbot>	 !log cgoubert@deploy2002 Synchronized README: (no justification provided) (duration: 05m 44s)
[10:11:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/908208 (https://phabricator.wikimedia.org/T295661) (owner: 10Elukey)
[10:12:10] <icinga-wm_>	 RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:12:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] alerting-host: toggle auto-restart for ircecho/icinga-am on failover [puppet] - 10https://gerrit.wikimedia.org/r/908211 (owner: 10Filippo Giunchedi)
[10:13:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/908210 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey)
[10:15:38] <wikibugs>	 10SRE-swift-storage, 10Patch-For-Review: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) Thanks, that is super helpful!  I agree that some tooling that's able to look up objects in backups //and// production would be really useful (if nothin...
[10:15:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Note that due to the change in exported resources name, the Prometheus configuration will converge after puppet has ran on all affected ho" [puppet] - 10https://gerrit.wikimedia.org/r/908215 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi)
[10:16:22] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:16:58] <icinga-wm_>	 PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:18:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P46487 and previous config saved to /var/cache/conftool/dbconfig/20230412-101808-ladsgroup.json
[10:18:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P46488 and previous config saved to /var/cache/conftool/dbconfig/20230412-101841-ladsgroup.json
[10:18:46] <Emperor>	 !log clearing out 24 ghost objects from Swift T327253
[10:18:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:50] <stashbot>	 T327253: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253
[10:21:14] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10User-MarcoAurelio: add MarcoAurelio to LDAP nda group - https://phabricator.wikimedia.org/T333884 (10MarcoAurelio)
[10:22:00] <wikibugs>	 (03PS2) 10Clément Goubert: contint: manage dsh target from Puppet DB [puppet] - 10https://gerrit.wikimedia.org/r/893483 (owner: 10Hashar)
[10:23:08] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud_private_subnet: add route to public IPv4 range [puppet] - 10https://gerrit.wikimedia.org/r/903622 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez)
[10:23:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P46489 and previous config saved to /var/cache/conftool/dbconfig/20230412-102346-root.json
[10:24:04] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40616/console" [puppet] - 10https://gerrit.wikimedia.org/r/893483 (owner: 10Hashar)
[10:26:05] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] contint: manage dsh target from Puppet DB [puppet] - 10https://gerrit.wikimedia.org/r/893483 (owner: 10Hashar)
[10:26:40] <icinga-wm_>	 RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:27:01] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1220 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/907833 (https://phabricator.wikimedia.org/T334564)
[10:27:55] <wikibugs>	 (03Abandoned) 10Marostegui: mariadb: Promote db1220 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/907833 (https://phabricator.wikimedia.org/T334564) (owner: 10Gerrit maintenance bot)
[10:28:41] <logmsgbot>	 !log hashar@deploy2002 Started deploy [zuul/deploy@4c6859c]: Dummy deploy with dsh file managed by Puppet
[10:28:44] <logmsgbot>	 !log hashar@deploy2002 Finished deploy [zuul/deploy@4c6859c]: Dummy deploy with dsh file managed by Puppet (duration: 00m 02s)
[10:29:02] <wikibugs>	 (03PS3) 10Muehlenhoff: Install Puppet 5.5 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/908202 (https://phabricator.wikimedia.org/T330495)
[10:29:12] <wikibugs>	 (03PS4) 10Muehlenhoff: Install Puppet 5.5 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/908202 (https://phabricator.wikimedia.org/T330495)
[10:29:13] <logmsgbot>	 !log hashar@deploy2002 Started deploy [integration/docroot@ab848e3]: Dummy deploy with dsh file managed by Puppet
[10:29:16] <logmsgbot>	 !log hashar@deploy2002 Finished deploy [integration/docroot@ab848e3]: Dummy deploy with dsh file managed by Puppet (duration: 00m 02s)
[10:29:20] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/908202 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[10:29:29] <logmsgbot>	 !log hashar@deploy2002 Started deploy [integration/docroot@ab848e3]: Dummy deploy with dsh file managed by Puppet
[10:29:35] <logmsgbot>	 !log hashar@deploy2002 Finished deploy [integration/docroot@ab848e3]: Dummy deploy with dsh file managed by Puppet (duration: 00m 06s)
[10:29:46] <logmsgbot>	 !log hashar@deploy2002 Started deploy [integration/docroot@ab848e3]: Dummy deploy with dsh file managed by Puppet
[10:29:51] <logmsgbot>	 !log hashar@deploy2002 Finished deploy [integration/docroot@ab848e3]: Dummy deploy with dsh file managed by Puppet (duration: 00m 04s)
[10:31:28] <icinga-wm_>	 PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:32:00] <wikibugs>	 (03PS2) 10Hashar: contint: manage jenkins-ci dsh group from Puppet DB [puppet] - 10https://gerrit.wikimedia.org/r/893484 (https://phabricator.wikimedia.org/T328920)
[10:32:13] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/893484 (https://phabricator.wikimedia.org/T328920) (owner: 10Hashar)
[10:32:37] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting:  CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Clement_Goubert) Starting 10 min cpu stress test: ` cgoubert@mw2448:~$ stress -c 48 --timeout 600s...
[10:33:06] <icinga-wm_>	 RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:33:14] <wikibugs>	 (03PS2) 10Hashar: releases: manage jenkins-rel dsh group from Puppet DB [puppet] - 10https://gerrit.wikimedia.org/r/893485 (https://phabricator.wikimedia.org/T323909)
[10:33:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P46490 and previous config saved to /var/cache/conftool/dbconfig/20230412-103314-ladsgroup.json
[10:33:30] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/893485 (https://phabricator.wikimedia.org/T323909) (owner: 10Hashar)
[10:33:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T333332)', diff saved to https://phabricator.wikimedia.org/P46491 and previous config saved to /var/cache/conftool/dbconfig/20230412-103348-ladsgroup.json
[10:33:51] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[10:33:53] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[10:34:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[10:34:08] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[10:34:16] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[10:34:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T333332)', diff saved to https://phabricator.wikimedia.org/P46492 and previous config saved to /var/cache/conftool/dbconfig/20230412-103421-ladsgroup.json
[10:35:02] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "https://puppet-compiler.wmflabs.org/output/893484/1724/deploy1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/893484 (https://phabricator.wikimedia.org/T328920) (owner: 10Hashar)
[10:35:55] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui) @ayounsi we are placing new DB hosts in production, can you run the same query you ran to gather the affected DBs just in case we have new on...
[10:36:11] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: cloud_private_subnet: codfw: relocate some hiera [puppet] - 10https://gerrit.wikimedia.org/r/903623 (https://phabricator.wikimedia.org/T324992)
[10:36:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T333332)', diff saved to https://phabricator.wikimedia.org/P46493 and previous config saved to /var/cache/conftool/dbconfig/20230412-103635-ladsgroup.json
[10:38:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/908215 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi)
[10:38:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P46494 and previous config saved to /var/cache/conftool/dbconfig/20230412-103851-root.json
[10:41:03] <wikibugs>	 (03PS3) 10Hashar: releases: manage jenkins-rel dsh group from Puppet DB [puppet] - 10https://gerrit.wikimedia.org/r/893485 (https://phabricator.wikimedia.org/T323909)
[10:41:16] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/893485 (https://phabricator.wikimedia.org/T323909) (owner: 10Hashar)
[10:41:54] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud_private_subnet: codfw: relocate some hiera [puppet] - 10https://gerrit.wikimedia.org/r/903623 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez)
[10:42:34] <icinga-wm_>	 PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:42:34] <wikibugs>	 (03PS3) 10Elukey: role::dse_k8s::worker: add AMD GPU support [puppet] - 10https://gerrit.wikimedia.org/r/908210 (https://phabricator.wikimedia.org/T333009)
[10:42:45] <wikibugs>	 (03CR) 10Elukey: role::dse_k8s::worker: add AMD GPU support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908210 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey)
[10:42:48] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/907834 (https://phabricator.wikimedia.org/T334567)
[10:43:34] <wikibugs>	 (03Abandoned) 10Ladsgroup: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/907834 (https://phabricator.wikimedia.org/T334567) (owner: 10Gerrit maintenance bot)
[10:43:54] <wikibugs>	 (03PS4) 10Hashar: releases: manage jenkins-rel dsh group from Puppet DB [puppet] - 10https://gerrit.wikimedia.org/r/893485 (https://phabricator.wikimedia.org/T323909)
[10:44:00] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/907835 (https://phabricator.wikimedia.org/T334568)
[10:44:23] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/893485 (https://phabricator.wikimedia.org/T323909) (owner: 10Hashar)
[10:47:41] <wikibugs>	 (03PS3) 10David Caro: maintain_dbusers: move all the files under service [puppet] - 10https://gerrit.wikimedia.org/r/906637
[10:47:50] <wikibugs>	 (03CR) 10David Caro: maintain_dbusers: move all the files under service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/906637 (owner: 10David Caro)
[10:48:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T333332)', diff saved to https://phabricator.wikimedia.org/P46495 and previous config saved to /var/cache/conftool/dbconfig/20230412-104820-ladsgroup.json
[10:48:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2119.codfw.wmnet with reason: Maintenance
[10:48:26] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[10:48:26] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/907836 (https://phabricator.wikimedia.org/T334569)
[10:48:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2119.codfw.wmnet with reason: Maintenance
[10:48:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2119 (T333332)', diff saved to https://phabricator.wikimedia.org/P46496 and previous config saved to /var/cache/conftool/dbconfig/20230412-104843-ladsgroup.json
[10:49:41] <wikibugs>	 (03Abandoned) 10Ladsgroup: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/907836 (https://phabricator.wikimedia.org/T334569) (owner: 10Gerrit maintenance bot)
[10:49:43] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "PCC https://puppet-compiler.wmflabs.org/output/893485/1727/deploy1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/893485 (https://phabricator.wikimedia.org/T323909) (owner: 10Hashar)
[10:49:51] <wikibugs>	 (03Abandoned) 10Ladsgroup: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/907835 (https://phabricator.wikimedia.org/T334568) (owner: 10Gerrit maintenance bot)
[10:50:27] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] releases: manage jenkins-rel dsh group from Puppet DB [puppet] - 10https://gerrit.wikimedia.org/r/893485 (https://phabricator.wikimedia.org/T323909) (owner: 10Hashar)
[10:50:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T333332)', diff saved to https://phabricator.wikimedia.org/P46497 and previous config saved to /var/cache/conftool/dbconfig/20230412-105056-ladsgroup.json
[10:51:01] <wikibugs>	 (03CR) 10Hashar: "The CI Jenkins do not have the scap::target['releng/jenkins-deploy'] yet for some reason. I have to investigate a bit more with Jaime." [puppet] - 10https://gerrit.wikimedia.org/r/893484 (https://phabricator.wikimedia.org/T328920) (owner: 10Hashar)
[10:51:09] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] contint: manage jenkins-ci dsh group from Puppet DB [puppet] - 10https://gerrit.wikimedia.org/r/893484 (https://phabricator.wikimedia.org/T328920) (owner: 10Hashar)
[10:51:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P46498 and previous config saved to /var/cache/conftool/dbconfig/20230412-105141-ladsgroup.json
[10:53:18] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST apiservices) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:53:49] <wikibugs>	 (03CR) 10Muehlenhoff: C:httpd move htcacheclean to httpd class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede)
[10:53:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P46499 and previous config saved to /var/cache/conftool/dbconfig/20230412-105356-root.json
[10:55:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede)
[10:55:28] <icinga-wm_>	 RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:55:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/908210 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey)
[10:55:55] <wikibugs>	 (03Abandoned) 10Hashar: scap: add contint2002 to ci-docroot, jenkins, zuul deploy [puppet] - 10https://gerrit.wikimedia.org/r/867670 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn)
[10:56:16] <moritzm>	 !log installing apache2 security updates on Bullseye
[10:56:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:44] <moritzm>	 !log installing apache2 security updates on Buster
[10:56:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:58:10] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting:  CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Clement_Goubert) Stress test went without issue, removing downtime and repooling host.
[10:58:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST apiservices) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:59:19] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw2448.codfw.wmnet
[10:59:20] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2448.codfw.wmnet
[10:59:45] <claime>	 !log repooling mw2448.codfw.wmnet - T334429
[10:59:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:49] <stashbot>	 T334429: hw troubleshooting:  CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429
[11:00:07] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: name=mw2448.*.codfw.wmnet
[11:00:10] <icinga-wm_>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale-full only: 1 (install1004), Fresh: 123 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[11:00:18] <icinga-wm_>	 PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:02:18] <wikibugs>	 (03CR) 10Hashar: "So the "issue" is that the CI Jenkins are not yet using scap for deployment of the Jenkins configuration. It is an opt-in via:" [puppet] - 10https://gerrit.wikimedia.org/r/893484 (https://phabricator.wikimedia.org/T328920) (owner: 10Hashar)
[11:06:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P46500 and previous config saved to /var/cache/conftool/dbconfig/20230412-110602-ladsgroup.json
[11:06:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P46501 and previous config saved to /var/cache/conftool/dbconfig/20230412-110647-ladsgroup.json
[11:08:17] <wikibugs>	 10SRE, 10Infrastructure-Foundations: IDM milestone 3 "Build-out for self service" - https://phabricator.wikimedia.org/T320801 (10SLyngshede-WMF)
[11:08:23] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Figure out a captcha option for IDM - https://phabricator.wikimedia.org/T320809 (10SLyngshede-WMF) 05In progress→03Resolved a:03SLyngshede-WMF
[11:08:24] <icinga-wm_>	 RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:10:32] <icinga-wm_>	 RECOVERY - mediawiki-installation DSH group on mw2448 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[11:12:34] <moritzm>	 !log installing gnutls28 security updates on buster
[11:12:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:14] <icinga-wm_>	 PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:16:29] <wikibugs>	 (03PS1) 10Marostegui: core_test.pp: Add mariadb 11.1 package [puppet] - 10https://gerrit.wikimedia.org/r/908221 (https://phabricator.wikimedia.org/T333289)
[11:18:33] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] core_test.pp: Add mariadb 11.1 package [puppet] - 10https://gerrit.wikimedia.org/r/908221 (https://phabricator.wikimedia.org/T333289) (owner: 10Marostegui)
[11:20:28] <wikibugs>	 (03PS1) 10Marostegui: db1106: Migrate to MariaDB 11.1 [puppet] - 10https://gerrit.wikimedia.org/r/908222 (https://phabricator.wikimedia.org/T333289)
[11:20:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1106: Migrate to MariaDB 11.1 [puppet] - 10https://gerrit.wikimedia.org/r/908222 (https://phabricator.wikimedia.org/T333289) (owner: 10Marostegui)
[11:21:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P46502 and previous config saved to /var/cache/conftool/dbconfig/20230412-112108-ladsgroup.json
[11:21:42] <icinga-wm_>	 RECOVERY - dump of es4 in eqiad on backupmon1001 is OK: Last dump for es4 at eqiad (es1022) taken on 2023-04-11 00:00:06 (4300 GiB, +1.3 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[11:21:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T333332)', diff saved to https://phabricator.wikimedia.org/P46503 and previous config saved to /var/cache/conftool/dbconfig/20230412-112154-ladsgroup.json
[11:21:56] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[11:21:59] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[11:22:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[11:22:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46504 and previous config saved to /var/cache/conftool/dbconfig/20230412-112217-ladsgroup.json
[11:23:31] <marostegui>	 !log dbmaint Upgrade db1106 to mariadb 11.1 (eqiad) T333289
[11:23:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:23:35] <stashbot>	 T333289: Compile and package MariaDB 11.1.0 - https://phabricator.wikimedia.org/T333289
[11:23:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46505 and previous config saved to /var/cache/conftool/dbconfig/20230412-112334-ladsgroup.json
[11:36:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T333332)', diff saved to https://phabricator.wikimedia.org/P46506 and previous config saved to /var/cache/conftool/dbconfig/20230412-113615-ladsgroup.json
[11:36:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2136.codfw.wmnet with reason: Maintenance
[11:36:20] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[11:36:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2136.codfw.wmnet with reason: Maintenance
[11:36:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2136 (T333332)', diff saved to https://phabricator.wikimedia.org/P46507 and previous config saved to /var/cache/conftool/dbconfig/20230412-113638-ladsgroup.json
[11:38:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P46508 and previous config saved to /var/cache/conftool/dbconfig/20230412-113840-ladsgroup.json
[11:38:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T333332)', diff saved to https://phabricator.wikimedia.org/P46509 and previous config saved to /var/cache/conftool/dbconfig/20230412-113850-ladsgroup.json
[11:45:12] <icinga-wm_>	 RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:50:02] <icinga-wm_>	 PROBLEM - SSH on wdqs2011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:51:20] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:53:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P46512 and previous config saved to /var/cache/conftool/dbconfig/20230412-115347-ladsgroup.json
[11:53:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P46513 and previous config saved to /var/cache/conftool/dbconfig/20230412-115357-ladsgroup.json
[11:59:14] <icinga-wm_>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:59:20] <icinga-wm_>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:59:49] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10jbond) p:05Triage→03Medium
[12:00:58] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:01:12] <icinga-wm_>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49851 bytes in 0.252 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:01:18] <icinga-wm_>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.379 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:01:43] <wikibugs>	 (03PS19) 10Jbond: C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 (https://phabricator.wikimedia.org/T334577) (owner: 10Slyngshede)
[12:01:56] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] C:httpd move htcacheclean to httpd class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904102 (https://phabricator.wikimedia.org/T334577) (owner: 10Slyngshede)
[12:02:56] <herzog>	 jouncebot: now
[12:02:56] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 57 minute(s)
[12:03:02] <herzog>	 jouncebot: next
[12:03:02] <jouncebot>	 In 0 hour(s) and 56 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230412T1300)
[12:06:59] <wikibugs>	 (03PS1) 10DCausse: rdf-streaming-updater: increase cpu limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/908225
[12:07:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Marostegui) Is this task still needed?
[12:08:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46514 and previous config saved to /var/cache/conftool/dbconfig/20230412-120853-ladsgroup.json
[12:08:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[12:08:58] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[12:09:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P46515 and previous config saved to /var/cache/conftool/dbconfig/20230412-120903-ladsgroup.json
[12:09:10] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[12:09:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[12:09:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[12:09:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T333332)', diff saved to https://phabricator.wikimedia.org/P46516 and previous config saved to /var/cache/conftool/dbconfig/20230412-120943-ladsgroup.json
[12:11:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T333332)', diff saved to https://phabricator.wikimedia.org/P46517 and previous config saved to /var/cache/conftool/dbconfig/20230412-121157-ladsgroup.json
[12:12:09] <wikibugs>	 (03CR) 10Jelto: "one question in-line regarding manage_host_keys. I guess we want to either run ssh-keygen or import keys from private puppet. But from loo" [puppet] - 10https://gerrit.wikimedia.org/r/907878 (owner: 10EoghanGaffney)
[12:14:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] kartotherian: Stop passing use_nodejs10 [puppet] - 10https://gerrit.wikimedia.org/r/908196 (owner: 10Muehlenhoff)
[12:14:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1120 T334580', diff saved to https://phabricator.wikimedia.org/P46518 and previous config saved to /var/cache/conftool/dbconfig/20230412-121420-marostegui.json
[12:14:25] <stashbot>	 T334580: decommission db1120.eqiad.wmnet - https://phabricator.wikimedia.org/T334580
[12:16:28] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10MoritzMuehlenhoff)
[12:19:25] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10MoritzMuehlenhoff)
[12:23:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10ayounsi) Yep, the list of servers on the task description is up to date.
[12:23:15] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10MoritzMuehlenhoff)
[12:23:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10ayounsi)
[12:24:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T333332)', diff saved to https://phabricator.wikimedia.org/P46519 and previous config saved to /var/cache/conftool/dbconfig/20230412-122409-ladsgroup.json
[12:24:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2137.codfw.wmnet with reason: Maintenance
[12:24:15] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[12:24:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2137.codfw.wmnet with reason: Maintenance
[12:24:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3314 (T333332)', diff saved to https://phabricator.wikimedia.org/P46520 and previous config saved to /var/cache/conftool/dbconfig/20230412-122433-ladsgroup.json
[12:25:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10ayounsi) > deploy1002 will need to be scheduled well in advance and/or failed over to deploy2002 as it is the canonical deployment host. @akosiaris  As we're in the DC switchover and 2002 is...
[12:26:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T333332)', diff saved to https://phabricator.wikimedia.org/P46521 and previous config saved to /var/cache/conftool/dbconfig/20230412-122645-ladsgroup.json
[12:27:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P46522 and previous config saved to /var/cache/conftool/dbconfig/20230412-122703-ladsgroup.json
[12:29:16] <wikibugs>	 (03PS1) 10Muehlenhoff: service::node: Remove use_nodejs10 [puppet] - 10https://gerrit.wikimedia.org/r/908226
[12:31:10] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10ayounsi)
[12:31:24] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ayounsi)
[12:31:34] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi)
[12:31:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10ayounsi)
[12:31:54] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ayounsi)
[12:32:08] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10ayounsi)
[12:34:49] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ayounsi)
[12:35:06] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:35:40] <moritzm>	 !log installing intel-microcode security updates
[12:35:42] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ayounsi) >>! In T333377#8775126, @Marostegui wrote: > @ayounsi we are placing new DB hosts in production, can you run the same query you ran to gather th...
[12:35:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:37:11] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Place db1223 into s3 [puppet] - 10https://gerrit.wikimedia.org/r/908227 (https://phabricator.wikimedia.org/T326669)
[12:38:19] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Place db1223 into s3 [puppet] - 10https://gerrit.wikimedia.org/r/908227 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui)
[12:38:28] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui) Thank you, nothing changes from our DB side!
[12:39:16] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10MoritzMuehlenhoff)
[12:40:35] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10SLyngshede-WMF)
[12:41:13] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10SLyngshede-WMF) idm servers have the module installed, but not enabled.
[12:41:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P46523 and previous config saved to /var/cache/conftool/dbconfig/20230412-124151-ladsgroup.json
[12:42:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P46524 and previous config saved to /var/cache/conftool/dbconfig/20230412-124210-ladsgroup.json
[12:45:24] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:50:15] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10MoritzMuehlenhoff)
[12:50:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P46525 and previous config saved to /var/cache/conftool/dbconfig/20230412-125016-root.json
[12:50:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/908202 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[12:52:47] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10jbond) >>! In T334577#8775710, @SLyngshede-WMF wrote: > idm servers have the module installed, but not enabled.  the apache2 package installs the file so t...
[12:54:06] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 230k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[12:55:29] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] aptrepo: import AMD ROCm 5.4 to bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/908208 (https://phabricator.wikimedia.org/T295661) (owner: 10Elukey)
[12:56:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P46526 and previous config saved to /var/cache/conftool/dbconfig/20230412-125658-ladsgroup.json
[12:57:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T333332)', diff saved to https://phabricator.wikimedia.org/P46527 and previous config saved to /var/cache/conftool/dbconfig/20230412-125716-ladsgroup.json
[12:57:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[12:57:21] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[12:57:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idm1001.wikimedia.org
[12:57:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[12:57:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T333332)', diff saved to https://phabricator.wikimedia.org/P46528 and previous config saved to /var/cache/conftool/dbconfig/20230412-125739-ladsgroup.json
[12:59:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T333332)', diff saved to https://phabricator.wikimedia.org/P46529 and previous config saved to /var/cache/conftool/dbconfig/20230412-125953-ladsgroup.json
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: Time to snap out of that daydream and deploy UTC afternoon backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230412T1300).
[13:00:05] <jouncebot>	 subbu, Sergi0, and herzog: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:07] <TheresNoTime>	 (unable to deploy today FYI)
[13:00:16] <sergi0>	 hello
[13:00:26] <Lucas_WMDE>	 o/
[13:00:30] <taavi>	 o/
[13:00:35] <herzog>	 o/
[13:00:56] <herzog>	 no patches from me today though, just a maintenance script run request
[13:00:56] * Lucas_WMDE looks at the calendar
[13:00:59] <taavi>	 Lucas_WMDE: can you deploy or should I?
[13:01:06] <Lucas_WMDE>	 I can do it
[13:01:11] <taavi>	 thanks!
[13:01:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idm1001.wikimedia.org
[13:01:43] <Lucas_WMDE>	 hi subbu! ready to start with your change?
[13:02:15] <subbu>	 o/ give me a couple mins to wake up fully. :)
[13:02:19] <Lucas_WMDE>	 ok sure
[13:02:24] <Lucas_WMDE>	 I can do another change first :)
[13:02:40] <subbu>	 sounds good. :)
[13:03:10] <moritzm>	 !log installing nodejs security updates on buster
[13:03:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:26] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10MoritzMuehlenhoff)
[13:04:33] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "list of wikis matches the two tasks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907899 (https://phabricator.wikimedia.org/T304551) (owner: 10Sergio Gimeno)
[13:04:55] <wikibugs>	 (03PS9) 10EoghanGaffney: Add keys for sshd-gitlab from the secrets repo [puppet] - 10https://gerrit.wikimedia.org/r/907878
[13:05:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907899 (https://phabricator.wikimedia.org/T304551) (owner: 10Sergio Gimeno)
[13:05:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P46530 and previous config saved to /var/cache/conftool/dbconfig/20230412-130521-root.json
[13:06:12] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: enable add link frontend in 7,8th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907899 (https://phabricator.wikimedia.org/T304551) (owner: 10Sergio Gimeno)
[13:06:52] * Lucas_WMDE watches scap/git fetch REL1_38 and branch_cut_pretest in a bunch of submodules
[13:07:11] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:907899|GrowthExperiments: enable add link frontend in 7,8th round wikis (T304551 T308133)]]
[13:07:17] <stashbot>	 T308133: Deploy "add a link" to 8th round of wikis - https://phabricator.wikimedia.org/T308133
[13:07:17] <stashbot>	 T304551: Deploy "add a link" to 7th round of wikis - https://phabricator.wikimedia.org/T304551
[13:07:27] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[13:07:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[13:08:33] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 sgimeno and lucaswerkmeister-wmde: Backport for [[gerrit:907899|GrowthExperiments: enable add link frontend in 7,8th round wikis (T304551 T308133)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
[13:08:48] <Lucas_WMDE>	 sergi0: can you test the change?
[13:08:51] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10MoritzMuehlenhoff) So it turns out none of our Apache installs which had it running actually needs it; these 11 cases must all have been caused by random d...
[13:09:14] <sergi0>	 Lucas_WMDE: sure I'll need ~3-5 min, gonna test 8-10 of them
[13:09:19] <Lucas_WMDE>	 ok sure
[13:11:46] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+1] airflow: Make Data Engineering primary contact [puppet] - 10https://gerrit.wikimedia.org/r/907992 (https://phabricator.wikimedia.org/T334522) (owner: 10Bking)
[13:12:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T333332)', diff saved to https://phabricator.wikimedia.org/P46531 and previous config saved to /var/cache/conftool/dbconfig/20230412-131204-ladsgroup.json
[13:12:07] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2138.codfw.wmnet with reason: Maintenance
[13:12:09] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[13:12:22] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2138.codfw.wmnet with reason: Maintenance
[13:12:24] <wikibugs>	 (03PS4) 10Elukey: role::dse_k8s::worker: add AMD GPU support [puppet] - 10https://gerrit.wikimedia.org/r/908210 (https://phabricator.wikimedia.org/T333009)
[13:12:26] <wikibugs>	 (03PS1) 10Elukey: aptrepo: fix rocm update rule for bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/908229
[13:12:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3314 (T333332)', diff saved to https://phabricator.wikimedia.org/P46532 and previous config saved to /var/cache/conftool/dbconfig/20230412-131227-ladsgroup.json
[13:13:03] <sergi0>	 2 more and done
[13:13:15] <Lucas_WMDE>	 herzog: do you know if there’s a standard way to phaste the maintenance script output?
[13:13:25] <Lucas_WMDE>	 otherwise I would probably go for | tee >(phaste)
[13:13:33] <herzog>	 Lucas_WMDE: I think [script] | phaste
[13:13:36] <herzog>	 e.g.
[13:13:57] <herzog>	 mwscript namespaceDupes.php kswiki | phaste
[13:14:06] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 202.4k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[13:14:07] <Lucas_WMDE>	 but then I don’t get to see the output myself, I assume
[13:14:11] <herzog>	 but I don't know for sure, is anyone here familiar with that?
[13:14:28] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40617/console" [puppet] - 10https://gerrit.wikimedia.org/r/907878 (owner: 10EoghanGaffney)
[13:14:28] <herzog>	 otherwise we can resort to the usual copy & paste terminal method :)
[13:14:32] <Lucas_WMDE>	 ^^
[13:14:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T333332)', diff saved to https://phabricator.wikimedia.org/P46533 and previous config saved to /var/cache/conftool/dbconfig/20230412-131440-ladsgroup.json
[13:14:42] <claime>	 | tee -a | phaste ?
[13:14:58] <herzog>	 no docs on Wikitech re phaste that I can see
[13:14:59] <claime>	 Ah, no that won't work
[13:15:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P46535 and previous config saved to /var/cache/conftool/dbconfig/20230412-131459-ladsgroup.json
[13:15:06] <sergi0>	 Lucas_WMDE: looking good from my side
[13:15:09] <Lucas_WMDE>	 I think | tee >(phaste) should work
[13:15:11] <Lucas_WMDE>	 sergi0: ok thanks!
[13:15:26] <Lucas_WMDE>	 I’ll try it ^^
[13:15:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/908229 (owner: 10Elukey)
[13:15:37] <claime>	 Lucas_WMDE: Yeah, that should work better 
[13:15:39] <claime>	 cgoubert@deploy2002:/srv/mediawiki$ echo plop | tee -a >(phaste)
[13:15:41] <claime>	 plop
[13:15:41] <herzog>	 https://wikitech.wikimedia.org/wiki/Phabricator/Conduit_API_Tokens#ProdPasteBot
[13:15:43] <claime>	 cgoubert@deploy2002:/srv/mediawiki$ https://phabricator.wikimedia.org/P46536
[13:15:53] <Lucas_WMDE>	 heh ^^
[13:15:54] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] aptrepo: fix rocm update rule for bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/908229 (owner: 10Elukey)
[13:16:11] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/907878 (owner: 10EoghanGaffney)
[13:16:52] <herzog>	 claime: and just | phaste won't work?
[13:17:10] <herzog>	 or you wouldn't see the terminal output
[13:17:16] <wikibugs>	 (03CR) 10EoghanGaffney: Add keys for sshd-gitlab from the secrets repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/907878 (owner: 10EoghanGaffney)
[13:17:21] <claime>	 herzog: It'll work, you won't see the output
[13:17:31] <herzog>	 thanks :)
[13:20:21] <herzog>	 Lucas_WMDE: how's the script going?
[13:20:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P46537 and previous config saved to /var/cache/conftool/dbconfig/20230412-132026-root.json
[13:20:34] <Lucas_WMDE>	 haven’t started it yet
[13:20:39] <herzog>	 ah, ok
[13:20:39] <Lucas_WMDE>	 scaap is still running for the other change :)
[13:20:42] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:907899|GrowthExperiments: enable add link frontend in 7,8th round wikis (T304551 T308133)]] (duration: 13m 30s)
[13:20:47] <stashbot>	 T308133: Deploy "add a link" to 8th round of wikis - https://phabricator.wikimedia.org/T308133
[13:20:47] <stashbot>	 T304551: Deploy "add a link" to 7th round of wikis - https://phabricator.wikimedia.org/T304551
[13:21:02] <Lucas_WMDE>	 ugh, the same error again
[13:22:04] <Lucas_WMDE>	 !log lucaswerkmeister-wmde@mwmaint2002:~$ mwscript namespaceDupes kswiki --fix | tee >(phaste -t T334277) # P46538; errors on stderr, cf. T328634
[13:22:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:10] <stashbot>	 T328634: Lost pages after deployed addtional namespaces on  shn.wikibooks - https://phabricator.wikimedia.org/T328634
[13:22:10] <stashbot>	 T334277: Run namespaceDupes.php for kswiki - https://phabricator.wikimedia.org/T334277
[13:22:46] <wikibugs>	 (03PS2) 10Ssingh: hiera: lvs/balancer: unify hiera post bullseye upgrade (esams) [puppet] - 10https://gerrit.wikimedia.org/r/907931 (https://phabricator.wikimedia.org/T321309)
[13:25:36] <herzog>	 so one link to fix manually it seems
[13:25:48] <Lucas_WMDE>	 I pasted the dry run output
[13:25:50] <Lucas_WMDE>	 nine pages in total
[13:26:11] <elukey>	 !log upload AMD ROCm 5.4 debian packages to wikimedia-bullseye:thirdparty/amd-rocm54 - T295661
[13:26:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:15] <stashbot>	 T295661: Upgrade ROCm to 4.5 - https://phabricator.wikimedia.org/T295661
[13:26:34] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::dse_k8s::worker: add AMD GPU support [puppet] - 10https://gerrit.wikimedia.org/r/908210 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey)
[13:26:46] <herzog>	 Lucas_WMDE: yep, I'll purge these via API
[13:26:50] <Lucas_WMDE>	 ok thanks
[13:26:58] <herzog>	 with both options
[13:27:04] <herzog>	 and see if that solves the issue
[13:27:13] <Lucas_WMDE>	 I don’t think recursive should be needed, but probably won’t hurt either
[13:27:22] <Lucas_WMDE>	 I can run another dry run later to check if it’s done
[13:27:31] <Lucas_WMDE>	 subbu: how are you feeling now? :)
[13:27:41] <subbu>	 Ready. :)
[13:27:45] <Lucas_WMDE>	 ok \o/
[13:27:58] <wikibugs>	 (03PS6) 10Lucas Werkmeister (WMDE): Make VE on officewiki use Parsoid directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896104 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler)
[13:28:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896104 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler)
[13:28:12] <eoghan>	 !log Stopping puppet on gitlab hosts to slow-rollout puppet ssh key management - T333840
[13:28:13] <wikibugs>	 (03PS2) 10Slyngshede: LDAP attribute editor [software/bitu] - 10https://gerrit.wikimedia.org/r/900621 (https://phabricator.wikimedia.org/T179463)
[13:28:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:17] <stashbot>	 T333840: Move gitlab ssh host keys to private puppet - https://phabricator.wikimedia.org/T333840
[13:28:24] <herzog>	 Lucas_WMDE: just a q, which is the bad link, the one on the left or the right? The namespace name being in a script I can't read doesn't help :)
[13:28:36] <Lucas_WMDE>	 let me see
[13:29:07] <wikibugs>	 (03Merged) 10jenkins-bot: Make VE on officewiki use Parsoid directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896104 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler)
[13:29:27] <Lucas_WMDE>	 I think this is the title mentioned in the first line https://ks.wikipedia.org/wiki/%D9%85%D8%A7%DA%88%DB%8C%D9%88%D9%97%D9%84:Citation/CS1/Configuration/%D8%AF%D9%8E%D8%B3%D8%AA%D8%A7%D9%88%DB%8C%D9%96%D8%B2
[13:29:31] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:896104|Make VE on officewiki use Parsoid directly (T320529 T333402)]]
[13:29:37] <stashbot>	 T333402: Switching from source editing to visual editing mode is broken with the REST API  - https://phabricator.wikimedia.org/T333402
[13:29:38] <stashbot>	 T320529: Configure VE backend to use Parsoid directly, instead of calling RESTbase - https://phabricator.wikimedia.org/T320529
[13:29:42] <wikibugs>	 (03PS1) 10Elukey: Fix dse_gpu_hosts format in regex.yaml [puppet] - 10https://gerrit.wikimedia.org/r/908231
[13:29:45] <wikibugs>	 (03PS3) 10Slyngshede: LDAP attribute editor [software/bitu] - 10https://gerrit.wikimedia.org/r/900621 (https://phabricator.wikimedia.org/T179463)
[13:29:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P46539 and previous config saved to /var/cache/conftool/dbconfig/20230412-132946-ladsgroup.json
[13:30:00] <Lucas_WMDE>	 hang on, no, that doesn’t make sense to specify as a URL
[13:30:04] <Lucas_WMDE>	 because that probably normalizes it anyway
[13:30:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P46540 and previous config saved to /var/cache/conftool/dbconfig/20230412-133006-ladsgroup.json
[13:30:29] <wikibugs>	 (03CR) 10Slyngshede: "I'll create another patch that sets up Sphinx, so documentation can get started." [software/bitu] - 10https://gerrit.wikimedia.org/r/900621 (https://phabricator.wikimedia.org/T179463) (owner: 10Slyngshede)
[13:30:54] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and daniel: Backport for [[gerrit:896104|Make VE on officewiki use Parsoid directly (T320529 T333402)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[13:31:04] <herzog>	 purge sent
[13:31:07] <herzog>	 for that one
[13:31:12] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40619/console" [puppet] - 10https://gerrit.wikimedia.org/r/908231 (owner: 10Elukey)
[13:31:16] <Lucas_WMDE>	 subbu: should be ready to test now :)
[13:31:22] <TheresNoTime>	 Lucas_WMDE: TIL about the `phaste` command o_O
[13:31:34] <subbu>	 Lucas_WMDE, ty. is it mwdebug on codfw?
[13:31:35] <Lucas_WMDE>	 same tbh ^^
[13:31:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10jbond) the logic we use in puppet is mostly the same as [[ https://phabricator.wikimedia.org/P46511 | this script ]] which would be a good template to use for a cookbook
[13:31:40] <Lucas_WMDE>	 should be yeah
[13:31:43] <subbu>	 ok.
[13:31:52] * TheresNoTime has been manually copy/pasting (:
[13:32:31] <Lucas_WMDE>	 herzog: I tried it in the API sandbox myself but got nine `missing`s :/ must’ve done something wrong
[13:32:31] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] Fix dse_gpu_hosts format in regex.yaml [puppet] - 10https://gerrit.wikimedia.org/r/908231 (owner: 10Elukey)
[13:32:45] <herzog>	 Lucas_WMDE: yep, "missing": true
[13:33:01] <herzog>	 or maybe it's the other title
[13:33:09] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] Add keys for sshd-gitlab from the secrets repo [puppet] - 10https://gerrit.wikimedia.org/r/907878 (owner: 10EoghanGaffney)
[13:33:12] <herzog>	 action=info for that page returns page_id = 0
[13:33:15] <herzog>	 which is not right
[13:33:25] <herzog>	 https://ks.wikipedia.org/w/index.php?title=%D9%85%D8%A7%DA%88%DB%8C%D9%88%D9%97%D9%84:Citation/CS1/Configuration/%D8%AF%D9%8E%D8%B3%D8%AA%D8%A7%D9%88%DB%8C%D9%96%D8%B2&action=info
[13:33:47] <herzog>	 well, it does not exist so indeed pageid = 0
[13:33:49] <subbu>	 Lucas_WMDE, lgtm ... good to go.
[13:33:54] <Lucas_WMDE>	 ok thanks!
[13:33:58] <Lucas_WMDE>	 syncing
[13:34:16] <sukhe>	 !log [puppetmaster] sudo /usr/local/sbin/puppet-facts-upload --proxy http://webproxy.eqiad.wmnet:8080 to update PCC
[13:34:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:20] <wikibugs>	 (03PS1) 10Jbond: ci: indicate which server is the control server via a hiera param [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659)
[13:35:06] <herzog>	 TheresNoTime: using the canonical NS name in English seems to work
[13:35:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P46541 and previous config saved to /var/cache/conftool/dbconfig/20230412-133531-root.json
[13:35:36] <TheresNoTime>	 Not been following the channel, what's up?
[13:35:45] <Lucas_WMDE>	 I was thinking of a more brutal route and using action=purge with generator=allpages and gapnamespace=828
[13:35:57] <Lucas_WMDE>	 would take a little bit though, since purges are rate limited
[13:36:01] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:36:07] <Lucas_WMDE>	 (but there are only 366 pages in the namespace so it wouldn’t be totally terrible either)
[13:36:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on puppetdb2002.codfw.wmnet with reason: puppetdb maintenance
[13:36:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on puppetdb2002.codfw.wmnet with reason: puppetdb maintenance
[13:37:36] <wikibugs>	 (03PS1) 10Phedenskog: perf: PaintTiming metrics is now sent in the navtiming event. [alerts] - 10https://gerrit.wikimedia.org/r/908234 (https://phabricator.wikimedia.org/T328256)
[13:38:23] <icinga-wm_>	 PROBLEM - Check systemd state on dse-k8s-worker1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_amd_rocm_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:39:20] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:896104|Make VE on officewiki use Parsoid directly (T320529 T333402)]] (duration: 09m 48s)
[13:39:26] <stashbot>	 T333402: Switching from source editing to visual editing mode is broken with the REST API  - https://phabricator.wikimedia.org/T333402
[13:39:26] <stashbot>	 T320529: Configure VE backend to use Parsoid directly, instead of calling RESTbase - https://phabricator.wikimedia.org/T320529
[13:39:27] <Lucas_WMDE>	 subbu: should be done
[13:39:33] <subbu>	 ty! :)
[13:39:40] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10jbond) awesome thanks @MoritzMuehlenhoff
[13:40:44] <wikibugs>	 (03PS2) 10Jbond: ci: indicate which server is the control server via a hiera param [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659)
[13:41:38] <Lucas_WMDE>	 herzog: 7 links left now
[13:41:55] <wikibugs>	 (03CR) 10Cathal Mooney: "Thanks for the review, will reformat and submit a new patchset." [homer/public] - 10https://gerrit.wikimedia.org/r/906726 (https://phabricator.wikimedia.org/T334281) (owner: 10Cathal Mooney)
[13:43:02] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:43:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:09] <herzog>	 Lucas_WMDE: great - I sent purge requests for the page ids listed there as well
[13:43:13] <herzog>	 just in case
[13:43:19] <herzog>	 I'll continue with the titles
[13:43:20] <moritzm>	 !log stop Puppet in codfw/edges for puppetdb maintenance
[13:43:22] <Lucas_WMDE>	 oh right, I missed that the output had page IDs
[13:43:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:24] <Lucas_WMDE>	 that looks great
[13:43:44] <herzog>	 but if 7 links still remain means it's not fixing all?
[13:43:59] <Lucas_WMDE>	 maybe the job queue will need some time to run
[13:44:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P46542 and previous config saved to /var/cache/conftool/dbconfig/20230412-134453-ladsgroup.json
[13:45:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T333332)', diff saved to https://phabricator.wikimedia.org/P46543 and previous config saved to /var/cache/conftool/dbconfig/20230412-134512-ladsgroup.json
[13:45:15] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[13:45:17] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[13:45:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[13:45:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T333332)', diff saved to https://phabricator.wikimedia.org/P46544 and previous config saved to /var/cache/conftool/dbconfig/20230412-134535-ladsgroup.json
[13:45:41] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:46:55] <wikibugs>	 (03PS3) 10Jbond: ci: indicate which server is the control server via a hiera param [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659)
[13:47:12] <wikibugs>	 (03PS1) 10Jelto: Revert "install_server: change device names in gitlab-raid1" [puppet] - 10https://gerrit.wikimedia.org/r/908045
[13:47:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "install_server: change device names in gitlab-raid1" [puppet] - 10https://gerrit.wikimedia.org/r/908045 (owner: 10Jelto)
[13:47:47] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for KMorgan - https://phabricator.wikimedia.org/T334432 (10Ottomata) Approved
[13:47:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T333332)', diff saved to https://phabricator.wikimedia.org/P46545 and previous config saved to /var/cache/conftool/dbconfig/20230412-134749-ladsgroup.json
[13:48:32] <wikibugs>	 (03PS2) 10Jelto: Revert "install_server: change device names in gitlab-raid1" [puppet] - 10https://gerrit.wikimedia.org/r/908045
[13:49:42] <wikibugs>	 (03PS6) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed [puppet] - 10https://gerrit.wikimedia.org/r/905243
[13:49:52] <Lucas_WMDE>	 herzog: I cobbled together a small python script to purge all pages in namespace 828 after all
[13:49:58] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40624/console" [puppet] - 10https://gerrit.wikimedia.org/r/907931 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[13:50:30] <wikibugs>	 (03CR) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe)
[13:50:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P46546 and previous config saved to /var/cache/conftool/dbconfig/20230412-135035-root.json
[13:50:37] <icinga-wm_>	 PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 BAD GATEWAY - 275 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard
[13:50:55] <wikibugs>	 (03CR) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe)
[13:51:04] <wikibugs>	 (03CR) 10Jelto: "the different device names did not help growing the raid. Let's try the old naming and a different maximum size. See also T333674#8775979" [puppet] - 10https://gerrit.wikimedia.org/r/908045 (owner: 10Jelto)
[13:53:05] <wikibugs>	 (03PS4) 10Jbond: ci: indicate which server is the control server via a hiera param [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659)
[13:53:28] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: lvs/balancer: unify hiera post bullseye upgrade (esams) [puppet] - 10https://gerrit.wikimedia.org/r/907931 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[13:53:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:54:43] <icinga-wm_>	 PROBLEM - Check systemd state on dse-k8s-worker1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_amd_rocm_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:55:35] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[13:55:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed...
[13:57:10] <wikibugs>	 (03PS5) 10Jbond: ci: indicate which server is the control server via a hiera param [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659)
[13:57:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ci: indicate which server is the control server via a hiera param [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) (owner: 10Jbond)
[13:59:26] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Increase varnish max_connections to ats-be on eqsin|ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/907912 (https://phabricator.wikimedia.org/T288106) (owner: 10Vgutierrez)
[13:59:53] <herzog>	 Lucas_WMDE: awesome, thanks
[14:00:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T333332)', diff saved to https://phabricator.wikimedia.org/P46547 and previous config saved to /var/cache/conftool/dbconfig/20230412-135959-ladsgroup.json
[14:00:02] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2139.codfw.wmnet with reason: Maintenance
[14:00:08] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[14:00:10] <herzog>	 not sure if it's the script that's broken but namespaceDupes should be fixed :)
[14:00:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2139.codfw.wmnet with reason: Maintenance
[14:00:24] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2147.codfw.wmnet with reason: Maintenance
[14:00:27] <Lucas_WMDE>	 only 4 links to fix now
[14:00:29] * herzog returns to tax filings
[14:00:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2147.codfw.wmnet with reason: Maintenance
[14:00:39] <sergi0>	 leaving, thanks for the assistance Lucas_WMDE!
[14:00:43] <Lucas_WMDE>	 well, namespaceDupes shouldn’t have a problem anymore once the links table migration is done I assume
[14:00:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2147 (T333332)', diff saved to https://phabricator.wikimedia.org/P46548 and previous config saved to /var/cache/conftool/dbconfig/20230412-140045-ladsgroup.json
[14:00:54] <Lucas_WMDE>	 not sure it’s worth fixing it until then, when I looked at it last time it wasn’t trivial
[14:01:57] <wikibugs>	 (03PS6) 10Jbond: ci: indicate which server is the control server via a hiera param [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659)
[14:01:59] <icinga-wm_>	 RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard2002 is OK: HTTP OK: HTTP/1.1 200 OK - 58711 bytes in 4.350 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard
[14:02:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P46549 and previous config saved to /var/cache/conftool/dbconfig/20230412-140255-ladsgroup.json
[14:03:28] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+2] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/907928 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan)
[14:03:32] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:04:06] <wikibugs>	 (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: increase cpu limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/908225 (owner: 10DCausse)
[14:04:24] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Ottomata) > FWIW I cannot access those either Makes sense, user `brett` would have to be in analytics-privatedata-users  Unless...what do you mean by 'access'?  You sho...
[14:05:21] <wikibugs>	 (03PS7) 10Jbond: ci: indicate which server is the control server via a hiera param [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659)
[14:05:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P46550 and previous config saved to /var/cache/conftool/dbconfig/20230412-140540-root.json
[14:05:55] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:06:03] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:07:06] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[14:07:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[14:07:41] <moritzm>	 !log re-enabled Puppet in codfw/edges after puppetdb maintenance
[14:07:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:34] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: fix lua handler [deployment-charts] - 10https://gerrit.wikimedia.org/r/907928 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan)
[14:08:49] <wikibugs>	 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 16 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10lbowmaker)
[14:10:27] <Lucas_WMDE>	 !log lucaswerkmeister-wmde@mwmaint2002:~$ mwscript namespaceDupes kswiki --fix # T334277, fixed the one remaining link
[14:10:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:31] <stashbot>	 T334277: Run namespaceDupes.php for kswiki - https://phabricator.wikimedia.org/T334277
[14:13:03] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[14:16:41] <wikibugs>	 (03PS8) 10Jbond: ci: indicate which server is the control server via a hiera param [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659)
[14:18:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P46552 and previous config saved to /var/cache/conftool/dbconfig/20230412-141801-ladsgroup.json
[14:18:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P46553 and previous config saved to /var/cache/conftool/dbconfig/20230412-141802-ladsgroup.json
[14:19:29] <wikibugs>	 (03PS3) 10Muehlenhoff: Create a separate Hiera variable of KDCs specifically for use in client config [puppet] - 10https://gerrit.wikimedia.org/r/906563 (https://phabricator.wikimedia.org/T331695)
[14:20:07] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:20:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P46554 and previous config saved to /var/cache/conftool/dbconfig/20230412-142045-root.json
[14:22:32] <wikibugs>	 (03PS4) 10Cathal Mooney: Add generic way to create static routes on switches [homer/public] - 10https://gerrit.wikimedia.org/r/906726 (https://phabricator.wikimedia.org/T334281)
[14:22:58] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/908202 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[14:23:09] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[14:25:07] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:25:13] <wikibugs>	 (03PS1) 10Hnowlan: rest-gateway: fix direct_response [deployment-charts] - 10https://gerrit.wikimedia.org/r/908242 (https://phabricator.wikimedia.org/T326321)
[14:25:55] <wikibugs>	 (03CR) 10Raymond Ndibe: [C: 03+1] maintain_dbusers: move all the files under service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/906637 (owner: 10David Caro)
[14:26:27] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+2] "LGTM again!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/908242 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan)
[14:27:57] <wikibugs>	 (03PS1) 10Ottomata: Update all eventgate clusters to same image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/908244 (https://phabricator.wikimedia.org/T334510)
[14:29:07] <wikibugs>	 (03PS9) 10Jbond: ci: indicate which server is the control server via a hiera param [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659)
[14:30:14] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/906563 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff)
[14:30:20] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/906726 (https://phabricator.wikimedia.org/T334281) (owner: 10Cathal Mooney)
[14:30:34] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add generic mechanism to add static routes on switches - https://phabricator.wikimedia.org/T334281 (10cmooney) FWIW I've submitted a new patchset with a different format for defining the routes in YAML (at Arzhel's suggestion). `     static...
[14:31:17] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: fix direct_response [deployment-charts] - 10https://gerrit.wikimedia.org/r/908242 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan)
[14:31:40] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: merge: hash for profile::cache::varnish::frontend::cache_be_opts [puppet] - 10https://gerrit.wikimedia.org/r/908245 (https://phabricator.wikimedia.org/T288106)
[14:32:15] <wikibugs>	 (03PS10) 10Jbond: ci: indicate which server is the control server via a hiera param [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659)
[14:32:32] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[14:33:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T333332)', diff saved to https://phabricator.wikimedia.org/P46556 and previous config saved to /var/cache/conftool/dbconfig/20230412-143308-ladsgroup.json
[14:33:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P46557 and previous config saved to /var/cache/conftool/dbconfig/20230412-143309-ladsgroup.json
[14:33:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[14:33:13] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[14:33:17] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Update all eventgate clusters to same image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/908244 (https://phabricator.wikimedia.org/T334510) (owner: 10Ottomata)
[14:33:24] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40635/console" [puppet] - 10https://gerrit.wikimedia.org/r/908245 (https://phabricator.wikimedia.org/T288106) (owner: 10Vgutierrez)
[14:33:25] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[14:33:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T333332)', diff saved to https://phabricator.wikimedia.org/P46558 and previous config saved to /var/cache/conftool/dbconfig/20230412-143331-ladsgroup.json
[14:35:11] <wikibugs>	 (03PS11) 10Jbond: ci: indicate which server is the control server via a hiera param [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659)
[14:35:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T333332)', diff saved to https://phabricator.wikimedia.org/P46559 and previous config saved to /var/cache/conftool/dbconfig/20230412-143545-ladsgroup.json
[14:36:14] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40637/console" [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) (owner: 10Jbond)
[14:36:36] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[14:36:39] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Update all eventgate clusters to same image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/908244 (https://phabricator.wikimedia.org/T334510) (owner: 10Ottomata)
[14:36:42] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "10 patches later and we finnaly have a pcc 😊" [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) (owner: 10Jbond)
[14:36:43] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[14:37:47] <godog>	 jouncebot: next
[14:37:48] <jouncebot>	 In 2 hour(s) and 22 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230412T1700)
[14:38:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Rename cadvisor_exporter to cadvisor [puppet] - 10https://gerrit.wikimedia.org/r/908215 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi)
[14:38:18] <godog>	 jbond: merging your change too
[14:38:27] <godog>	 for private.git that is
[14:38:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:38:36] <moritzm>	 !log installing apache security updates on gerrit1001
[14:38:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:48] <godog>	 jbond: labs/private.git 
[14:40:25] <moritzm>	 !log installing apache security updates on phab1004 (phabricator.wikimedia.org)
[14:40:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:41:46] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply
[14:42:15] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[14:42:20] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply
[14:42:34] <urbanecm>	 jouncebot: nowandnext
[14:42:34] <jouncebot>	 No deployments scheduled for the next 2 hour(s) and 17 minute(s)
[14:42:34] <jouncebot>	 In 2 hour(s) and 17 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230412T1700)
[14:43:04] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply
[14:43:23] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply
[14:43:32] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply
[14:43:32] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:44:11] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply
[14:44:56] <wikibugs>	 (03PS1) 10Urbanecm: [Growth] beta: Enable Personalized praise everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908249
[14:45:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908249 (owner: 10Urbanecm)
[14:46:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Create a separate Hiera variable of KDCs specifically for use in client config [puppet] - 10https://gerrit.wikimedia.org/r/906563 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff)
[14:46:30] <wikibugs>	 (03Merged) 10jenkins-bot: [Growth] beta: Enable Personalized praise everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908249 (owner: 10Urbanecm)
[14:47:27] <wikibugs>	 (03PS1) 10Ottomata: eventgate - remove deprecated all_settings stream config param [deployment-charts] - 10https://gerrit.wikimedia.org/r/908251 (https://phabricator.wikimedia.org/T286344)
[14:47:44] <wikibugs>	 (03PS5) 10David Caro: openstack: encapi: new id-based api for Terraform [puppet] - 10https://gerrit.wikimedia.org/r/874812 (https://phabricator.wikimedia.org/T317478) (owner: 10Majavah)
[14:47:52] <wikibugs>	 (03CR) 10David Caro: openstack: encapi: new id-based api for Terraform (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/874812 (https://phabricator.wikimedia.org/T317478) (owner: 10Majavah)
[14:48:09] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate - remove deprecated all_settings stream config param [deployment-charts] - 10https://gerrit.wikimedia.org/r/908251 (https://phabricator.wikimedia.org/T286344) (owner: 10Ottomata)
[14:48:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T333332)', diff saved to https://phabricator.wikimedia.org/P46560 and previous config saved to /var/cache/conftool/dbconfig/20230412-144815-ladsgroup.json
[14:48:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2155.codfw.wmnet with reason: Maintenance
[14:48:20] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[14:48:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2155.codfw.wmnet with reason: Maintenance
[14:48:35] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance
[14:48:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance
[14:48:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2155 (T333332)', diff saved to https://phabricator.wikimedia.org/P46561 and previous config saved to /var/cache/conftool/dbconfig/20230412-144856-ladsgroup.json
[14:48:57] <wikibugs>	 (03PS1) 10Jbond: puppet-diffs: add project members to the access list [puppet] - 10https://gerrit.wikimedia.org/r/908252
[14:49:13] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] puppet-diffs: add project members to the access list [puppet] - 10https://gerrit.wikimedia.org/r/908252 (owner: 10Jbond)
[14:49:58] <wikibugs>	 (03PS1) 10Ayounsi: Allow cumin host to reach gNMI on sonic switches [homer/public] - 10https://gerrit.wikimedia.org/r/908253
[14:50:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Let's give it a shot" [puppet] - 10https://gerrit.wikimedia.org/r/908045 (owner: 10Jelto)
[14:50:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P46562 and previous config saved to /var/cache/conftool/dbconfig/20230412-145051-ladsgroup.json
[14:51:07] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:51:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T333332)', diff saved to https://phabricator.wikimedia.org/P46563 and previous config saved to /var/cache/conftool/dbconfig/20230412-145108-ladsgroup.json
[14:51:47] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:51:55] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] openstack: encapi: new id-based api for Terraform [puppet] - 10https://gerrit.wikimedia.org/r/874812 (https://phabricator.wikimedia.org/T317478) (owner: 10Majavah)
[14:51:58] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (FY2022/2023-Q4): WMCS Cookbook Automation FY2022-23 Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri) p:05Medium→03High
[14:52:05] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Allow cumin host to reach gNMI on sonic switches [homer/public] - 10https://gerrit.wikimedia.org/r/908253 (owner: 10Ayounsi)
[14:52:07] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply
[14:52:12] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] openstack: encapi: new id-based api for Terraform [puppet] - 10https://gerrit.wikimedia.org/r/874812 (https://phabricator.wikimedia.org/T317478) (owner: 10Majavah)
[14:52:15] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (FY2022/2023-Q4): WMCS Cookbook Automation FY2022-23 Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri)
[14:52:20] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[14:52:35] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply
[14:52:38] <wikibugs>	 (03Merged) 10jenkins-bot: Allow cumin host to reach gNMI on sonic switches [homer/public] - 10https://gerrit.wikimedia.org/r/908253 (owner: 10Ayounsi)
[14:52:48] <wikibugs>	 (03PS6) 10David Caro: openstack: encapi: open up write access [puppet] - 10https://gerrit.wikimedia.org/r/874813 (https://phabricator.wikimedia.org/T317478) (owner: 10Majavah)
[14:53:16] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply
[14:53:19] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply
[14:53:35] <wikibugs>	 10SRE, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn)
[14:53:44] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply
[14:54:28] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply
[14:55:10] <wikibugs>	 (03PS1) 10Ottomata: eventgate - remove irrelevant comment about all_settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/908255
[14:55:11] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply
[14:55:28] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate - remove irrelevant comment about all_settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/908255 (owner: 10Ottomata)
[14:55:49] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply
[14:56:06] <wikibugs>	 (03PS2) 10David Caro: cloudlib: support https for fetching data [puppet] - 10https://gerrit.wikimedia.org/r/875896 (owner: 10Majavah)
[14:56:07] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:56:07] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply
[14:56:13] <wikibugs>	 (03PS3) 10David Caro: hieradata: use port 443 for enc access [puppet] - 10https://gerrit.wikimedia.org/r/874894 (owner: 10Majavah)
[14:56:19] <wikibugs>	 (03PS6) 10David Caro: openstack: encapi: drop legacy ports [puppet] - 10https://gerrit.wikimedia.org/r/874814 (owner: 10Majavah)
[14:56:19] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply
[14:57:15] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply
[14:57:42] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply
[14:57:54] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply
[14:57:59] <icinga-wm_>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:58:22] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply
[14:58:40] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply
[14:58:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Create a separate Hiera variable of KDCs specifically for use in client config [puppet] - 10https://gerrit.wikimedia.org/r/906563 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff)
[14:58:49] <icinga-wm_>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:59:07] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:59:08] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply
[14:59:20] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply
[14:59:36] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply
[14:59:53] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply
[15:00:15] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply
[15:00:25] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply
[15:00:35] <wikibugs>	 (03PS13) 10David Caro: maintain-dbusers: use click for cli definition [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955)
[15:01:37] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:02:13] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[15:02:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye execute...
[15:03:03] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10Jclark-ctr) Created raid 1 for 2 ssd @elukey
[15:04:29] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[15:04:38] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[15:05:11] <icinga-wm_>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49852 bytes in 0.981 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:05:35] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[15:05:44] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[15:05:57] <icinga-wm_>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.399 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:05:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P46564 and previous config saved to /var/cache/conftool/dbconfig/20230412-150557-ladsgroup.json
[15:06:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P46565 and previous config saved to /var/cache/conftool/dbconfig/20230412-150614-ladsgroup.json
[15:09:06] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 211.8k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[15:09:07] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:09:51] <wikibugs>	 (03PS1) 10Ottomata: flink - Allow for conditionally disabling jemalloc [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/908256
[15:10:42] <wikibugs>	 10SRE, 10MediaWiki-General: The script file run.php cannot be executed using MaintenanceRunner - https://phabricator.wikimedia.org/T334484 (10TheresNoTime) 05Open→03Resolved [[ https://wikitech.wikimedia.org/w/index.php?title=Maintenance_server&diff=prev&oldid=2068462&diffmode=source | Updated the docs ]],...
[15:13:34] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10BCornwall) @Ottomata yes, I just meant those dashboards, which I do get privilege errors. I'll wait for @FNavas-foundation to clarify their issues. Thanks!
[15:13:36] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40638/console" [puppet] - 10https://gerrit.wikimedia.org/r/906017 (https://phabricator.wikimedia.org/T334092) (owner: 10Stevemunene)
[15:14:06] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 204.1k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[15:14:19] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply
[15:14:23] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply
[15:15:41] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+1] "Given the state of the host, I am in favor of remove it from hadoop so we can reimage it completely (HDDs as well), and then we can re-add" [puppet] - 10https://gerrit.wikimedia.org/r/906017 (https://phabricator.wikimedia.org/T334092) (owner: 10Stevemunene)
[15:16:07] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:18:16] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for KMorgan - https://phabricator.wikimedia.org/T334432 (10nettrom_WMF) >>! In T334432#8773117, @BCornwall wrote: > @nettrom_WMF Are you the approving party (manager) of @KMorgan-WMF?  No, I think that would be eit...
[15:18:45] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40639/console" [puppet] - 10https://gerrit.wikimedia.org/r/906637 (owner: 10David Caro)
[15:19:04] <ottomata>	 elukey:  o/ am trying to deploy eventgate-main, and it looks like maybe i'm getting a kafka ssl error in staging
[15:19:18] <wikibugs>	 (03PS2) 10Jdlrobson: Set Vector 2022 as default skin on Welsh Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907539 (https://phabricator.wikimedia.org/T334279)
[15:19:23] <ottomata>	 i already deployed all other eventgates,  (that don't talk to kafka main) those were fine.
[15:19:45] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: merge: hash for profile::cache::varnish::frontend::cache_be_opts [puppet] - 10https://gerrit.wikimedia.org/r/908245 (https://phabricator.wikimedia.org/T288106) (owner: 10Vgutierrez)
[15:19:45] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for KMorgan - https://phabricator.wikimedia.org/T334432 (10BCornwall) Hi, @DMburugu and @KStoller-WMF! Could either/both of you review the description of this ticket and approve/deny the request, please? A simple c...
[15:19:47] <sukhe>	 BGP alerts in codfw expected
[15:19:57] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:20:01] <elukey>	 ottomata: ouch, checking
[15:20:13] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] Revert "install_server: change device names in gitlab-raid1" [puppet] - 10https://gerrit.wikimedia.org/r/908045 (owner: 10Jelto)
[15:20:41] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for KMorgan - https://phabricator.wikimedia.org/T334432 (10BCornwall)
[15:20:59] <ottomata>	 {"name":"eventgate-wikimedia","hostname":"eventgate-production-67474fd66b-wbxwm","pid":141,"producer_type":"GuaranteedProducer","level":"ERROR","error":{"origin":"local","message":"ssl error","code":-1,"errno":-1,"stack":"Error: Local: SSL error"},"msg":"Encountered rdkafka error event: ssl error","time":"2023-04-12T15:20:52.719Z","v":0}
[15:21:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T333332)', diff saved to https://phabricator.wikimedia.org/P46566 and previous config saved to /var/cache/conftool/dbconfig/20230412-152104-ladsgroup.json
[15:21:06] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[15:21:07] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:21:09] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[15:21:16] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for KMorgan - https://phabricator.wikimedia.org/T334432 (10BCornwall)
[15:21:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[15:21:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P46567 and previous config saved to /var/cache/conftool/dbconfig/20230412-152120-ladsgroup.json
[15:21:24] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for trizek - https://phabricator.wikimedia.org/T333863 (10Trizek-WMF) Thank you everyone!
[15:21:25] <wikibugs>	 (03PS2) 10Jdlrobson: Drop unused VectorPageTools feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907511 (https://phabricator.wikimedia.org/T332090)
[15:21:31] <ottomata>	 hm 
[15:21:32] <ottomata>	 ssl.ca.location":"/etc/eventgate/puppetca.crt.pem
[15:21:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance
[15:21:48] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance
[15:22:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2100.codfw.wmnet with reason: Maintenance
[15:22:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2100.codfw.wmnet with reason: Maintenance
[15:22:36] <elukey>	 ottomata: let's go to #sre
[15:22:37] <elukey>	 .19
[15:22:39] <ottomata>	 k
[15:22:44] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2010.codfw.wmnet with OS bullseye
[15:23:00] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2108.codfw.wmnet with reason: Maintenance
[15:23:01] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2010.codfw.wmnet with OS bullseye
[15:23:15] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2108.codfw.wmnet with reason: Maintenance
[15:23:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T333332)', diff saved to https://phabricator.wikimedia.org/P46568 and previous config saved to /var/cache/conftool/dbconfig/20230412-152320-ladsgroup.json
[15:23:50] <wikibugs>	 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 16 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10fgiunchedi)
[15:24:25] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:24:58] <sukhe>	 ^ expected
[15:25:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T333332)', diff saved to https://phabricator.wikimedia.org/P46569 and previous config saved to /var/cache/conftool/dbconfig/20230412-152553-ladsgroup.json
[15:26:36] <wikibugs>	 (03PS3) 10Jelto: buildkitd: Isolate build container user/process/network namespaces [puppet] - 10https://gerrit.wikimedia.org/r/902132 (https://phabricator.wikimedia.org/T332804) (owner: 10Dduvall)
[15:26:43] <wikibugs>	 (03PS1) 10Ottomata: eventgate-main - set common Kafka ssl settings in values.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/908258
[15:27:25] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] buildkitd: Isolate build container user/process/network namespaces [puppet] - 10https://gerrit.wikimedia.org/r/902132 (https://phabricator.wikimedia.org/T332804) (owner: 10Dduvall)
[15:28:09] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40640/console" [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro)
[15:30:02] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[15:30:10] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[15:32:34] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release eventgate-main/production on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-main - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[15:32:51] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] eventgate-main - set common Kafka ssl settings in values.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/908258 (owner: 10Ottomata)
[15:34:35] <wikibugs>	 (03PS1) 10Ssingh: hiera: lvs2010: update iface names for bullseye (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/908270 (https://phabricator.wikimedia.org/T321309)
[15:35:15] <wikibugs>	 (03PS1) 10Dzahn: add gerrit-new service IP [dns] - 10https://gerrit.wikimedia.org/r/908271 (https://phabricator.wikimedia.org/T334524)
[15:35:49] <wikibugs>	 (03CR) 10Gmodena: [C: 03+1] "LGTM." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/908256 (owner: 10Ottomata)
[15:36:07] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:36:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] add gerrit-new service IP [dns] - 10https://gerrit.wikimedia.org/r/908271 (https://phabricator.wikimedia.org/T334524) (owner: 10Dzahn)
[15:36:25] <wikibugs>	 (03PS2) 10Ssingh: hiera: lvs2010: update iface names for bullseye (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/908270 (https://phabricator.wikimedia.org/T321309)
[15:36:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T333332)', diff saved to https://phabricator.wikimedia.org/P46570 and previous config saved to /var/cache/conftool/dbconfig/20230412-153627-ladsgroup.json
[15:36:30] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2172.codfw.wmnet with reason: Maintenance
[15:36:33] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[15:36:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2172.codfw.wmnet with reason: Maintenance
[15:36:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2172 (T333332)', diff saved to https://phabricator.wikimedia.org/P46571 and previous config saved to /var/cache/conftool/dbconfig/20230412-153651-ladsgroup.json
[15:38:46] <wikibugs>	 (03PS1) 10Snwachukwu: Add referer_name field to druid pageviews hourly and daily tables [puppet] - 10https://gerrit.wikimedia.org/r/908272 (https://phabricator.wikimedia.org/T334224)
[15:39:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T333332)', diff saved to https://phabricator.wikimedia.org/P46572 and previous config saved to /var/cache/conftool/dbconfig/20230412-153903-ladsgroup.json
[15:39:12] <wikibugs>	 (03PS2) 10Dzahn: add gerrit-new service IP [dns] - 10https://gerrit.wikimedia.org/r/908271 (https://phabricator.wikimedia.org/T334524)
[15:41:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P46573 and previous config saved to /var/cache/conftool/dbconfig/20230412-154100-ladsgroup.json
[15:41:07] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:42:32] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/908271 (https://phabricator.wikimedia.org/T334524) (owner: 10Dzahn)
[15:44:01] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2010.codfw.wmnet with reason: host reimage
[15:44:07] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:44:27] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] add gerrit-new service IP [dns] - 10https://gerrit.wikimedia.org/r/908271 (https://phabricator.wikimedia.org/T334524) (owner: 10Dzahn)
[15:44:40] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "thanks for the review" [dns] - 10https://gerrit.wikimedia.org/r/908271 (https://phabricator.wikimedia.org/T334524) (owner: 10Dzahn)
[15:44:58] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[15:45:04] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[15:45:17] <wikibugs>	 (03PS1) 10Majavah: aptrepo: Drop kubernetes 1.21 components [puppet] - 10https://gerrit.wikimedia.org/r/908275 (https://phabricator.wikimedia.org/T286856)
[15:45:19] <wikibugs>	 (03PS1) 10Majavah: aptrepo: Add kubeadm 1.23 component [puppet] - 10https://gerrit.wikimedia.org/r/908276 (https://phabricator.wikimedia.org/T298005)
[15:46:08] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: lvs2010: update iface names for bullseye (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/908270 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[15:46:19] <wikibugs>	 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 16 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10Michaelcochez) Do we have a timeline for this move already?  Is it better to not update the gerrit repo at the moment? Our main development...
[15:46:59] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2010.codfw.wmnet with reason: host reimage
[15:47:02] <wikibugs>	 10SRE, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn)
[15:47:05] <wikibugs>	 (03PS2) 10Ottomata: eventgate - set default kafka ssl.ca.location in chart values.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/908258
[15:47:08] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[15:47:16] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[15:47:48] <wikibugs>	 10SRE, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) 208.80.154.151 / 2620:0:861:2:208:80:154:151  has been selected as the service IP for the new host in the subtask above
[15:48:09] <wikibugs>	 10SRE, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) p:05Medium→03High
[15:49:07] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:49:25] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[15:49:33] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[15:52:16] <wikibugs>	 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 16 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10calbon)
[15:52:50] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[15:52:57] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[15:53:00] <wikibugs>	 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 16 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10leila)
[15:54:07] <wikibugs>	 (03PS1) 10Dzahn: site: add role(gerrit::migration) to host gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/908278 (https://phabricator.wikimedia.org/T326368)
[15:54:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P46575 and previous config saved to /var/cache/conftool/dbconfig/20230412-155410-ladsgroup.json
[15:56:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P46576 and previous config saved to /var/cache/conftool/dbconfig/20230412-155606-ladsgroup.json
[15:57:07] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:57:38] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[15:57:46] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[15:58:35] <wikibugs>	 (03PS5) 10Cathal Mooney: Add generic way to create static routes on switches [homer/public] - 10https://gerrit.wikimedia.org/r/906726 (https://phabricator.wikimedia.org/T334281)
[15:58:52] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes201[0123].codfw.wmnet
[15:59:11] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventgate - set default kafka ssl.ca.location in chart values.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/908258 (owner: 10Ottomata)
[15:59:14] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add generic way to create static routes on switches [homer/public] - 10https://gerrit.wikimedia.org/r/906726 (https://phabricator.wikimedia.org/T334281) (owner: 10Cathal Mooney)
[15:59:48] <wikibugs>	 (03Merged) 10jenkins-bot: Add generic way to create static routes on switches [homer/public] - 10https://gerrit.wikimedia.org/r/906726 (https://phabricator.wikimedia.org/T334281) (owner: 10Cathal Mooney)
[16:02:07] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:02:45] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply
[16:03:00] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply
[16:03:29] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:03:40] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply
[16:04:10] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs2010.codfw.wmnet with OS bullseye
[16:04:13] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply
[16:04:20] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply
[16:04:20] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs2010.codfw.wmnet with OS bullseye completed: - lvs2010 (**PASS**)   - Downtimed on Icinga/Aler...
[16:04:49] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply
[16:05:30] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[16:05:39] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[16:07:34] <jinxer-wm>	 (HelmReleaseBadStatus) resolved: Helm release eventgate-main/production on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-main - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[16:07:51] <ottomata>	 hm
[16:08:22] <ottomata>	 oh resolved, right cool.
[16:09:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P46577 and previous config saved to /var/cache/conftool/dbconfig/20230412-160916-ladsgroup.json
[16:09:49] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[16:09:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[16:11:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T333332)', diff saved to https://phabricator.wikimedia.org/P46578 and previous config saved to /var/cache/conftool/dbconfig/20230412-161112-ladsgroup.json
[16:11:15] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2120.codfw.wmnet with reason: Maintenance
[16:11:17] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[16:11:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2120.codfw.wmnet with reason: Maintenance
[16:11:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T333332)', diff saved to https://phabricator.wikimedia.org/P46579 and previous config saved to /var/cache/conftool/dbconfig/20230412-161135-ladsgroup.json
[16:14:09] <wikibugs>	 10SRE, 10serviceops-collab, 10Patch-For-Review: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn)
[16:14:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T333332)', diff saved to https://phabricator.wikimedia.org/P46580 and previous config saved to /var/cache/conftool/dbconfig/20230412-161409-ladsgroup.json
[16:15:33] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] flink - Allow for conditionally disabling jemalloc [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/908256 (owner: 10Ottomata)
[16:15:56] <wikibugs>	 (03CR) 10Dzahn: [V: 04-1] "https://puppet-compiler.wmflabs.org/output/908278/40641/gerrit1003.wikimedia.org/change.gerrit1003.wikimedia.org.err" [puppet] - 10https://gerrit.wikimedia.org/r/908278 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn)
[16:16:53] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] flink - Allow for conditionally disabling jemalloc [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/908256 (owner: 10Ottomata)
[16:17:02] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] flink - Allow for conditionally disabling jemalloc [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/908256 (owner: 10Ottomata)
[16:18:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10cmooney) I'd consider client auth a "stretch goal" for now, nice to have but not sure we want to have all that extra complexity.  In terms of an intermediate CA just for network...
[16:22:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic: Set NEL `success_fraction: 1.0` on HTTP responses for measurement domains - https://phabricator.wikimedia.org/T334608 (10CDanis)
[16:24:18] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (FY2022/2023-Q4): Update Spicerack documentation - https://phabricator.wikimedia.org/T325754 (10fnegri)
[16:24:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T333332)', diff saved to https://phabricator.wikimedia.org/P46581 and previous config saved to /var/cache/conftool/dbconfig/20230412-162422-ladsgroup.json
[16:24:27] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2179.codfw.wmnet with reason: Maintenance
[16:24:28] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[16:24:42] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2179.codfw.wmnet with reason: Maintenance
[16:24:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2179 (T333332)', diff saved to https://phabricator.wikimedia.org/P46582 and previous config saved to /var/cache/conftool/dbconfig/20230412-162448-ladsgroup.json
[16:25:42] <wikibugs>	 (03PS2) 10Dzahn: site: add role(gerrit::migration) to gerrit1003 and fix code [puppet] - 10https://gerrit.wikimedia.org/r/908278 (https://phabricator.wikimedia.org/T326368)
[16:27:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T333332)', diff saved to https://phabricator.wikimedia.org/P46583 and previous config saved to /var/cache/conftool/dbconfig/20230412-162700-ladsgroup.json
[16:27:31] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:28:07] <wikibugs>	 (03PS3) 10Dzahn: site: add role(gerrit::migration) to gerrit1003 and fix code [puppet] - 10https://gerrit.wikimedia.org/r/908278 (https://phabricator.wikimedia.org/T326368)
[16:29:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P46584 and previous config saved to /var/cache/conftool/dbconfig/20230412-162915-ladsgroup.json
[16:30:37] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:30:40] <wikibugs>	 (03PS1) 10Andrew Bogott: Update partman for cloudvirtlocal100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/908283 (https://phabricator.wikimedia.org/T329863)
[16:33:11] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Update partman for cloudvirtlocal100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/908283 (https://phabricator.wikimedia.org/T329863) (owner: 10Andrew Bogott)
[16:36:07] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic: Set NEL `success_fraction: 1.0` on HTTP responses for measurement domains - https://phabricator.wikimedia.org/T334608 (10CDanis)
[16:40:31] <wikibugs>	 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 11): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata) > Flink doc does suggest that their k8s HA implementation could wor...
[16:42:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P46585 and previous config saved to /var/cache/conftool/dbconfig/20230412-164206-ladsgroup.json
[16:43:11] <wikibugs>	 (03PS4) 10Dzahn: site: add role(gerrit::migration) to gerrit1003 and fix code [puppet] - 10https://gerrit.wikimedia.org/r/908278 (https://phabricator.wikimedia.org/T326368)
[16:44:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P46586 and previous config saved to /var/cache/conftool/dbconfig/20230412-164422-ladsgroup.json
[16:44:55] <wikibugs>	 10SRE, 10SRE-Access-Requests: Update SSH key for Mikhail Popov - https://phabricator.wikimedia.org/T334423 (10mpopov) Thank you @BCornwall! I can confirm that I can SSH from the new laptop with the new key.
[16:47:01] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Add database config for disable_tool process [puppet] - 10https://gerrit.wikimedia.org/r/907983 (https://phabricator.wikimedia.org/T332514) (owner: 10Andrew Bogott)
[16:48:38] <wikibugs>	 (03PS1) 10Cathal Mooney: Fix incorrect next-hop for IPv6 default route on cloudsw2-c8-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/908285 (https://phabricator.wikimedia.org/T334281)
[16:49:58] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Fix incorrect next-hop for IPv6 default route on cloudsw2-c8-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/908285 (https://phabricator.wikimedia.org/T334281) (owner: 10Cathal Mooney)
[16:50:15] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[16:50:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet wit...
[16:51:30] <topranks>	 !log Updating routing-options on Eqiad lsw1 switches to add empty rib inet6 stanza T334281
[16:51:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:51:34] <stashbot>	 T334281: Add generic mechanism to add static routes on switches - https://phabricator.wikimedia.org/T334281
[16:51:52] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[16:52:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet...
[16:53:54] <wikibugs>	 (03Merged) 10jenkins-bot: Fix incorrect next-hop for IPv6 default route on cloudsw2-c8-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/908285 (https://phabricator.wikimedia.org/T334281) (owner: 10Cathal Mooney)
[16:54:10] <topranks>	 !log Updating routing-options on drmrs asw switches to add empty rib inet6 stanza T334281
[16:54:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:57:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P46587 and previous config saved to /var/cache/conftool/dbconfig/20230412-165712-ladsgroup.json
[16:58:55] <wikibugs>	 (03PS5) 10Dzahn: site: add role(gerrit::migration) to gerrit1003 and fix code [puppet] - 10https://gerrit.wikimedia.org/r/908278 (https://phabricator.wikimedia.org/T326368)
[16:59:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T333332)', diff saved to https://phabricator.wikimedia.org/P46588 and previous config saved to /var/cache/conftool/dbconfig/20230412-165928-ladsgroup.json
[16:59:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2121.codfw.wmnet with reason: Maintenance
[16:59:33] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[16:59:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2121.codfw.wmnet with reason: Maintenance
[16:59:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T333332)', diff saved to https://phabricator.wikimedia.org/P46589 and previous config saved to /var/cache/conftool/dbconfig/20230412-165951-ladsgroup.json
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230412T1700)
[17:02:00] <wikibugs>	 (03CR) 10Dzahn: "cool, thanks for this. let's link these to https://phabricator.wikimedia.org/T324659" [puppet] - 10https://gerrit.wikimedia.org/r/893483 (owner: 10Hashar)
[17:02:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T333332)', diff saved to https://phabricator.wikimedia.org/P46590 and previous config saved to /var/cache/conftool/dbconfig/20230412-170224-ladsgroup.json
[17:02:45] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Dzahn) Hashar did "contint: manage dsh target from Puppet DB" -> https://gerrit.wikimedia.org/r/c/operations/puppet/+/893483
[17:05:06] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 213.6k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[17:10:06] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 202.5k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[17:12:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T333332)', diff saved to https://phabricator.wikimedia.org/P46591 and previous config saved to /var/cache/conftool/dbconfig/20230412-171219-ladsgroup.json
[17:12:24] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[17:17:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P46592 and previous config saved to /var/cache/conftool/dbconfig/20230412-171730-ladsgroup.json
[17:24:49] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh)
[17:25:07] <wikibugs>	 (03PS1) 10Jbond: hieradata: move overrides to role/site part of hiera [puppet] - 10https://gerrit.wikimedia.org/r/908308
[17:25:13] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:26:32] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40644/console" [puppet] - 10https://gerrit.wikimedia.org/r/908308 (owner: 10Jbond)
[17:28:18] <wikibugs>	 (03CR) 10Hashar: "That is amazing John thank you! I will jump on it tomorrow morning :)" [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) (owner: 10Jbond)
[17:28:58] <wikibugs>	 10SRE, 10Commons, 10Traffic: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10Umar) For more than a month I have not seen new versions of files.  https://commons.wikimedia.org/wiki/File:Vake_District.svg
[17:30:07] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:32:00] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/908272 (https://phabricator.wikimedia.org/T334224) (owner: 10Snwachukwu)
[17:32:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P46593 and previous config saved to /var/cache/conftool/dbconfig/20230412-173237-ladsgroup.json
[17:37:06] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] perf: PaintTiming metrics is now sent in the navtiming event. [alerts] - 10https://gerrit.wikimedia.org/r/908234 (https://phabricator.wikimedia.org/T328256) (owner: 10Phedenskog)
[17:38:10] <wikibugs>	 (03PS1) 10Ottomata: flink-operator - set default resource limits and requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/908310 (https://phabricator.wikimedia.org/T333464)
[17:39:23] <wikibugs>	 (03Merged) 10jenkins-bot: perf: PaintTiming metrics is now sent in the navtiming event. [alerts] - 10https://gerrit.wikimedia.org/r/908234 (https://phabricator.wikimedia.org/T328256) (owner: 10Phedenskog)
[17:44:37] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1002.eqiad.wmnet with OS bullseye
[17:44:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1002.eqiad.wmnet with OS bullseye
[17:45:26] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] flink-operator - set default resource limits and requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/908310 (https://phabricator.wikimedia.org/T333464) (owner: 10Ottomata)
[17:46:59] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[17:47:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed...
[17:47:28] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[17:47:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[17:47:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T333332)', diff saved to https://phabricator.wikimedia.org/P46594 and previous config saved to /var/cache/conftool/dbconfig/20230412-174743-ladsgroup.json
[17:47:46] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2122.codfw.wmnet with reason: Maintenance
[17:47:48] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[17:48:01] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2122.codfw.wmnet with reason: Maintenance
[17:48:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T333332)', diff saved to https://phabricator.wikimedia.org/P46595 and previous config saved to /var/cache/conftool/dbconfig/20230412-174806-ladsgroup.json
[17:48:22] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[17:52:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T333332)', diff saved to https://phabricator.wikimedia.org/P46596 and previous config saved to /var/cache/conftool/dbconfig/20230412-175240-ladsgroup.json
[17:54:33] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:00:06] <jouncebot>	 ^demon and hashar: #bothumor My software never has bugs. It just develops random features. Rise for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230412T1800).
[18:00:06] <jouncebot>	 ^demon and hashar: OwO what's this, a deployment window?? MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230412T1800). nyaa~
[18:00:45] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:01:16] <dancy>	 I'm running the train today.
[18:02:15] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908314 (https://phabricator.wikimedia.org/T330210)
[18:02:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908314 (https://phabricator.wikimedia.org/T330210) (owner: 10TrainBranchBot)
[18:03:58] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908314 (https://phabricator.wikimedia.org/T330210) (owner: 10TrainBranchBot)
[18:07:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P46597 and previous config saved to /var/cache/conftool/dbconfig/20230412-180746-ladsgroup.json
[18:10:27] <logmsgbot>	 !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.4  refs T330210
[18:10:31] <stashbot>	 T330210: 1.41.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T330210
[18:14:53] <wikibugs>	 (03PS1) 10Andrew Bogott: Fix partman for cloudvirtlocal100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/908316 (https://phabricator.wikimedia.org/T329863)
[18:15:34] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Fix partman for cloudvirtlocal100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/908316 (https://phabricator.wikimedia.org/T329863) (owner: 10Andrew Bogott)
[18:16:29] <logmsgbot>	 !log dancy@deploy2002 Synchronized php: group1 wikis to 1.41.0-wmf.4  refs T330210 (duration: 06m 02s)
[18:16:36] <stashbot>	 T330210: 1.41.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T330210
[18:18:30] <wikibugs>	 (03PS1) 10Andrew Bogott: profile::toolforge::disable_tool: fix a couple of param names [puppet] - 10https://gerrit.wikimedia.org/r/908317 (https://phabricator.wikimedia.org/T332514)
[18:20:31] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] profile::toolforge::disable_tool: fix a couple of param names [puppet] - 10https://gerrit.wikimedia.org/r/908317 (https://phabricator.wikimedia.org/T332514) (owner: 10Andrew Bogott)
[18:22:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P46598 and previous config saved to /var/cache/conftool/dbconfig/20230412-182252-ladsgroup.json
[18:23:27] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10jbond) >I would worry about how we deal with the security / key management aspects of it.   Just to expand on this a bit the reason why there may be a need for an additional inte...
[18:25:45] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/output/908278/40645/" [puppet] - 10https://gerrit.wikimedia.org/r/908278 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn)
[18:37:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T333332)', diff saved to https://phabricator.wikimedia.org/P46599 and previous config saved to /var/cache/conftool/dbconfig/20230412-183758-ladsgroup.json
[18:38:01] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2150.codfw.wmnet with reason: Maintenance
[18:38:05] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[18:38:16] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2150.codfw.wmnet with reason: Maintenance
[18:38:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T333332)', diff saved to https://phabricator.wikimedia.org/P46600 and previous config saved to /var/cache/conftool/dbconfig/20230412-183822-ladsgroup.json
[18:39:07] <wikibugs>	 (03PS1) 10Andrew Bogott: profile::toolforge::disable_tool: include python3-pymysql [puppet] - 10https://gerrit.wikimedia.org/r/908319 (https://phabricator.wikimedia.org/T332514)
[18:39:44] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirtlocal1002.eqiad.wmnet with OS bullseye
[18:41:02] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1002.eqiad.wmnet with OS bullseye
[18:41:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1002.eqiad.wmnet with OS bullseye
[18:41:21] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] profile::toolforge::disable_tool: include python3-pymysql [puppet] - 10https://gerrit.wikimedia.org/r/908319 (https://phabricator.wikimedia.org/T332514) (owner: 10Andrew Bogott)
[18:42:34] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[18:42:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed...
[18:42:46] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[18:42:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[18:42:57] <wikibugs>	 (03PS1) 10Jforrester: Ensure ApiHelp correctly types values in TOCData objects [core] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/908289 (https://phabricator.wikimedia.org/T334551)
[18:43:06] <wikibugs>	 (03PS1) 10Jforrester: Ensure ApiHelp correctly types values in TOCData objects [core] (wmf/1.41.0-wmf.3) - 10https://gerrit.wikimedia.org/r/908290 (https://phabricator.wikimedia.org/T334551)
[18:57:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Ensure ApiHelp correctly types values in TOCData objects [core] (wmf/1.41.0-wmf.3) - 10https://gerrit.wikimedia.org/r/908290 (https://phabricator.wikimedia.org/T334551) (owner: 10Jforrester)
[19:00:30] <zabe>	 jouncebot: nowandnext
[19:00:30] <jouncebot>	 For the next 0 hour(s) and 59 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230412T1800)
[19:00:31] <jouncebot>	 In 0 hour(s) and 59 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230412T2000)
[19:00:45] <dancy>	 Train has already been advanced so you're welcome to do stuff.
[19:00:58] <zabe>	 thanks :)
[19:01:18] <wikibugs>	 (03PS1) 10BCornwall: hiera: lvs2007: update iface names for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/908322 (https://phabricator.wikimedia.org/T321309)
[19:04:46] <wikibugs>	 (03PS1) 10Cathal Mooney: Expose interface VRF association to templates if present in Netbox [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/908325 (https://phabricator.wikimedia.org/T312635)
[19:05:29] <wikibugs>	 (03PS1) 10Zabe: composer.json: Explicitly pin psr/http-message to 1.0.1 [core] (wmf/1.41.0-wmf.3) - 10https://gerrit.wikimedia.org/r/908291 (https://phabricator.wikimedia.org/T333993)
[19:06:03] <wikibugs>	 (03PS2) 10Zabe: Ensure ApiHelp correctly types values in TOCData objects [core] (wmf/1.41.0-wmf.3) - 10https://gerrit.wikimedia.org/r/908290 (https://phabricator.wikimedia.org/T334551) (owner: 10Jforrester)
[19:07:45] <wikibugs>	 (03PS2) 10Jbond: environment: add environment.conf file and remove environments dir [puppet] - 10https://gerrit.wikimedia.org/r/907991
[19:07:47] <wikibugs>	 (03PS4) 10Jbond: wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841)
[19:07:49] <wikibugs>	 (03PS29) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356
[19:07:51] <wikibugs>	 (03PS1) 10Jbond: core_modules: add core modules [puppet] - 10https://gerrit.wikimedia.org/r/908326
[19:08:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] environment: add environment.conf file and remove environments dir [puppet] - 10https://gerrit.wikimedia.org/r/907991 (owner: 10Jbond)
[19:08:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] core_modules: add core modules [puppet] - 10https://gerrit.wikimedia.org/r/908326 (owner: 10Jbond)
[19:09:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T333332)', diff saved to https://phabricator.wikimedia.org/P46601 and previous config saved to /var/cache/conftool/dbconfig/20230412-190904-ladsgroup.json
[19:09:10] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[19:10:24] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] composer.json: Explicitly pin psr/http-message to 1.0.1 [core] (wmf/1.41.0-wmf.3) - 10https://gerrit.wikimedia.org/r/908291 (https://phabricator.wikimedia.org/T333993) (owner: 10Zabe)
[19:10:26] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Ensure ApiHelp correctly types values in TOCData objects [core] (wmf/1.41.0-wmf.3) - 10https://gerrit.wikimedia.org/r/908290 (https://phabricator.wikimedia.org/T334551) (owner: 10Jforrester)
[19:10:32] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Ensure ApiHelp correctly types values in TOCData objects [core] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/908289 (https://phabricator.wikimedia.org/T334551) (owner: 10Jforrester)
[19:12:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) (owner: 10Jbond)
[19:13:14] <wikibugs>	 (03PS1) 10Andrew Bogott: Add profile::toolforge::nfs_disable_tool [puppet] - 10https://gerrit.wikimedia.org/r/908327
[19:13:33] <wikibugs>	 (03PS4) 10Krinkle: Set "s3" as the default section name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893834 (owner: 10Aaron Schulz)
[19:15:36] <wikibugs>	 (03PS1) 10Jameel Kaisar: Set NEL 'success_fraction: 1.0' on HTTP responses for measurement domains [puppet] - 10https://gerrit.wikimedia.org/r/908328 (https://phabricator.wikimedia.org/T334608)
[19:16:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Set NEL 'success_fraction: 1.0' on HTTP responses for measurement domains [puppet] - 10https://gerrit.wikimedia.org/r/908328 (https://phabricator.wikimedia.org/T334608) (owner: 10Jameel Kaisar)
[19:16:37] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Add profile::toolforge::nfs_disable_tool [puppet] - 10https://gerrit.wikimedia.org/r/908327 (owner: 10Andrew Bogott)
[19:16:53] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on sessionstore1001.eqiad.wmnet with reason: Reproducing dissonant cluster state
[19:17:09] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on sessionstore1001.eqiad.wmnet with reason: Reproducing dissonant cluster state
[19:17:19] <wikibugs>	 (03PS1) 10Cwhite: opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732)
[19:17:50] <wikibugs>	 (03PS2) 10BCornwall: hiera: lvs2007: update iface names for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/908322 (https://phabricator.wikimedia.org/T321309)
[19:17:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite)
[19:18:58] <wikibugs>	 (03CR) 10Krinkle: Set "s3" as the default section name (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893834 (owner: 10Aaron Schulz)
[19:19:35] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[19:20:20] <wikibugs>	 (03PS2) 10Jameel Kaisar: Set NEL 'success_fraction: 1.0' on HTTP responses for measurement domains [puppet] - 10https://gerrit.wikimedia.org/r/908328 (https://phabricator.wikimedia.org/T334608)
[19:20:39] <wikibugs>	 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 16 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10thcipriani)
[19:21:23] <wikibugs>	 (03PS1) 10Andrew Bogott: disable_tool: update nfs patchs for tool archiving [puppet] - 10https://gerrit.wikimedia.org/r/908329
[19:23:28] <wikibugs>	 (03PS1) 10Eevans: sessionstore: disable sessionstore1001 native transport [puppet] - 10https://gerrit.wikimedia.org/r/908330 (https://phabricator.wikimedia.org/T327954)
[19:23:54] <wikibugs>	 10SRE, 10ops-eqiad, 10serviceops-collab, 10GitLab (Infrastructure): Install additional SSDs on gitlab1004.wikimedia.org (B1) - https://phabricator.wikimedia.org/T333997 (10Jclark-ctr) 05Open→03Resolved T330172 drives where installed and commented on this ticket.   procurement ticket listed servers gitl...
[19:24:05] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/908330 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans)
[19:24:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P46602 and previous config saved to /var/cache/conftool/dbconfig/20230412-192411-ladsgroup.json
[19:24:19] <wikibugs>	 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 16 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10thcipriani)
[19:25:46] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] sessionstore: disable sessionstore1001 native transport [puppet] - 10https://gerrit.wikimedia.org/r/908330 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans)
[19:26:37] <wikibugs>	 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 16 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10thcipriani)
[19:28:01] <wikibugs>	 (03PS1) 10Dzahn: vrts: do not use /srv/sqldata as mariadb datadir (cloud, devtools) [puppet] - 10https://gerrit.wikimedia.org/r/908331
[19:28:24] <urandom>	 !log restart Cassandra —sessionstore1001— to disable native transport for testing — T327954
[19:28:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:28:28] <stashbot>	 T327954: session storage: dissonant cluster status after reboot (was: 'cannot achieve consistency level' errors) - https://phabricator.wikimedia.org/T327954
[19:28:31] <wikibugs>	 (03Merged) 10jenkins-bot: composer.json: Explicitly pin psr/http-message to 1.0.1 [core] (wmf/1.41.0-wmf.3) - 10https://gerrit.wikimedia.org/r/908291 (https://phabricator.wikimedia.org/T333993) (owner: 10Zabe)
[19:28:37] <wikibugs>	 (03Merged) 10jenkins-bot: Ensure ApiHelp correctly types values in TOCData objects [core] (wmf/1.41.0-wmf.3) - 10https://gerrit.wikimedia.org/r/908290 (https://phabricator.wikimedia.org/T334551) (owner: 10Jforrester)
[19:28:42] <wikibugs>	 (03Merged) 10jenkins-bot: Ensure ApiHelp correctly types values in TOCData objects [core] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/908289 (https://phabricator.wikimedia.org/T334551) (owner: 10Jforrester)
[19:29:06] <wikibugs>	 (03PS2) 10Cwhite: opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732)
[19:29:19] <wikibugs>	 (03PS2) 10Snwachukwu: Add referer_name field to druid pageviews hourly and daily tables turnilo [puppet] - 10https://gerrit.wikimedia.org/r/908272 (https://phabricator.wikimedia.org/T334224)
[19:29:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite)
[19:30:24] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops, 10ARM support: Adoption of aarch64 (aka arm64) in WMF production? (SRE Summit 2022 Session) - https://phabricator.wikimedia.org/T320811 (10Ladsgroup) This might be interesting, specially in choosing a manufacturer: https://www.hetzner.com/press-release/arm...
[19:30:51] <logmsgbot>	 !log zabe@deploy2002 Started scap: Backport for [[gerrit:908291|composer.json: Explicitly pin psr/http-message to 1.0.1 (T333993)]], [[gerrit:908290|Ensure ApiHelp correctly types values in TOCData objects (T334551)]], [[gerrit:908289|Ensure ApiHelp correctly types values in TOCData objects (T334551)]]
[19:30:57] <stashbot>	 T334551: action=help&toc=1: Caught exception of type TypeError - https://phabricator.wikimedia.org/T334551
[19:30:57] <stashbot>	 T333993: Explicitly pin psr/http-message to 1.0.1 in composer.json - https://phabricator.wikimedia.org/T333993
[19:31:43] <wikibugs>	 (03PS2) 10Dzahn: vrts: do not use /srv/sqldata as mariadb datadir (cloud, devtools) [puppet] - 10https://gerrit.wikimedia.org/r/908331 (https://phabricator.wikimedia.org/T329571)
[19:32:12] <logmsgbot>	 !log zabe@deploy2002 jforrester and zabe: Backport for [[gerrit:908291|composer.json: Explicitly pin psr/http-message to 1.0.1 (T333993)]], [[gerrit:908290|Ensure ApiHelp correctly types values in TOCData objects (T334551)]], [[gerrit:908289|Ensure ApiHelp correctly types values in TOCData objects (T334551)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.
[19:32:12] <logmsgbot>	 eqiad.wmnet
[19:35:46] <logmsgbot>	 !log zabe@deploy2002 Sync cancelled.
[19:36:14] <wikibugs>	 (03PS1) 10Zabe: Revert "Ensure ApiHelp correctly types values in TOCData objects" [core] (wmf/1.41.0-wmf.3) - 10https://gerrit.wikimedia.org/r/908292
[19:36:19] <wikibugs>	 (03PS1) 10Zabe: Revert "Ensure ApiHelp correctly types values in TOCData objects" [core] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/908293
[19:36:20] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirtlocal1002.eqiad.wmnet with OS bullseye
[19:36:21] <wikibugs>	 (03CR) 10Zabe: [V: 03+2 C: 03+2] Revert "Ensure ApiHelp correctly types values in TOCData objects" [core] (wmf/1.41.0-wmf.3) - 10https://gerrit.wikimedia.org/r/908292 (owner: 10Zabe)
[19:36:26] <wikibugs>	 (03CR) 10Zabe: [V: 03+2 C: 03+2] Revert "Ensure ApiHelp correctly types values in TOCData objects" [core] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/908293 (owner: 10Zabe)
[19:37:02] <urandom>	 !log sessionstore1001: systemctl stop cassandra-a.service && systemctl start cassandra-a.service — T327954
[19:37:04] <logmsgbot>	 !log zabe@deploy2002 Started scap: Backport for [[gerrit:908292|Revert "Ensure ApiHelp correctly types values in TOCData objects"]], [[gerrit:908293|Revert "Ensure ApiHelp correctly types values in TOCData objects"]]
[19:37:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:37:06] <stashbot>	 T327954: session storage: dissonant cluster status after reboot (was: 'cannot achieve consistency level' errors) - https://phabricator.wikimedia.org/T327954
[19:37:49] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[19:37:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed...
[19:38:26] <logmsgbot>	 !log zabe@deploy2002 zabe: Backport for [[gerrit:908292|Revert "Ensure ApiHelp correctly types values in TOCData objects"]], [[gerrit:908293|Revert "Ensure ApiHelp correctly types values in TOCData objects"]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[19:39:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P46603 and previous config saved to /var/cache/conftool/dbconfig/20230412-193917-ladsgroup.json
[19:39:33] <wikibugs>	 (03PS1) 10Ottomata: flink-operator - set default resource limits and requests in operatorPod [deployment-charts] - 10https://gerrit.wikimedia.org/r/908334 (https://phabricator.wikimedia.org/T333464)
[19:39:43] <wikibugs>	 (03PS2) 10Ottomata: flink-operator - set default resource limits and requests in operatorPod [deployment-charts] - 10https://gerrit.wikimedia.org/r/908334 (https://phabricator.wikimedia.org/T333464)
[19:39:55] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] flink-operator - set default resource limits and requests in operatorPod [deployment-charts] - 10https://gerrit.wikimedia.org/r/908334 (https://phabricator.wikimedia.org/T333464) (owner: 10Ottomata)
[19:40:15] <wikibugs>	 (03PS2) 10Andrew Bogott: disable_tool: update nfs patchs for tool archiving [puppet] - 10https://gerrit.wikimedia.org/r/908329
[19:40:29] <logmsgbot>	 !log otto@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[19:40:41] <wikibugs>	 (03PS3) 10Andrew Bogott: disable_tool: update nfs paths for tool archiving [puppet] - 10https://gerrit.wikimedia.org/r/908329
[19:41:09] <logmsgbot>	 !log otto@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[19:41:54] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[19:42:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[19:42:49] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] disable_tool: update nfs paths for tool archiving [puppet] - 10https://gerrit.wikimedia.org/r/908329 (owner: 10Andrew Bogott)
[19:43:44] <logmsgbot>	 !log zabe@deploy2002 Finished scap: Backport for [[gerrit:908292|Revert "Ensure ApiHelp correctly types values in TOCData objects"]], [[gerrit:908293|Revert "Ensure ApiHelp correctly types values in TOCData objects"]] (duration: 06m 40s)
[19:46:35] <wikibugs>	 (03PS1) 10Arlolra: Remove unused parsoidSettings, nativeGalleryEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908337
[19:48:20] <wikibugs>	 (03PS3) 10Cwhite: opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732)
[19:48:29] <wikibugs>	 (03CR) 10Ssingh: "Looks good! Let's wait on merging this as I think we should also set a higher BGP med for lvs2007 so that it has a lower priority than lvs" [puppet] - 10https://gerrit.wikimedia.org/r/908322 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall)
[19:48:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite)
[19:49:15] <wikibugs>	 (03CR) 10Dzahn: "Arnold, this should fix the issue you described to me about the DB on vrts-1001 in devtools. But right now puppet is disabled. Please conf" [puppet] - 10https://gerrit.wikimedia.org/r/908331 (https://phabricator.wikimedia.org/T329571) (owner: 10Dzahn)
[19:50:38] <wikibugs>	 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 16 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10bd808)
[19:51:48] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1002.eqiad.wmnet with OS bullseye
[19:51:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1002.eqiad.wmnet with OS bullseye
[19:54:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T333332)', diff saved to https://phabricator.wikimedia.org/P46604 and previous config saved to /var/cache/conftool/dbconfig/20230412-195423-ladsgroup.json
[19:54:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2159.codfw.wmnet with reason: Maintenance
[19:54:28] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[19:54:42] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2159.codfw.wmnet with reason: Maintenance
[19:54:44] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance
[19:54:47] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance
[19:54:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T333332)', diff saved to https://phabricator.wikimedia.org/P46605 and previous config saved to /var/cache/conftool/dbconfig/20230412-195453-ladsgroup.json
[19:57:07] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:58:20] <wikibugs>	 (03PS4) 10Cwhite: opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732)
[19:58:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite)
[19:59:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T333332)', diff saved to https://phabricator.wikimedia.org/P46606 and previous config saved to /var/cache/conftool/dbconfig/20230412-195926-ladsgroup.json
[19:59:31] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230412T2000).
[20:00:05] <jouncebot>	 Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:15] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:01:59] <Jdlrobson>	 here
[20:02:29] <zabe>	 I can deploy
[20:03:00] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for KMorgan - https://phabricator.wikimedia.org/T334432 (10KStoller-WMF) @KMorgan-WMF is an engineer on the Growth team, and this has my approval as the Product Manager of Growth.  But if this need the approval of...
[20:03:17] <zabe>	 Jdlrobson: is it okay to merge, test and sync your two patches together?
[20:03:35] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Drop unused VectorPageTools feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907511 (https://phabricator.wikimedia.org/T332090) (owner: 10Jdlrobson)
[20:03:37] <Jdlrobson>	 yep
[20:03:46] <wikibugs>	 (03PS3) 10Zabe: Set Vector 2022 as default skin on Welsh Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907539 (https://phabricator.wikimedia.org/T334279) (owner: 10Jdlrobson)
[20:03:50] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Set Vector 2022 as default skin on Welsh Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907539 (https://phabricator.wikimedia.org/T334279) (owner: 10Jdlrobson)
[20:04:36] <wikibugs>	 (03Merged) 10jenkins-bot: Drop unused VectorPageTools feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907511 (https://phabricator.wikimedia.org/T332090) (owner: 10Jdlrobson)
[20:04:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907539 (https://phabricator.wikimedia.org/T334279) (owner: 10Jdlrobson)
[20:04:42] <wikibugs>	 (03PS5) 10Cwhite: opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732)
[20:04:44] <wikibugs>	 (03Merged) 10jenkins-bot: Set Vector 2022 as default skin on Welsh Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907539 (https://phabricator.wikimedia.org/T334279) (owner: 10Jdlrobson)
[20:05:07] <logmsgbot>	 !log zabe@deploy2002 Started scap: Backport for [[gerrit:907511|Drop unused VectorPageTools feature flag (T332090)]], [[gerrit:907539|Set Vector 2022 as default skin on Welsh Wikipedia (T334279)]]
[20:05:13] <stashbot>	 T332090: Post page tools cleanup: Remove page tools disabled code - https://phabricator.wikimedia.org/T332090
[20:05:13] <stashbot>	 T334279: Deploy Vector 2022 on Welsh Wikipedia - https://phabricator.wikimedia.org/T334279
[20:05:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite)
[20:06:26] <logmsgbot>	 !log zabe@deploy2002 zabe and jdlrobson: Backport for [[gerrit:907511|Drop unused VectorPageTools feature flag (T332090)]], [[gerrit:907539|Set Vector 2022 as default skin on Welsh Wikipedia (T334279)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[20:06:40] <wikibugs>	 (03PS3) 10BCornwall: hiera: lvs2007: update iface names for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/908322 (https://phabricator.wikimedia.org/T321309)
[20:06:45] <wikibugs>	 (03PS1) 10Andrew Bogott: profile::toolforge::grid::exec_environ: use ensure_packages on pymysql [puppet] - 10https://gerrit.wikimedia.org/r/908345
[20:07:15] <wikibugs>	 (03PS1) 10Cathal Mooney: Automate DHCP forwarding on Juniper L3 Swithces [homer/public] - 10https://gerrit.wikimedia.org/r/908346 (https://phabricator.wikimedia.org/T312635)
[20:08:31] <zabe>	 Jdlrobson: please test :)
[20:09:09] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] profile::toolforge::grid::exec_environ: use ensure_packages on pymysql [puppet] - 10https://gerrit.wikimedia.org/r/908345 (owner: 10Andrew Bogott)
[20:09:13] <Jdlrobson>	 zabe: on it..
[20:09:42] <Jdlrobson>	 zabe: LGTM
[20:09:45] <Jdlrobson>	 please sync
[20:10:49] <wikibugs>	 (03CR) 10Atieno: "b" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/906575 (https://phabricator.wikimedia.org/T334205) (owner: 10Atieno)
[20:11:03] <wikibugs>	 (03PS3) 10Jameel Kaisar: Set NEL 'success_fraction: 1.0' on HTTP responses for measurement domains [puppet] - 10https://gerrit.wikimedia.org/r/908328 (https://phabricator.wikimedia.org/T334608)
[20:12:49] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:13:06] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40646/console" [puppet] - 10https://gerrit.wikimedia.org/r/908322 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall)
[20:14:23] <wikibugs>	 (03PS6) 10Cwhite: opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732)
[20:14:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P46608 and previous config saved to /var/cache/conftool/dbconfig/20230412-201432-ladsgroup.json
[20:14:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite)
[20:15:09] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] hiera: lvs2007: update iface names for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/908322 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall)
[20:15:27] <logmsgbot>	 !log zabe@deploy2002 Finished scap: Backport for [[gerrit:907511|Drop unused VectorPageTools feature flag (T332090)]], [[gerrit:907539|Set Vector 2022 as default skin on Welsh Wikipedia (T334279)]] (duration: 10m 19s)
[20:15:28] <zabe>	 Jdlrobson: should be live
[20:15:32] <stashbot>	 T332090: Post page tools cleanup: Remove page tools disabled code - https://phabricator.wikimedia.org/T332090
[20:15:32] <stashbot>	 T334279: Deploy Vector 2022 on Welsh Wikipedia - https://phabricator.wikimedia.org/T334279
[20:15:37] <Jdlrobson>	 thanks Zabe!
[20:15:45] <zabe>	 yw
[20:15:59] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:16:38] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "Arnold said things work when datadir is set to /srv/sqldata and after service was restarted." [puppet] - 10https://gerrit.wikimedia.org/r/908331 (https://phabricator.wikimedia.org/T329571) (owner: 10Dzahn)
[20:20:11] <wikibugs>	 (03PS7) 10Cwhite: opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732)
[20:20:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite)
[20:27:41] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/908278/40645/" [puppet] - 10https://gerrit.wikimedia.org/r/908278 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn)
[20:29:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P46609 and previous config saved to /var/cache/conftool/dbconfig/20230412-202939-ladsgroup.json
[20:35:17] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop confirmed on gerrit2002, gerrit1002 prod servers" [puppet] - 10https://gerrit.wikimedia.org/r/908278 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn)
[20:36:47] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:38:12] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[20:38:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed...
[20:44:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T333332)', diff saved to https://phabricator.wikimedia.org/P46610 and previous config saved to /var/cache/conftool/dbconfig/20230412-204445-ladsgroup.json
[20:44:48] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2168.codfw.wmnet with reason: Maintenance
[20:44:51] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[20:45:03] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2168.codfw.wmnet with reason: Maintenance
[20:45:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46611 and previous config saved to /var/cache/conftool/dbconfig/20230412-204508-ladsgroup.json
[20:45:23] <wikibugs>	 10SRE, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) with the merge above there is now a "gerrit2" user and group on gerrit1003, rsyncd is running and ready to be pushed to from gerrit1001.. and releng users got shell access
[20:46:27] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:46:43] <wikibugs>	 (03PS8) 10Cwhite: opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732)
[20:47:04] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirtlocal1002.eqiad.wmnet with OS bullseye
[20:47:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite)
[20:47:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46612 and previous config saved to /var/cache/conftool/dbconfig/20230412-204742-ladsgroup.json
[20:49:23] <wikibugs>	 (03CR) 10Krinkle: arclamp: serve SVGs, compressed logs from Swift (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke)
[20:50:17] <wikibugs>	 (03PS9) 10Cwhite: opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732)
[20:50:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite)
[20:57:08] <wikibugs>	 10SRE, 10Commons, 10Traffic: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10Lionel_Scheepmans) It seems that here in Phabricator, no new is bad new.
[20:58:16] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+2] hiera: lvs2007: update iface names for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/908322 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall)
[20:58:53] <brett>	 !log Disable Puppet/PyBal on lvs2007 in preparation for reimaging - T321309
[20:58:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:58:58] <stashbot>	 T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309
[20:59:37] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] Add keys for sshd-gitlab from the secrets repo (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/907878 (owner: 10EoghanGaffney)
[21:01:28] <icinga-wm_>	 PROBLEM - pybal on lvs2007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[21:01:38] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:01:42] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2007.codfw.wmnet with OS bullseye
[21:01:53] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs2007.codfw.wmnet with OS bullseye
[21:01:55] <logmsgbot>	 !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host lvs2007.codfw.wmnet with OS bullseye
[21:02:05] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs2007.codfw.wmnet with OS bullseye executed with errors: - lvs2007 (**FAIL**)   - **The reimage...
[21:02:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P46613 and previous config saved to /var/cache/conftool/dbconfig/20230412-210249-ladsgroup.json
[21:04:17] <mutante>	 !log gerrit1001 - pushing data over to gerrit1003 via rsync, with bwlimit option: rsync -avp --bwlimit=1m /srv/gerrit/ rsync://gerrit1003.wikimedia.org/gerrit-data/  (T326368)
[21:04:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:04:22] <stashbot>	 T326368: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368
[21:05:20] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:16:10] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2007.codfw.wmnet with OS bullseye
[21:16:17] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs2007.codfw.wmnet with OS bullseye
[21:17:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P46614 and previous config saved to /var/cache/conftool/dbconfig/20230412-211755-ladsgroup.json
[21:24:40] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:28:36] <wikibugs>	 (03PS1) 10Eevans: Revert "sessionstore: disable sessionstore1001 native transport" [puppet] - 10https://gerrit.wikimedia.org/r/908294
[21:29:17] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] Revert "sessionstore: disable sessionstore1001 native transport" [puppet] - 10https://gerrit.wikimedia.org/r/908294 (owner: 10Eevans)
[21:33:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46615 and previous config saved to /var/cache/conftool/dbconfig/20230412-213301-ladsgroup.json
[21:33:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance
[21:33:07] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[21:33:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance
[21:33:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46616 and previous config saved to /var/cache/conftool/dbconfig/20230412-213325-ladsgroup.json
[21:35:36] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2007.codfw.wmnet with reason: host reimage
[21:35:50] <urandom>	 !log restarting Cassandra —sessionstore1001— to reenable native transport — T327954
[21:35:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:35:54] <stashbot>	 T327954: session storage: dissonant cluster status after reboot (was: 'cannot achieve consistency level' errors) - https://phabricator.wikimedia.org/T327954
[21:35:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46617 and previous config saved to /var/cache/conftool/dbconfig/20230412-213558-ladsgroup.json
[21:38:54] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2007.codfw.wmnet with reason: host reimage
[21:45:25] <wikibugs>	 (03CR) 10Dzahn: ci: split contint hosts to different roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/907886 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar)
[21:47:34] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:47:38] <wikibugs>	 (03CR) 10Dzahn: "Why would this be a special case that warrants finding new patterns? Including one profile in 2 roles is standard." [puppet] - 10https://gerrit.wikimedia.org/r/907886 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar)
[21:48:37] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[21:51:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P46618 and previous config saved to /var/cache/conftool/dbconfig/20230412-215104-ladsgroup.json
[21:52:49] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for sessionstore1001.eqiad.wmnet
[21:52:49] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for sessionstore1001.eqiad.wmnet
[21:54:34] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:56:05] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs2007.codfw.wmnet with OS bullseye
[21:56:10] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs2007.codfw.wmnet with OS bullseye completed: - lvs2007 (**PASS**)   - Downtimed on Icinga/Alertmanager   - Disabled...
[22:06:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P46619 and previous config saved to /var/cache/conftool/dbconfig/20230412-220611-ladsgroup.json
[22:09:10] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:15:28] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:21:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46620 and previous config saved to /var/cache/conftool/dbconfig/20230412-222117-ladsgroup.json
[22:21:20] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2182.codfw.wmnet with reason: Maintenance
[22:21:23] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[22:21:35] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2182.codfw.wmnet with reason: Maintenance
[22:21:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T333332)', diff saved to https://phabricator.wikimedia.org/P46621 and previous config saved to /var/cache/conftool/dbconfig/20230412-222141-ladsgroup.json
[22:24:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T333332)', diff saved to https://phabricator.wikimedia.org/P46622 and previous config saved to /var/cache/conftool/dbconfig/20230412-222414-ladsgroup.json
[22:29:39] <wikibugs>	 (03PS1) 10Urbanecm: [Growth] Prepare for a Personalized praise config variable change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908365 (https://phabricator.wikimedia.org/T334630)
[22:31:47] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] exim: fix hard-coded vrts hostname [puppet] - 10https://gerrit.wikimedia.org/r/905722 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth)
[22:38:56] <wikibugs>	 (03PS1) 10Urbanecm: [Growth] Finish Personalized praise variable rename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908367 (https://phabricator.wikimedia.org/T334630)
[22:39:19] <wikibugs>	 (03PS2) 10Urbanecm: [Growth] Finish Personalized praise variable rename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908367 (https://phabricator.wikimedia.org/T334630)
[22:39:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P46623 and previous config saved to /var/cache/conftool/dbconfig/20230412-223921-ladsgroup.json
[22:39:49] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-2] "Not yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908367 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm)
[22:54:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P46624 and previous config saved to /var/cache/conftool/dbconfig/20230412-225427-ladsgroup.json
[22:56:46] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:59:32] <wikibugs>	 (03PS1) 10Raymond Ndibe: tools-webservice: set default for buildservice-image [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/908369 (https://phabricator.wikimedia.org/T334586)
[23:01:30] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:09:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T333332)', diff saved to https://phabricator.wikimedia.org/P46625 and previous config saved to /var/cache/conftool/dbconfig/20230412-230933-ladsgroup.json
[23:09:39] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[23:30:24] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1 C: 03+2] prometheus: Apply prometheus::pop role to prometheus3002 [puppet] - 10https://gerrit.wikimedia.org/r/905705 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse)
[23:55:35] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job pint in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:56:22] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job blackbox/pingthing in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable