[00:03:18] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1017079 (owner: 10TrainBranchBot) [00:07:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T356166)', diff saved to https://phabricator.wikimedia.org/P59613 and previous config saved to /var/cache/conftool/dbconfig/20240405-000755-marostegui.json [00:07:59] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [00:12:03] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:13:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P59614 and previous config saved to /var/cache/conftool/dbconfig/20240405-001350-arnaudb.json [00:23:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P59615 and previous config saved to /var/cache/conftool/dbconfig/20240405-002303-marostegui.json [00:28:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P59616 and previous config saved to /var/cache/conftool/dbconfig/20240405-002857-arnaudb.json [00:38:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P59617 and previous config saved to /var/cache/conftool/dbconfig/20240405-003810-marostegui.json [00:40:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.527s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:44:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T360332)', diff saved to https://phabricator.wikimedia.org/P59618 and previous config saved to /var/cache/conftool/dbconfig/20240405-004405-arnaudb.json [00:44:08] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2182.codfw.wmnet with reason: Maintenance [00:44:09] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [00:44:21] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2182.codfw.wmnet with reason: Maintenance [00:44:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2182 (T360332)', diff saved to https://phabricator.wikimedia.org/P59619 and previous config saved to /var/cache/conftool/dbconfig/20240405-004428-arnaudb.json [00:45:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 1.527s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:47:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T360332)', diff saved to https://phabricator.wikimedia.org/P59620 and previous config saved to /var/cache/conftool/dbconfig/20240405-004705-arnaudb.json [00:53:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T356166)', diff saved to https://phabricator.wikimedia.org/P59621 and previous config saved to /var/cache/conftool/dbconfig/20240405-005318-marostegui.json [00:53:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1228.eqiad.wmnet with reason: Maintenance [00:53:21] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [00:53:34] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1228.eqiad.wmnet with reason: Maintenance [00:53:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1228 (T356166)', diff saved to https://phabricator.wikimedia.org/P59622 and previous config saved to /var/cache/conftool/dbconfig/20240405-005341-marostegui.json [01:02:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P59623 and previous config saved to /var/cache/conftool/dbconfig/20240405-010212-arnaudb.json [01:06:50] 06SRE, 06Infrastructure-Foundations, 10Mail, 10MediaWiki-Email: Old "Email this user" email is repeatedly resent - https://phabricator.wikimedia.org/T361860#9691323 (10DannyS712) @Xover I created https://phabricator.wikimedia.org/P59624 that is restricted to WMF-NDA members and you, which should be secure... [01:17:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P59625 and previous config saved to /var/cache/conftool/dbconfig/20240405-011720-arnaudb.json [01:32:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T360332)', diff saved to https://phabricator.wikimedia.org/P59626 and previous config saved to /var/cache/conftool/dbconfig/20240405-013227-arnaudb.json [01:32:31] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2198.codfw.wmnet with reason: Maintenance [01:32:31] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [01:32:44] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2198.codfw.wmnet with reason: Maintenance [01:33:16] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2208.codfw.wmnet with reason: Maintenance [01:33:29] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2208.codfw.wmnet with reason: Maintenance [01:33:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2208 (T360332)', diff saved to https://phabricator.wikimedia.org/P59627 and previous config saved to /var/cache/conftool/dbconfig/20240405-013336-arnaudb.json [01:36:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T360332)', diff saved to https://phabricator.wikimedia.org/P59628 and previous config saved to /var/cache/conftool/dbconfig/20240405-013615-arnaudb.json [01:51:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P59629 and previous config saved to /var/cache/conftool/dbconfig/20240405-015123-arnaudb.json [01:55:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.187s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:05:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 1.278s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:06:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P59630 and previous config saved to /var/cache/conftool/dbconfig/20240405-020630-arnaudb.json [02:21:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T360332)', diff saved to https://phabricator.wikimedia.org/P59631 and previous config saved to /var/cache/conftool/dbconfig/20240405-022138-arnaudb.json [02:21:41] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2218.codfw.wmnet with reason: Maintenance [02:21:42] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [02:21:54] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2218.codfw.wmnet with reason: Maintenance [02:22:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2218 (T360332)', diff saved to https://phabricator.wikimedia.org/P59632 and previous config saved to /var/cache/conftool/dbconfig/20240405-022201-arnaudb.json [02:24:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T360332)', diff saved to https://phabricator.wikimedia.org/P59633 and previous config saved to /var/cache/conftool/dbconfig/20240405-022442-arnaudb.json [02:38:27] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P59634 and previous config saved to /var/cache/conftool/dbconfig/20240405-023949-arnaudb.json [02:54:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P59635 and previous config saved to /var/cache/conftool/dbconfig/20240405-025458-arnaudb.json [02:58:27] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:08:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228 (T356166)', diff saved to https://phabricator.wikimedia.org/P59636 and previous config saved to /var/cache/conftool/dbconfig/20240405-030809-marostegui.json [03:08:12] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [03:10:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T360332)', diff saved to https://phabricator.wikimedia.org/P59637 and previous config saved to /var/cache/conftool/dbconfig/20240405-031005-arnaudb.json [03:10:08] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2220.codfw.wmnet with reason: Maintenance [03:10:09] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [03:10:21] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2220.codfw.wmnet with reason: Maintenance [03:10:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2220 (T360332)', diff saved to https://phabricator.wikimedia.org/P59638 and previous config saved to /var/cache/conftool/dbconfig/20240405-031028-arnaudb.json [03:12:55] (03PS6) 10Krinkle: codesearch: Enable network=host and set CODESEARCH_HOUND_BASE [puppet] - 10https://gerrit.wikimedia.org/r/1016480 [03:12:55] (03PS1) 10Krinkle: codesearch: Allow local containers to talk to local Hound proxy [puppet] - 10https://gerrit.wikimedia.org/r/1017179 (https://phabricator.wikimedia.org/T361899) [03:13:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T360332)', diff saved to https://phabricator.wikimedia.org/P59639 and previous config saved to /var/cache/conftool/dbconfig/20240405-031307-arnaudb.json [03:13:22] (03PS7) 10Krinkle: codesearch: Set CODESEARCH_HOUND_BASE to enable local Hound connections [puppet] - 10https://gerrit.wikimedia.org/r/1016480 [03:13:36] (03PS8) 10Krinkle: codesearch: Set CODESEARCH_HOUND_BASE to enable local Hound connections [puppet] - 10https://gerrit.wikimedia.org/r/1016480 (https://phabricator.wikimedia.org/T361899) [03:19:09] (03CR) 10Tim Starling: "This is ready to go whenever" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006181 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling) [03:23:00] (03CR) 10Krinkle: "Ack" [puppet] - 10https://gerrit.wikimedia.org/r/1016480 (https://phabricator.wikimedia.org/T361899) (owner: 10Krinkle) [03:23:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228', diff saved to https://phabricator.wikimedia.org/P59640 and previous config saved to /var/cache/conftool/dbconfig/20240405-032316-marostegui.json [03:28:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P59641 and previous config saved to /var/cache/conftool/dbconfig/20240405-032814-arnaudb.json [03:38:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228', diff saved to https://phabricator.wikimedia.org/P59642 and previous config saved to /var/cache/conftool/dbconfig/20240405-033823-marostegui.json [03:43:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P59643 and previous config saved to /var/cache/conftool/dbconfig/20240405-034322-arnaudb.json [03:53:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228 (T356166)', diff saved to https://phabricator.wikimedia.org/P59644 and previous config saved to /var/cache/conftool/dbconfig/20240405-035331-marostegui.json [03:53:34] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1232.eqiad.wmnet with reason: Maintenance [03:53:35] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [03:53:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1232.eqiad.wmnet with reason: Maintenance [03:53:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1232 (T356166)', diff saved to https://phabricator.wikimedia.org/P59645 and previous config saved to /var/cache/conftool/dbconfig/20240405-035353-marostegui.json [03:55:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T356166)', diff saved to https://phabricator.wikimedia.org/P59646 and previous config saved to /var/cache/conftool/dbconfig/20240405-035503-marostegui.json [03:58:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [03:58:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T360332)', diff saved to https://phabricator.wikimedia.org/P59647 and previous config saved to /var/cache/conftool/dbconfig/20240405-035829-arnaudb.json [03:58:33] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [04:10:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P59648 and previous config saved to /var/cache/conftool/dbconfig/20240405-041010-marostegui.json [04:12:03] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:25:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P59649 and previous config saved to /var/cache/conftool/dbconfig/20240405-042517-marostegui.json [04:40:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T356166)', diff saved to https://phabricator.wikimedia.org/P59650 and previous config saved to /var/cache/conftool/dbconfig/20240405-044025-marostegui.json [04:40:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1234.eqiad.wmnet with reason: Maintenance [04:40:29] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [04:40:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1234.eqiad.wmnet with reason: Maintenance [04:40:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T356166)', diff saved to https://phabricator.wikimedia.org/P59651 and previous config saved to /var/cache/conftool/dbconfig/20240405-044048-marostegui.json [05:20:36] (03PS1) 10Marostegui: installserver: Do not format es2039 [puppet] - 10https://gerrit.wikimedia.org/r/1017185 [05:23:56] (03CR) 10Marostegui: [C:03+2] installserver: Do not format es2039 [puppet] - 10https://gerrit.wikimedia.org/r/1017185 (owner: 10Marostegui) [05:33:40] (03PS1) 10Marostegui: mariadb: Prepare dbproxy installations [puppet] - 10https://gerrit.wikimedia.org/r/1017186 (https://phabricator.wikimedia.org/T361352) [05:34:10] (03CR) 10CI reject: [V:04-1] mariadb: Prepare dbproxy installations [puppet] - 10https://gerrit.wikimedia.org/r/1017186 (https://phabricator.wikimedia.org/T361352) (owner: 10Marostegui) [05:36:48] (03PS2) 10Marostegui: mariadb: Prepare dbproxy installations [puppet] - 10https://gerrit.wikimedia.org/r/1017186 (https://phabricator.wikimedia.org/T361352) [05:37:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 810.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:41:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:42:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 810.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:43:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240405T0600) [06:04:21] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:06:53] (03PS3) 10Jdlrobson: Enable desktop watchlist on beta cluster, clean up old references [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016022 (https://phabricator.wikimedia.org/T109277) [06:09:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:13:39] 06SRE, 06Infrastructure-Foundations, 10Mail, 10MediaWiki-Email: Old "Email this user" email is repeatedly resent - https://phabricator.wikimedia.org/T361860#9691600 (10Xover) >>! In T361860#9691323, @DannyS712 wrote: > @Xover I created https://phabricator.wikimedia.org/P59624 that is restricted to WMF-NDA... [06:17:05] 06SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 2 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668#9691601 (10jcrespo) Cloning speed for 133 GB / 28K objects: ` # rclone copy -P backup2007:mediabackups/commonswiki/fff backup2011:... [06:21:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:22:29] 06SRE, 06Infrastructure-Foundations, 10Mail, 10MediaWiki-Email: Old "Email this user" email is repeatedly resent - https://phabricator.wikimedia.org/T361860#9691625 (10Peachey88) For ceeations, I think we just need to add you into phab trusted-contribs, I can do it when I'm back at my laptop if someone doe... [06:34:50] (03CR) 10Arnaudb: [C:03+1] mariadb: Prepare dbproxy installations [puppet] - 10https://gerrit.wikimedia.org/r/1017186 (https://phabricator.wikimedia.org/T361352) (owner: 10Marostegui) [06:35:02] (03CR) 10Marostegui: [C:03+2] mariadb: Prepare dbproxy installations [puppet] - 10https://gerrit.wikimedia.org/r/1017186 (https://phabricator.wikimedia.org/T361352) (owner: 10Marostegui) [06:35:04] !log ayounsi@cumin1002 START - Cookbook sre.network.debug for Netbox circuit ID 108 [06:35:16] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 108 [06:48:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240405T0700) [07:18:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:28:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T356166)', diff saved to https://phabricator.wikimedia.org/P59652 and previous config saved to /var/cache/conftool/dbconfig/20240405-072816-marostegui.json [07:28:20] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [07:43:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P59653 and previous config saved to /var/cache/conftool/dbconfig/20240405-074323-marostegui.json [07:54:34] (03PS1) 10Slyngshede: API: Username validation API. [software/bitu] - 10https://gerrit.wikimedia.org/r/1017244 (https://phabricator.wikimedia.org/T361066) [07:55:25] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1017056 (owner: 10Slyngshede) [07:55:45] (03CR) 10Slyngshede: [C:03+2] Update error pages to Codex design. [software/bitu] - 10https://gerrit.wikimedia.org/r/1017056 (owner: 10Slyngshede) [07:56:08] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1167.eqiad.wmnet with reason: Maintenance [07:56:21] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1167.eqiad.wmnet with reason: Maintenance [07:56:23] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:56:39] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:56:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T360332)', diff saved to https://phabricator.wikimedia.org/P59654 and previous config saved to /var/cache/conftool/dbconfig/20240405-075646-arnaudb.json [07:56:49] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [07:57:07] (03Merged) 10jenkins-bot: Update error pages to Codex design. [software/bitu] - 10https://gerrit.wikimedia.org/r/1017056 (owner: 10Slyngshede) [07:58:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P59655 and previous config saved to /var/cache/conftool/dbconfig/20240405-075831-marostegui.json [07:59:39] 06SRE, 06Infrastructure-Foundations, 10Mail, 10MediaWiki-Email: Old "Email this user" email is repeatedly resent - https://phabricator.wikimedia.org/T361860#9691793 (10Aklapper) >>! In T361860#9691600, @Xover wrote: > @Aklapper is this by design, or just permissions accidentally set too tightly? @Xover: H... [08:02:57] 10ops-codfw, 06DC-Ops: hw troubleshooting: unidentified for db2214.codfw.wmnet - https://phabricator.wikimedia.org/T361911 (10ABran-WMF) 03NEW [08:03:47] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: codfw: use old asw switches from row A and B as msw switches in row C and D - https://phabricator.wikimedia.org/T361871#9691816 (10ayounsi) We first need to discuss if we want to start using managed switches for management switches (except the agg... [08:06:24] 10SRE-swift-storage: Swift TLS certificates will expire soon (14 April) - https://phabricator.wikimedia.org/T361844#9691825 (10elukey) I have updated the [[ https://wikitech.wikimedia.org/wiki/Cergen#Update_a_certificate | docs ]] for the renewal use case, I don't think that we need to change anything in the cer... [08:10:13] !log depool cp4037 to test new Benthos configuration (T361845) [08:10:35] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.esams.wmnet [08:10:55] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [08:12:03] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:13:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T356166)', diff saved to https://phabricator.wikimedia.org/P59656 and previous config saved to /var/cache/conftool/dbconfig/20240405-081340-marostegui.json [08:13:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1235.eqiad.wmnet with reason: Maintenance [08:13:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1235.eqiad.wmnet with reason: Maintenance [08:14:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1235 (T356166)', diff saved to https://phabricator.wikimedia.org/P59657 and previous config saved to /var/cache/conftool/dbconfig/20240405-081402-marostegui.json [08:14:12] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host gitlab-runner2004.codfw.wmnet [08:20:28] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner2004.codfw.wmnet [08:31:26] (03PS3) 10Fabfur: benthos: add metric for ttfb [puppet] - 10https://gerrit.wikimedia.org/r/1017088 (https://phabricator.wikimedia.org/T361845) [08:38:27] (03PS1) 10Muehlenhoff: Move cloudcephosd2001-dev to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1017248 (https://phabricator.wikimedia.org/T361913) [08:39:25] (SystemdUnitFailed) firing: systemd-timedated.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:40:41] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1017248 (https://phabricator.wikimedia.org/T361913) (owner: 10Muehlenhoff) [08:43:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [08:44:19] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1017179 (https://phabricator.wikimedia.org/T361899) (owner: 10Krinkle) [08:45:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T360332)', diff saved to https://phabricator.wikimedia.org/P59658 and previous config saved to /var/cache/conftool/dbconfig/20240405-084515-arnaudb.json [08:45:19] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [08:46:11] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/1017088 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur) [08:48:04] (03CR) 10DCausse: "other flink jobs do properly restart themselve in general, wondering if we should tune the restart strategy of the cirrus jobs first?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017115 (https://phabricator.wikimedia.org/T361870) (owner: 10Bking) [08:52:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T356166)', diff saved to https://phabricator.wikimedia.org/P59659 and previous config saved to /var/cache/conftool/dbconfig/20240405-085222-marostegui.json [08:52:26] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [08:59:25] (SystemdUnitFailed) resolved: systemd-timedated.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:00:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P59660 and previous config saved to /var/cache/conftool/dbconfig/20240405-090023-arnaudb.json [09:03:03] (03PS1) 10Filippo Giunchedi: aptrepo: upgrade to Grafana 9.5 [puppet] - 10https://gerrit.wikimedia.org/r/1017251 (https://phabricator.wikimedia.org/T361830) [09:07:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P59661 and previous config saved to /var/cache/conftool/dbconfig/20240405-090730-marostegui.json [09:08:36] (03CR) 10Filippo Giunchedi: [C:03+2] sre: disable pint promql/series for EnvoyRuntimeAdminOverrides [alerts] - 10https://gerrit.wikimedia.org/r/1016786 (https://phabricator.wikimedia.org/T359633) (owner: 10Filippo Giunchedi) [09:15:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P59662 and previous config saved to /var/cache/conftool/dbconfig/20240405-091531-arnaudb.json [09:15:35] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1017251 (https://phabricator.wikimedia.org/T361830) (owner: 10Filippo Giunchedi) [09:19:31] !log ayounsi@cumin1002 START - Cookbook sre.hosts.decommission for hosts testvm2006.codfw.wmnet [09:19:38] (03CR) 10Filippo Giunchedi: [C:03+2] aptrepo: upgrade to Grafana 9.5 [puppet] - 10https://gerrit.wikimedia.org/r/1017251 (https://phabricator.wikimedia.org/T361830) (owner: 10Filippo Giunchedi) [09:22:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P59663 and previous config saved to /var/cache/conftool/dbconfig/20240405-092237-marostegui.json [09:22:39] (03CR) 10Alexandros Kosiaris: "-1 due to https://phabricator.wikimedia.org/T328036#9688704. Once Kiwix has finished their migration to new endpoints, we can pick this ba" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017054 (https://phabricator.wikimedia.org/T361483) (owner: 10Alexandros Kosiaris) [09:24:07] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [09:28:57] 10ops-codfw, 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4): titan200[12] RAM/SSD upgrade coordination - https://phabricator.wikimedia.org/T361229#9692041 (10fgiunchedi) Thank you @Jhancock.wm @herron ! I think the easiest in this case would be to: * have titan2001 match titan2002 (i.e. remo... [09:30:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T360332)', diff saved to https://phabricator.wikimedia.org/P59664 and previous config saved to /var/cache/conftool/dbconfig/20240405-093038-arnaudb.json [09:30:41] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [09:30:42] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [09:30:54] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [09:31:04] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1172.eqiad.wmnet with reason: Maintenance [09:31:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1172.eqiad.wmnet with reason: Maintenance [09:31:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T360332)', diff saved to https://phabricator.wikimedia.org/P59665 and previous config saved to /var/cache/conftool/dbconfig/20240405-093124-arnaudb.json [09:33:46] (03PS4) 10Fabfur: benthos: add metric for ttfb [puppet] - 10https://gerrit.wikimedia.org/r/1017088 (https://phabricator.wikimedia.org/T361845) [09:36:17] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9692080 (10cmooney) >>! In T350179#9690121, @ssingh wrote: > Any other opinions/thoughts on how we can try and fix this and where? I am... [09:37:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T356166)', diff saved to https://phabricator.wikimedia.org/P59666 and previous config saved to /var/cache/conftool/dbconfig/20240405-093745-marostegui.json [09:37:47] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1239.eqiad.wmnet with reason: Maintenance [09:37:48] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [09:38:01] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1239.eqiad.wmnet with reason: Maintenance [09:38:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1240.eqiad.wmnet with reason: Maintenance [09:38:18] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1240.eqiad.wmnet with reason: Maintenance [09:38:21] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [09:38:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [09:45:37] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9692099 (10cmooney) >>! In T350179#9586432, @ayounsi wrote: > Last maybe we could explore relying less on PXE, for example is it possibl... [09:51:13] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm2006.codfw.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002" [09:59:16] (03Abandoned) 10Lucas Werkmeister (WMDE): Add debug code for entity usage logic issue [extensions/Wikibase] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/989125 (https://phabricator.wikimedia.org/T255706) (owner: 10Lucas Werkmeister (WMDE)) [10:11:45] (03Abandoned) 10MSantos: add maps beta to dsh targets [puppet] - 10https://gerrit.wikimedia.org/r/803894 (owner: 10MSantos) [10:12:05] (03Abandoned) 10MSantos: maps: script to send zoom level expiration events [puppet] - 10https://gerrit.wikimedia.org/r/740236 (owner: 10MSantos) [10:13:56] (03PS5) 10Fabfur: benthos: add metric for ttfb [puppet] - 10https://gerrit.wikimedia.org/r/1017088 (https://phabricator.wikimedia.org/T361845) [10:14:32] (03CR) 10Fabfur: benthos: add metric for ttfb (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1017088 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur) [10:21:07] (03PS1) 10JMeybohm: Update apertium chart to mesh.deployment:1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017258 (https://phabricator.wikimedia.org/T346638) [10:23:15] (03PS1) 10JMeybohm: Update blubberoid chart to mesh.deployment:1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017259 (https://phabricator.wikimedia.org/T346638) [10:25:57] (03PS1) 10Muehlenhoff: os-reports: Update puppetdb query [puppet] - 10https://gerrit.wikimedia.org/r/1017260 (https://phabricator.wikimedia.org/T355924) [10:28:57] (03CR) 10CI reject: [V:04-1] os-reports: Update puppetdb query [puppet] - 10https://gerrit.wikimedia.org/r/1017260 (https://phabricator.wikimedia.org/T355924) (owner: 10Muehlenhoff) [10:31:12] (03PS2) 10Muehlenhoff: os-reports: Update puppetdb query [puppet] - 10https://gerrit.wikimedia.org/r/1017260 (https://phabricator.wikimedia.org/T355924) [10:31:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T360332)', diff saved to https://phabricator.wikimedia.org/P59667 and previous config saved to /var/cache/conftool/dbconfig/20240405-103149-arnaudb.json [10:31:54] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [10:36:24] (03CR) 10Filippo Giunchedi: "LGTM, only a comment re: histogram_buckets" [puppet] - 10https://gerrit.wikimedia.org/r/1017088 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur) [10:42:07] (03PS1) 10Muehlenhoff: barbican: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1017261 [10:44:10] (03PS6) 10Fabfur: benthos: add metric for ttfb [puppet] - 10https://gerrit.wikimedia.org/r/1017088 (https://phabricator.wikimedia.org/T361845) [10:44:45] (03CR) 10Fabfur: benthos: add metric for ttfb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1017088 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur) [10:46:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P59668 and previous config saved to /var/cache/conftool/dbconfig/20240405-104657-arnaudb.json [10:47:42] (03PS1) 10Muehlenhoff: mx: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1017263 [10:48:01] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1017261 (owner: 10Muehlenhoff) [10:50:36] (03CR) 10Majavah: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1017261 (owner: 10Muehlenhoff) [10:53:35] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1017263 (owner: 10Muehlenhoff) [10:53:46] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9692338 (10MoritzMuehlenhoff) We could also consider to pass this over to Dell support? [10:59:30] (03PS1) 10Muehlenhoff: Tighten data type for profile::icinga::partners [puppet] - 10https://gerrit.wikimedia.org/r/1017265 [11:00:53] (03CR) 10Filippo Giunchedi: [C:03+1] benthos: add metric for ttfb [puppet] - 10https://gerrit.wikimedia.org/r/1017088 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur) [11:02:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P59669 and previous config saved to /var/cache/conftool/dbconfig/20240405-110204-arnaudb.json [11:05:21] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1017265 (owner: 10Muehlenhoff) [11:11:31] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: codfw: use old asw switches from row A and B as msw switches in row C and D - https://phabricator.wikimedia.org/T361871#9692392 (10cmooney) >>! In T361871#9691816, @ayounsi wrote: > We first need to discuss if we want to start using managed switch... [11:15:01] (03PS7) 10Fabfur: benthos: add metric for ttfb [puppet] - 10https://gerrit.wikimedia.org/r/1017088 (https://phabricator.wikimedia.org/T361845) [11:17:06] (03PS2) 10Muehlenhoff: Tighten data type for profile::icinga::partners [puppet] - 10https://gerrit.wikimedia.org/r/1017265 [11:17:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T360332)', diff saved to https://phabricator.wikimedia.org/P59670 and previous config saved to /var/cache/conftool/dbconfig/20240405-111713-arnaudb.json [11:17:16] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1177.eqiad.wmnet with reason: Maintenance [11:17:17] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [11:17:29] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1177.eqiad.wmnet with reason: Maintenance [11:17:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T360332)', diff saved to https://phabricator.wikimedia.org/P59671 and previous config saved to /var/cache/conftool/dbconfig/20240405-111736-arnaudb.json [11:22:20] (03PS8) 10Fabfur: benthos: add metric for ttfb [puppet] - 10https://gerrit.wikimedia.org/r/1017088 (https://phabricator.wikimedia.org/T361845) [11:27:30] (03CR) 10Slyngshede: [C:03+1] os-reports: Update puppetdb query (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1017260 (https://phabricator.wikimedia.org/T355924) (owner: 10Muehlenhoff) [11:31:39] (03PS2) 10Slyngshede: API: Username validation API. [software/bitu] - 10https://gerrit.wikimedia.org/r/1017244 (https://phabricator.wikimedia.org/T361066) [11:40:33] (03CR) 10Muehlenhoff: os-reports: Update puppetdb query (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1017260 (https://phabricator.wikimedia.org/T355924) (owner: 10Muehlenhoff) [11:40:36] (03CR) 10Muehlenhoff: [C:03+2] os-reports: Update puppetdb query [puppet] - 10https://gerrit.wikimedia.org/r/1017260 (https://phabricator.wikimedia.org/T355924) (owner: 10Muehlenhoff) [11:46:57] (03PS1) 10Isabelle Hurbain-Palatin: Add Kartographer Parsoid support to hewikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017268 (https://phabricator.wikimedia.org/T342871) [11:48:25] (03CR) 10Jgiannelos: [C:03+1] Add Kartographer Parsoid support to hewikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017268 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin) [11:48:26] (03CR) 10Hnowlan: "cassandra-devel? Links it to the hostname but distinguishes it from the hostname." [puppet] - 10https://gerrit.wikimedia.org/r/1016899 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans) [11:54:10] (SystemdUnitFailed) resolved: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:10:26] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm2006.codfw.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002" [12:10:26] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:10:27] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts testvm2006.codfw.wmnet [12:10:38] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: 14Investigate Ganeti in routed mode - 14https://phabricator.wikimedia.org/T300152#9692520 (10ops-monitoring-bot) 14cookbooks.sre.hosts.decommission executed by ayounsi@cumin1002 for hosts: `testvm2006.codfw.wmnet` - testvm2... [12:12:28] (03PS1) 10Muehlenhoff: Deprecate system::role for IF services (batch two) [puppet] - 10https://gerrit.wikimedia.org/r/1017269 [12:17:37] (03CR) 10Filippo Giunchedi: [C:03+1] "nice, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1017265 (owner: 10Muehlenhoff) [12:18:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T360332)', diff saved to https://phabricator.wikimedia.org/P59672 and previous config saved to /var/cache/conftool/dbconfig/20240405-121801-arnaudb.json [12:18:05] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [12:18:35] (03PS9) 10Fabfur: benthos: add metric for ttfb [puppet] - 10https://gerrit.wikimedia.org/r/1017088 (https://phabricator.wikimedia.org/T361845) [12:21:32] (03PS1) 10Majavah: team-wmcs: Add wiki replica exclude to one more alert [alerts] - 10https://gerrit.wikimedia.org/r/1017273 [12:21:33] (03CR) 10Filippo Giunchedi: "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/1017088 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur) [12:22:38] (03CR) 10David Caro: [C:03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1017273 (owner: 10Majavah) [12:22:53] (03CR) 10Majavah: [C:03+2] team-wmcs: Add wiki replica exclude to one more alert [alerts] - 10https://gerrit.wikimedia.org/r/1017273 (owner: 10Majavah) [12:23:02] (03CR) 10CI reject: [V:04-1] team-wmcs: Add wiki replica exclude to one more alert [alerts] - 10https://gerrit.wikimedia.org/r/1017273 (owner: 10Majavah) [12:23:36] (03PS10) 10Fabfur: benthos: add metric for ttfb [puppet] - 10https://gerrit.wikimedia.org/r/1017088 (https://phabricator.wikimedia.org/T361845) [12:23:38] (03PS2) 10Majavah: team-wmcs: Add wiki replica exclude to one more alert [alerts] - 10https://gerrit.wikimedia.org/r/1017273 [12:23:45] (03PS1) 10Muehlenhoff: Deprecate system::role for Collaboration services (batch one) [puppet] - 10https://gerrit.wikimedia.org/r/1017274 [12:23:56] (03CR) 10Majavah: [C:03+2] team-wmcs: Add wiki replica exclude to one more alert [alerts] - 10https://gerrit.wikimedia.org/r/1017273 (owner: 10Majavah) [12:24:09] (03CR) 10Fabfur: benthos: add metric for ttfb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1017088 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur) [12:25:40] (03Merged) 10jenkins-bot: team-wmcs: Add wiki replica exclude to one more alert [alerts] - 10https://gerrit.wikimedia.org/r/1017273 (owner: 10Majavah) [12:32:30] !log repool cp4037 (T361845) [12:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:33] T361845: Add metrics to Benthos - https://phabricator.wikimedia.org/T361845 [12:32:38] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [12:33:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P59673 and previous config saved to /var/cache/conftool/dbconfig/20240405-123309-arnaudb.json [12:43:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [12:48:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P59674 and previous config saved to /var/cache/conftool/dbconfig/20240405-124816-arnaudb.json [12:50:40] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2036:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2036 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:03:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T360332)', diff saved to https://phabricator.wikimedia.org/P59675 and previous config saved to /var/cache/conftool/dbconfig/20240405-130324-arnaudb.json [13:03:26] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1178.eqiad.wmnet with reason: Maintenance [13:03:34] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [13:03:40] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1178.eqiad.wmnet with reason: Maintenance [13:03:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T360332)', diff saved to https://phabricator.wikimedia.org/P59676 and previous config saved to /var/cache/conftool/dbconfig/20240405-130347-arnaudb.json [13:15:57] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1017088 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur) [13:15:59] (03CR) 10Fabfur: [C:03+2] benthos: add metric for ttfb [puppet] - 10https://gerrit.wikimedia.org/r/1017088 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur) [13:16:49] (03PS1) 10Muehlenhoff: Uninstall eject on production VMs [puppet] - 10https://gerrit.wikimedia.org/r/1017275 [13:17:33] (03CR) 10Andrew Bogott: [C:03+2] "A bunch of CA certs expired by surprise; it should be fixed now. If not, you can fix it by moving the ca.pem out of the way." [puppet] - 10https://gerrit.wikimedia.org/r/1015382 (https://phabricator.wikimedia.org/T351452) (owner: 10Andrew Bogott) [13:17:54] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1802/co" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1017274 (owner: 10Muehlenhoff) [13:18:14] (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm, nit in-line" [puppet] - 10https://gerrit.wikimedia.org/r/1017274 (owner: 10Muehlenhoff) [13:18:26] 10ops-codfw, 06SRE, 06DC-Ops: 14hw troubleshooting: unidentified for db2214.codfw.wmnet - 14https://phabricator.wikimedia.org/T361911#9692595 (10ABran-WMF) 05Open→03Invalid [13:18:38] (03CR) 10Elukey: [C:03+1] Update apertium chart to mesh.deployment:1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017258 (https://phabricator.wikimedia.org/T346638) (owner: 10JMeybohm) [13:19:10] (03CR) 10Elukey: [C:03+1] Update blubberoid chart to mesh.deployment:1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017259 (https://phabricator.wikimedia.org/T346638) (owner: 10JMeybohm) [13:21:15] (03PS1) 10Muehlenhoff: installserver: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1017279 [13:22:09] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1017265 (owner: 10Muehlenhoff) [13:23:10] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: codfw: use old asw switches from row A and B as msw switches in row C and D - https://phabricator.wikimedia.org/T361871#9692687 (10Papaul) @ayounsi @cmooney thanks for all the inputs. What I am asking is to use the Juniper old switches as dummies... [13:23:48] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1017279 (owner: 10Muehlenhoff) [13:24:06] (03CR) 10Muehlenhoff: Deprecate system::role for Collaboration services (batch one) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1017274 (owner: 10Muehlenhoff) [13:24:26] (03PS1) 10Elukey: kserve: update fixtures to avoid listing the Puppet CA cert/bundle [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017284 [13:24:30] (03PS1) 10Majavah: hieradata: eqiad1: bastion allow connections from bastion-restricted-eqiad1-3 [puppet] - 10https://gerrit.wikimedia.org/r/1017282 (https://phabricator.wikimedia.org/T361831) [13:24:34] (03PS1) 10Majavah: P:cumin: cloud_ssh_config: fix logging onto the restricted bastions [puppet] - 10https://gerrit.wikimedia.org/r/1017283 (https://phabricator.wikimedia.org/T361831) [13:27:04] (03PS1) 10Arnaudb: mariadb: toggle notifications for db2214 [puppet] - 10https://gerrit.wikimedia.org/r/1017081 (https://phabricator.wikimedia.org/T361851) [13:29:41] (03PS2) 10Arnaudb: mariadb: toggle notifications for db2214 [puppet] - 10https://gerrit.wikimedia.org/r/1017081 (https://phabricator.wikimedia.org/T361851) [13:42:39] (03CR) 10Arnaudb: [C:03+2] mariadb: toggle notifications for db2214 [puppet] - 10https://gerrit.wikimedia.org/r/1017081 (https://phabricator.wikimedia.org/T361851) (owner: 10Arnaudb) [13:45:14] 06SRE, 06Infrastructure-Foundations, 10Mail, 10MediaWiki-Email: Old "Email this user" email is repeatedly resent - https://phabricator.wikimedia.org/T361860#9692785 (10Xover) And now another two ticked in. [13:45:15] 10ops-codfw, 06DBA, 13Patch-For-Review: db2214 crashed - https://phabricator.wikimedia.org/T361851#9692786 (10ABran-WMF) [13:45:40] (KubernetesRsyslogDown) resolved: rsyslog on kubernetes2036:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2036 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:57:28] (03CR) 10Muehlenhoff: "We usually just keep components around; external people might use them from apt.wikimedia.org after all." [puppet] - 10https://gerrit.wikimedia.org/r/1016425 (https://phabricator.wikimedia.org/T354670) (owner: 10Ryan Kemper) [13:57:42] (03CR) 10Andrew Bogott: [C:03+1] Uninstall eject on production VMs [puppet] - 10https://gerrit.wikimedia.org/r/1017275 (owner: 10Muehlenhoff) [13:58:08] (03CR) 10Bking: [C:03+1] "Per our last few pairing sessions, the removal of the curator library in Spicerack (T361647) is not related to the removal of the curator " [puppet] - 10https://gerrit.wikimedia.org/r/1016425 (https://phabricator.wikimedia.org/T354670) (owner: 10Ryan Kemper) [14:00:20] (03PS4) 10Bking: elastic: remove wmf 3rd party curator [puppet] - 10https://gerrit.wikimedia.org/r/1016425 (https://phabricator.wikimedia.org/T354670) (owner: 10Ryan Kemper) [14:00:56] (03CR) 10CI reject: [V:04-1] elastic: remove wmf 3rd party curator [puppet] - 10https://gerrit.wikimedia.org/r/1016425 (https://phabricator.wikimedia.org/T354670) (owner: 10Ryan Kemper) [14:02:00] (03PS2) 10Lucas Werkmeister (WMDE): termbox: update to 2024-03-14-121904-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007643 (https://phabricator.wikimedia.org/T343239) [14:02:59] (03CR) 10Lucas Werkmeister (WMDE): "I tested the image locally:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007643 (https://phabricator.wikimedia.org/T343239) (owner: 10Lucas Werkmeister (WMDE)) [14:04:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T360332)', diff saved to https://phabricator.wikimedia.org/P59677 and previous config saved to /var/cache/conftool/dbconfig/20240405-140412-arnaudb.json [14:04:17] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [14:05:41] (03CR) 10JHathaway: [C:03+1] Uninstall eject on production VMs [puppet] - 10https://gerrit.wikimedia.org/r/1017275 (owner: 10Muehlenhoff) [14:05:46] (03CR) 10Lucas Werkmeister (WMDE): "I also decided to update the versions in a single change; the [deployment instructions](https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007643 (https://phabricator.wikimedia.org/T343239) (owner: 10Lucas Werkmeister (WMDE)) [14:07:14] TIL the helm-lint job console for operations/deployment-charts changes contains a diff for deployments, similar to diffConfig for operations/mediawiki-config – nice! https://integration.wikimedia.org/ci/job/helm-lint/16562/console [14:12:24] (03Abandoned) 10Bking: elastic: remove wmf 3rd party curator [puppet] - 10https://gerrit.wikimedia.org/r/1016425 (https://phabricator.wikimedia.org/T354670) (owner: 10Ryan Kemper) [14:12:36] (03PS1) 10Elukey: ml-services: update RR ML/Wikidata's Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017292 (https://phabricator.wikimedia.org/T360111) [14:15:28] !log taavi@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1017.eqiad.wmnet [14:18:27] !log taavi@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1017.eqiad.wmnet [14:19:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P59678 and previous config saved to /var/cache/conftool/dbconfig/20240405-141919-arnaudb.json [14:19:23] (03CR) 10Marostegui: [C:03+1] mariadb: toggle notifications for db2214 [puppet] - 10https://gerrit.wikimedia.org/r/1017081 (https://phabricator.wikimedia.org/T361851) (owner: 10Arnaudb) [14:20:31] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9692885 (10ssingh) >>! In T350179#9692080, @cmooney wrote: >>>! In T350179#9690121, @ssingh wrote: >> Any other opinions/thoughts on how... [14:20:39] !log taavi@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1018.eqiad.wmnet [14:22:23] !log taavi@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1018.eqiad.wmnet [14:22:45] !log taavi@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet [14:25:13] !log taavi@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet [14:25:40] !log taavi@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1020.eqiad.wmnet [14:26:36] !log taavi@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1020.eqiad.wmnet [14:29:26] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9692933 (10ssingh) >>! In T350179#9692338, @MoritzMuehlenhoff wrote: > We could also consider to pass this over to Dell support? My onl... [14:29:37] !log taavi@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1013.eqiad.wmnet [14:33:47] !log taavi@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1013.eqiad.wmnet [14:33:52] !log taavi@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1014.eqiad.wmnet [14:34:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P59679 and previous config saved to /var/cache/conftool/dbconfig/20240405-143427-arnaudb.json [14:34:30] !log taavi@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1014.eqiad.wmnet [14:34:45] 10ops-codfw, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Degraded RAID on elastic2088 - https://phabricator.wikimedia.org/T361525#9692962 (10Jhancock.wm) follow up: still going back and forth with Dell. [14:34:47] (03PS1) 10Bking: cirrus-streaming-updater: Increase taskManager memory for cloudelastic job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017296 (https://phabricator.wikimedia.org/T361870) [14:34:51] !log taavi@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1015.eqiad.wmnet [14:35:40] !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:35:54] !log taavi@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1015.eqiad.wmnet [14:36:02] !log taavi@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1016.eqiad.wmnet [14:36:41] (03CR) 10JHathaway: [C:03+1] mx: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1017263 (owner: 10Muehlenhoff) [14:36:41] !log taavi@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1016.eqiad.wmnet [14:38:27] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:35] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9692988 (10cmooney) >>! In T350179#9692885, @ssingh wrote: > All cp hosts in eqiad are in rows A, B, C, and D, so that does look worth t... [14:44:40] (03PS2) 10Jon Harald Søby: Add 'mainpage-title-loggedin' to $wgForceUIMsgAsContentMsg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015151 (https://phabricator.wikimedia.org/T361171) [14:49:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T360332)', diff saved to https://phabricator.wikimedia.org/P59680 and previous config saved to /var/cache/conftool/dbconfig/20240405-144934-arnaudb.json [14:49:37] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1192.eqiad.wmnet with reason: Maintenance [14:49:43] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [14:49:50] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1192.eqiad.wmnet with reason: Maintenance [14:49:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1192 (T360332)', diff saved to https://phabricator.wikimedia.org/P59681 and previous config saved to /var/cache/conftool/dbconfig/20240405-144957-arnaudb.json [14:52:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T360332)', diff saved to https://phabricator.wikimedia.org/P59682 and previous config saved to /var/cache/conftool/dbconfig/20240405-145213-arnaudb.json [14:55:36] !log dancy@deploy1002 Installing scap version "4.75.0" for 353 hosts [14:56:28] !log dancy@deploy1002 Installation of scap version "4.75.0" completed for 353 hosts [14:58:27] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:03:34] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudbackup1002-dev.eqiad.wmnet with OS bookworm [15:07:05] 06SRE, 06Infrastructure-Foundations, 10Mail, 10MediaWiki-Email: Old "Email this user" email is repeatedly resent - https://phabricator.wikimedia.org/T361860#9693054 (10DannyS712) >>! In T361860#9691600, @Xover wrote: >>>! In T361860#9691323, @DannyS712 wrote: >> @Xover I created https://phabricator.wikimed... [15:07:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P59683 and previous config saved to /var/cache/conftool/dbconfig/20240405-150721-arnaudb.json [15:11:38] (03PS2) 10Bking: cirrus-streaming-updater: Increase taskManager memory for cloudelastic job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017296 (https://phabricator.wikimedia.org/T361870) [15:11:41] 10ops-codfw, 06SRE, 06DBA, 13Patch-For-Review: db2214 crashed - https://phabricator.wikimedia.org/T361851#9693059 (10Jhancock.wm) a:03Jhancock.wm Here are some more logs. 2024-04-04 16:29:06 SYS1000 System is turning on. 2024-04-04 16:29:00 SWC5019 Unable to authenticate the BIOS image file bec... [15:14:53] !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [15:16:56] (03CR) 10Eevans: [C:03+1] cassandra::instance: add the tls_use_pki_keep_old_ca parameter [puppet] - 10https://gerrit.wikimedia.org/r/1013571 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [15:17:01] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup1002-dev.eqiad.wmnet with reason: host reimage [15:18:05] (03CR) 10JHathaway: [C:03+1] Deprecate system::role for IF services (batch two) [puppet] - 10https://gerrit.wikimedia.org/r/1017269 (owner: 10Muehlenhoff) [15:19:25] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9693083 (10ssingh) Continuing to trying to isolate the possible causes of this, I noticed when dumping the facter output between the dif... [15:19:59] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup1002-dev.eqiad.wmnet with reason: host reimage [15:22:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P59684 and previous config saved to /var/cache/conftool/dbconfig/20240405-152228-arnaudb.json [15:34:40] new page for db2214, this is from yesterday [15:34:41] resolving it [15:34:44] (03PS1) 10Elukey: Update kubernetes' svc ipv6 ranges for AUX and DSE [puppet] - 10https://gerrit.wikimedia.org/r/1017311 (https://phabricator.wikimedia.org/T353705) [15:34:45] (03PS1) 10Elukey: network::data: update all kubesvc's ipv6 ranges [puppet] - 10https://gerrit.wikimedia.org/r/1017312 (https://phabricator.wikimedia.org/T353705) [15:36:30] (03CR) 10Elukey: [V:03+1 C:03+2] cassandra::instance: add the tls_use_pki_keep_old_ca parameter [puppet] - 10https://gerrit.wikimedia.org/r/1013571 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [15:36:46] !incidents [15:36:47] No incidents occurred in the past 24 hours for team SRE [15:36:55] good vibes sirenbot [15:37:33] (03PS1) 10CDanis: jaeger ui: two week lookback [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017314 [15:37:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T360332)', diff saved to https://phabricator.wikimedia.org/P59685 and previous config saved to /var/cache/conftool/dbconfig/20240405-153736-arnaudb.json [15:37:39] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1193.eqiad.wmnet with reason: Maintenance [15:37:40] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [15:37:52] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1193.eqiad.wmnet with reason: Maintenance [15:37:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1193 (T360332)', diff saved to https://phabricator.wikimedia.org/P59686 and previous config saved to /var/cache/conftool/dbconfig/20240405-153759-arnaudb.json [15:40:49] (03CR) 10Eevans: "I was optimizing for "short", but I reckon that most folks will run `cqlsh` interactively, thus only have to type it once (and even then u" [puppet] - 10https://gerrit.wikimedia.org/r/1016899 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans) [15:41:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T360332)', diff saved to https://phabricator.wikimedia.org/P59687 and previous config saved to /var/cache/conftool/dbconfig/20240405-154115-arnaudb.json [15:45:49] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudbackup1002-dev.eqiad.wmnet with OS bookworm [15:53:42] (03CR) 10DCausse: [C:03+1] cirrus-streaming-updater: Increase taskManager memory for cloudelastic job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017296 (https://phabricator.wikimedia.org/T361870) (owner: 10Bking) [15:54:50] (03CR) 10Brouberol: [C:03+1] "LGTM after our exchange on IRC" [puppet] - 10https://gerrit.wikimedia.org/r/1017311 (https://phabricator.wikimedia.org/T353705) (owner: 10Elukey) [15:56:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P59688 and previous config saved to /var/cache/conftool/dbconfig/20240405-155622-arnaudb.json [15:57:59] (03PS1) 10JHathaway: email: add node definitions for mx-out boxes [puppet] - 10https://gerrit.wikimedia.org/r/1017318 [16:03:18] !log jhathaway@cumin1002 START - Cookbook sre.ganeti.makevm for new host mx-out1001.wikimedia.org [16:03:20] !log jhathaway@cumin1002 START - Cookbook sre.dns.netbox [16:05:16] (03CR) 10AikoChou: [C:03+1] kserve: update fixtures to avoid listing the Puppet CA cert/bundle [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017284 (owner: 10Elukey) [16:07:06] !log jhathaway@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM mx-out1001.wikimedia.org - jhathaway@cumin1002" [16:11:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P59689 and previous config saved to /var/cache/conftool/dbconfig/20240405-161130-arnaudb.json [16:14:07] (03PS2) 10JHathaway: email: add node definitions for mx-out boxes [puppet] - 10https://gerrit.wikimedia.org/r/1017318 (https://phabricator.wikimedia.org/T361750) [16:14:59] (03CR) 10Bking: [C:03+2] cirrus-streaming-updater: Increase taskManager memory for cloudelastic job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017296 (https://phabricator.wikimedia.org/T361870) (owner: 10Bking) [16:16:12] (03CR) 10Elukey: [C:03+2] kserve: update fixtures to avoid listing the Puppet CA cert/bundle [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017284 (owner: 10Elukey) [16:16:24] (03Merged) 10jenkins-bot: cirrus-streaming-updater: Increase taskManager memory for cloudelastic job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017296 (https://phabricator.wikimedia.org/T361870) (owner: 10Bking) [16:18:15] !log bking@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:18:21] !log bking@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:24:18] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM mx-out1001.wikimedia.org - jhathaway@cumin1002" [16:24:18] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:24:18] !log jhathaway@cumin1002 START - Cookbook sre.dns.wipe-cache mx-out1001.wikimedia.org on all recursors [16:24:21] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mx-out1001.wikimedia.org on all recursors [16:24:45] !log jhathaway@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM mx-out1001.wikimedia.org - jhathaway@cumin1002" [16:25:37] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM mx-out1001.wikimedia.org - jhathaway@cumin1002" [16:26:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T360332)', diff saved to https://phabricator.wikimedia.org/P59691 and previous config saved to /var/cache/conftool/dbconfig/20240405-162637-arnaudb.json [16:26:40] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1203.eqiad.wmnet with reason: Maintenance [16:26:41] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [16:26:53] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1203.eqiad.wmnet with reason: Maintenance [16:27:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1203 (T360332)', diff saved to https://phabricator.wikimedia.org/P59692 and previous config saved to /var/cache/conftool/dbconfig/20240405-162700-arnaudb.json [16:27:32] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host mx-out1001.wikimedia.org with OS bookworm [16:27:46] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: eqiad, codfw 2 VM request for postfix mx-out - https://phabricator.wikimedia.org/T361750#9693279 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin1002 for host mx-out1001.wikimedia.org wit... [16:29:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T360332)', diff saved to https://phabricator.wikimedia.org/P59693 and previous config saved to /var/cache/conftool/dbconfig/20240405-162916-arnaudb.json [16:40:05] (03PS1) 10Andrew Bogott: role:cinder_backups: include full env scripts in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1017319 (https://phabricator.wikimedia.org/T358855) [16:40:49] (03PS1) 10Elukey: services: set up TLS validation experiment for sessionstore in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017320 (https://phabricator.wikimedia.org/T352647) [16:42:22] (03PS2) 10Andrew Bogott: role:cinder_backups: include full env scripts in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1017319 (https://phabricator.wikimedia.org/T358855) [16:42:45] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1017319 (https://phabricator.wikimedia.org/T358855) (owner: 10Andrew Bogott) [16:42:45] (03PS1) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1017321 [16:43:12] (03CR) 10CI reject: [V:04-1] purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1017321 (owner: 10CDobbins) [16:43:16] (03CR) 10JHathaway: [C:03+2] email: add node definitions for mx-out boxes [puppet] - 10https://gerrit.wikimedia.org/r/1017318 (https://phabricator.wikimedia.org/T361750) (owner: 10JHathaway) [16:43:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [16:44:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P59694 and previous config saved to /var/cache/conftool/dbconfig/20240405-164424-arnaudb.json [16:44:52] (03CR) 10Andrew Bogott: [C:03+2] role:cinder_backups: include full env scripts in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1017319 (https://phabricator.wikimedia.org/T358855) (owner: 10Andrew Bogott) [16:59:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P59695 and previous config saved to /var/cache/conftool/dbconfig/20240405-165931-arnaudb.json [17:11:09] !log jhathaway@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mx-out1001.wikimedia.org with reason: host reimage [17:14:15] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mx-out1001.wikimedia.org with reason: host reimage [17:14:16] (03PS1) 10Andrew Bogott: cloudbackup100[12]-dev: include ceph admin creds [puppet] - 10https://gerrit.wikimedia.org/r/1017332 (https://phabricator.wikimedia.org/T358855) [17:14:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T360332)', diff saved to https://phabricator.wikimedia.org/P59696 and previous config saved to /var/cache/conftool/dbconfig/20240405-171439-arnaudb.json [17:14:42] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1211.eqiad.wmnet with reason: Maintenance [17:14:44] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [17:14:55] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1211.eqiad.wmnet with reason: Maintenance [17:15:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1211 (T360332)', diff saved to https://phabricator.wikimedia.org/P59697 and previous config saved to /var/cache/conftool/dbconfig/20240405-171502-arnaudb.json [17:16:26] (03PS2) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1017321 [17:17:08] (03CR) 10CI reject: [V:04-1] purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1017321 (owner: 10CDobbins) [17:17:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T360332)', diff saved to https://phabricator.wikimedia.org/P59698 and previous config saved to /var/cache/conftool/dbconfig/20240405-171719-arnaudb.json [17:22:28] (03PS3) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1017321 [17:22:56] (03CR) 10CI reject: [V:04-1] purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1017321 (owner: 10CDobbins) [17:24:36] (03PS4) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1017321 [17:25:04] (03CR) 10CI reject: [V:04-1] purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1017321 (owner: 10CDobbins) [17:32:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P59699 and previous config saved to /var/cache/conftool/dbconfig/20240405-173227-arnaudb.json [17:39:54] (03CR) 10FNegri: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1017282 (https://phabricator.wikimedia.org/T361831) (owner: 10Majavah) [17:40:22] 06SRE, 06Infrastructure-Foundations, 10Mail, 10MediaWiki-Email: Old "Email this user" email is repeatedly resent - https://phabricator.wikimedia.org/T361860#9693475 (10Xover) >>! In T361860#9693188, @jhathaway wrote: > @Xover if you could paste the headers of two of the messages that would help, the whole... [17:43:54] (03CR) 10FNegri: P:cumin: cloud_ssh_config: fix logging onto the restricted bastions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1017283 (https://phabricator.wikimedia.org/T361831) (owner: 10Majavah) [17:44:36] (03CR) 10Majavah: [C:03+2] hieradata: eqiad1: bastion allow connections from bastion-restricted-eqiad1-3 [puppet] - 10https://gerrit.wikimedia.org/r/1017282 (https://phabricator.wikimedia.org/T361831) (owner: 10Majavah) [17:45:53] (03CR) 10Majavah: P:cumin: cloud_ssh_config: fix logging onto the restricted bastions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1017283 (https://phabricator.wikimedia.org/T361831) (owner: 10Majavah) [17:46:23] (03CR) 10Andrew Bogott: [C:03+2] cloudbackup100[12]-dev: include ceph admin creds [puppet] - 10https://gerrit.wikimedia.org/r/1017332 (https://phabricator.wikimedia.org/T358855) (owner: 10Andrew Bogott) [17:47:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P59700 and previous config saved to /var/cache/conftool/dbconfig/20240405-174735-arnaudb.json [17:53:50] (03CR) 10FNegri: P:cumin: cloud_ssh_config: fix logging onto the restricted bastions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1017283 (https://phabricator.wikimedia.org/T361831) (owner: 10Majavah) [17:56:26] (03PS2) 10Majavah: P:cumin: cloud_ssh_config: fix logging onto the restricted bastions [puppet] - 10https://gerrit.wikimedia.org/r/1017283 (https://phabricator.wikimedia.org/T361831) [17:56:50] (03CR) 10Majavah: P:cumin: cloud_ssh_config: fix logging onto the restricted bastions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1017283 (https://phabricator.wikimedia.org/T361831) (owner: 10Majavah) [17:58:14] (03CR) 10FNegri: [C:03+1] "Nice one!" [puppet] - 10https://gerrit.wikimedia.org/r/1017283 (https://phabricator.wikimedia.org/T361831) (owner: 10Majavah) [17:59:13] (03CR) 10FNegri: P:cumin: cloud_ssh_config: fix logging onto the restricted bastions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1017283 (https://phabricator.wikimedia.org/T361831) (owner: 10Majavah) [18:00:18] (03PS3) 10Majavah: P:cumin: cloud_ssh_config: fix logging onto the restricted bastions [puppet] - 10https://gerrit.wikimedia.org/r/1017283 (https://phabricator.wikimedia.org/T361831) [18:01:15] !log depooling db1246 which went down and paged [18:01:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:21] db host down, depooling [18:01:28] sukhe: already ran the command [18:01:32] mutante: nice thanks! [18:01:34] first time for me [18:01:45] (03CR) 10FNegri: [C:03+1] P:cumin: cloud_ssh_config: fix logging onto the restricted bastions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1017283 (https://phabricator.wikimedia.org/T361831) (owner: 10Majavah) [18:02:03] mutante: changes committed? usually it logs here [18:02:07] doesn't depooling a replica usually result in a log here? [18:03:20] !log dzahn@cumin2002 dbctl commit (dc=all): 'depool db1246', diff saved to https://phabricator.wikimedia.org/P59701 and previous config saved to /var/cache/conftool/dbconfig/20240405-180319-dzahn.json [18:03:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T360332)', diff saved to https://phabricator.wikimedia.org/P59702 and previous config saved to /var/cache/conftool/dbconfig/20240405-180330-arnaudb.json [18:03:32] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1214.eqiad.wmnet with reason: Maintenance [18:03:33] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [18:03:35] (03CR) 10Majavah: [C:03+2] P:cumin: cloud_ssh_config: fix logging onto the restricted bastions [puppet] - 10https://gerrit.wikimedia.org/r/1017283 (https://phabricator.wikimedia.org/T361831) (owner: 10Majavah) [18:03:45] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1214.eqiad.wmnet with reason: Maintenance [18:03:51] mutante: thanks, downtiming [18:03:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1214 (T360332)', diff saved to https://phabricator.wikimedia.org/P59703 and previous config saved to /var/cache/conftool/dbconfig/20240405-180352-arnaudb.json [18:03:59] oh great, thakns arnaudb [18:04:47] arnaudb: out of curiosity, this was expected? [18:04:50] ah, maintenance! [18:05:08] taavi: needed sudo [18:05:19] but only for the commit, not the depool [18:06:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T360332)', diff saved to https://phabricator.wikimedia.org/P59704 and previous config saved to /var/cache/conftool/dbconfig/20240405-180608-arnaudb.json [18:06:15] (MediaWikiHighErrorRate) firing: (4) Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:06:57] this is an automation that runs in a tmux pane on cumin1002, it was not expected for it to trigger any error 🤔 [18:07:45] ok, so your downtime for maintenance was you responding to the page but not a separate event, right? [18:08:17] I was not responding to any page haha I just was passing by and saw the ping [18:09:01] I guess the downtime took some time to be effective [18:09:01] ok thanks :) [18:09:33] arnaudb: something went wrong, it alerted us and the error rate went up [18:09:48] hm this is weird [18:10:06] like was a cookbook supposed to depool it but didnt? [18:10:33] yep, it's supposed [18:10:37] https://gitlab.wikimedia.org/repos/sre/schema-changes/-/blob/main/2024/alter_cu_private_event_T360332.py [18:10:56] getsel looks clear [18:11:15] (MediaWikiHighErrorRate) resolved: (4) Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:11:25] as for the logic: https://wikitech.wikimedia.org/wiki/Auto_schema#Depooling [18:12:08] I am happy to see that error rate resolved [18:12:16] and it's already repooled.. so seems we are good [18:12:38] amazing! enjoy your weekends then! [18:12:56] we should remove the downtime if it is repooled [18:13:03] removing [18:13:06] arnaudb: thank you, is there going to be more hosts today ? [18:13:27] !log sukhe@cumin2002 START - Cookbook sre.hosts.remove-downtime for db1214.eqiad.wmnet [18:13:28] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1214.eqiad.wmnet [18:13:42] does the "all_dbs = True" mean it affected many hosts? [18:14:02] or just all databases in one cluster I assume? [18:14:17] seeing the s6 part [18:17:23] (03PS5) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1017321 [18:18:59] 10ops-eqiad, 06SRE, 10decommission-hardware: decommission restbase10[19-27] - https://phabricator.wikimedia.org/T361372#9693509 (10VRiley-WMF) a:03VRiley-WMF [18:20:13] (03CR) 10CI reject: [V:04-1] purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1017321 (owner: 10CDobbins) [18:21:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P59705 and previous config saved to /var/cache/conftool/dbconfig/20240405-182115-arnaudb.json [18:23:43] (03PS6) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1017321 [18:29:40] 10ops-eqiad, 06SRE, 10decommission-hardware: decommission restbase10[19-27] - https://phabricator.wikimedia.org/T361372#9693545 (10VRiley-WMF) [18:30:30] 10ops-eqiad, 06SRE, 10decommission-hardware: decommission restbase10[19-27] - https://phabricator.wikimedia.org/T361372#9693547 (10VRiley-WMF) These servers have been removed and the decomm script has been run. [18:30:42] 10ops-eqiad, 06SRE, 10decommission-hardware: 14decommission restbase10[19-27] - 14https://phabricator.wikimedia.org/T361372#9693548 (10VRiley-WMF) 05Open→03Resolved [18:34:23] 10ops-eqiad, 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Decommission puppetmaster1002 - https://phabricator.wikimedia.org/T357093#9693553 (10VRiley-WMF) a:05Jclark-ctr→03VRiley-WMF [18:36:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P59706 and previous config saved to /var/cache/conftool/dbconfig/20240405-183623-arnaudb.json [18:36:49] (03PS1) 10Andrew Bogott: cloudbackup: some host-specific enable_v2_messenger overrides [puppet] - 10https://gerrit.wikimedia.org/r/1017342 (https://phabricator.wikimedia.org/T358855) [18:37:37] 10ops-eqiad, 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Decommission puppetmaster1002 - https://phabricator.wikimedia.org/T357093#9693562 (10VRiley-WMF) [18:37:40] (03PS2) 10Andrew Bogott: cloudbackup: remove some host-specific enable_v2_messenger overrides [puppet] - 10https://gerrit.wikimedia.org/r/1017342 (https://phabricator.wikimedia.org/T358855) [18:37:43] (03PS2) 10Krinkle: codesearch: Allow local containers to talk to local Hound proxy [puppet] - 10https://gerrit.wikimedia.org/r/1017179 (https://phabricator.wikimedia.org/T361899) [18:37:50] (03PS9) 10Krinkle: codesearch: Set CODESEARCH_HOUND_BASE to enable local Hound connections [puppet] - 10https://gerrit.wikimedia.org/r/1016480 (https://phabricator.wikimedia.org/T361899) [18:37:53] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1017342 (https://phabricator.wikimedia.org/T358855) (owner: 10Andrew Bogott) [18:38:00] 10ops-eqiad, 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Decommission puppetmaster1002 - https://phabricator.wikimedia.org/T357093#9693563 (10VRiley-WMF) This has been unracked and decommission script has been run. [18:38:10] (03CR) 10CI reject: [V:04-1] cloudbackup: remove some host-specific enable_v2_messenger overrides [puppet] - 10https://gerrit.wikimedia.org/r/1017342 (https://phabricator.wikimedia.org/T358855) (owner: 10Andrew Bogott) [18:38:10] 10ops-eqiad, 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: 14Decommission puppetmaster1002 - 14https://phabricator.wikimedia.org/T357093#9693564 (10VRiley-WMF) 05Open→03Resolved [18:39:20] (03PS3) 10Andrew Bogott: cloudbackup: remove some host-specific enable_v2_messenger overrides [puppet] - 10https://gerrit.wikimedia.org/r/1017342 (https://phabricator.wikimedia.org/T358855) [18:39:35] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1017342 (https://phabricator.wikimedia.org/T358855) (owner: 10Andrew Bogott) [18:41:18] (03CR) 10Andrew Bogott: [C:03+2] cloudbackup: remove some host-specific enable_v2_messenger overrides [puppet] - 10https://gerrit.wikimedia.org/r/1017342 (https://phabricator.wikimedia.org/T358855) (owner: 10Andrew Bogott) [18:43:25] (03CR) 10Dzahn: [C:03+2] "I think this was part of https://phabricator.wikimedia.org/T149804 back then" [puppet] - 10https://gerrit.wikimedia.org/r/607647 (owner: 10Dzahn) [18:44:25] (03CR) 10Dzahn: [C:03+2] "as far as I remember the reason to add the srange was not specific to this project but just part of https://phabricator.wikimedia.org/T149" [puppet] - 10https://gerrit.wikimedia.org/r/1017179 (https://phabricator.wikimedia.org/T361899) (owner: 10Krinkle) [18:47:48] (03CR) 10Dzahn: [C:03+2] codesearch: Set CODESEARCH_HOUND_BASE to enable local Hound connections [puppet] - 10https://gerrit.wikimedia.org/r/1016480 (https://phabricator.wikimedia.org/T361899) (owner: 10Krinkle) [18:51:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T360332)', diff saved to https://phabricator.wikimedia.org/P59707 and previous config saved to /var/cache/conftool/dbconfig/20240405-185131-arnaudb.json [18:51:33] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1216.eqiad.wmnet with reason: Maintenance [18:51:35] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [18:51:46] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1216.eqiad.wmnet with reason: Maintenance [18:51:56] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1226.eqiad.wmnet with reason: Maintenance [18:52:09] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1226.eqiad.wmnet with reason: Maintenance [18:52:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1226 (T360332)', diff saved to https://phabricator.wikimedia.org/P59708 and previous config saved to /var/cache/conftool/dbconfig/20240405-185216-arnaudb.json [18:55:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T360332)', diff saved to https://phabricator.wikimedia.org/P59709 and previous config saved to /var/cache/conftool/dbconfig/20240405-185533-arnaudb.json [18:57:32] (03PS1) 10Jdrewniak: [beta] Set Vector 2022 font-size to 16px on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017345 (https://phabricator.wikimedia.org/T360098) [18:58:52] (03PS1) 10Scott French: Release etcd-mirror 0.0.11 [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1017346 (https://phabricator.wikimedia.org/T358636) [18:59:11] (03PS1) 10Andrew Bogott: cinder_backups: add some real work to codfw1dev backups [puppet] - 10https://gerrit.wikimedia.org/r/1017347 (https://phabricator.wikimedia.org/T358855) [18:59:44] (03CR) 10Andrew Bogott: [C:03+2] cinder_backups: add some real work to codfw1dev backups [puppet] - 10https://gerrit.wikimedia.org/r/1017347 (https://phabricator.wikimedia.org/T358855) (owner: 10Andrew Bogott) [19:02:25] !log codesearch - puppet trying to restart hound-search after deploying gerrit:1017179 and gerrit:1016480 [19:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P59710 and previous config saved to /var/cache/conftool/dbconfig/20240405-191042-arnaudb.json [19:13:16] So db1246 is down for real? [19:13:20] arnaudb: ^ [19:13:25] Is there a task for it? [19:15:43] (03PS1) 10Dzahn: Revert "codesearch: Set CODESEARCH_HOUND_BASE to enable local Hound connections" [puppet] - 10https://gerrit.wikimedia.org/r/1017144 [19:16:05] marostegui: eh, it was repooled and the errors went away but it is indeed down for me now [19:16:18] I don't understand [19:16:19] but also it didnt page again and downtime was removed.. eh... [19:16:21] So it was down? [19:16:25] But got repooled? [19:16:51] I cannot even ssh to it [19:17:09] marostegui: it went down, paged us, then was repooled by the maintenance script that was running [19:17:23] it was like it was planned maintenance except it wasnt downtimed [19:17:39] But mariadb is down and the host is down [19:17:40] after repool the raised error rate recovered [19:21:45] I am a bit lost on what has happened earlier, the current situation is that the host is hard down and depooled [19:25:20] mutante: https://phabricator.wikimedia.org/T361968 [19:25:21] 10ops-eqiad, 06DBA: db1246 crashed - https://phabricator.wikimedia.org/T361968 (10Marostegui) 03NEW [19:25:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P59711 and previous config saved to /var/cache/conftool/dbconfig/20240405-192549-arnaudb.json [19:26:20] 10ops-eqiad, 06SRE, 10decommission-hardware, 13Patch-For-Review: decommission logstash101[012] - https://phabricator.wikimedia.org/T360950#9693677 (10VRiley-WMF) a:03VRiley-WMF [19:27:12] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1246.eqiad.wmnet with reason: Host down [19:27:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1246.eqiad.wmnet with reason: Host down [19:28:00] (03PS1) 10Andrew Bogott: cinder_backups: apply codfw1dev env files for eqiad backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/1017351 (https://phabricator.wikimedia.org/T358855) [19:28:45] (03CR) 10Andrew Bogott: [C:03+2] cinder_backups: apply codfw1dev env files for eqiad backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/1017351 (https://phabricator.wikimedia.org/T358855) (owner: 10Andrew Bogott) [19:30:29] (03PS1) 10Marostegui: db1246: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1017352 (https://phabricator.wikimedia.org/T361968) [19:30:51] (03CR) 10Dzahn: [C:03+2] Revert "codesearch: Set CODESEARCH_HOUND_BASE to enable local Hound connections" [puppet] - 10https://gerrit.wikimedia.org/r/1017144 (owner: 10Dzahn) [19:31:02] 10ops-eqiad, 06SRE, 06DBA, 13Patch-For-Review: db1246 crashed - https://phabricator.wikimedia.org/T361968#9693697 (10Marostegui) #ops-eqiad this is the second time this host has the same error see T359940 [19:32:34] marostegui: ACK,thank you [19:35:59] 10ops-eqiad, 06SRE, 06DBA, 13Patch-For-Review: db1246 crashed - https://phabricator.wikimedia.org/T361968#9693720 (10VRiley-WMF) a:03VRiley-WMF [19:37:43] 10ops-eqiad, 06SRE, 06DBA, 13Patch-For-Review: db1246 crashed - https://phabricator.wikimedia.org/T361968#9693721 (10VRiley-WMF) Hey @Marostegui I am currently looking at this unit. I checked both power supplied and logged into the machine. However, it seems to be healthy and it doesn't seem to be reportin... [19:38:22] (03PS1) 10Andrew Bogott: designate: move codfw1dev settings to 'common' for cross-site access [puppet] - 10https://gerrit.wikimedia.org/r/1017354 (https://phabricator.wikimedia.org/T358855) [19:38:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:38:46] (03PS2) 10Andrew Bogott: designate: move codfw1dev settings to 'common' for cross-site access [puppet] - 10https://gerrit.wikimedia.org/r/1017354 (https://phabricator.wikimedia.org/T358855) [19:38:55] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1017354 (https://phabricator.wikimedia.org/T358855) (owner: 10Andrew Bogott) [19:40:28] !log jhathaway@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mx-out1001.wikimedia.org with OS bookworm [19:40:28] !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host mx-out1001.wikimedia.org [19:40:37] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: eqiad, codfw 2 VM request for postfix mx-out - https://phabricator.wikimedia.org/T361750#9693746 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin1002 for host mx-out1001.wikimedia.org with OS... [19:40:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T360332)', diff saved to https://phabricator.wikimedia.org/P59712 and previous config saved to /var/cache/conftool/dbconfig/20240405-194057-arnaudb.json [19:41:00] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [19:41:05] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [19:41:24] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [19:41:34] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2098.codfw.wmnet with reason: Maintenance [19:41:46] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2098.codfw.wmnet with reason: Maintenance [19:42:01] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2152.codfw.wmnet with reason: Maintenance [19:42:03] 10ops-eqiad, 06SRE, 06DBA, 13Patch-For-Review: db1246 crashed - https://phabricator.wikimedia.org/T361968#9693750 (10Marostegui) @VRiley-WMF the host is hard down for us, we cannot ssh to it. As I mentioned, this is the second time the host crashed for the same reason, do you think we could escalate this t... [19:42:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2152.codfw.wmnet with reason: Maintenance [19:42:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T360332)', diff saved to https://phabricator.wikimedia.org/P59713 and previous config saved to /var/cache/conftool/dbconfig/20240405-194221-arnaudb.json [19:43:21] 10ops-eqiad, 06SRE, 06DBA, 13Patch-For-Review: db1246 crashed - https://phabricator.wikimedia.org/T361968#9693755 (10VRiley-WMF) Sure thing, I will be reaching out to them. [19:43:37] (03PS1) 10Krinkle: codesearch: Set CODESEARCH_HOUND_BASE to local Hound proxy (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/1017355 (https://phabricator.wikimedia.org/T361899) [19:43:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T360332)', diff saved to https://phabricator.wikimedia.org/P59714 and previous config saved to /var/cache/conftool/dbconfig/20240405-194339-arnaudb.json [19:44:43] (03CR) 10Dzahn: [C:03+2] codesearch: Set CODESEARCH_HOUND_BASE to local Hound proxy (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/1017355 (https://phabricator.wikimedia.org/T361899) (owner: 10Krinkle) [19:44:48] 10ops-eqiad, 06SRE, 06DBA, 13Patch-For-Review: db1246 crashed - https://phabricator.wikimedia.org/T361968#9693762 (10Marostegui) Could you reboot it to see if it comes back for us? [19:45:00] (03PS1) 10Andrew Bogott: openstack: move a bunch of codfw1dev passwords from 'codfw' to 'common' [labs/private] - 10https://gerrit.wikimedia.org/r/1017356 (https://phabricator.wikimedia.org/T358855) [19:45:22] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] openstack: move a bunch of codfw1dev passwords from 'codfw' to 'common' [labs/private] - 10https://gerrit.wikimedia.org/r/1017356 (https://phabricator.wikimedia.org/T358855) (owner: 10Andrew Bogott) [19:45:33] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host mx-out1001.wikimedia.org with OS bookworm [19:45:42] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: eqiad, codfw 2 VM request for postfix mx-out - https://phabricator.wikimedia.org/T361750#9693769 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin1002 for host mx-out1001.wikimedia.org wit... [19:46:40] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1017354 (https://phabricator.wikimedia.org/T358855) (owner: 10Andrew Bogott) [19:49:16] 10ops-eqiad, 06SRE, 10decommission-hardware, 13Patch-For-Review: 14decommission logstash101[012] - 14https://phabricator.wikimedia.org/T360950#9693787 (10VRiley-WMF) [19:49:19] 10ops-eqiad, 06SRE, 10decommission-hardware, 13Patch-For-Review: 14decommission logstash101[012] - 14https://phabricator.wikimedia.org/T360950#9693788 (10VRiley-WMF) 05Open→03Resolved [19:50:11] (03PS3) 10Andrew Bogott: designate: move codfw1dev settings to 'common' for cross-site access [puppet] - 10https://gerrit.wikimedia.org/r/1017354 (https://phabricator.wikimedia.org/T358855) [19:50:20] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1017354 (https://phabricator.wikimedia.org/T358855) (owner: 10Andrew Bogott) [19:50:30] (03CR) 10Marostegui: [C:03+2] db1246: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1017352 (https://phabricator.wikimedia.org/T361968) (owner: 10Marostegui) [19:54:05] (03PS4) 10Andrew Bogott: designate: move codfw1dev settings to 'common' for cross-site access [puppet] - 10https://gerrit.wikimedia.org/r/1017354 (https://phabricator.wikimedia.org/T358855) [19:57:24] !log jhathaway@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mx-out1001.wikimedia.org with reason: host reimage [19:58:13] (03CR) 10Andrew Bogott: [C:03+2] designate: move codfw1dev settings to 'common' for cross-site access [puppet] - 10https://gerrit.wikimedia.org/r/1017354 (https://phabricator.wikimedia.org/T358855) (owner: 10Andrew Bogott) [19:58:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P59715 and previous config saved to /var/cache/conftool/dbconfig/20240405-195847-arnaudb.json [20:02:07] (03PS2) 10Scott French: Release etcd-mirror 0.0.11 [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1017346 (https://phabricator.wikimedia.org/T358636) [20:02:07] (03PS1) 10Scott French: Bump etcdmirror package version: 0.0.11 [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1017358 (https://phabricator.wikimedia.org/T358636) [20:02:07] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mx-out1001.wikimedia.org with reason: host reimage [20:06:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 22.91% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:10:02] (03PS1) 10Andrew Bogott: wmcs-backup.py: don't crash if no backup host is assigned to ALLOTHERS [puppet] - 10https://gerrit.wikimedia.org/r/1017361 [20:10:29] (03PS2) 10Andrew Bogott: wmcs-backup.py: don't crash if no backup host is assigned to ALLOTHERS [puppet] - 10https://gerrit.wikimedia.org/r/1017361 [20:11:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 27.53% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:13:38] !log jhathaway@cumin2002 START - Cookbook sre.ganeti.makevm for new host mx-out2001.wikimedia.org [20:13:40] !log jhathaway@cumin2002 START - Cookbook sre.dns.netbox [20:13:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P59716 and previous config saved to /var/cache/conftool/dbconfig/20240405-201354-arnaudb.json [20:14:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.103s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:15:25] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mx-out1001.wikimedia.org with OS bookworm [20:15:40] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: eqiad, codfw 2 VM request for postfix mx-out - https://phabricator.wikimedia.org/T361750#9693977 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin1002 for host mx-out1001.wikimedia.org with OS... [20:16:00] (03PS1) 10Dzahn: base: add a firewall alias for the default docker network [puppet] - 10https://gerrit.wikimedia.org/r/1017367 [20:16:12] (03CR) 10Andrew Bogott: [C:03+2] wmcs-backup.py: don't crash if no backup host is assigned to ALLOTHERS [puppet] - 10https://gerrit.wikimedia.org/r/1017361 (owner: 10Andrew Bogott) [20:16:57] !log jhathaway@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM mx-out2001.wikimedia.org - jhathaway@cumin2002" [20:18:24] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM mx-out2001.wikimedia.org - jhathaway@cumin2002" [20:18:24] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:18:25] !log jhathaway@cumin2002 START - Cookbook sre.dns.wipe-cache mx-out2001.wikimedia.org on all recursors [20:18:28] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mx-out2001.wikimedia.org on all recursors [20:18:56] !log jhathaway@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM mx-out2001.wikimedia.org - jhathaway@cumin2002" [20:19:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 1.002s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:19:46] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM mx-out2001.wikimedia.org - jhathaway@cumin2002" [20:20:15] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host mx-out2001.wikimedia.org with OS bookworm [20:20:28] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: eqiad, codfw 2 VM request for postfix mx-out - https://phabricator.wikimedia.org/T361750#9693989 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host mx-out2001.wikimedia.org wit... [20:20:56] (03PS2) 10Dzahn: base: add a firewall alias for the default docker network [puppet] - 10https://gerrit.wikimedia.org/r/1017367 [20:24:41] (03PS3) 10Dzahn: base: add a firewall alias for the default docker network [puppet] - 10https://gerrit.wikimedia.org/r/1017367 [20:26:12] (03CR) 10Krinkle: base: add a firewall alias for the default docker network (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1017367 (owner: 10Dzahn) [20:28:07] (03CR) 10Dzahn: base: add a firewall alias for the default docker network (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1017367 (owner: 10Dzahn) [20:29:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T360332)', diff saved to https://phabricator.wikimedia.org/P59717 and previous config saved to /var/cache/conftool/dbconfig/20240405-202901-arnaudb.json [20:29:05] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2154.codfw.wmnet with reason: Maintenance [20:29:06] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [20:29:18] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2154.codfw.wmnet with reason: Maintenance [20:29:20] (03CR) 10Dzahn: base: add a firewall alias for the default docker network (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1017367 (owner: 10Dzahn) [20:29:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2154 (T360332)', diff saved to https://phabricator.wikimedia.org/P59718 and previous config saved to /var/cache/conftool/dbconfig/20240405-202925-arnaudb.json [20:30:02] (03PS4) 10Dzahn: base: add a firewall alias for the default docker network [puppet] - 10https://gerrit.wikimedia.org/r/1017367 [20:31:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T360332)', diff saved to https://phabricator.wikimedia.org/P59719 and previous config saved to /var/cache/conftool/dbconfig/20240405-203143-arnaudb.json [20:37:54] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mx-out2001.wikimedia.org with reason: host reimage [20:38:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [20:40:19] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mx-out2001.wikimedia.org with reason: host reimage [20:44:25] 06SRE, 10AQS2.0, 10Cassandra, 06serviceops, 07Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855#9694089 (10Eevans) 05Open→03Stalled [20:45:00] 06SRE, 10AQS2.0, 10Cassandra, 06serviceops, 07Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855#9694091 (10Eevans) p:05Triage→03Medium [20:46:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P59720 and previous config saved to /var/cache/conftool/dbconfig/20240405-204650-arnaudb.json [20:59:09] (03PS1) 10Kimberly Sarabia: Remove sampling rate in config for MP events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017378 (https://phabricator.wikimedia.org/T361962) [21:01:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P59721 and previous config saved to /var/cache/conftool/dbconfig/20240405-210157-arnaudb.json [21:17:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T360332)', diff saved to https://phabricator.wikimedia.org/P59722 and previous config saved to /var/cache/conftool/dbconfig/20240405-211705-arnaudb.json [21:17:08] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2161.codfw.wmnet with reason: Maintenance [21:17:09] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [21:17:21] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2161.codfw.wmnet with reason: Maintenance [21:17:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2161 (T360332)', diff saved to https://phabricator.wikimedia.org/P59723 and previous config saved to /var/cache/conftool/dbconfig/20240405-211728-arnaudb.json [21:19:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T360332)', diff saved to https://phabricator.wikimedia.org/P59724 and previous config saved to /var/cache/conftool/dbconfig/20240405-211946-arnaudb.json [21:27:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 928.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:32:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 894.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:34:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P59725 and previous config saved to /var/cache/conftool/dbconfig/20240405-213454-arnaudb.json [21:48:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:50:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P59727 and previous config saved to /var/cache/conftool/dbconfig/20240405-215001-arnaudb.json [22:05:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T360332)', diff saved to https://phabricator.wikimedia.org/P59728 and previous config saved to /var/cache/conftool/dbconfig/20240405-220510-arnaudb.json [22:05:13] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2162.codfw.wmnet with reason: Maintenance [22:05:19] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [22:05:26] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2162.codfw.wmnet with reason: Maintenance [22:05:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2162 (T360332)', diff saved to https://phabricator.wikimedia.org/P59729 and previous config saved to /var/cache/conftool/dbconfig/20240405-220533-arnaudb.json [22:07:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T360332)', diff saved to https://phabricator.wikimedia.org/P59730 and previous config saved to /var/cache/conftool/dbconfig/20240405-220751-arnaudb.json [22:21:25] (SystemdUnitFailed) firing: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:22:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P59731 and previous config saved to /var/cache/conftool/dbconfig/20240405-222259-arnaudb.json [22:38:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P59732 and previous config saved to /var/cache/conftool/dbconfig/20240405-223806-arnaudb.json [22:40:48] (PuppetFailure) firing: Puppet has failed on mx-out2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:46:25] (SystemdUnitFailed) resolved: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:48:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [22:53:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T360332)', diff saved to https://phabricator.wikimedia.org/P59733 and previous config saved to /var/cache/conftool/dbconfig/20240405-225313-arnaudb.json [22:53:16] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2163.codfw.wmnet with reason: Maintenance [22:53:19] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [22:53:30] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2163.codfw.wmnet with reason: Maintenance [22:53:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2163 (T360332)', diff saved to https://phabricator.wikimedia.org/P59734 and previous config saved to /var/cache/conftool/dbconfig/20240405-225336-arnaudb.json [22:55:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T360332)', diff saved to https://phabricator.wikimedia.org/P59735 and previous config saved to /var/cache/conftool/dbconfig/20240405-225554-arnaudb.json [23:11:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P59736 and previous config saved to /var/cache/conftool/dbconfig/20240405-231102-arnaudb.json [23:26:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P59737 and previous config saved to /var/cache/conftool/dbconfig/20240405-232609-arnaudb.json [23:35:48] (PuppetFailure) resolved: Puppet has failed on mx-out2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:41:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T360332)', diff saved to https://phabricator.wikimedia.org/P59738 and previous config saved to /var/cache/conftool/dbconfig/20240405-234117-arnaudb.json [23:41:21] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2164.codfw.wmnet with reason: Maintenance [23:41:21] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [23:41:34] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2164.codfw.wmnet with reason: Maintenance [23:41:36] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [23:41:49] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [23:41:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2164 (T360332)', diff saved to https://phabricator.wikimedia.org/P59739 and previous config saved to /var/cache/conftool/dbconfig/20240405-234156-arnaudb.json [23:44:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T360332)', diff saved to https://phabricator.wikimedia.org/P59740 and previous config saved to /var/cache/conftool/dbconfig/20240405-234413-arnaudb.json [23:46:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 976.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:48:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:51:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 944.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:59:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P59741 and previous config saved to /var/cache/conftool/dbconfig/20240405-235920-arnaudb.json