[00:01:00] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:01:34] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2236.codfw.wmnet with reason: host reimage [00:02:23] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1057305 (owner: 10TrainBranchBot) [00:04:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2236.codfw.wmnet with reason: host reimage [00:05:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P66953 and previous config saved to /var/cache/conftool/dbconfig/20240727-000509-ladsgroup.json [00:08:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:08:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2237.codfw.wmnet with OS bookworm [00:08:17] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10019892 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2237.codfw.wmnet with OS bookworm completed: - db2237 (**PASS*... [00:09:33] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10019894 (10Papaul) [00:10:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2235.codfw.wmnet with OS bookworm [00:10:26] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10019896 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2235.codfw.wmnet with OS bookworm [00:20:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T352010)', diff saved to https://phabricator.wikimedia.org/P66954 and previous config saved to /var/cache/conftool/dbconfig/20240727-002016-ladsgroup.json [00:20:29] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [00:20:40] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:24:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2235.codfw.wmnet with reason: host reimage [00:26:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2235.codfw.wmnet with reason: host reimage [00:34:51] (03PS2) 10Scott French: switchdc: mediawiki cache warmup now targets k8s [cookbooks] - 10https://gerrit.wikimedia.org/r/1057255 (https://phabricator.wikimedia.org/T369921) [00:35:32] (03CR) 10Scott French: "Thanks, Reuven!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1057255 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [00:42:33] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:50:26] 06SRE, 10Charts, 06serviceops, 10Shellbox: Figure out how a shellbox instance for the Chart extension would work - https://phabricator.wikimedia.org/T370739#10019952 (10CDanis) >>! In T370739#10019839, @Catrope wrote: > @akosiaris I'm trying to figure out how we should proceed based on your comment. Should... [00:51:12] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:51:13] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2236.codfw.wmnet with OS bookworm [00:51:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:51:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2235.codfw.wmnet with OS bookworm [00:51:21] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10019953 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2236.codfw.wmnet with OS bookworm completed: - db2236 (**WARN*... [00:51:24] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10019954 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2235.codfw.wmnet with OS bookworm completed: - db2235 (**PASS*... [00:53:12] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2234.codfw.wmnet with OS bookworm [00:53:26] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10019955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2234.codfw.wmnet with OS bookworm [00:57:30] 06SRE, 10Charts, 06serviceops, 10Shellbox: Figure out how a shellbox instance for the Chart extension would work - https://phabricator.wikimedia.org/T370739#10019956 (10Catrope) I'm not sure if we will -- this service would only be accessed from within MediaWiki when it parses wiki pages (which generally o... [00:59:43] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10019960 (10Papaul) [01:06:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from restbase.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=ulsfo&var-cluster=text&var-origin=restbase.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [01:07:20] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2233.codfw.wmnet with OS bookworm [01:07:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2234.codfw.wmnet with reason: host reimage [01:07:33] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10019964 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2233.codfw.wmnet with OS bookworm [01:10:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2234.codfw.wmnet with reason: host reimage [01:13:45] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db2233.codfw.wmnet with OS bookworm [01:13:53] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10019966 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2233.codfw.wmnet with OS bookworm executed with errors: - db22... [01:16:32] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10019971 (10Papaul) @Jhancock.wm when you back onsite next week before me please check the cable on db2233 thank you. ` xe-0/0/30 up down db2233 [01:26:39] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:26:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from restbase.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=ulsfo&var-cluster=text&var-origin=restbase.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [01:54:22] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:54:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2234.codfw.wmnet with OS bookworm [01:54:29] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10020015 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2234.codfw.wmnet with OS bookworm completed: - db2234 (**PASS*... [02:39:21] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:38] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:06:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:09:38] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10020123 (10Papaul) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:23:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T367856)', diff saved to https://phabricator.wikimedia.org/P66955 and previous config saved to /var/cache/conftool/dbconfig/20240727-062317-marostegui.json [06:23:23] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [06:38:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P66956 and previous config saved to /var/cache/conftool/dbconfig/20240727-063824-marostegui.json [06:53:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P66957 and previous config saved to /var/cache/conftool/dbconfig/20240727-065332-marostegui.json [07:06:40] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:08:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T367856)', diff saved to https://phabricator.wikimedia.org/P66958 and previous config saved to /var/cache/conftool/dbconfig/20240727-070839-marostegui.json [07:08:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db2202.codfw.wmnet with reason: Maintenance [07:08:46] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [07:08:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db2202.codfw.wmnet with reason: Maintenance [07:59:21] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:55:38] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:03:58] I depool [10:04:51] Thank you [10:05:05] I acked the alert [10:05:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool db1246, paged', diff saved to https://phabricator.wikimedia.org/P66959 and previous config saved to /var/cache/conftool/dbconfig/20240727-100533-ladsgroup.json [10:07:23] Amir1: I'll create a follow up task for Monday [10:08:13] oh thank you [10:10:48] we should downtime it [10:10:53] Indeed [10:11:16] The placeholder task: https://phabricator.wikimedia.org/T371171 [10:11:28] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1246.eqiad.wmnet with reason: Sad [10:11:41] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1246.eqiad.wmnet with reason: Sad [10:12:53] I'll resolve the incidents as well so that they don't re-page tomorrow [10:14:35] Done [10:38:14] now we have another one, another replica [10:38:14] <_joe_> uhhh [10:38:37] <_joe_> Amir1: did mysql crash? [10:38:50] I didn't check. I depooled [10:38:55] let me check this one [10:45:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P66960 and previous config saved to /var/cache/conftool/dbconfig/20240727-104502-ladsgroup.json [11:00:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P66961 and previous config saved to /var/cache/conftool/dbconfig/20240727-110007-ladsgroup.json [11:06:40] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:15:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P66962 and previous config saved to /var/cache/conftool/dbconfig/20240727-111512-ladsgroup.json [11:30:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P66963 and previous config saved to /var/cache/conftool/dbconfig/20240727-113018-ladsgroup.json [13:13:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T367856)', diff saved to https://phabricator.wikimedia.org/P66964 and previous config saved to /var/cache/conftool/dbconfig/20240727-131316-marostegui.json [13:13:22] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [13:28:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P66965 and previous config saved to /var/cache/conftool/dbconfig/20240727-132824-marostegui.json [13:43:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P66966 and previous config saved to /var/cache/conftool/dbconfig/20240727-134331-marostegui.json [13:58:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T367856)', diff saved to https://phabricator.wikimedia.org/P66967 and previous config saved to /var/cache/conftool/dbconfig/20240727-135838-marostegui.json [13:58:40] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1241.eqiad.wmnet with reason: Maintenance [13:58:43] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [13:58:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1241.eqiad.wmnet with reason: Maintenance [13:59:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1241 (T367856)', diff saved to https://phabricator.wikimedia.org/P66968 and previous config saved to /var/cache/conftool/dbconfig/20240727-135859-marostegui.json [14:39:21] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:21] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:40] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:11:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10020696 (10VRiley-WMF) [19:06:40] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:15:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 24.63% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:25:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 24.34% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:06:40] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:38:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1057368 [23:38:22] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1057368 (owner: 10TrainBranchBot)