[00:01:42] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1039602 (owner: 10TrainBranchBot) [00:06:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T364069)', diff saved to https://phabricator.wikimedia.org/P64390 and previous config saved to /var/cache/conftool/dbconfig/20240609-000640-marostegui.json [00:06:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [00:06:45] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [00:06:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [00:06:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [00:07:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [00:07:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T364069)', diff saved to https://phabricator.wikimedia.org/P64391 and previous config saved to /var/cache/conftool/dbconfig/20240609-000718-marostegui.json [00:09:10] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on elastic1086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:15:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:39:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T364299)', diff saved to https://phabricator.wikimedia.org/P64392 and previous config saved to /var/cache/conftool/dbconfig/20240609-003906-marostegui.json [00:39:10] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [00:45:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:54:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P64393 and previous config saved to /var/cache/conftool/dbconfig/20240609-005414-marostegui.json [01:09:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P64394 and previous config saved to /var/cache/conftool/dbconfig/20240609-010922-marostegui.json [01:24:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T364299)', diff saved to https://phabricator.wikimedia.org/P64395 and previous config saved to /var/cache/conftool/dbconfig/20240609-012432-marostegui.json [01:24:34] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [01:24:36] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [01:24:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [01:34:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [01:45:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:00:59] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance [02:01:12] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance [02:01:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1226 (T352010)', diff saved to https://phabricator.wikimedia.org/P64396 and previous config saved to /var/cache/conftool/dbconfig/20240609-020120-ladsgroup.json [02:01:24] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [02:13:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T364069)', diff saved to https://phabricator.wikimedia.org/P64397 and previous config saved to /var/cache/conftool/dbconfig/20240609-021333-marostegui.json [02:13:41] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [02:28:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P64398 and previous config saved to /var/cache/conftool/dbconfig/20240609-022840-marostegui.json [02:38:44] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:43:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P64399 and previous config saved to /var/cache/conftool/dbconfig/20240609-024349-marostegui.json [02:45:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:55:45] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:58:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T364069)', diff saved to https://phabricator.wikimedia.org/P64400 and previous config saved to /var/cache/conftool/dbconfig/20240609-025856-marostegui.json [02:59:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [02:59:03] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [02:59:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [02:59:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T364069)', diff saved to https://phabricator.wikimedia.org/P64401 and previous config saved to /var/cache/conftool/dbconfig/20240609-025921-marostegui.json [04:09:10] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on elastic1086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:15:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:38:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T352010)', diff saved to https://phabricator.wikimedia.org/P64402 and previous config saved to /var/cache/conftool/dbconfig/20240609-043811-ladsgroup.json [04:38:14] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [04:53:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P64403 and previous config saved to /var/cache/conftool/dbconfig/20240609-045319-ladsgroup.json [05:02:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T364069)', diff saved to https://phabricator.wikimedia.org/P64404 and previous config saved to /var/cache/conftool/dbconfig/20240609-050245-marostegui.json [05:02:49] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [05:08:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P64405 and previous config saved to /var/cache/conftool/dbconfig/20240609-050826-ladsgroup.json [05:15:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:17:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P64406 and previous config saved to /var/cache/conftool/dbconfig/20240609-051753-marostegui.json [05:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:23:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T352010)', diff saved to https://phabricator.wikimedia.org/P64407 and previous config saved to /var/cache/conftool/dbconfig/20240609-052334-ladsgroup.json [05:23:37] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [05:23:38] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [05:23:50] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [05:23:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T352010)', diff saved to https://phabricator.wikimedia.org/P64408 and previous config saved to /var/cache/conftool/dbconfig/20240609-052358-ladsgroup.json [05:33:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P64409 and previous config saved to /var/cache/conftool/dbconfig/20240609-053301-marostegui.json [05:34:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [05:45:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:48:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T364069)', diff saved to https://phabricator.wikimedia.org/P64410 and previous config saved to /var/cache/conftool/dbconfig/20240609-054809-marostegui.json [05:48:12] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [05:48:13] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [05:48:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [05:48:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T364069)', diff saved to https://phabricator.wikimedia.org/P64411 and previous config saved to /var/cache/conftool/dbconfig/20240609-054833-marostegui.json [05:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:50:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T352010)', diff saved to https://phabricator.wikimedia.org/P64412 and previous config saved to /var/cache/conftool/dbconfig/20240609-055017-ladsgroup.json [05:50:20] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [06:01:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T352010)', diff saved to https://phabricator.wikimedia.org/P64413 and previous config saved to /var/cache/conftool/dbconfig/20240609-060146-ladsgroup.json [06:01:50] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [06:03:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:05:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P64414 and previous config saved to /var/cache/conftool/dbconfig/20240609-060525-ladsgroup.json [06:08:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:16:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P64415 and previous config saved to /var/cache/conftool/dbconfig/20240609-061653-ladsgroup.json [06:20:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P64416 and previous config saved to /var/cache/conftool/dbconfig/20240609-062033-ladsgroup.json [06:32:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P64417 and previous config saved to /var/cache/conftool/dbconfig/20240609-063201-ladsgroup.json [06:35:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T352010)', diff saved to https://phabricator.wikimedia.org/P64418 and previous config saved to /var/cache/conftool/dbconfig/20240609-063543-ladsgroup.json [06:35:46] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [06:35:48] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [06:35:59] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [06:36:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2181 (T352010)', diff saved to https://phabricator.wikimedia.org/P64419 and previous config saved to /var/cache/conftool/dbconfig/20240609-063607-ladsgroup.json [06:45:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:47:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T352010)', diff saved to https://phabricator.wikimedia.org/P64420 and previous config saved to /var/cache/conftool/dbconfig/20240609-064709-ladsgroup.json [06:47:12] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1234.eqiad.wmnet with reason: Maintenance [06:47:13] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [06:47:25] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1234.eqiad.wmnet with reason: Maintenance [06:47:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T352010)', diff saved to https://phabricator.wikimedia.org/P64421 and previous config saved to /var/cache/conftool/dbconfig/20240609-064733-ladsgroup.json [06:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240609T0700) [07:15:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:16:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T364069)', diff saved to https://phabricator.wikimedia.org/P64422 and previous config saved to /var/cache/conftool/dbconfig/20240609-071601-marostegui.json [07:16:05] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [07:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:31:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P64423 and previous config saved to /var/cache/conftool/dbconfig/20240609-073109-marostegui.json [07:45:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:46:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P64424 and previous config saved to /var/cache/conftool/dbconfig/20240609-074617-marostegui.json [07:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:54:54] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1221.eqiad.wmnet with reason: Maintenance [07:55:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1221.eqiad.wmnet with reason: Maintenance [07:55:09] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:55:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:55:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1221 (T364299)', diff saved to https://phabricator.wikimedia.org/P64425 and previous config saved to /var/cache/conftool/dbconfig/20240609-075533-marostegui.json [07:55:37] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [08:01:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T364069)', diff saved to https://phabricator.wikimedia.org/P64426 and previous config saved to /var/cache/conftool/dbconfig/20240609-080125-marostegui.json [08:01:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2193.codfw.wmnet with reason: Maintenance [08:01:29] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [08:01:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2193.codfw.wmnet with reason: Maintenance [08:01:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T364069)', diff saved to https://phabricator.wikimedia.org/P64427 and previous config saved to /var/cache/conftool/dbconfig/20240609-080149-marostegui.json [08:09:10] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on elastic1086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:15:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:13:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T364069)', diff saved to https://phabricator.wikimedia.org/P64428 and previous config saved to /var/cache/conftool/dbconfig/20240609-091329-marostegui.json [09:13:34] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [09:15:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:28:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P64429 and previous config saved to /var/cache/conftool/dbconfig/20240609-092837-marostegui.json [09:34:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [09:43:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P64430 and previous config saved to /var/cache/conftool/dbconfig/20240609-094346-marostegui.json [09:58:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T364069)', diff saved to https://phabricator.wikimedia.org/P64431 and previous config saved to /var/cache/conftool/dbconfig/20240609-095854-marostegui.json [09:58:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2197.codfw.wmnet with reason: Maintenance [09:58:58] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [09:59:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2197.codfw.wmnet with reason: Maintenance [10:15:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:23:12] (03PS1) 10Ebrahim: errorpage: Add dark mode support to error page [puppet] - 10https://gerrit.wikimedia.org/r/1040809 [10:25:46] (03PS2) 10Ebrahim: errorpage: Add dark mode support to error page [puppet] - 10https://gerrit.wikimedia.org/r/1040809 [10:33:25] FIRING: SystemdUnitFailed: imagecatalog_record.service on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:34:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T352010)', diff saved to https://phabricator.wikimedia.org/P64432 and previous config saved to /var/cache/conftool/dbconfig/20240609-103421-ladsgroup.json [10:34:26] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [10:45:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:49:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P64433 and previous config saved to /var/cache/conftool/dbconfig/20240609-104929-ladsgroup.json [11:04:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P64434 and previous config saved to /var/cache/conftool/dbconfig/20240609-110437-ladsgroup.json [11:07:30] FIRING: [2x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:09:21] PROBLEM - WDQS SPARQL on wdqs1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:12:11] RECOVERY - WDQS SPARQL on wdqs1020 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:17:30] RESOLVED: [2x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:19:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T352010)', diff saved to https://phabricator.wikimedia.org/P64435 and previous config saved to /var/cache/conftool/dbconfig/20240609-111945-ladsgroup.json [11:19:47] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [11:19:49] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [11:20:00] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [11:20:02] 10ops-eqiad, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T367004 (10phaultfinder) 03NEW [11:22:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T364299)', diff saved to https://phabricator.wikimedia.org/P64436 and previous config saved to /var/cache/conftool/dbconfig/20240609-112229-marostegui.json [11:22:34] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [11:25:14] (03PS3) 10Ebrahim: errorpage: Add dark mode support to error page [puppet] - 10https://gerrit.wikimedia.org/r/1040809 [11:29:30] (03PS4) 10Ebrahim: errorpage: Add dark mode support to error page [puppet] - 10https://gerrit.wikimedia.org/r/1040809 [11:33:25] RESOLVED: SystemdUnitFailed: imagecatalog_record.service on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:37:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P64437 and previous config saved to /var/cache/conftool/dbconfig/20240609-113737-marostegui.json [11:45:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:52:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P64438 and previous config saved to /var/cache/conftool/dbconfig/20240609-115245-marostegui.json [12:03:40] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2217.codfw.wmnet with reason: Maintenance [12:03:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2217.codfw.wmnet with reason: Maintenance [12:04:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2217 (T364069)', diff saved to https://phabricator.wikimedia.org/P64439 and previous config saved to /var/cache/conftool/dbconfig/20240609-120400-marostegui.json [12:04:06] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [12:07:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T364299)', diff saved to https://phabricator.wikimedia.org/P64440 and previous config saved to /var/cache/conftool/dbconfig/20240609-120753-marostegui.json [12:07:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1248.eqiad.wmnet with reason: Maintenance [12:07:58] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [12:08:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1248.eqiad.wmnet with reason: Maintenance [12:08:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1248 (T364299)', diff saved to https://phabricator.wikimedia.org/P64441 and previous config saved to /var/cache/conftool/dbconfig/20240609-120817-marostegui.json [12:09:10] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on elastic1086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:15:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:45:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:58:55] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T367004#9873795 (10phaultfinder) [13:15:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:34:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [13:45:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T364069)', diff saved to https://phabricator.wikimedia.org/P64442 and previous config saved to /var/cache/conftool/dbconfig/20240609-134508-marostegui.json [13:45:12] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [13:45:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T352010)', diff saved to https://phabricator.wikimedia.org/P64443 and previous config saved to /var/cache/conftool/dbconfig/20240609-134541-ladsgroup.json [13:45:47] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [14:00:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P64444 and previous config saved to /var/cache/conftool/dbconfig/20240609-140016-marostegui.json [14:00:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P64445 and previous config saved to /var/cache/conftool/dbconfig/20240609-140049-ladsgroup.json [14:15:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P64446 and previous config saved to /var/cache/conftool/dbconfig/20240609-141524-marostegui.json [14:15:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:15:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P64447 and previous config saved to /var/cache/conftool/dbconfig/20240609-141557-ladsgroup.json [14:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:30:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T364069)', diff saved to https://phabricator.wikimedia.org/P64448 and previous config saved to /var/cache/conftool/dbconfig/20240609-143032-marostegui.json [14:30:37] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [14:31:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T352010)', diff saved to https://phabricator.wikimedia.org/P64449 and previous config saved to /var/cache/conftool/dbconfig/20240609-143105-ladsgroup.json [14:31:08] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [14:31:11] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [14:31:21] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [14:31:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T352010)', diff saved to https://phabricator.wikimedia.org/P64450 and previous config saved to /var/cache/conftool/dbconfig/20240609-143128-ladsgroup.json [14:34:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T352010)', diff saved to https://phabricator.wikimedia.org/P64451 and previous config saved to /var/cache/conftool/dbconfig/20240609-143432-ladsgroup.json [14:38:45] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:45:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:48:45] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:49:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P64452 and previous config saved to /var/cache/conftool/dbconfig/20240609-144940-ladsgroup.json [14:55:46] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:04:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P64453 and previous config saved to /var/cache/conftool/dbconfig/20240609-150448-ladsgroup.json [15:08:52] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T367004#9873829 (10phaultfinder) [15:15:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:18:45] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:19:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T352010)', diff saved to https://phabricator.wikimedia.org/P64454 and previous config saved to /var/cache/conftool/dbconfig/20240609-151956-ladsgroup.json [15:19:59] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1235.eqiad.wmnet with reason: Maintenance [15:20:04] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [15:20:12] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1235.eqiad.wmnet with reason: Maintenance [15:20:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1235 (T352010)', diff saved to https://phabricator.wikimedia.org/P64455 and previous config saved to /var/cache/conftool/dbconfig/20240609-152020-ladsgroup.json [15:20:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T364299)', diff saved to https://phabricator.wikimedia.org/P64456 and previous config saved to /var/cache/conftool/dbconfig/20240609-152057-marostegui.json [15:21:01] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [15:36:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P64457 and previous config saved to /var/cache/conftool/dbconfig/20240609-153605-marostegui.json [15:45:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:48:45] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:51:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P64458 and previous config saved to /var/cache/conftool/dbconfig/20240609-155113-marostegui.json [16:06:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T364299)', diff saved to https://phabricator.wikimedia.org/P64459 and previous config saved to /var/cache/conftool/dbconfig/20240609-160621-marostegui.json [16:06:26] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [16:09:10] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on elastic1086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:45:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:15:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:34:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [17:45:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:15:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:38:23] RECOVERY - Host elastic2099 is UP: PING WARNING - Packet loss = 80%, RTA = 0.58 ms [18:38:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T352010)', diff saved to https://phabricator.wikimedia.org/P64460 and previous config saved to /var/cache/conftool/dbconfig/20240609-183839-ladsgroup.json [18:38:43] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [18:39:07] PROBLEM - SSH on elastic2099 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:44:47] PROBLEM - Host elastic2099 is DOWN: PING CRITICAL - Packet loss = 100% [18:53:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P64461 and previous config saved to /var/cache/conftool/dbconfig/20240609-185347-ladsgroup.json [19:08:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P64462 and previous config saved to /var/cache/conftool/dbconfig/20240609-190856-ladsgroup.json [19:15:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:24:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T352010)', diff saved to https://phabricator.wikimedia.org/P64463 and previous config saved to /var/cache/conftool/dbconfig/20240609-192404-ladsgroup.json [19:24:07] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance [19:24:09] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [19:24:20] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance [19:24:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2195 (T352010)', diff saved to https://phabricator.wikimedia.org/P64464 and previous config saved to /var/cache/conftool/dbconfig/20240609-192428-ladsgroup.json [19:45:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:48:45] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:58:56] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T367004#9873908 (10phaultfinder) [20:09:10] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on elastic1086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:15:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:18:45] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:45:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:34:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:45:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:48:45] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:15:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:18:45] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:23:51] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T367004#9873930 (10phaultfinder) [22:43:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T352010)', diff saved to https://phabricator.wikimedia.org/P64465 and previous config saved to /var/cache/conftool/dbconfig/20240609-224357-ladsgroup.json [22:44:02] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [22:45:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:48:45] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:55:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T352010)', diff saved to https://phabricator.wikimedia.org/P64466 and previous config saved to /var/cache/conftool/dbconfig/20240609-225523-ladsgroup.json [22:55:27] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [22:59:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P64467 and previous config saved to /var/cache/conftool/dbconfig/20240609-225905-ladsgroup.json [23:10:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P64468 and previous config saved to /var/cache/conftool/dbconfig/20240609-231031-ladsgroup.json [23:14:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P64469 and previous config saved to /var/cache/conftool/dbconfig/20240609-231413-ladsgroup.json [23:15:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:25:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P64470 and previous config saved to /var/cache/conftool/dbconfig/20240609-232539-ladsgroup.json [23:29:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T352010)', diff saved to https://phabricator.wikimedia.org/P64471 and previous config saved to /var/cache/conftool/dbconfig/20240609-232921-ladsgroup.json [23:29:24] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance [23:29:25] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:29:37] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance [23:38:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1039603 [23:38:15] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1039603 (owner: 10TrainBranchBot) [23:40:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T352010)', diff saved to https://phabricator.wikimedia.org/P64472 and previous config saved to /var/cache/conftool/dbconfig/20240609-234047-ladsgroup.json [23:40:50] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2188.codfw.wmnet with reason: Maintenance [23:40:52] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:41:03] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2188.codfw.wmnet with reason: Maintenance [23:41:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T352010)', diff saved to https://phabricator.wikimedia.org/P64473 and previous config saved to /var/cache/conftool/dbconfig/20240609-234110-ladsgroup.json