[00:01:46] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:02:03] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1033387 (owner: 10TrainBranchBot) [00:02:46] PROBLEM - mailman3_runners on lists1001 is CRITICAL: PROCS CRITICAL: 15 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:03:46] RECOVERY - mailman3_runners on lists1001 is OK: PROCS OK: 14 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:03:55] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:21:45] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [00:58:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T352010)', diff saved to https://phabricator.wikimedia.org/P62639 and previous config saved to /var/cache/conftool/dbconfig/20240519-005811-ladsgroup.json [00:58:18] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [01:13:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P62640 and previous config saved to /var/cache/conftool/dbconfig/20240519-011320-ladsgroup.json [01:28:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P62641 and previous config saved to /var/cache/conftool/dbconfig/20240519-012827-ladsgroup.json [01:43:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T352010)', diff saved to https://phabricator.wikimedia.org/P62642 and previous config saved to /var/cache/conftool/dbconfig/20240519-014335-ladsgroup.json [01:43:38] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2197.codfw.wmnet with reason: Maintenance [01:43:40] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [01:43:51] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2197.codfw.wmnet with reason: Maintenance [02:36:45] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:51:06] FIRING: KubernetesAPINotScrapable: k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [03:01:45] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:51:45] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:06:45] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:21:46] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:10:09] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2214.codfw.wmnet with reason: Maintenance [05:10:22] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2214.codfw.wmnet with reason: Maintenance [05:10:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2214 (T352010)', diff saved to https://phabricator.wikimedia.org/P62643 and previous config saved to /var/cache/conftool/dbconfig/20240519-051029-ladsgroup.json [05:10:37] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [05:32:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance [05:32:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:51:06] FIRING: KubernetesAPINotScrapable: k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240519T0700) [07:00:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T352010)', diff saved to https://phabricator.wikimedia.org/P62644 and previous config saved to /var/cache/conftool/dbconfig/20240519-070008-ladsgroup.json [07:00:26] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [07:01:46] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:15:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P62645 and previous config saved to /var/cache/conftool/dbconfig/20240519-071517-ladsgroup.json [07:30:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P62646 and previous config saved to /var/cache/conftool/dbconfig/20240519-073025-ladsgroup.json [07:45:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T352010)', diff saved to https://phabricator.wikimedia.org/P62647 and previous config saved to /var/cache/conftool/dbconfig/20240519-074532-ladsgroup.json [07:45:35] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2217.codfw.wmnet with reason: Maintenance [07:45:39] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [07:45:48] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2217.codfw.wmnet with reason: Maintenance [07:45:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2217 (T352010)', diff saved to https://phabricator.wikimedia.org/P62648 and previous config saved to /var/cache/conftool/dbconfig/20240519-074556-ladsgroup.json [07:51:46] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:46] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:21:45] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:37:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T352010)', diff saved to https://phabricator.wikimedia.org/P62649 and previous config saved to /var/cache/conftool/dbconfig/20240519-093723-ladsgroup.json [09:37:29] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [09:52:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P62650 and previous config saved to /var/cache/conftool/dbconfig/20240519-095231-ladsgroup.json [10:07:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P62651 and previous config saved to /var/cache/conftool/dbconfig/20240519-100739-ladsgroup.json [10:22:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T352010)', diff saved to https://phabricator.wikimedia.org/P62652 and previous config saved to /var/cache/conftool/dbconfig/20240519-102247-ladsgroup.json [10:22:51] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [10:22:52] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [10:23:04] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [10:23:05] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [10:23:08] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [10:23:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2159 (T352010)', diff saved to https://phabricator.wikimedia.org/P62653 and previous config saved to /var/cache/conftool/dbconfig/20240519-102315-ladsgroup.json [10:30:44] (03PS1) 10NMW03: Enable wgMinervaShowCategories for English Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1033656 (https://phabricator.wikimedia.org/T365323) [10:31:40] (03PS70) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [10:31:40] (03PS1) 10AOkoth: vrts: aesthetic code improvements [puppet] - 10https://gerrit.wikimedia.org/r/1033657 [10:32:50] (03CR) 10CI reject: [V:04-1] vrts: aesthetic code improvements [puppet] - 10https://gerrit.wikimedia.org/r/1033657 (owner: 10AOkoth) [10:35:45] (03PS2) 10AOkoth: vrts: aesthetic code improvements [puppet] - 10https://gerrit.wikimedia.org/r/1033657 [10:36:06] (03CR) 10CI reject: [V:04-1] vrts: aesthetic code improvements [puppet] - 10https://gerrit.wikimedia.org/r/1033657 (owner: 10AOkoth) [10:38:08] (03PS3) 10AOkoth: vrts: aesthetic code improvements [puppet] - 10https://gerrit.wikimedia.org/r/1033657 [10:42:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T352010)', diff saved to https://phabricator.wikimedia.org/P62654 and previous config saved to /var/cache/conftool/dbconfig/20240519-104206-ladsgroup.json [10:42:11] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [10:42:21] (03CR) 10AOkoth: "Compile Results: https://puppet-compiler.wmflabs.org/output/1033657/2508/" [puppet] - 10https://gerrit.wikimedia.org/r/1033657 (owner: 10AOkoth) [10:43:27] (03PS4) 10AOkoth: vrts: aesthetic code improvements [puppet] - 10https://gerrit.wikimedia.org/r/1033657 [10:51:06] FIRING: KubernetesAPINotScrapable: k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [10:57:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P62655 and previous config saved to /var/cache/conftool/dbconfig/20240519-105714-ladsgroup.json [11:01:46] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:02:02] (03PS5) 10AOkoth: vrts: aesthetic code improvements [puppet] - 10https://gerrit.wikimedia.org/r/1033657 [11:07:22] (03PS6) 10AOkoth: vrts: aesthetic code improvements [puppet] - 10https://gerrit.wikimedia.org/r/1033657 [11:07:35] (03PS7) 10AOkoth: vrts: aesthetic code improvements [puppet] - 10https://gerrit.wikimedia.org/r/1033657 [11:12:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P62656 and previous config saved to /var/cache/conftool/dbconfig/20240519-111222-ladsgroup.json [11:15:09] (03CR) 10AOkoth: "Compiler Results: https://puppet-compiler.wmflabs.org/output/1033657/2510/" [puppet] - 10https://gerrit.wikimedia.org/r/1033657 (owner: 10AOkoth) [11:27:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T352010)', diff saved to https://phabricator.wikimedia.org/P62657 and previous config saved to /var/cache/conftool/dbconfig/20240519-112730-ladsgroup.json [11:27:34] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [11:51:46] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:06:46] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:21:46] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:36:46] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:51:06] FIRING: KubernetesAPINotScrapable: k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [14:58:55] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:51:46] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:06:46] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:21:46] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:38:27] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [16:38:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [16:38:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:38:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:38:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T364299)', diff saved to https://phabricator.wikimedia.org/P62658 and previous config saved to /var/cache/conftool/dbconfig/20240519-163855-marostegui.json [16:39:01] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [16:39:32] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Schema change [16:39:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Schema change [16:51:18] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1006 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:21:18] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:39:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T364299)', diff saved to https://phabricator.wikimedia.org/P62660 and previous config saved to /var/cache/conftool/dbconfig/20240519-173923-marostegui.json [17:39:28] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [17:54:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P62661 and previous config saved to /var/cache/conftool/dbconfig/20240519-175431-marostegui.json [18:09:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P62662 and previous config saved to /var/cache/conftool/dbconfig/20240519-180939-marostegui.json [18:24:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T364299)', diff saved to https://phabricator.wikimedia.org/P62663 and previous config saved to /var/cache/conftool/dbconfig/20240519-182447-marostegui.json [18:24:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [18:24:51] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [18:25:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [18:51:06] FIRING: KubernetesAPINotScrapable: k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [19:01:46] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:02:37] (03PS1) 10Vgutierrez: hiera: Set p::contacts::role_contacts for ncmonitor [puppet] - 10https://gerrit.wikimedia.org/r/1033705 [19:51:46] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:06:46] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:21:46] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:08:32] (03PS2) 10David Martin: Add a stream for tracking the API of WikiLambda [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017962 (https://phabricator.wikimedia.org/T356228) [21:49:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T352010)', diff saved to https://phabricator.wikimedia.org/P62664 and previous config saved to /var/cache/conftool/dbconfig/20240519-214936-ladsgroup.json [21:49:42] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [22:04:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P62665 and previous config saved to /var/cache/conftool/dbconfig/20240519-220445-ladsgroup.json [22:13:58] (03PS3) 10RLazarus: tegola-vector-tiles: Dependency updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032524 (https://phabricator.wikimedia.org/T362978) [22:18:29] (03PS4) 10RLazarus: tegola-vector-tiles: Add securityContext and update dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032524 (https://phabricator.wikimedia.org/T362978) [22:19:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P62666 and previous config saved to /var/cache/conftool/dbconfig/20240519-221954-ladsgroup.json [22:35:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T352010)', diff saved to https://phabricator.wikimedia.org/P62667 and previous config saved to /var/cache/conftool/dbconfig/20240519-223502-ladsgroup.json [22:35:05] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [22:35:07] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [22:35:18] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [22:35:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2168 (T352010)', diff saved to https://phabricator.wikimedia.org/P62668 and previous config saved to /var/cache/conftool/dbconfig/20240519-223525-ladsgroup.json [22:51:06] FIRING: KubernetesAPINotScrapable: k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [22:58:01] (03CR) 10RLazarus: "Good catch." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032524 (https://phabricator.wikimedia.org/T362978) (owner: 10RLazarus) [23:01:46] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:38:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1033388 [23:38:16] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1033388 (owner: 10TrainBranchBot) [23:51:46] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:57:55] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1033388 (owner: 10TrainBranchBot)