[00:01:37] (03CR) 10Andrea Denisse: "Apologies for the noise my linters added. Please let me know if you'd prefer me to disable the relevant Emacs packages that introduced tho" [puppet] - 10https://gerrit.wikimedia.org/r/1032608 (https://phabricator.wikimedia.org/T267664) (owner: 10Andrea Denisse) [00:04:34] (03CR) 10Cwhite: [C:04-1] "This is unreviewable due to number of style changes." [puppet] - 10https://gerrit.wikimedia.org/r/1032608 (https://phabricator.wikimedia.org/T267664) (owner: 10Andrea Denisse) [00:05:25] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:06:38] (03CR) 10Andrea Denisse: "My apologies, I'll update send a new patchset." [puppet] - 10https://gerrit.wikimedia.org/r/1032608 (https://phabricator.wikimedia.org/T267664) (owner: 10Andrea Denisse) [00:11:53] (03PS3) 10Andrea Denisse: smart: Refine data collection to differentiate RAID and non-RAID disks [puppet] - 10https://gerrit.wikimedia.org/r/1032608 (https://phabricator.wikimedia.org/T267664) [00:12:33] (03CR) 10Andrea Denisse: "I've disabled the minor modes that made the style changes. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1032608 (https://phabricator.wikimedia.org/T267664) (owner: 10Andrea Denisse) [00:20:31] (03CR) 10Andrea Denisse: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255) (owner: 10Andrea Denisse) [00:40:15] (03PS1) 10Dzahn: gerrit: fix team name for https monitor alerting [puppet] - 10https://gerrit.wikimedia.org/r/1032609 (https://phabricator.wikimedia.org/T365148) [00:47:01] (03PS1) 10Dzahn: scap: remove snapshot1008 from dsh group mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/1032610 (https://phabricator.wikimedia.org/T325228) [00:48:38] (03CR) 10Dzahn: [C:03+2] scap: remove snapshot1008 from dsh group mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/1032610 (https://phabricator.wikimedia.org/T325228) (owner: 10Dzahn) [00:48:44] (03PS2) 10Dzahn: scap: remove snapshot1008 from dsh group mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/1032610 (https://phabricator.wikimedia.org/T325228) [00:49:38] FIRING: [2x] ProbeDown: Service cloudidm2001-dev:443 has failed probes (http_cloudtestidm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:52:04] (03CR) 10Dzahn: [V:03+2 C:03+2] scap: remove snapshot1008 from dsh group mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/1032610 (https://phabricator.wikimedia.org/T325228) (owner: 10Dzahn) [00:54:41] PROBLEM - MegaRAID on es2022 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:54:42] ACKNOWLEDGEMENT - MegaRAID on es2022 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T365213 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:54:53] 10ops-codfw, 06SRE: Degraded RAID on es2022 - https://phabricator.wikimedia.org/T365213 (10ops-monitoring-bot) 03NEW [01:19:30] FIRING: [2x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:24:30] RESOLVED: [2x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:26:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T352010)', diff saved to https://phabricator.wikimedia.org/P62554 and previous config saved to /var/cache/conftool/dbconfig/20240517-012622-ladsgroup.json [01:26:27] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [01:39:59] (03PS2) 10Dzahn: gerrit: fix team name for https monitor alerting [puppet] - 10https://gerrit.wikimedia.org/r/1032609 (https://phabricator.wikimedia.org/T365148) [01:41:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P62555 and previous config saved to /var/cache/conftool/dbconfig/20240517-014132-ladsgroup.json [01:42:14] (03CR) 10Dzahn: [C:03+2] gerrit: fix team name for https monitor alerting [puppet] - 10https://gerrit.wikimedia.org/r/1032609 (https://phabricator.wikimedia.org/T365148) (owner: 10Dzahn) [01:43:26] (03CR) 10Dzahn: [C:03+2] "In case this is confusing: the HTTPS check was already using this new name (see https://codesearch.wmcloud.org/search/?q=collaboration-ser" [puppet] - 10https://gerrit.wikimedia.org/r/1032609 (https://phabricator.wikimedia.org/T365148) (owner: 10Dzahn) [01:56:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P62556 and previous config saved to /var/cache/conftool/dbconfig/20240517-015640-ladsgroup.json [01:57:26] 06SRE, 10SRE-Access-Requests: Requesting access to fr-tech-devs for cstone - https://phabricator.wikimedia.org/T365214 (10Cstone) 03NEW [02:03:55] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:11:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T352010)', diff saved to https://phabricator.wikimedia.org/P62557 and previous config saved to /var/cache/conftool/dbconfig/20240517-021148-ladsgroup.json [02:11:51] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [02:11:53] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [02:12:04] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [02:12:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T352010)', diff saved to https://phabricator.wikimedia.org/P62558 and previous config saved to /var/cache/conftool/dbconfig/20240517-021211-ladsgroup.json [02:22:48] (03CR) 10Dzahn: lists: move definition of primary and standby host to common hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1032032 (owner: 10Dzahn) [02:36:29] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:36:45] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:36:45] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:54:05] 06SRE, 10SRE-Access-Requests: Requesting access to crm for cstone - https://phabricator.wikimedia.org/T365214#9807584 (10Cstone) [02:58:55] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [03:01:45] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:33:55] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:36:29] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:06:45] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:31:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T352010)', diff saved to https://phabricator.wikimedia.org/P62559 and previous config saved to /var/cache/conftool/dbconfig/20240517-043134-ladsgroup.json [04:31:47] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [04:46:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P62560 and previous config saved to /var/cache/conftool/dbconfig/20240517-044642-ladsgroup.json [04:51:45] FIRING: [2x] ProbeDown: Service cloudidm2001-dev:443 has failed probes (http_cloudtestidm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:01:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P62561 and previous config saved to /var/cache/conftool/dbconfig/20240517-050150-ladsgroup.json [05:05:46] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [05:06:00] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [05:16:35] PROBLEM - MediaWiki CirrusSearch update rate - codfw on alert1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [05:16:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T352010)', diff saved to https://phabricator.wikimedia.org/P62562 and previous config saved to /var/cache/conftool/dbconfig/20240517-051658-ladsgroup.json [05:17:01] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [05:17:03] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [05:17:14] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [05:17:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T352010)', diff saved to https://phabricator.wikimedia.org/P62563 and previous config saved to /var/cache/conftool/dbconfig/20240517-051721-ladsgroup.json [05:17:33] !log Restart wikibugs [05:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:33] (03PS1) 10Marostegui: site.pp: Reorganize es6 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1032624 [05:24:04] (03CR) 10Marostegui: [C:03+2] site.pp: Reorganize es6 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1032624 (owner: 10Marostegui) [05:50:49] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on 55 hosts with reason: T363975 [05:52:12] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 55 hosts with reason: T363975 [05:52:26] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: T363975 eqiad cluster restart - ryankemper@cumin2002 - T363975 [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240517T0600) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:06:45] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:08:55] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:59] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: T363975 eqiad cluster restart - ryankemper@cumin2002 - T363975 [06:13:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T364299)', diff saved to https://phabricator.wikimedia.org/P62564 and previous config saved to /var/cache/conftool/dbconfig/20240517-061334-marostegui.json [06:13:39] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [06:17:21] !log ryankemper@cumin2002 START - Cookbook sre.hosts.remove-downtime for 55 hosts [06:18:01] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 55 hosts [06:22:06] 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 07Schema-change-in-production, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501#9807715 (10Marostegui) 05Stalled→03Open... [06:22:46] (03PS1) 10BCornwall: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1032626 [06:28:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P62565 and previous config saved to /var/cache/conftool/dbconfig/20240517-062842-marostegui.json [06:43:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P62566 and previous config saved to /var/cache/conftool/dbconfig/20240517-064350-marostegui.json [06:44:52] 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 07Schema-change-in-production, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501#9807728 (10Marostegui) Enabled slow query... [06:50:51] FIRING: KubernetesAPINotScrapable: k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [06:58:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T364299)', diff saved to https://phabricator.wikimedia.org/P62567 and previous config saved to /var/cache/conftool/dbconfig/20240517-065857-marostegui.json [06:59:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [06:59:01] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [06:59:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [06:59:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2167 (T364299)', diff saved to https://phabricator.wikimedia.org/P62568 and previous config saved to /var/cache/conftool/dbconfig/20240517-065920-marostegui.json [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240517T0700) [07:01:18] (03CR) 10Muehlenhoff: "What do you mean with limitations, are you running into the PageSize limit? This can be handled transparently by the LDAP server, it can r" [software/bitu] - 10https://gerrit.wikimedia.org/r/1026919 (https://phabricator.wikimedia.org/T163478) (owner: 10Slyngshede) [07:01:45] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:01:45] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [07:14:32] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete wmflabs certs [puppet] - 10https://gerrit.wikimedia.org/r/1031947 (owner: 10Muehlenhoff) [07:17:13] (03CR) 10Muehlenhoff: [C:03+2] an-test-druid: Use firewall::service for Zookeeper firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1031842 (owner: 10Muehlenhoff) [07:18:21] (03PS1) 10Marostegui: check_flags_per_dc: Add es6 and es7 [software] - 10https://gerrit.wikimedia.org/r/1032630 [07:18:26] (03CR) 10CI reject: [V:04-1] check_flags_per_dc: Add es6 and es7 [software] - 10https://gerrit.wikimedia.org/r/1032630 (owner: 10Marostegui) [07:19:47] (03CR) 10Filippo Giunchedi: [C:03+1] postfix: prometheus ops config for mx-out boxes [puppet] - 10https://gerrit.wikimedia.org/r/1019116 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [07:20:12] (03PS1) 10Marostegui: check_flags_per_dc: Add es6 and es7 [software] - 10https://gerrit.wikimedia.org/r/1032631 [07:20:17] (03CR) 10CI reject: [V:04-1] check_flags_per_dc: Add es6 and es7 [software] - 10https://gerrit.wikimedia.org/r/1032631 (owner: 10Marostegui) [07:22:25] 06SRE, 07Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505#9807768 (10LSobanski) @akosiaris I would be inclined to close this task or would you prefer to leave it open as a pointer for users searching for the error... [07:30:40] 10ops-codfw, 06SRE, 06DBA: Degraded RAID on es2022 - https://phabricator.wikimedia.org/T365213#9807774 (10Marostegui) [07:30:51] (03PS1) 10Muehlenhoff: an-test-druid: Switch to use nftables instead of iptables [puppet] - 10https://gerrit.wikimedia.org/r/1032632 [07:32:09] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1032632 (owner: 10Muehlenhoff) [07:34:29] PROBLEM - MediaWiki CirrusSearch update rate - codfw on alert1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [07:40:48] (03PS1) 10JMeybohm: Add kubestagemaster100[345] to the etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1032633 (https://phabricator.wikimedia.org/T363307) [07:40:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T352010)', diff saved to https://phabricator.wikimedia.org/P62570 and previous config saved to /var/cache/conftool/dbconfig/20240517-074050-ladsgroup.json [07:40:54] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [07:41:41] (03CR) 10CI reject: [V:04-1] Add kubestagemaster100[345] to the etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1032633 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [07:43:02] (03PS1) 10JMeybohm: Add kubestagemaster100[345] as master_stacked [puppet] - 10https://gerrit.wikimedia.org/r/1032634 (https://phabricator.wikimedia.org/T363307) [07:45:44] (03CR) 10Hashar: "recheck" [software] - 10https://gerrit.wikimedia.org/r/1032630 (owner: 10Marostegui) [07:45:48] (03PS1) 10Muehlenhoff: Deprecate system::role for backup roles [puppet] - 10https://gerrit.wikimedia.org/r/1032636 [07:47:07] (03CR) 10Marostegui: "<3" [software] - 10https://gerrit.wikimedia.org/r/1032630 (owner: 10Marostegui) [07:49:42] (03CR) 10Marostegui: [C:03+2] check_flags_per_dc: Add es6 and es7 [software] - 10https://gerrit.wikimedia.org/r/1032630 (owner: 10Marostegui) [07:50:25] (03Merged) 10jenkins-bot: check_flags_per_dc: Add es6 and es7 [software] - 10https://gerrit.wikimedia.org/r/1032630 (owner: 10Marostegui) [07:50:37] (03PS2) 10JMeybohm: Add kubestagemaster100[345] to the etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1032633 (https://phabricator.wikimedia.org/T363307) [07:53:25] 06SRE, 06Infrastructure-Foundations, 10netops: magru network setup - https://phabricator.wikimedia.org/T362421#9807801 (10ayounsi) Before advertising ns2, we need to do some traffic engineering. Telxius being part of Spain's main ISP, Telefonica ES prefers magru to drmrs : See https://w.wiki/A6qH {F53575207}... [07:55:15] (03PS1) 10JMeybohm: Decom kubestagemaster100[12] [puppet] - 10https://gerrit.wikimedia.org/r/1032706 (https://phabricator.wikimedia.org/T363307) [07:55:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P62571 and previous config saved to /var/cache/conftool/dbconfig/20240517-075558-ladsgroup.json [08:02:33] (03PS2) 10JMeybohm: Add kubestagemaster100[345] as master_stacked [puppet] - 10https://gerrit.wikimedia.org/r/1032634 (https://phabricator.wikimedia.org/T363307) [08:02:34] (03PS2) 10JMeybohm: Decom kubestagemaster100[12] [puppet] - 10https://gerrit.wikimedia.org/r/1032706 (https://phabricator.wikimedia.org/T363307) [08:02:34] (03PS1) 10JMeybohm: Decom kubestagetcd200[123] [puppet] - 10https://gerrit.wikimedia.org/r/1032707 [08:02:59] (03CR) 10CI reject: [V:04-1] Decom kubestagetcd200[123] [puppet] - 10https://gerrit.wikimedia.org/r/1032707 (owner: 10JMeybohm) [08:04:27] 06SRE, 07Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505#9807808 (10akosiaris) 05Open→03Resolved a:03akosiaris >>! In T301505#9807768, @LSobanski wrote: > @akosiaris I would be inclined to close this tas... [08:04:30] (03PS2) 10JMeybohm: Decom kubestagetcd200[123] [puppet] - 10https://gerrit.wikimedia.org/r/1032707 (https://phabricator.wikimedia.org/T363307) [08:04:33] (03PS3) 10JMeybohm: Add kubestagemaster100[345] as master_stacked [puppet] - 10https://gerrit.wikimedia.org/r/1032634 (https://phabricator.wikimedia.org/T363307) [08:04:33] (03PS3) 10JMeybohm: Decom kubestagemaster100[12] [puppet] - 10https://gerrit.wikimedia.org/r/1032706 (https://phabricator.wikimedia.org/T363307) [08:04:41] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9807815 (10MoritzMuehlenhoff) [08:06:42] (03PS1) 10JMeybohm: Remove kubestagetcd100[123] from etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/1032708 (https://phabricator.wikimedia.org/T363307) [08:07:47] (03PS1) 10JMeybohm: Decom kubestagetcd100[123] [puppet] - 10https://gerrit.wikimedia.org/r/1032709 (https://phabricator.wikimedia.org/T363307) [08:08:08] (03CR) 10JMeybohm: [C:03+2] Decom kubestagetcd200[123] [puppet] - 10https://gerrit.wikimedia.org/r/1032707 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [08:08:32] (03CR) 10Gehel: [C:03+2] hadoop: remove outdated ref to backup cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/1032537 (owner: 10Ryan Kemper) [08:09:25] (03PS1) 10Muehlenhoff: Apply Puppet 7 for dumper_misc_crons_only on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1032710 (https://phabricator.wikimedia.org/T349619) [08:10:21] (03CR) 10Muehlenhoff: [C:04-1] "Needs one more change to the underlying firewall definitions actually, one middlemanager setting is still passed in ferm-syntax" [puppet] - 10https://gerrit.wikimedia.org/r/1032632 (owner: 10Muehlenhoff) [08:10:53] (03CR) 10JMeybohm: [C:03+2] zotero: Ensure containers have a securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032523 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [08:11:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P62572 and previous config saved to /var/cache/conftool/dbconfig/20240517-081105-ladsgroup.json [08:11:45] (03Merged) 10jenkins-bot: zotero: Ensure containers have a securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032523 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [08:13:36] (03CR) 10JMeybohm: [C:03+1] cxserver: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030195 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [08:14:52] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply [08:15:20] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply [08:16:28] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: apply [08:16:57] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [08:17:21] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: apply [08:17:51] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [08:18:40] (03CR) 10JMeybohm: [C:03+1] citoid: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030191 (https://phabricator.wikimedia.org/T346638) (owner: 10Scott French) [08:23:48] (03PS1) 10Muehlenhoff: Pass Druid middle manager ports as port range [puppet] - 10https://gerrit.wikimedia.org/r/1032712 [08:26:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T352010)', diff saved to https://phabricator.wikimedia.org/P62573 and previous config saved to /var/cache/conftool/dbconfig/20240517-082613-ladsgroup.json [08:26:15] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [08:26:18] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [08:26:29] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [08:26:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1201 (T352010)', diff saved to https://phabricator.wikimedia.org/P62574 and previous config saved to /var/cache/conftool/dbconfig/20240517-082636-ladsgroup.json [08:27:01] (03CR) 10JMeybohm: [C:03+1] changeprop: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030190 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [08:27:49] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1031834 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:30:19] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1031605 [08:30:19] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1031605 (owner: 10TrainBranchBot) [08:31:55] (03PS1) 10Muehlenhoff: Remove obsolete wmflabs dummy certs [labs/private] - 10https://gerrit.wikimedia.org/r/1032713 [08:32:36] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1032712 (owner: 10Muehlenhoff) [08:32:46] (03PS2) 10Fabfur: benthos:cache: better parsing for path and query string [puppet] - 10https://gerrit.wikimedia.org/r/1031818 (https://phabricator.wikimedia.org/T358109) [08:39:05] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:41:29] (03PS1) 10JMeybohm: [WIP] Global update of test-service-checker template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032714 (https://phabricator.wikimedia.org/T362978) [08:43:55] (03CR) 10CI reject: [V:04-1] [WIP] Global update of test-service-checker template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032714 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [08:45:12] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1002387 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [08:45:57] (03PS1) 10Aqu: Run Gobblin later to let time for Canary events [puppet] - 10https://gerrit.wikimedia.org/r/1032715 (https://phabricator.wikimedia.org/T365223) [08:51:45] FIRING: [2x] ProbeDown: Service cloudidm2001-dev:443 has failed probes (http_cloudtestidm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:52:58] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1031605 (owner: 10TrainBranchBot) [08:54:42] (03CR) 10Btullis: [C:03+1] "Looks good to me. Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031497 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [08:57:42] (03PS1) 10Muehlenhoff: Switch dumps::generation::worker::testbed to Puppet 7 on role level [puppet] - 10https://gerrit.wikimedia.org/r/1032717 (https://phabricator.wikimedia.org/T349619) [08:58:05] (03CR) 10Filippo Giunchedi: [C:03+1] P:ganeti Prometheus monitoring of ganeti noded services. [puppet] - 10https://gerrit.wikimedia.org/r/1031834 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [09:01:05] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host snapshot1015.eqiad.wmnet [09:01:30] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.05.06 - 2024.05.26): Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9807924 (10ops-monitoring-bot) Host rebooted by btullis@cumin1002 with reason: Rebooting to pick up new kernel [09:04:43] (03CR) 10JMeybohm: [C:04-1] "I've opened a task (https://phabricator.wikimedia.org/T365224) regarding the app.job module situation with this chart as it should really " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031105 (https://phabricator.wikimedia.org/T346638) (owner: 10Scott French) [09:06:43] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1015.eqiad.wmnet [09:07:18] (03CR) 10Gehel: wdqs.data-reload: support HDFS as a source (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse) [09:08:21] (03PS1) 10Gerrit maintenance bot: Add dtp to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1032726 (https://phabricator.wikimedia.org/T365220) [09:09:46] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "I wonder if some tools would lack the logic to reconnect to redis at all, and thus this change will break them." [puppet] - 10https://gerrit.wikimedia.org/r/1029158 (https://phabricator.wikimedia.org/T363709) (owner: 10FNegri) [09:17:15] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host snapshot1016.eqiad.wmnet [09:17:38] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.05.06 - 2024.05.26): Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9808043 (10ops-monitoring-bot) Host rebooted by btullis@cumin1002 with reason: Rebooting to pick up new kernel [09:18:36] (03CR) 10Hnowlan: [C:03+1] Add kubestagemaster100[345] as master_stacked [puppet] - 10https://gerrit.wikimedia.org/r/1032634 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [09:21:38] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1032712 (owner: 10Muehlenhoff) [09:21:52] (03CR) 10Cathal Mooney: [C:03+1] "Good call!!" [homer/public] - 10https://gerrit.wikimedia.org/r/1032386 (https://phabricator.wikimedia.org/T362523) (owner: 10Ayounsi) [09:22:15] (03CR) 10Btullis: [C:03+1] Switch dumps::generation::worker::testbed to Puppet 7 on role level [puppet] - 10https://gerrit.wikimedia.org/r/1032717 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:22:52] (03CR) 10Ladsgroup: [V:03+2 C:03+2] Add dtp to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1032726 (https://phabricator.wikimedia.org/T365220) (owner: 10Gerrit maintenance bot) [09:22:58] (03CR) 10Btullis: [C:03+1] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1032710 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:24:30] (03CR) 10Hnowlan: [C:03+1] Add kubestagemaster100[345] to the etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1032633 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [09:24:52] (03CR) 10Muehlenhoff: [C:03+2] Switch dumps::generation::worker::testbed to Puppet 7 on role level [puppet] - 10https://gerrit.wikimedia.org/r/1032717 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:25:11] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1016.eqiad.wmnet [09:25:13] (03PS3) 10JMeybohm: Add kubestagemaster100[345] to the etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1032633 (https://phabricator.wikimedia.org/T363307) [09:25:54] (03CR) 10Hnowlan: Remove kubestagetcd100[123] from etcd SRV records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1032708 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [09:26:24] 06SRE, 06Infrastructure-Foundations, 10netops: magru network setup - https://phabricator.wikimedia.org/T362421#9808055 (10cmooney) +1 sounds like a good idea. Nice we have some limited scope to experiment with the DoH ranges before pulling the plug on ns2. FWIW I think these would be the ones to use with E... [09:26:53] (03CR) 10JMeybohm: [C:03+2] Add kubestagemaster100[345] to the etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1032633 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [09:27:06] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9808056 (10MoritzMuehlenhoff) [09:27:32] (03CR) 10Brouberol: Move datahub and datahub-staging helfile deployments to dse-k8s (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028761 (https://phabricator.wikimedia.org/T363300) (owner: 10Stevemunene) [09:28:50] (03PS5) 10Ilias Sarantopoulos: ml-services: increase revscoring-reverted replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032517 (https://phabricator.wikimedia.org/T362503) [09:29:35] (03CR) 10AikoChou: [C:03+1] ml-services: increase revscoring-reverted replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032517 (https://phabricator.wikimedia.org/T362503) (owner: 10Ilias Sarantopoulos) [09:32:42] (03CR) 10JMeybohm: [C:03+2] Add kubestagemaster100[345] as master_stacked [puppet] - 10https://gerrit.wikimedia.org/r/1032634 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [09:35:02] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: increase revscoring-reverted replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032517 (https://phabricator.wikimedia.org/T362503) (owner: 10Ilias Sarantopoulos) [09:35:52] (03Merged) 10jenkins-bot: ml-services: increase revscoring-reverted replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032517 (https://phabricator.wikimedia.org/T362503) (owner: 10Ilias Sarantopoulos) [09:39:20] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [09:40:15] (03CR) 10Zabe: "Have deployed the secret to PrivateSettings.php: https://sal.toolforge.org/log/Yaeyg48BhuQtenzvXGqt" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029183 (https://phabricator.wikimedia.org/T150647) (owner: 10Zabe) [09:44:37] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [09:44:40] (03CR) 10EoghanGaffney: [C:03+1] lists: move definition of primary and standby host to common hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1032032 (owner: 10Dzahn) [09:46:45] FIRING: [3x] SystemdUnitFailed: docker.service on kubestagemaster1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:47:34] (03PS1) 10Ilias Sarantopoulos: Add new version for amd-pytorch image (torch 2.3.0 - rocm 6.0) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1032725 (https://phabricator.wikimedia.org/T365166) [09:49:40] FIRING: KubernetesRsyslogDown: rsyslog on kubestagemaster1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestagemaster1003 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:49:47] (03PS1) 10JMeybohm: Remove kubestagemaster100[45] from server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1032746 (https://phabricator.wikimedia.org/T363307) [09:51:12] (03CR) 10JMeybohm: [C:03+2] Remove kubestagemaster100[45] from server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1032746 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [09:51:45] FIRING: [2x] ProbeDown: Service kubestagemaster1003:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:51:50] FIRING: [4x] SystemdUnitFailed: docker.service on kubestagemaster1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:52:25] 06SRE, 10SRE-Access-Requests: Give access to Anti Harassment Tools team to production deployment - https://phabricator.wikimedia.org/T246053#9808091 (10Tchanders) Hi @Dzahn, my SSH key is being rejected with a `Permission denied (publickey)` error, as of the last few weeks. I've checked locally that my key... [09:54:40] RESOLVED: KubernetesRsyslogDown: rsyslog on kubestagemaster1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestagemaster1003 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:54:40] FIRING: KubernetesRsyslogDown: rsyslog on kubestagemaster1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestagemaster1003 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:56:45] FIRING: [4x] SystemdUnitFailed: docker.service on kubestagemaster1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:58:55] FIRING: [4x] SystemdUnitFailed: docker.service on kubestagemaster1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:59:40] RESOLVED: KubernetesRsyslogDown: rsyslog on kubestagemaster1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestagemaster1003 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:00:27] (03CR) 10Cathal Mooney: [C:03+1] homer: comments-only change: specify 198.35.27.0/24 as ns2 [homer/public] - 10https://gerrit.wikimedia.org/r/1032522 (owner: 10Ssingh) [10:01:45] FIRING: [4x] SystemdUnitFailed: docker.service on kubestagemaster1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:01:45] RESOLVED: [2x] ProbeDown: Service kubestagemaster1003:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:04:57] FIRING: KubernetesCalicoDown: kubestagemaster1003.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1003.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:05:07] (03PS1) 10Ayounsi: magru: 3x prepending for Anycast prefixes to Telxius [homer/public] - 10https://gerrit.wikimedia.org/r/1032747 (https://phabricator.wikimedia.org/T362421) [10:06:45] RESOLVED: [4x] SystemdUnitFailed: docker.service on kubestagemaster1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:06:45] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:07:04] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/1032747 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [10:07:28] (03CR) 10Ayounsi: [C:03+2] magru: 3x prepending for Anycast prefixes to Telxius [homer/public] - 10https://gerrit.wikimedia.org/r/1032747 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [10:07:59] (03Merged) 10jenkins-bot: magru: 3x prepending for Anycast prefixes to Telxius [homer/public] - 10https://gerrit.wikimedia.org/r/1032747 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [10:08:32] 06SRE, 06Infrastructure-Foundations, 06Traffic: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps - https://phabricator.wikimedia.org/T359054#9808106 (10cmooney) >>! In T359054#9807307, @CDanis wrote: > Adding the 3rd transit link in magru **greatly** improved the latency for m... [10:11:04] (03PS1) 10JMeybohm: Add kubestagemaster1004 to the etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1032748 (https://phabricator.wikimedia.org/T363307) [10:11:06] (03PS1) 10JMeybohm: Add kubestagemaster1005 to the etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1032749 (https://phabricator.wikimedia.org/T363307) [10:11:45] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:18:55] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:24:32] (03CR) 10JMeybohm: [C:03+2] Add kubestagemaster1004 to the etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1032748 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [10:24:57] RESOLVED: KubernetesCalicoDown: kubestagemaster1003.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1003.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:31:11] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2501/console" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [10:32:51] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9808187 (10cmooney) Pcap of DHCP request from contint2002 here: {F53586857} [10:33:55] FIRING: [2x] SystemdUnitFailed: docker.service on kubestagemaster1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:35:19] 06SRE, 06Infrastructure-Foundations, 10netops: magru network setup - https://phabricator.wikimedia.org/T362421#9808194 (10ayounsi) Cogent is a bit surprising, from EU or the US they route to magru. `lines=15 Fri May 17 10:29:23.898 UTC BGP routing table entry for 185.71.138.0/24 Versions: Process... [10:36:45] RESOLVED: [2x] SystemdUnitFailed: docker.service on kubestagemaster1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:37:06] (03CR) 10JMeybohm: [C:03+2] Add kubestagemaster1005 to the etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1032749 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [10:42:19] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9808229 (10cmooney) One observation is that the NAK's are unique in so far as they are sent from 208.80.153.33 (Switch IRB int IP) to 255.255.25... [10:42:27] FIRING: [2x] KubernetesCalicoDown: kubestagemaster1003.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:45:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T352010)', diff saved to https://phabricator.wikimedia.org/P62575 and previous config saved to /var/cache/conftool/dbconfig/20240517-104553-ladsgroup.json [10:45:59] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [10:46:45] FIRING: [4x] SystemdUnitFailed: docker.service on kubestagemaster1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:49:32] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9808246 (10cmooney) Also I didn't see in the dhcpd docs and way to constrain the generation of NAKs in response to invalid REQUEST messages. [... [10:49:40] FIRING: KubernetesRsyslogDown: rsyslog on kubestagemaster1005:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestagemaster1005 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:51:06] FIRING: KubernetesAPINotScrapable: k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [10:51:45] RESOLVED: [4x] SystemdUnitFailed: docker.service on kubestagemaster1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:52:52] (03PS2) 10JMeybohm: Remove kubestagetcd100[123] from etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/1032708 (https://phabricator.wikimedia.org/T363307) [10:53:29] (03PS4) 10JMeybohm: Decom kubestagemaster100[12] [puppet] - 10https://gerrit.wikimedia.org/r/1032706 (https://phabricator.wikimedia.org/T363307) [10:53:38] (03PS2) 10JMeybohm: Decom kubestagetcd100[123] [puppet] - 10https://gerrit.wikimedia.org/r/1032709 (https://phabricator.wikimedia.org/T363307) [10:54:40] RESOLVED: KubernetesRsyslogDown: rsyslog on kubestagemaster1005:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestagemaster1005 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:57:27] FIRING: [2x] KubernetesCalicoDown: kubestagemaster1004.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:58:04] (03PS3) 10TChin: datasets-config: Remove service-runner config and update default config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032491 (https://phabricator.wikimedia.org/T357434) [10:58:49] (03CR) 10CI reject: [V:04-1] datasets-config: Remove service-runner config and update default config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032491 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [10:59:21] (03PS4) 10TChin: datasets-config: Remove service-runner config and update default config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032491 (https://phabricator.wikimedia.org/T357434) [10:59:46] 06SRE, 06Infrastructure-Foundations, 10netops: magru network setup - https://phabricator.wikimedia.org/T362421#9808254 (10cmooney) >>! In T362421#9808194, @ayounsi wrote: > They might prefer going through EdgeUno once we add the prepending to Novvacore, so the same change would be needed there as well. It's... [11:00:00] (03CR) 10CI reject: [V:04-1] datasets-config: Remove service-runner config and update default config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032491 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240517T0700) [11:00:05] eoghan, jelto, arnoldokoth, and mutante: Time to do the GitLab version upgrades deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240517T1100). [11:01:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P62576 and previous config saved to /var/cache/conftool/dbconfig/20240517-110101-ladsgroup.json [11:01:45] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:01:45] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:02:27] FIRING: [2x] KubernetesCalicoDown: kubestagemaster1004.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:07:27] RESOLVED: [2x] KubernetesCalicoDown: kubestagemaster1004.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:08:46] !log jayme@cumin1002 conftool action : set/pooled=yes:weight=10; selector: name=kubestagemaster100[3-5].eqiad.wmnet [11:10:02] (03PS5) 10TChin: datasets-config: Remove service-runner config and update default config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032491 (https://phabricator.wikimedia.org/T357434) [11:16:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P62577 and previous config saved to /var/cache/conftool/dbconfig/20240517-111611-ladsgroup.json [11:18:16] (03CR) 10AOkoth: prometheus: puppetise sql_exporter (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [11:18:23] (03PS4) 10Stevemunene: Move datahub and datahub-staging helfile deployments to dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028761 (https://phabricator.wikimedia.org/T363300) [11:18:44] (03CR) 10Alexandros Kosiaris: [C:04-1] datasets-config: Remove service-runner config and update default config (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032491 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [11:19:55] (03PS2) 10Muehlenhoff: Apply Puppet 7 for dumper_misc_crons_only on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1032710 (https://phabricator.wikimedia.org/T349619) [11:20:23] 10ops-codfw, 06SRE, 06DBA: Degraded RAID on es2022 - https://phabricator.wikimedia.org/T365213#9808275 (10ABran-WMF) 05Open→03In progress I've checked on Netbox and that server is older than 3yo! @wiki_willy can we still get a disk replacement? [11:21:07] (03CR) 10Muehlenhoff: [C:03+2] Apply Puppet 7 for dumper_misc_crons_only on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1032710 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:21:25] 10ops-codfw, 06SRE, 06DBA: Degraded RAID on es2022 - https://phabricator.wikimedia.org/T365213#9808278 (10ABran-WMF) p:05Triage→03Medium [11:21:57] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9808280 (10MoritzMuehlenhoff) [11:26:21] (03PS16) 10EoghanGaffney: lists: Add lists role to list2001 [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) [11:26:41] (03CR) 10CI reject: [V:04-1] lists: Add lists role to list2001 [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [11:28:02] (03PS17) 10EoghanGaffney: lists: Add lists role to list2001 [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) [11:31:07] (03PS5) 10Stevemunene: Move datahub and datahub-staging helfile deployments to dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028761 (https://phabricator.wikimedia.org/T363300) [11:31:14] (03CR) 10CI reject: [V:04-1] lists: Add lists role to list2001 [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [11:31:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T352010)', diff saved to https://phabricator.wikimedia.org/P62578 and previous config saved to /var/cache/conftool/dbconfig/20240517-113119-ladsgroup.json [11:31:21] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance [11:31:23] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [11:31:34] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance [11:31:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1224 (T352010)', diff saved to https://phabricator.wikimedia.org/P62579 and previous config saved to /var/cache/conftool/dbconfig/20240517-113142-ladsgroup.json [11:33:10] (03PS6) 10TChin: datasets-config: Remove service-runner config and update default config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032491 (https://phabricator.wikimedia.org/T357434) [11:33:16] (03CR) 10TChin: datasets-config: Remove service-runner config and update default config (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032491 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [11:34:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1032730 [11:34:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1032730 (owner: 10TrainBranchBot) [11:34:54] (03CR) 10Alexandros Kosiaris: [C:03+1] datasets-config: Remove service-runner config and update default config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032491 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [11:39:46] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ldap-replica2007.wikimedia.org [11:44:35] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:45:20] (03PS18) 10EoghanGaffney: lists: Add lists role to list2001 [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) [11:47:11] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ldap-replica2007.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:48:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ldap-replica2007.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:48:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:48:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ldap-replica2007.wikimedia.org [11:48:48] 06SRE, 06Infrastructure-Foundations, 07LDAP, 13Patch-For-Review: Migrate the r/w LDAP servers to Bookworm and MDB storage - https://phabricator.wikimedia.org/T331699#9808346 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ldap-replica2007.wikimedia.org` - ldap-... [11:51:16] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on kubestagemaster[1001-1002].eqiad.wmnet with reason: decom [11:51:33] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on kubestagemaster[1001-1002].eqiad.wmnet with reason: decom [11:51:53] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ldap-replica2008.wikimedia.org [11:53:11] !log jayme@cumin1002 conftool action : set/pooled=inactive; selector: name=kubestagemaster100[12].eqiad.wmnet [11:55:16] (03CR) 10Hnowlan: [C:03+1] Decom kubestagemaster100[12] [puppet] - 10https://gerrit.wikimedia.org/r/1032706 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [11:55:38] (03CR) 10Hnowlan: [C:03+1] Decom kubestagetcd100[123] [puppet] - 10https://gerrit.wikimedia.org/r/1032709 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [11:56:31] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:56:34] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1032730 (owner: 10TrainBranchBot) [11:57:55] (03CR) 10TChin: [C:03+2] datasets-config: Remove service-runner config and update default config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032491 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [11:58:47] (03Merged) 10jenkins-bot: datasets-config: Remove service-runner config and update default config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032491 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [12:01:57] (03PS3) 10JMeybohm: Remove kubestagetcd100[123] from etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/1032708 (https://phabricator.wikimedia.org/T363307) [12:02:46] !log jayme@cumin1002 START - Cookbook sre.hosts.decommission for hosts kubestagemaster[1001-1002].eqiad.wmnet [12:05:34] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ldap-replica2008.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:07:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ldap-replica2008.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:07:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:07:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ldap-replica2008.wikimedia.org [12:07:11] !log tchin@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config: apply [12:07:17] !log tchin@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/datasets-config: apply [12:07:24] 06SRE, 06Infrastructure-Foundations, 07LDAP, 13Patch-For-Review: Migrate the r/w LDAP servers to Bookworm and MDB storage - https://phabricator.wikimedia.org/T331699#9808359 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ldap-replica2008.wikimedia.org` - ldap-... [12:08:43] !log tchin@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config: apply [12:09:04] (03CR) 10JMeybohm: [C:03+2] Decom kubestagemaster100[12] [puppet] - 10https://gerrit.wikimedia.org/r/1032706 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [12:09:07] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [12:11:08] !log tchin@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config: apply [12:11:41] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ldap-replica1005.wikimedia.org [12:11:45] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:11:46] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubestagemaster[1001-1002].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002" [12:12:33] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubestagemaster[1001-1002].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002" [12:12:33] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:12:34] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kubestagemaster[1001-1002].eqiad.wmnet [12:13:57] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2502/co" [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [12:16:45] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:18:56] (03PS3) 10JMeybohm: Decom kubestagetcd100[123] [puppet] - 10https://gerrit.wikimedia.org/r/1032709 (https://phabricator.wikimedia.org/T363307) [12:18:56] (03PS1) 10JMeybohm: cumin/aliases: Remove role kubernetes::staging::master [puppet] - 10https://gerrit.wikimedia.org/r/1032763 (https://phabricator.wikimedia.org/T363307) [12:21:45] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:21:59] (03CR) 10Elukey: [C:03+2] profile::amd_gpu: refactor configurations for k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/1032506 (https://phabricator.wikimedia.org/T363191) (owner: 10Elukey) [12:23:08] (03PS1) 10Kamila Součková: recommendation-api: add securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032764 (https://phabricator.wikimedia.org/T362978) [12:24:27] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on kubestagetcd[1004-1006].eqiad.wmnet with reason: decom [12:24:43] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on kubestagetcd[1004-1006].eqiad.wmnet with reason: decom [12:25:50] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ldap-replica1005.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:27:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ldap-replica1005.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:27:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:27:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ldap-replica1005.wikimedia.org [12:27:23] 06SRE, 06Infrastructure-Foundations, 07LDAP, 13Patch-For-Review: Migrate the r/w LDAP servers to Bookworm and MDB storage - https://phabricator.wikimedia.org/T331699#9808397 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ldap-replica1005.wikimedia.org` - ldap-... [12:28:29] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ldap-replica1006.wikimedia.org [12:28:45] (03PS1) 10Elukey: Skip ROCm packages for ml-staging2001 [puppet] - 10https://gerrit.wikimedia.org/r/1032765 (https://phabricator.wikimedia.org/T363191) [12:30:27] (03CR) 10Hnowlan: [C:03+1] cumin/aliases: Remove role kubernetes::staging::master [puppet] - 10https://gerrit.wikimedia.org/r/1032763 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [12:30:37] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2503/co" [puppet] - 10https://gerrit.wikimedia.org/r/1032765 (https://phabricator.wikimedia.org/T363191) (owner: 10Elukey) [12:32:50] !log tchin@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config: apply [12:33:18] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:33:47] (03CR) 10JMeybohm: [C:03+2] Remove kubestagetcd100[123] from etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/1032708 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [12:33:58] !log tchin@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/datasets-config: apply [12:34:03] (03CR) 10JMeybohm: [C:03+2] Remove kubestagetcd100[123] from etcd SRV records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1032708 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [12:35:24] !log jayme@cumin1002 START - Cookbook sre.hosts.decommission for hosts kubestagetcd[1004-1006].eqiad.wmnet [12:36:31] (03PS1) 10Muehlenhoff: zookeeper/test: Switch the Zookeeper firewall settings to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1032771 [12:36:49] (03PS1) 10Brouberol: global_config: register IP/port for the datahubsearch opensearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/1032772 (https://phabricator.wikimedia.org/T331894) [12:38:01] (03PS2) 10Kevin Bazira: ml-services: update logo-detection image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021402 (https://phabricator.wikimedia.org/T362749) [12:38:43] (03PS1) 10Muehlenhoff: zk/flink: Switch the Zookeeper firewall settings to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1032773 [12:38:59] (03PS2) 10Brouberol: global_config: register IP/port for the datahubsearch opensearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/1032772 (https://phabricator.wikimedia.org/T331894) [12:39:16] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ldap-replica1006.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:39:22] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1032771 (owner: 10Muehlenhoff) [12:39:51] (03CR) 10Filippo Giunchedi: [V:03+1 C:04-1] prometheus: puppetise sql_exporter (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [12:40:15] (03CR) 10Phuedx: [C:03+1] Introduce sample overrides to web_ui_actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024813 (https://phabricator.wikimedia.org/T361962) (owner: 10Kimberly Sarabia) [12:40:57] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update logo-detection image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021402 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira) [12:41:58] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update logo-detection image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021402 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira) [12:42:39] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [12:42:59] (03Merged) 10jenkins-bot: ml-services: update logo-detection image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021402 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira) [12:43:06] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2504/co" [puppet] - 10https://gerrit.wikimedia.org/r/1032772 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [12:46:18] (03CR) 10Filippo Giunchedi: [V:03+1 C:04-1] prometheus: puppetise sql_exporter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [12:46:38] !log kevinbazira@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:46:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ldap-replica1006.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:46:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:46:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ldap-replica1006.wikimedia.org [12:46:58] 06SRE, 06Infrastructure-Foundations, 07LDAP, 13Patch-For-Review: Migrate the r/w LDAP servers to Bookworm and MDB storage - https://phabricator.wikimedia.org/T331699#9808451 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ldap-replica1006.wikimedia.org` - ldap-... [12:51:45] FIRING: [2x] ProbeDown: Service cloudidm2001-dev:443 has failed probes (http_cloudtestidm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:53:47] (03CR) 10Brouberol: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1032771 (owner: 10Muehlenhoff) [12:54:43] (03CR) 10Hashar: Allow users to recheck tests in checkers (034 comments) [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [12:54:58] (03PS24) 10Hashar: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [12:55:57] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubestagetcd[1004-1006].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002" [12:56:50] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubestagetcd[1004-1006].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002" [12:56:50] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:56:51] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kubestagetcd[1004-1006].eqiad.wmnet [12:57:03] (03CR) 10Jforrester: "Let's do I98c6df162f20556fb1c31a64f55ab9a47d072cd9 first before the later update. (Also, I try to spell out exactly what commits are going" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032579 (owner: 10Cory Massaro) [12:58:06] (03CR) 10JMeybohm: [C:03+2] Decom kubestagetcd100[123] [puppet] - 10https://gerrit.wikimedia.org/r/1032709 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [13:03:54] (03PS2) 10JMeybohm: Remove remaining occurrences of kubernetes::staging::master role [puppet] - 10https://gerrit.wikimedia.org/r/1032763 (https://phabricator.wikimedia.org/T363307) [13:07:16] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1032773 (owner: 10Muehlenhoff) [13:09:00] (03PS1) 10Muehlenhoff: druid: Switch the Zookeeper firewall settings to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1032775 [13:10:43] (03PS1) 10Muehlenhoff: an-druid: Switch the Zookeeper firewall settings to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1032776 [13:11:49] (03PS1) 10Muehlenhoff: an-conf: Switch the Zookeeper firewall settings to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1032778 [13:11:50] (03CR) 10Hashar: Allow users to recheck tests in checkers (035 comments) [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [13:13:55] (03PS1) 10Alexandros Kosiaris: mobileapps: Use mesh modules version enabling IPv6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032779 (https://phabricator.wikimedia.org/T255568) [13:14:31] (03PS2) 10Alexandros Kosiaris: mobileapps: Use mesh modules version enabling IPv6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032779 (https://phabricator.wikimedia.org/T255568) [13:14:31] (03PS1) 10Muehlenhoff: conf: Switch the Zookeeper firewall settings to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1032780 [13:15:13] (03PS3) 10Alexandros Kosiaris: mobileapps: Use mesh modules version enabling IPv6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032779 (https://phabricator.wikimedia.org/T255568) [13:15:31] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1032775 (owner: 10Muehlenhoff) [13:17:13] 06SRE, 06Infrastructure-Foundations, 10netops: magru network setup - https://phabricator.wikimedia.org/T362421#9808627 (10ayounsi) The Telxius community doesn't seem to be of any effect so far, I'll wait for their reply, maybe they changed or need to be enabled on their side first. I'll look at the other pro... [13:18:14] (03PS4) 10Alexandros Kosiaris: mobileapps: Use mesh modules version enabling IPv6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032779 (https://phabricator.wikimedia.org/T255568) [13:21:01] (03PS2) 10Muehlenhoff: druid: Switch the Zookeeper firewall settings to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1032775 [13:21:17] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1032776 (owner: 10Muehlenhoff) [13:21:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T352010)', diff saved to https://phabricator.wikimedia.org/P62582 and previous config saved to /var/cache/conftool/dbconfig/20240517-132122-ladsgroup.json [13:21:27] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:21:44] 07sre-alert-triage, 10Data-Platform-SRE (2024.05.06 - 2024.05.26): Alert in need of triage: PybalBackendDown (instance elastic2090:0) - https://phabricator.wikimedia.org/T364528#9808639 (10bking) a:03bking [13:22:15] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [13:22:26] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [13:23:33] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [13:23:41] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [13:23:59] (03PS25) 10Hashar: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [13:24:28] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [13:24:56] 06SRE, 06Infrastructure-Foundations, 10netops: Support Anycast GW on EVPN switches without unique IP - https://phabricator.wikimedia.org/T350579#9808649 (10cmooney) Just a note on this, I only discovered this document after the task: https://www.juniper.net/documentation/us/en/software/nce/nce-216-evpn-... [13:25:49] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [13:26:43] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [13:27:27] (03CR) 10JMeybohm: [C:04-1] mobileapps: Use mesh modules version enabling IPv6 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032779 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [13:29:12] 06SRE, 06Infrastructure-Foundations, 06Traffic: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps - https://phabricator.wikimedia.org/T359054#9808679 (10CDanis) The 3rd transit was also of great help to Chile, and probably Peru (although sample size there is a bit small). {F53... [13:33:27] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1032778 (owner: 10Muehlenhoff) [13:36:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P62583 and previous config saved to /var/cache/conftool/dbconfig/20240517-133630-ladsgroup.json [13:37:01] (03CR) 10Brouberol: Move datahub and datahub-staging helfile deployments to dse-k8s (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028761 (https://phabricator.wikimedia.org/T363300) (owner: 10Stevemunene) [13:38:09] 10ops-codfw, 06SRE: Degraded RAID on backup2010 - https://phabricator.wikimedia.org/T365217#9808724 (10Jhancock.wm) I found this in the idrac log. 2024-05-17 05:07:52 PDR10 Disk 2 on Integrated RAID Controller 1 rebuild has started. 2024-05-17 05:07:52 PDR8 Disk 2 in Backplane 2 of Integrated RAID C... [13:38:11] (03CR) 10JMeybohm: [C:04-1] "Interesting twist here:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032524 (https://phabricator.wikimedia.org/T362978) (owner: 10RLazarus) [13:41:28] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1032780 (owner: 10Muehlenhoff) [13:42:00] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1032775 (owner: 10Muehlenhoff) [13:42:33] (03PS1) 10Bking: elasticsearch: add elastic2090 to correct pybal pool [puppet] - 10https://gerrit.wikimedia.org/r/1032784 (https://phabricator.wikimedia.org/T364528) [13:43:59] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1032784 (https://phabricator.wikimedia.org/T364528) (owner: 10Bking) [13:44:04] (03CR) 10JMeybohm: [C:04-1] recommendation-api: add securityContext (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032764 (https://phabricator.wikimedia.org/T362978) (owner: 10Kamila Součková) [13:47:02] (03CR) 10JMeybohm: "> 2. My understanding of the `subjectAltNames` diffs on the `DestinationRule`s is that these aren't really necessary, as the cert-manager " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032525 (https://phabricator.wikimedia.org/T362978) (owner: 10Clément Goubert) [13:49:10] (03PS2) 10Muehlenhoff: Add a new function to return the wiki PHP version currently in use [puppet] - 10https://gerrit.wikimedia.org/r/1029900 [13:49:19] (03CR) 10Muehlenhoff: Add a new function to return the wiki PHP version currently in use (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1029900 (owner: 10Muehlenhoff) [13:51:35] (03CR) 10CDanis: [C:03+2] Service mesh: rename local_service cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030221 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [13:51:38] (03CR) 10CDanis: [C:03+2] Service mesh: rename local_service cluster (copy patch) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030220 (owner: 10CDanis) [13:51:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P62584 and previous config saved to /var/cache/conftool/dbconfig/20240517-135138-ladsgroup.json [13:51:51] (03PS2) 10Ilias Sarantopoulos: Add new version for amd-pytorch image (torch 2.3.0 - rocm 6.0) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1032725 (https://phabricator.wikimedia.org/T365166) [13:51:51] (03CR) 10Ilias Sarantopoulos: "New image is 15.9GB (5GB larger than the previous version...)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1032725 (https://phabricator.wikimedia.org/T365166) (owner: 10Ilias Sarantopoulos) [13:52:28] (03Merged) 10jenkins-bot: Service mesh: rename local_service cluster (copy patch) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030220 (owner: 10CDanis) [13:52:33] (03Merged) 10jenkins-bot: Service mesh: rename local_service cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030221 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [13:52:52] (03CR) 10Elukey: "Sigh :(" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1032725 (https://phabricator.wikimedia.org/T365166) (owner: 10Ilias Sarantopoulos) [13:53:32] (03PS2) 10Volans: NEL: add alert by country [alerts] - 10https://gerrit.wikimedia.org/r/902316 (https://phabricator.wikimedia.org/T328941) [13:53:56] (03CR) 10Elukey: "Also, to triple check - are the ROCm libs in place? No nvidia garbage?" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1032725 (https://phabricator.wikimedia.org/T365166) (owner: 10Ilias Sarantopoulos) [13:54:06] (03PS1) 10Slyngshede: P:idm Use account login page for monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/1032789 [13:56:07] (03PS3) 10Muehlenhoff: druid: Switch the Zookeeper firewall settings to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1032775 [13:57:01] (03CR) 10DCausse: [C:03+1] elasticsearch: add elastic2090 to correct pybal pool [puppet] - 10https://gerrit.wikimedia.org/r/1032784 (https://phabricator.wikimedia.org/T364528) (owner: 10Bking) [13:58:51] (03CR) 10CDanis: [C:03+2] NEL: add alert by country [alerts] - 10https://gerrit.wikimedia.org/r/902316 (https://phabricator.wikimedia.org/T328941) (owner: 10Volans) [14:00:15] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1032789 (owner: 10Slyngshede) [14:00:21] (03Merged) 10jenkins-bot: NEL: add alert by country [alerts] - 10https://gerrit.wikimedia.org/r/902316 (https://phabricator.wikimedia.org/T328941) (owner: 10Volans) [14:01:19] (03CR) 10Bking: [C:03+2] elasticsearch: add elastic2090 to correct pybal pool [puppet] - 10https://gerrit.wikimedia.org/r/1032784 (https://phabricator.wikimedia.org/T364528) (owner: 10Bking) [14:03:20] (03PS1) 10Cathal Mooney: Drop NAK outbound from IRB interface with EVPN Anycast IRB [homer/public] - 10https://gerrit.wikimedia.org/r/1032791 (https://phabricator.wikimedia.org/T365204) [14:03:29] (03CR) 10JMeybohm: [C:03+1] push-notifications: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032519 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [14:03:54] (03CR) 10Clément Goubert: [C:03+1] Remove remaining occurrences of kubernetes::staging::master role [puppet] - 10https://gerrit.wikimedia.org/r/1032763 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [14:04:45] 10ops-codfw, 06SRE, 06DBA: Degraded RAID on es2022 - https://phabricator.wikimedia.org/T365213#9808829 (10Jhancock.wm) @ABran-WMF I tried to check the warranty status on this server on Dell's site but that function is not working at the moment. they are having technical difficulties. I do not have any 2TB dr... [14:04:50] (03CR) 10JMeybohm: [C:03+2] Remove remaining occurrences of kubernetes::staging::master role [puppet] - 10https://gerrit.wikimedia.org/r/1032763 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [14:05:36] (03PS2) 10Cathal Mooney: Drop NAK outbound from IRB interface with EVPN Anycast IRB [homer/public] - 10https://gerrit.wikimedia.org/r/1032791 (https://phabricator.wikimedia.org/T365204) [14:06:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T352010)', diff saved to https://phabricator.wikimedia.org/P62585 and previous config saved to /var/cache/conftool/dbconfig/20240517-140648-ladsgroup.json [14:06:51] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [14:06:53] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [14:07:04] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [14:08:02] (03PS3) 10Cathal Mooney: Drop NAK outbound from IRB interface with EVPN Anycast IRB [homer/public] - 10https://gerrit.wikimedia.org/r/1032791 (https://phabricator.wikimedia.org/T365204) [14:11:45] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:17:49] (03CR) 10JHathaway: [C:03+2] postfix: prometheus ops config for mx-out boxes [puppet] - 10https://gerrit.wikimedia.org/r/1019116 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [14:20:38] (03PS6) 10Stevemunene: Move datahub and datahub-staging helfile deployments to dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028761 (https://phabricator.wikimedia.org/T363300) [14:21:32] (03CR) 10CI reject: [V:04-1] Move datahub and datahub-staging helfile deployments to dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028761 (https://phabricator.wikimedia.org/T363300) (owner: 10Stevemunene) [14:21:45] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:23:51] (03CR) 10Filippo Giunchedi: [C:03+1] P:idm Use account login page for monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/1032789 (owner: 10Slyngshede) [14:26:45] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:26:52] (03PS1) 10KartikMistry: Fix the mobile experience for a second group of Wikipedias where CX is in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032793 (https://phabricator.wikimedia.org/T361597) [14:33:00] (03CR) 10JMeybohm: [C:03+1] aqs-http-gateway: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031497 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [14:36:45] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:48] 06SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Dennis Mburugu - https://phabricator.wikimedia.org/T364320#9808929 (10DMburugu) Thanks. I can access both of them now [14:45:53] (03PS7) 10Stevemunene: Move datahub and datahub-staging helfile deployments to dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028761 (https://phabricator.wikimedia.org/T363300) [14:46:15] (03CR) 10Dzahn: [C:03+2] "thanks for review and compiling it:)" [puppet] - 10https://gerrit.wikimedia.org/r/1032032 (owner: 10Dzahn) [14:49:41] (03PS3) 10Dzahn: lists: move definition of primary and standby host to common hiera [puppet] - 10https://gerrit.wikimedia.org/r/1032032 [14:51:06] FIRING: KubernetesAPINotScrapable: k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [14:51:34] 06SRE, 10SRE-Access-Requests: Give access to Anti Harassment Tools team to production deployment - https://phabricator.wikimedia.org/T246053#9808975 (10Dzahn) 05Resolved→03Open [14:51:54] 06SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Dennis Mburugu - https://phabricator.wikimedia.org/T364320#9808977 (10Dzahn) Great! Thanks for confirming it. [14:54:25] (03CR) 10JMeybohm: "Thanks for working on this!" [puppet] - 10https://gerrit.wikimedia.org/r/1031602 (https://phabricator.wikimedia.org/T290020) (owner: 10Cwhite) [14:54:35] (03CR) 10Dzahn: [C:03+2] lists: move definition of primary and standby host to common hiera [puppet] - 10https://gerrit.wikimedia.org/r/1032032 (owner: 10Dzahn) [14:56:58] (03CR) 10Dzahn: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1029900 (owner: 10Muehlenhoff) [14:57:16] (03CR) 10Dzahn: [C:03+2] "noop in production confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/1032032 (owner: 10Dzahn) [14:58:55] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:05:47] (03PS1) 10C. Scott Ananian: Fix serialization errors in PageBundle extensiondata [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032807 (https://phabricator.wikimedia.org/T365036) [15:06:40] (03CR) 10Herron: [C:03+1] profile::kafka::broker: Drop support for non PKI configs [puppet] - 10https://gerrit.wikimedia.org/r/1031813 (owner: 10Muehlenhoff) [15:10:30] 06SRE, 10Cassandra, 06serviceops, 10Data Products (Data Products Sprint 13), and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9809043 (10Eevans) >>! In T364921#9807379, @Scott_French wrote: > Many thanks for getting the image builds running and settin... [15:12:24] (03CR) 10Cwhite: "Fields not handled are dropped and a record of their removal is placed in `normalized.dropped.no_such_field`. Caveat: these indicators ar" [puppet] - 10https://gerrit.wikimedia.org/r/1031602 (https://phabricator.wikimedia.org/T290020) (owner: 10Cwhite) [15:14:57] (03CR) 10Elukey: [C:03+2] revscoring-draftquality: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018993 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [15:19:31] (03PS1) 10Hashar: Remove broken deploy.sh script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032802 (https://phabricator.wikimedia.org/T305033) [15:20:55] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2009.codfw.wmnet with OS bullseye [15:21:01] 06SRE, 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): The python-build images regenerate wheels even when matching ones are already available - https://phabricator.wikimedia.org/T259611#9809117 (10hashar) 05Open→03Resolved [15:21:06] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9809120 (10cmooney) Re-reading the man page for dhcpd.conf it seems that pontentially changing the 'authoritative' stateme... [15:21:07] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9809121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye [15:21:38] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [15:22:25] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [15:26:43] (03CR) 10Ayounsi: [C:03+1] "LGTM ! I agree it's an improvment." [homer/public] - 10https://gerrit.wikimedia.org/r/1032505 (https://phabricator.wikimedia.org/T365169) (owner: 10Cathal Mooney) [15:27:09] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1032459 (owner: 10L10n-bot) [15:34:18] (03PS1) 10Cwhite: add user.extra field [software/ecs] - 10https://gerrit.wikimedia.org/r/1032732 (https://phabricator.wikimedia.org/T290020) [15:34:55] (03CR) 10Scott French: [C:03+1] "Thanks, Moritz! I see no diffs other than the loss of the extra (presumably superfluous) parens around the source ranges, so LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1032780 (owner: 10Muehlenhoff) [15:38:42] (03PS8) 10Dzahn: stewards: add rsync server, let lists primary host pull data [puppet] - 10https://gerrit.wikimedia.org/r/1031565 (https://phabricator.wikimedia.org/T351202) [15:42:25] (03PS1) 10Hnowlan: appservers: 6 appservers to insetup before reimaging [puppet] - 10https://gerrit.wikimedia.org/r/1032805 (https://phabricator.wikimedia.org/T353464) [15:43:37] (03PS1) 10Cwhite: logstash: test ecs 1.11.0-8 on beta-logs [puppet] - 10https://gerrit.wikimedia.org/r/1032733 (https://phabricator.wikimedia.org/T290020) [15:46:44] (03CR) 10Dzahn: appservers: 6 appservers to insetup before reimaging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1032805 (https://phabricator.wikimedia.org/T353464) (owner: 10Hnowlan) [15:46:53] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9809234 (10cmooney) From what I can tell the 'authoritative' statement only controls NAK generation. I think we're hittin... [15:48:55] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:53:21] (03CR) 10Scott French: "Thanks, Janis!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031105 (https://phabricator.wikimedia.org/T346638) (owner: 10Scott French) [15:55:10] (03CR) 10Jforrester: "I'd rather the code change go out first (and have a double-wrapped button) than apply this un-shucked image with the wrong-sized buttons f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester) [15:58:41] (03PS1) 10Hnowlan: trafficserver: move to 15% traffic split for commons [puppet] - 10https://gerrit.wikimedia.org/r/1032828 (https://phabricator.wikimedia.org/T362323) [15:59:10] (03CR) 10Hnowlan: [C:03+1] wmnet: add data-gateway CNAME record for k8s ingress [dns] - 10https://gerrit.wikimedia.org/r/1032590 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [15:59:23] (03CR) 10Hnowlan: [C:03+1] kubernetes: add data-gateway usernames for deployment server [puppet] - 10https://gerrit.wikimedia.org/r/1032591 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [16:00:26] (03CR) 10Hnowlan: [C:03+1] service: add data-gateway service (k8s ingress) [puppet] - 10https://gerrit.wikimedia.org/r/1032592 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [16:00:50] (03CR) 10Hnowlan: [C:03+1] service: move data-gateway service to production [puppet] - 10https://gerrit.wikimedia.org/r/1032593 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [16:01:49] (03CR) 10Hnowlan: [C:03+1] admin_ng: add namespace for data-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032594 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [16:06:08] (03CR) 10Hnowlan: "lgtm mostly, some minor notes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032595 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [16:06:50] (03CR) 10Hnowlan: [C:03+1] envoy: add data-gateway service listener [puppet] - 10https://gerrit.wikimedia.org/r/1032599 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [16:07:09] (03CR) 10Scott French: [C:03+1] appservers: 6 appservers to insetup before reimaging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1032805 (https://phabricator.wikimedia.org/T353464) (owner: 10Hnowlan) [16:08:26] (03PS1) 10Hnowlan: geo-analytics: use replicas consistent with other analytics services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032831 [16:09:30] (03CR) 10DCausse: wdqs.data-reload: support HDFS as a source (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse) [16:10:22] (03PS10) 10DCausse: wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) [16:12:33] (03CR) 10Dzahn: [C:04-1] "The srange needs to be passed as an array of hosts or IPs" [puppet] - 10https://gerrit.wikimedia.org/r/1031565 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [16:13:15] 06SRE, 10SRE-Access-Requests: Requesting access to crm for cstone - https://phabricator.wikimedia.org/T365214#9809354 (10Cstone) [16:13:35] (03CR) 10Isabelle Hurbain-Palatin: [C:03+1] Fix serialization errors in PageBundle extensiondata [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032807 (https://phabricator.wikimedia.org/T365036) (owner: 10C. Scott Ananian) [16:15:50] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:17:35] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2009.codfw.wmnet with OS bullseye [16:17:44] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9809368 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye executed... [16:17:53] (03PS2) 10Kamila Součková: recommendation-api: add securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032764 (https://phabricator.wikimedia.org/T362978) [16:18:07] (03CR) 10Kamila Součková: recommendation-api: add securityContext (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032764 (https://phabricator.wikimedia.org/T362978) (owner: 10Kamila Součková) [16:21:22] (03PS9) 10Dzahn: stewards: add rsync server, let lists primary host pull data [puppet] - 10https://gerrit.wikimedia.org/r/1031565 (https://phabricator.wikimedia.org/T351202) [16:21:43] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding sretest2002 to codfw - jhancock@cumin2002" [16:21:45] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:22:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding sretest2002 to codfw - jhancock@cumin2002" [16:22:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:23:22] (03PS2) 10Scott French: services: add data-gateway service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032595 (https://phabricator.wikimedia.org/T364921) [16:26:49] (03PS1) 10Jdlrobson: Drop responsive behaviour [skins/MinervaNeue] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032808 (https://phabricator.wikimedia.org/T109277) [16:30:09] (03CR) 10Scott French: "Thanks for the review, Hugh!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032595 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [16:30:42] (03PS1) 10Jdlrobson: Enable desktop watchlist HTML on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032833 (https://phabricator.wikimedia.org/T109277) [16:32:21] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1031565/2506/stewards1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1031565 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [16:32:42] (03CR) 10Scott French: [C:03+1] geo-analytics: use replicas consistent with other analytics services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032831 (owner: 10Hnowlan) [16:33:43] (03PS1) 10JHathaway: postfix: send logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/1032834 (https://phabricator.wikimedia.org/T325395) [16:35:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [16:35:50] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1032834 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [16:37:01] (03CR) 10Dzahn: [C:03+2] stewards: add rsync server, let lists primary host pull data [puppet] - 10https://gerrit.wikimedia.org/r/1031565 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [16:38:49] (03PS2) 10JHathaway: postfix: send logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/1032834 (https://phabricator.wikimedia.org/T325395) [16:38:57] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1032834 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [16:42:49] (03CR) 10JHathaway: [C:03+2] postfix: send logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/1032834 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [16:47:42] 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists, 13Patch-For-Review: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#9809490 (10Dzahn) We now have an rsy... [16:57:19] (03CR) 10Brouberol: [C:03+1] "LGTM! Great job" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028761 (https://phabricator.wikimedia.org/T363300) (owner: 10Stevemunene) [17:06:16] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [17:18:16] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [17:18:29] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [17:35:47] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [17:36:01] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [17:36:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2121 (T352010)', diff saved to https://phabricator.wikimedia.org/P62587 and previous config saved to /var/cache/conftool/dbconfig/20240517-173608-ladsgroup.json [17:36:12] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [18:00:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T364299)', diff saved to https://phabricator.wikimedia.org/P62588 and previous config saved to /var/cache/conftool/dbconfig/20240517-180006-marostegui.json [18:00:17] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [18:04:34] (03PS2) 10Cwhite: add orchestrator and user.extra fields [software/ecs] - 10https://gerrit.wikimedia.org/r/930597 (https://phabricator.wikimedia.org/T292881) [18:05:37] (03PS3) 10Cwhite: add orchestrator and user.extra fields [software/ecs] - 10https://gerrit.wikimedia.org/r/930597 (https://phabricator.wikimedia.org/T292881) [18:07:02] (03PS5) 10Kimberly Sarabia: Introduce sample overrides to web_ui_actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024813 (https://phabricator.wikimedia.org/T361962) [18:07:31] (03PS2) 10Cwhite: logstash: test ecs 1.11.0-7 on beta-logs [puppet] - 10https://gerrit.wikimedia.org/r/1032733 (https://phabricator.wikimedia.org/T290020) [18:09:10] (03Abandoned) 10Cwhite: add user.extra field [software/ecs] - 10https://gerrit.wikimedia.org/r/1032732 (https://phabricator.wikimedia.org/T290020) (owner: 10Cwhite) [18:11:45] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:12:11] (03PS1) 10Dzahn: stewards: make rsync server listen on IPv6 as well, not just 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/1032844 (https://phabricator.wikimedia.org/T351202) [18:13:02] (03PS2) 10Dzahn: stewards: make rsync server listen on IPv6 as well, not just 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/1032844 (https://phabricator.wikimedia.org/T351202) [18:15:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P62589 and previous config saved to /var/cache/conftool/dbconfig/20240517-181515-marostegui.json [18:19:23] PROBLEM - Host asw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:20:21] PROBLEM - Host ps1-c2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:22:46] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [18:25:23] (03CR) 10Dzahn: [C:03+2] "netstat before: tcp 0 0 0.0.0.0:873 0.0.0.0:* LISTEN 0 25185228 4009856/rsync" [puppet] - 10https://gerrit.wikimedia.org/r/1032844 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [18:28:49] (03CR) 10Dzahn: [C:03+2] "netstat after:" [puppet] - 10https://gerrit.wikimedia.org/r/1032844 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [18:29:27] RECOVERY - Host ps1-c2-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.84 ms [18:29:37] RECOVERY - Host asw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.70 ms [18:30:22] (03CR) 10Dzahn: [C:03+2] "but regardless it rsync client still can't push to it over IPv6 with auto_firewall = true and firewall provider nftables" [puppet] - 10https://gerrit.wikimedia.org/r/1032844 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [18:30:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P62590 and previous config saved to /var/cache/conftool/dbconfig/20240517-183022-marostegui.json [18:40:32] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9809808 (10Jhancock.wm) a:05Jhancock.wm→03Papaul @Papaul I'm still having trouble with the same spot as noted before. Can you take a look at it? [18:45:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T364299)', diff saved to https://phabricator.wikimedia.org/P62591 and previous config saved to /var/cache/conftool/dbconfig/20240517-184530-marostegui.json [18:45:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [18:45:39] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [18:45:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [18:45:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2181 (T364299)', diff saved to https://phabricator.wikimedia.org/P62592 and previous config saved to /var/cache/conftool/dbconfig/20240517-184554-marostegui.json [18:50:42] 06SRE, 10SRE-Access-Requests: Requesting access to crm for cstone - https://phabricator.wikimedia.org/T365214#9809849 (10greg) This has my approval. [18:50:54] (03PS1) 10Scott French: [WIP] dbctl: extend dbconfig checks to external sections [software/conftool] - 10https://gerrit.wikimedia.org/r/1032849 [18:51:06] FIRING: KubernetesAPINotScrapable: k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [18:56:59] (03PS2) 10Scott French: dbctl: extend dbconfig checks to external sections [software/conftool] - 10https://gerrit.wikimedia.org/r/1032849 (https://phabricator.wikimedia.org/T365123) [19:01:45] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:06:03] PROBLEM - Juniper alarms on lsw1-e5-eqiad.mgmt is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [19:10:28] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9809946 (10Papaul) Thank you will do [19:10:51] State sensor Power Supply 1 @ 0/1/* has changed from online (6) to offline (8) [19:11:30] (03PS4) 10Cwhite: add orchestrator and user.extra fields [software/ecs] - 10https://gerrit.wikimedia.org/r/930597 (https://phabricator.wikimedia.org/T292881) [19:12:49] (03PS5) 10Cwhite: add orchestrator and user.extra fields [software/ecs] - 10https://gerrit.wikimedia.org/r/930597 (https://phabricator.wikimedia.org/T292881) [19:14:33] (03CR) 10Ryan Kemper: [C:03+1] zk/flink: Switch the Zookeeper firewall settings to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1032773 (owner: 10Muehlenhoff) [19:14:37] (03CR) 10Cwhite: [C:03+2] add orchestrator and user.extra fields [software/ecs] - 10https://gerrit.wikimedia.org/r/930597 (https://phabricator.wikimedia.org/T292881) (owner: 10Cwhite) [19:14:42] 10ops-eqiad, 06Infrastructure-Foundations, 10netops: partial power outage for lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365289 (10CDanis) 03NEW [19:14:50] (03Merged) 10jenkins-bot: add orchestrator and user.extra fields [software/ecs] - 10https://gerrit.wikimedia.org/r/930597 (https://phabricator.wikimedia.org/T292881) (owner: 10Cwhite) [19:14:54] 10ops-eqiad, 06Infrastructure-Foundations, 10netops: partial power outage for lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365289#9810003 (10CDanis) p:05Triage→03High [19:15:40] oh thanks cdanis for filing the task [19:15:53] yep [19:16:01] I'm also downtiming until Monday business hours [19:16:44] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#9809948 (10Jdlrobson) [19:16:46] <3 [19:17:43] (03PS1) 10Cwhite: logstash: update ecs patch version to 7 [puppet] - 10https://gerrit.wikimedia.org/r/1032737 (https://phabricator.wikimedia.org/T290020) [19:19:15] (03CR) 10Cwhite: [C:03+2] logstash: test ecs 1.11.0-7 on beta-logs [puppet] - 10https://gerrit.wikimedia.org/r/1032733 (https://phabricator.wikimedia.org/T290020) (owner: 10Cwhite) [19:21:51] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2009.codfw.wmnet with OS bullseye [19:22:00] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9810014 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye [19:28:53] PROBLEM - Host ml-serve2002 is DOWN: PING CRITICAL - Packet loss = 100% [19:28:59] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:29:31] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:32:50] FIRING: KubernetesCalicoDown: ml-serve2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:32:56] (03PS1) 10Hashar: Test please ignore [puppet] - 10https://gerrit.wikimedia.org/r/1032852 [19:33:07] ok, looking into this ml-serve thing [19:35:08] Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1. [19:35:11] ok then [19:35:28] it's hard to be hardware on a Friday [19:36:34] should we depool it from https://config-master.wikimedia.org/pybal/codfw/k8s-ingress-ml-serve ? [19:37:00] I think we should yeah [19:37:04] where depool can be depooled or "inactive" [19:38:37] !log dzahn@cumin1002 conftool action : set/pooled=no; selector: name=ml-serve2002.codfw.wmnet [19:38:45] thanks, filing task [19:39:13] thanks as well [19:40:14] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [19:40:16] 10ops-codfw, 06DC-Ops, 06Machine-Learning-Team: ml-serve2002 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T365291 (10ssingh) 03NEW [19:42:37] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-main2009.codfw.wmnet with OS bullseye [19:42:42] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9810061 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye executed wi... [19:43:13] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2009.codfw.wmnet with OS bullseye [19:43:21] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9810065 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye [19:44:42] 10ops-codfw, 06DC-Ops, 06Machine-Learning-Team: ml-serve2002 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T365291#9810066 (10ssingh) Host is depooled: ` 19:38:37 <+logmsgbot> !log dzahn@cumin1002 conftool action : set/pooled=no; selector: name=ml-serve2002.codfw.wmnet ` [19:51:45] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:55:59] (03PS1) 10Papaul: The resue option is making re-image to fail so testing without it. [puppet] - 10https://gerrit.wikimedia.org/r/1032857 (https://phabricator.wikimedia.org/T363209) [20:00:22] (03CR) 10Papaul: [C:03+2] The resue option is making re-image to fail so testing without it. [puppet] - 10https://gerrit.wikimedia.org/r/1032857 (https://phabricator.wikimedia.org/T363209) (owner: 10Papaul) [20:01:24] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [20:01:37] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [20:03:41] PROBLEM - Host parse1002 is DOWN: PING CRITICAL - Packet loss = 100% [20:06:30] FIRING: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:08:56] (03PS1) 10Cwhite: logstash: bugfix out incorrect index pattern [puppet] - 10https://gerrit.wikimedia.org/r/1032738 [20:09:53] !magically_make_the_bot_work_that_updates_the_topic [20:11:30] RESOLVED: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:12:33] (03CR) 10Cwhite: [C:03+2] logstash: bugfix out incorrect index pattern [puppet] - 10https://gerrit.wikimedia.org/r/1032738 (owner: 10Cwhite) [20:12:50] FIRING: [2x] KubernetesCalicoDown: ml-serve2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:14:40] given that we have the ticket. guess we should silence that [20:15:04] always forget how short lived the "short lived" is when you click on alerts.wm.org [20:21:45] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [20:39:35] (03PS1) 10TChin: datasets-config: Bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032862 (https://phabricator.wikimedia.org/T357434) [21:01:27] 06SRE, 06Infrastructure-Foundations, 06Traffic: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps - https://phabricator.wikimedia.org/T359054#9810241 (10CDanis) Latest results: magru is a clear win for BR, AR, CL, PY, UY, BO This adds BO to the "clear win" set. I am guessing... [21:02:03] !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=yes; selector: name=elastic2090\.codfw\.wmnet [21:10:38] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2009.codfw.wmnet with OS bullseye [21:10:50] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9810270 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye executed wi... [21:11:41] (03PS1) 10Ryan Kemper: elastic: only alert on update rate in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1032869 [21:12:00] (03CR) 10CI reject: [V:04-1] elastic: only alert on update rate in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1032869 (owner: 10Ryan Kemper) [21:12:36] (03PS2) 10Ryan Kemper: elastic: only alert on update rate in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1032869 [21:12:54] (03CR) 10CI reject: [V:04-1] elastic: only alert on update rate in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1032869 (owner: 10Ryan Kemper) [21:13:12] (03PS3) 10Ryan Kemper: elastic: only alert on update rate in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1032869 [21:14:05] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1032869 (owner: 10Ryan Kemper) [21:15:56] (03PS1) 10Papaul: Add back reuse option after testing [puppet] - 10https://gerrit.wikimedia.org/r/1032871 (https://phabricator.wikimedia.org/T363209) [21:16:42] (03PS4) 10Ryan Kemper: elastic: only alert on update rate in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1032869 [21:16:51] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1032869 (owner: 10Ryan Kemper) [21:17:01] (03PS5) 10Ryan Kemper: elastic: only alert on update rate in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1032869 [21:17:11] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1032869 (owner: 10Ryan Kemper) [21:18:59] (03PS1) 10Dzahn: lists: add timer to sync data from stewards hosts [puppet] - 10https://gerrit.wikimedia.org/r/1032872 (https://phabricator.wikimedia.org/T351202) [21:20:13] (03CR) 10Bking: [C:03+1] elastic: only alert on update rate in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1032869 (owner: 10Ryan Kemper) [21:20:22] (03PS2) 10Dzahn: lists: add timer to sync data from stewards hosts [puppet] - 10https://gerrit.wikimedia.org/r/1032872 (https://phabricator.wikimedia.org/T351202) [21:20:25] (03CR) 10Ryan Kemper: [C:03+2] elastic: only alert on update rate in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1032869 (owner: 10Ryan Kemper) [21:31:15] 06SRE, 10SRE-Access-Requests: Give access to Anti Harassment Tools team to production deployment - https://phabricator.wikimedia.org/T246053#9810307 (10Dzahn) Hi @Tchanders Since this ticket was resolved in 2020 the deployment server has been replaced. It was `deploy1001` here but now it is `deploy1002` (and d... [21:33:39] 06SRE, 10SRE-Access-Requests: Give access to Anti Harassment Tools team to production deployment - https://phabricator.wikimedia.org/T246053#9810309 (10Dzahn) @Tchanders I can see you logged in on bast1003 and your key is accepted there. So it's not the key. I think it's just the wrong host name. By the way y... [21:37:11] (03PS1) 10Alexandros Kosiaris: preseed: kafka-main[12]00[6-9]|kafka-main[12]010 [puppet] - 10https://gerrit.wikimedia.org/r/1032876 (https://phabricator.wikimedia.org/T363212) [21:40:59] (03CR) 10Alexandros Kosiaris: [C:03+2] preseed: kafka-main[12]00[6-9]|kafka-main[12]010 [puppet] - 10https://gerrit.wikimedia.org/r/1032876 (https://phabricator.wikimedia.org/T363212) (owner: 10Alexandros Kosiaris) [21:46:08] (03CR) 10Papaul: [C:03+2] Add back reuse option after testing [puppet] - 10https://gerrit.wikimedia.org/r/1032871 (https://phabricator.wikimedia.org/T363209) (owner: 10Papaul) [21:47:10] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1006.eqiad.wmnet with OS bullseye [21:47:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9810341 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host kafka-main1006.eqiad.wmn... [21:49:15] (03PS2) 10Papaul: Add back reuse option after testing [puppet] - 10https://gerrit.wikimedia.org/r/1032871 (https://phabricator.wikimedia.org/T363209) [21:52:10] (03CR) 10CI reject: [V:04-1] Add back reuse option after testing [puppet] - 10https://gerrit.wikimedia.org/r/1032871 (https://phabricator.wikimedia.org/T363209) (owner: 10Papaul) [21:54:45] (03PS1) 10Alexandros Kosiaris: Brown paperbag fix for kafka-main preseed [puppet] - 10https://gerrit.wikimedia.org/r/1032880 (https://phabricator.wikimedia.org/T363212) [21:55:10] (03CR) 10Alexandros Kosiaris: [V:03+2 C:03+2] Brown paperbag fix for kafka-main preseed [puppet] - 10https://gerrit.wikimedia.org/r/1032880 (https://phabricator.wikimedia.org/T363212) (owner: 10Alexandros Kosiaris) [21:57:21] !log akosiaris@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-main1006.eqiad.wmnet with OS bullseye [21:57:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9810363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host kafka-main1006.eqiad.wmnet w... [21:57:56] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1006.eqiad.wmnet with OS bullseye [21:58:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9810364 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host kafka-main1006.eqiad.wmn... [22:02:50] (03PS3) 10Papaul: Add back reuse option after testing [puppet] - 10https://gerrit.wikimedia.org/r/1032871 (https://phabricator.wikimedia.org/T363209) [22:03:01] (03CR) 10CI reject: [V:04-1] Add back reuse option after testing [puppet] - 10https://gerrit.wikimedia.org/r/1032871 (https://phabricator.wikimedia.org/T363209) (owner: 10Papaul) [22:06:47] (03Abandoned) 10Papaul: Add back reuse option after testing [puppet] - 10https://gerrit.wikimedia.org/r/1032871 (https://phabricator.wikimedia.org/T363209) (owner: 10Papaul) [22:10:50] (03PS1) 10BCornwall: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1032884 [22:11:24] (03CR) 10CI reject: [V:04-1] Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1032884 (owner: 10BCornwall) [22:11:45] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:19:55] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1007.eqiad.wmnet with OS bullseye [22:20:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2009.codfw.wmnet with OS bullseye [22:20:24] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9810418 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet wi... [22:20:43] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1008.eqiad.wmnet with OS bullseye [22:21:09] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1010.eqiad.wmnet with OS bullseye [22:24:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9810421 (10akosiaris) [22:26:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9810438 (10akosiaris) For some reason on kafka1006 software RAID re-syncing is taking forever (moving at 19K/s, which is VERY slow) and... [22:36:45] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:37:28] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:38:59] (03PS1) 10Brian Wolff: Allow async (job queue based) chunked upload on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032888 (https://phabricator.wikimedia.org/T364644) [22:43:53] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main1006.eqiad.wmnet with OS bullseye [22:44:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9810468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host kafka-main1006.eqiad.wmnet with OS bullseye execut... [22:45:16] (03PS3) 10Cwhite: logstash: reformat k8s audit logs to ECS [puppet] - 10https://gerrit.wikimedia.org/r/1031602 (https://phabricator.wikimedia.org/T290020) [22:48:50] (03PS4) 10Cwhite: logstash: reformat k8s audit logs to ECS [puppet] - 10https://gerrit.wikimedia.org/r/1031602 (https://phabricator.wikimedia.org/T290020) [22:49:35] (03CR) 10Cwhite: logstash: reformat k8s audit logs to ECS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1031602 (https://phabricator.wikimedia.org/T290020) (owner: 10Cwhite) [22:51:06] FIRING: KubernetesAPINotScrapable: k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [23:01:45] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:02:27] (03CR) 10Cwhite: [C:03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255) (owner: 10Andrea Denisse) [23:05:53] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main1007.eqiad.wmnet with OS bullseye [23:06:41] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main1008.eqiad.wmnet with OS bullseye [23:08:04] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2009.codfw.wmnet with OS bullseye [23:08:16] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9810496 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye executed wi... [23:09:58] (03CR) 10Cwhite: "Great work!" [puppet] - 10https://gerrit.wikimedia.org/r/1032608 (https://phabricator.wikimedia.org/T267664) (owner: 10Andrea Denisse) [23:11:26] (03CR) 10Cwhite: "" [puppet] - 10https://gerrit.wikimedia.org/r/1032608 (https://phabricator.wikimedia.org/T267664) (owner: 10Andrea Denisse) [23:32:57] (03CR) 10BryanDavis: [C:03+1] "+1 indicating that the reviewer has a working mouse" [puppet] - 10https://gerrit.wikimedia.org/r/976355 (https://phabricator.wikimedia.org/T334512) (owner: 10Brennen Bearnes) [23:33:55] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:37:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:38:47] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1032740 [23:38:47] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1032740 (owner: 10TrainBranchBot) [23:41:25] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main1010.eqiad.wmnet with OS bullseye [23:41:27] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2009.codfw.wmnet with OS bullseye [23:41:40] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9810548 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye [23:43:49] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2009.codfw.wmnet with reason: host reimage [23:46:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2009.codfw.wmnet with reason: host reimage [23:51:45] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:59:36] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1032740 (owner: 10TrainBranchBot)