[00:01:41] (03PS1) 10Dzahn: devtools: update host name for new gerrit test instance [puppet] - 10https://gerrit.wikimedia.org/r/1036767 (https://phabricator.wikimedia.org/T363196) [00:02:13] (03CR) 10Dzahn: [C:03+2] devtools: update host name for new gerrit test instance [puppet] - 10https://gerrit.wikimedia.org/r/1036767 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn) [00:03:08] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1036596 (owner: 10TrainBranchBot) [00:13:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P63490 and previous config saved to /var/cache/conftool/dbconfig/20240529-001303-marostegui.json [00:18:20] (03PS4) 10Aaron Schulz: Set "s3" as the default section name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909763 [00:18:57] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:20:49] (03PS1) 10Dzahn: gerrit: add parameter to toggle lfs_replica_sync ensure [puppet] - 10https://gerrit.wikimedia.org/r/1036771 [00:21:10] (03CR) 10CI reject: [V:04-1] gerrit: add parameter to toggle lfs_replica_sync ensure [puppet] - 10https://gerrit.wikimedia.org/r/1036771 (owner: 10Dzahn) [00:28:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P63491 and previous config saved to /var/cache/conftool/dbconfig/20240529-002811-marostegui.json [00:43:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T364299)', diff saved to https://phabricator.wikimedia.org/P63492 and previous config saved to /var/cache/conftool/dbconfig/20240529-004319-marostegui.json [00:43:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1194.eqiad.wmnet with reason: Maintenance [00:43:27] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [00:43:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1194.eqiad.wmnet with reason: Maintenance [00:43:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T364299)', diff saved to https://phabricator.wikimedia.org/P63493 and previous config saved to /var/cache/conftool/dbconfig/20240529-004343-marostegui.json [01:48:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T364299)', diff saved to https://phabricator.wikimedia.org/P63494 and previous config saved to /var/cache/conftool/dbconfig/20240529-014845-marostegui.json [01:48:53] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [01:58:07] FIRING: KubernetesCalicoDown: kubernetes2032.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2032.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [02:03:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P63495 and previous config saved to /var/cache/conftool/dbconfig/20240529-020353-marostegui.json [02:19:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P63496 and previous config saved to /var/cache/conftool/dbconfig/20240529-021901-marostegui.json [02:34:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T364299)', diff saved to https://phabricator.wikimedia.org/P63497 and previous config saved to /var/cache/conftool/dbconfig/20240529-023409-marostegui.json [02:34:12] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1202.eqiad.wmnet with reason: Maintenance [02:34:16] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [02:34:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1202.eqiad.wmnet with reason: Maintenance [02:34:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T364299)', diff saved to https://phabricator.wikimedia.org/P63498 and previous config saved to /var/cache/conftool/dbconfig/20240529-023432-marostegui.json [02:36:48] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:42:47] 06SRE, 06serviceops, 13Patch-For-Review: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9840665 (10CDanis) = tldr: * Adding the new control plane workers in eqiad turned what was a CPU saturation issue (causing blackbox probes to be slow but still within timeouts), into a simultaneous... [02:56:48] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:04:27] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:17:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T364069)', diff saved to https://phabricator.wikimedia.org/P63499 and previous config saved to /var/cache/conftool/dbconfig/20240529-031710-marostegui.json [03:17:20] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [03:18:57] RESOLVED: JobUnavailable: Reduced availability for job wmf_gitlab_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:29:56] 06SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for Sohom Datta - https://phabricator.wikimedia.org/T366032#9840685 (10Soda) a:05Sodaβ†’03None Sent the information. (in an email titled `Re: Information for T366032 (Sohom Datta)`) [03:32:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P63500 and previous config saved to /var/cache/conftool/dbconfig/20240529-033221-marostegui.json [03:38:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T364299)', diff saved to https://phabricator.wikimedia.org/P63501 and previous config saved to /var/cache/conftool/dbconfig/20240529-033814-marostegui.json [03:38:22] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [03:47:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P63502 and previous config saved to /var/cache/conftool/dbconfig/20240529-034728-marostegui.json [03:53:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P63503 and previous config saved to /var/cache/conftool/dbconfig/20240529-035323-marostegui.json [03:55:17] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance [03:55:30] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance [03:55:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2125 (T352010)', diff saved to https://phabricator.wikimedia.org/P63504 and previous config saved to /var/cache/conftool/dbconfig/20240529-035538-ladsgroup.json [03:55:46] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [04:02:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T364069)', diff saved to https://phabricator.wikimedia.org/P63505 and previous config saved to /var/cache/conftool/dbconfig/20240529-040236-marostegui.json [04:02:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [04:02:43] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [04:02:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [04:03:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2137 (T364069)', diff saved to https://phabricator.wikimedia.org/P63506 and previous config saved to /var/cache/conftool/dbconfig/20240529-040259-marostegui.json [04:08:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P63507 and previous config saved to /var/cache/conftool/dbconfig/20240529-040831-marostegui.json [04:21:48] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:23:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T364299)', diff saved to https://phabricator.wikimedia.org/P63508 and previous config saved to /var/cache/conftool/dbconfig/20240529-042339-marostegui.json [04:23:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1227.eqiad.wmnet with reason: Maintenance [04:23:44] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [04:23:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1227.eqiad.wmnet with reason: Maintenance [04:24:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T364299)', diff saved to https://phabricator.wikimedia.org/P63509 and previous config saved to /var/cache/conftool/dbconfig/20240529-042402-marostegui.json [04:36:24] 10ops-codfw, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T366134 (10phaultfinder) 03NEW [04:41:27] 10ops-codfw, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T366134#9840740 (10phaultfinder) [04:42:56] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 137, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:43:26] (03CR) 10AOkoth: [C:03+1] vrts: add missing comma to vrts_aliases.py [puppet] - 10https://gerrit.wikimedia.org/r/1036760 (https://phabricator.wikimedia.org/T284145) (owner: 10Dzahn) [04:43:30] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:46:29] 10ops-codfw, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T366134#9840741 (10phaultfinder) [05:21:07] (03PS1) 10Marostegui: db1211: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1036782 [05:21:51] (03CR) 10Marostegui: [C:03+2] db1211: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1036782 (owner: 10Marostegui) [05:39:27] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:58:07] FIRING: KubernetesCalicoDown: kubernetes2032.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2032.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240529T0600) [06:22:32] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:22:40] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 138, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:44:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T364299)', diff saved to https://phabricator.wikimedia.org/P63510 and previous config saved to /var/cache/conftool/dbconfig/20240529-064453-marostegui.json [06:45:00] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [06:47:34] (03PS4) 10Elukey: redfish: expand support for Supermicro hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372) [06:49:08] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1218.eqiad.wmnet [06:52:52] (03PS1) 10Muehlenhoff: Switch db1218 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036910 (https://phabricator.wikimedia.org/T349619) [06:55:50] (03CR) 10Muehlenhoff: [C:03+2] Switch db1218 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036910 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [06:59:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1218.eqiad.wmnet [07:00:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P63511 and previous config saved to /var/cache/conftool/dbconfig/20240529-070001-marostegui.json [07:00:05] Amir1 and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240529T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:15:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P63512 and previous config saved to /var/cache/conftool/dbconfig/20240529-071509-marostegui.json [07:16:06] (03PS1) 10Marostegui: db2170: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1036912 [07:16:48] (03CR) 10Marostegui: [C:03+2] db2170: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1036912 (owner: 10Marostegui) [07:29:52] (03PS1) 10Marostegui: core_test.pp: Add MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1036916 (https://phabricator.wikimedia.org/T365805) [07:30:13] (03CR) 10Marostegui: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036916 (https://phabricator.wikimedia.org/T365805) (owner: 10Marostegui) [07:30:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T364299)', diff saved to https://phabricator.wikimedia.org/P63513 and previous config saved to /var/cache/conftool/dbconfig/20240529-073017-marostegui.json [07:30:21] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [07:30:24] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [07:30:34] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [07:31:21] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for backup roles [puppet] - 10https://gerrit.wikimedia.org/r/1032636 (owner: 10Muehlenhoff) [07:32:12] RECOVERY - Categories update lag on wdqs1018 is OK: OK - Categories lag: 2:32:11.288501 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:35:16] RECOVERY - Categories update lag on wdqs2013 is OK: OK - Categories lag: 2:35:15.453618 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:35:16] RECOVERY - Categories update lag on wdqs2025 is OK: OK - Categories lag: 2:35:15.479951 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:35:16] RECOVERY - Categories update lag on wdqs2011 is OK: OK - Categories lag: 2:35:15.489814 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:35:57] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1219.eqiad.wmnet [07:37:10] (03PS1) 10Jelto: gitlab: bump exporter version to v1.0.10 [puppet] - 10https://gerrit.wikimedia.org/r/1036987 (https://phabricator.wikimedia.org/T354656) [07:38:16] RECOVERY - Categories update lag on wdqs2007 is OK: OK - Categories lag: 2:38:14.638224 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:38:38] (03CR) 10Jelto: [C:03+2] gitlab: bump exporter version to v1.0.10 [puppet] - 10https://gerrit.wikimedia.org/r/1036987 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [07:39:05] (03PS1) 10Muehlenhoff: Switch db1219 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036989 (https://phabricator.wikimedia.org/T349619) [07:41:12] (03PS1) 10DCausse: cirrus-streaming-updater: use latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036992 (https://phabricator.wikimedia.org/T365692) [07:41:16] RECOVERY - Categories update lag on wdqs2009 is OK: OK - Categories lag: 2:41:14.574669 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:41:18] jouncebot: nowandnext [07:41:18] For the next 0 hour(s) and 18 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240529T0700) [07:41:18] In 0 hour(s) and 18 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240529T0800) [07:41:19] (03CR) 10Muehlenhoff: [C:03+2] Switch db1219 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036989 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:47:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1219.eqiad.wmnet [07:47:12] RECOVERY - Categories update lag on wdqs1017 is OK: OK - Categories lag: 2:47:10.428264 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:47:12] RECOVERY - Categories update lag on wdqs1015 is OK: OK - Categories lag: 2:47:10.489025 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:47:12] RECOVERY - Categories update lag on wdqs1019 is OK: OK - Categories lag: 2:47:11.314602 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:47:18] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1228.eqiad.wmnet [07:48:39] (03PS1) 10Muehlenhoff: Switch db1228 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036993 (https://phabricator.wikimedia.org/T349619) [07:49:25] (03PS1) 10Stevemunene: Remove datahub from LVS [puppet] - 10https://gerrit.wikimedia.org/r/1036994 (https://phabricator.wikimedia.org/T366137) [07:49:44] (03CR) 10Muehlenhoff: [C:03+2] Switch db1228 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036993 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:50:12] RECOVERY - Categories update lag on wdqs2008 is OK: OK - Categories lag: 2:50:11.301393 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:50:14] RECOVERY - Categories update lag on wdqs2014 is OK: OK - Categories lag: 2:50:12.729355 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:50:14] RECOVERY - Categories update lag on wdqs2010 is OK: OK - Categories lag: 2:50:12.746571 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:50:14] RECOVERY - Categories update lag on wdqs2012 is OK: OK - Categories lag: 2:50:12.759660 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:50:15] RECOVERY - Categories update lag on wdqs2024 is OK: OK - Categories lag: 2:50:13.429869 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:50:15] RECOVERY - Categories update lag on wdqs2022 is OK: OK - Categories lag: 2:50:13.427291 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:51:13] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: use latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036992 (https://phabricator.wikimedia.org/T365692) (owner: 10DCausse) [07:52:12] (03Merged) 10jenkins-bot: cirrus-streaming-updater: use latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036992 (https://phabricator.wikimedia.org/T365692) (owner: 10DCausse) [07:54:10] !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:54:37] !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:55:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1228.eqiad.wmnet [07:56:20] 07Puppet, 10Wikidata, 06Wikidata Dev Team, 10wmde-wikidata-tech, and 2 others: Remove the WDCM clone (stats1007) - https://phabricator.wikimedia.org/T351072#9840949 (10AndrewTavis_WMDE) Moving this to verification given the work in T364965. Thanks for all of this, @Lucas_Werkmeister_WMDE! Maybe we can reso... [08:00:05] dancy and andre: Your horoscope predicts another MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240529T0800). [08:00:10] (03CR) 10Muehlenhoff: vrts: add missing comma to vrts_aliases.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1036760 (https://phabricator.wikimedia.org/T284145) (owner: 10Dzahn) [08:00:45] !log brouberol@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: sync [08:01:06] (03CR) 10Muehlenhoff: [C:03+2] ml/etcd: remove obsolete certificites [puppet] - 10https://gerrit.wikimedia.org/r/1036619 (owner: 10Muehlenhoff) [08:05:19] (03PS2) 10Dzahn: vrts: add missing comma to vrts_aliases.py [puppet] - 10https://gerrit.wikimedia.org/r/1036760 (https://phabricator.wikimedia.org/T284145) [08:05:19] (03CR) 10Dzahn: vrts: add missing comma to vrts_aliases.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1036760 (https://phabricator.wikimedia.org/T284145) (owner: 10Dzahn) [08:05:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [08:06:29] (03PS1) 10Effie Mouzeli: memcached: minor fixes in class and profile [puppet] - 10https://gerrit.wikimedia.org/r/1036995 [08:06:57] (03PS2) 10Effie Mouzeli: memcached: minor fixes in class and profile [puppet] - 10https://gerrit.wikimedia.org/r/1036995 [08:07:20] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036995 (owner: 10Effie Mouzeli) [08:09:03] (03PS1) 10Muehlenhoff: Remove obsolete wikikube/staging etcd certificates [puppet] - 10https://gerrit.wikimedia.org/r/1036998 (https://phabricator.wikimedia.org/T357750) [08:10:54] !log brouberol@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: sync [08:11:33] (03PS1) 10Hashar: Merge tag 'v3.9.5' into wmf/stable-3.9 [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1036999 (https://phabricator.wikimedia.org/T354887) [08:11:47] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036998 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [08:12:48] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1036760 (https://phabricator.wikimedia.org/T284145) (owner: 10Dzahn) [08:14:07] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036995 (owner: 10Effie Mouzeli) [08:15:09] (03PS2) 10Mvolz: Update user-agent string in citoid to be like Zot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034860 (https://phabricator.wikimedia.org/T366093) [08:15:19] (03CR) 10Muehlenhoff: [C:03+1] "Thanks!" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1013156 (https://phabricator.wikimedia.org/T350129) (owner: 10Pppery) [08:15:49] (03CR) 10Slyngshede: [C:03+2] Update links to point to non-wiki privacy policy and bypass redirects [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1013156 (https://phabricator.wikimedia.org/T350129) (owner: 10Pppery) [08:15:57] (03CR) 10Slyngshede: [V:03+2 C:03+2] Update links to point to non-wiki privacy policy and bypass redirects [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1013156 (https://phabricator.wikimedia.org/T350129) (owner: 10Pppery) [08:18:19] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review, 10Release-Engineering-Team (Priority Backlog πŸ“₯): Correct IDP login page Privacy Policy - https://phabricator.wikimedia.org/T350129#9840992 (10SLyngshede-WMF) The updated template will be rolled out with the next version bump of CAS. [08:18:36] (03CR) 10Effie Mouzeli: [C:03+1] role::thanos::frontend: move all envoy TLS certs to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036643 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [08:20:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9841008 (10akosiaris) >>! In T363212#9839469, @Dzahn wrote: > @akosiaris re: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1035769/1/modules/profile/da... [08:21:48] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:22:59] !log brouberol@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: sync [08:23:14] (03CR) 10Ayounsi: "codfw/eqiad IPs lgtm, I can't vouch for the SPF settings though." [dns] - 10https://gerrit.wikimedia.org/r/1036739 (https://phabricator.wikimedia.org/T366113) (owner: 10JHathaway) [08:23:24] (03CR) 10Effie Mouzeli: "PCC OK https://puppet-compiler.wmflabs.org/output/1036995/2671/" [puppet] - 10https://gerrit.wikimedia.org/r/1036995 (owner: 10Effie Mouzeli) [08:23:29] (03CR) 10Hashar: [C:03+2] Merge tag 'v3.9.5' into wmf/stable-3.9 [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1036999 (https://phabricator.wikimedia.org/T354887) (owner: 10Hashar) [08:24:52] 07Puppet: Repeated Puppet failures for PetScan - https://phabricator.wikimedia.org/T366141 (10Magnus) 03NEW [08:27:05] (03PS1) 10Slyngshede: Bump to version 6.6.15.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1037000 (https://phabricator.wikimedia.org/T366140) [08:28:28] (03PS1) 10Muehlenhoff: Remove obsolete wikikube etcd certificates [puppet] - 10https://gerrit.wikimedia.org/r/1037002 (https://phabricator.wikimedia.org/T357750) [08:29:08] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 49666 [08:29:32] (03Merged) 10jenkins-bot: Merge tag 'v3.9.5' into wmf/stable-3.9 [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1036999 (https://phabricator.wikimedia.org/T354887) (owner: 10Hashar) [08:31:09] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 49666 [08:33:08] !log brouberol@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: sync [08:35:30] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 8674 [08:36:20] (03CR) 10Marostegui: [C:03+2] core_test.pp: Add MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1036916 (https://phabricator.wikimedia.org/T365805) (owner: 10Marostegui) [08:39:22] !log brouberol@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [08:40:09] !log brouberol@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [08:42:16] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 45 probes of 791 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:47:32] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 14 probes of 791 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:48:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037002 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [08:51:37] (03CR) 10Muehlenhoff: "Looks good, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1036995 (owner: 10Effie Mouzeli) [08:54:56] (03CR) 10Muehlenhoff: [C:03+2] profile::elasticsearch::cirrus: Remove obsolete http2 parameter [puppet] - 10https://gerrit.wikimedia.org/r/1036556 (owner: 10Muehlenhoff) [08:58:34] (03PS1) 10Aklapper: Remove FIXME comment for waxing and waning moon phases [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1037026 (https://phabricator.wikimedia.org/T365853) [08:58:34] (03PS3) 10Effie Mouzeli: memcached: minor fixes in class and profile [puppet] - 10https://gerrit.wikimedia.org/r/1036995 [08:59:17] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2098.codfw.wmnet with reason: Maintenance [08:59:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2098.codfw.wmnet with reason: Maintenance [09:00:12] (03PS4) 10Effie Mouzeli: memcached: minor fixes in class and profile [puppet] - 10https://gerrit.wikimedia.org/r/1036995 [09:05:30] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 8674 [09:05:59] (03CR) 10Volans: "Nice addition! Couple of suggestions inline, looks already good." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [09:06:53] (03PS1) 10Muehlenhoff: tlsproxy::localssl: Remove support for HTTP2 [puppet] - 10https://gerrit.wikimedia.org/r/1037029 [09:07:58] (03PS5) 10Effie Mouzeli: memcached: minor fixes in class and profile [puppet] - 10https://gerrit.wikimedia.org/r/1036995 [09:09:41] (03CR) 10Muehlenhoff: memcached: minor fixes in class and profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1036995 (owner: 10Effie Mouzeli) [09:10:46] 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 07Schema-change-in-production, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501#9841121 (10Marostegui) For the record (in... [09:11:30] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Joely Rooke WMDE - https://phabricator.wikimedia.org/T366145 (10JoelyRooke-WMDE) 03NEW [09:11:58] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037029 (owner: 10Muehlenhoff) [09:12:05] !log Deploy schema change on s7 eqiad dbmaint T307501 [09:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:11] T307501: Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501 [09:12:26] 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 07Schema-change-in-production, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501#9841147 (10Marostegui) [09:13:11] (03CR) 10Effie Mouzeli: memcached: minor fixes in class and profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1036995 (owner: 10Effie Mouzeli) [09:14:29] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1036995 (owner: 10Effie Mouzeli) [09:15:56] (03PS2) 10Muehlenhoff: tlsproxy::localssl: Remove support for HTTP2 [puppet] - 10https://gerrit.wikimedia.org/r/1037029 [09:16:12] (03PS1) 10Santiago Faci: aqs-http-gateway chart and edit-analytic service k8s configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037033 (https://phabricator.wikimedia.org/T355408) [09:16:30] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1232.eqiad.wmnet [09:17:03] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Joely Rooke WMDE - https://phabricator.wikimedia.org/T366145#9841151 (10WMDECyn) I approve the request on WMDE's behalf [09:17:15] FYI, doing some pod rolling restarts in eqiad trying to reproduce https://phabricator.wikimedia.org/T366094 [09:18:15] (03PS2) 10Santiago Faci: aqs-http-gateway chart and edit-analytic service k8s configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037033 (https://phabricator.wikimedia.org/T355408) [09:19:22] (03PS2) 10Slyngshede: Bump to version 6.6.15.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1037000 (https://phabricator.wikimedia.org/T366140) [09:20:40] (03PS6) 10Effie Mouzeli: memcached: minor fixes in class and profile [puppet] - 10https://gerrit.wikimedia.org/r/1036995 [09:22:31] (03PS1) 10Muehlenhoff: Switch db1232 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1037035 (https://phabricator.wikimedia.org/T349619) [09:23:49] (03CR) 10Muehlenhoff: Bump to version 6.6.15.1 (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1037000 (https://phabricator.wikimedia.org/T366140) (owner: 10Slyngshede) [09:24:25] (03PS3) 10Santiago Faci: aqs-http-gateway chart and edit-analytic service k8s configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037033 (https://phabricator.wikimedia.org/T355408) [09:25:03] (03CR) 10Muehlenhoff: Bump to version 6.6.15.1 (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1037000 (https://phabricator.wikimedia.org/T366140) (owner: 10Slyngshede) [09:26:41] (03CR) 10Muehlenhoff: [C:03+2] Switch db1232 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1037035 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:27:33] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: sync [09:27:54] (03CR) 10Effie Mouzeli: [C:03+2] memcached: minor fixes in class and profile [puppet] - 10https://gerrit.wikimedia.org/r/1036995 (owner: 10Effie Mouzeli) [09:28:53] (03PS6) 10Santiago Faci: editor-analytics deployment: big refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036191 (https://phabricator.wikimedia.org/T355408) [09:28:54] (03PS3) 10Slyngshede: Bump to version 6.6.15.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1037000 (https://phabricator.wikimedia.org/T366140) [09:29:00] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037029 (owner: 10Muehlenhoff) [09:29:09] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: sync [09:30:38] (03CR) 10Brouberol: [C:03+1] aqs-http-gateway chart and edit-analytic service k8s configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037033 (https://phabricator.wikimedia.org/T355408) (owner: 10Santiago Faci) [09:31:22] 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 07Schema-change-in-production, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501#9841191 (10Marostegui) [09:31:40] (03CR) 10Santiago Faci: [C:03+2] aqs-http-gateway chart and edit-analytic service k8s configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037033 (https://phabricator.wikimedia.org/T355408) (owner: 10Santiago Faci) [09:31:52] (03PS4) 10Slyngshede: Bump to version 6.6.15.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1037000 (https://phabricator.wikimedia.org/T366140) [09:32:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1232.eqiad.wmnet [09:32:09] 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 07Schema-change-in-production, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501#9841192 (10Marostegui) [09:33:03] 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 07Schema-change-in-production, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501#9841193 (10Marostegui) 05Openβ†’03Res... [09:33:15] (03Merged) 10jenkins-bot: aqs-http-gateway chart and edit-analytic service k8s configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037033 (https://phabricator.wikimedia.org/T355408) (owner: 10Santiago Faci) [09:33:57] (03CR) 10Slyngshede: Bump to version 6.6.15.1 (032 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1037000 (https://phabricator.wikimedia.org/T366140) (owner: 10Slyngshede) [09:35:31] (03CR) 10Brouberol: [C:03+1] editor-analytics deployment: big refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036191 (https://phabricator.wikimedia.org/T355408) (owner: 10Santiago Faci) [09:36:26] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: sync [09:36:26] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: sync [09:37:29] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: sync [09:38:07] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: sync [09:38:46] !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [09:39:33] !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [09:39:42] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:39:56] RECOVERY - Memcached on mc2049 is OK: TCP OK - 0.031 second response time on 10.192.32.81 port 11214 https://wikitech.wikimedia.org/wiki/Memcached [09:41:42] (03PS1) 10Effie Mouzeli: memcached::instance: add the actual datafile in the options [puppet] - 10https://gerrit.wikimedia.org/r/1037037 [09:43:11] (03PS1) 10DCausse: cirrus-streaming-updater: use latest image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037038 [09:43:49] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1037000 (https://phabricator.wikimedia.org/T366140) (owner: 10Slyngshede) [09:44:12] (03PS5) 10Slyngshede: Bump to version 6.6.15.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1037000 (https://phabricator.wikimedia.org/T366140) [09:44:27] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:47:33] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1234.eqiad.wmnet [09:48:58] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: use latest image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037038 (owner: 10DCausse) [09:49:55] (03Merged) 10jenkins-bot: cirrus-streaming-updater: use latest image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037038 (owner: 10DCausse) [09:50:40] !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:51:02] !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:51:28] (03PS2) 10Effie Mouzeli: memcached::instance: add the actual datafile in the options [puppet] - 10https://gerrit.wikimedia.org/r/1037037 [09:52:10] (03PS1) 10Muehlenhoff: Switch db1234 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1037039 (https://phabricator.wikimedia.org/T349619) [09:54:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [09:54:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [09:54:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2121 (T364299)', diff saved to https://phabricator.wikimedia.org/P63514 and previous config saved to /var/cache/conftool/dbconfig/20240529-095437-marostegui.json [09:54:43] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [09:55:17] (03CR) 10Muehlenhoff: [C:03+2] Switch db1234 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1037039 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:57:34] (03PS1) 10Hashar: Gerrit 3.9.5, rebuild plugins and update TypeScript API [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1037041 (https://phabricator.wikimedia.org/T354887) [09:57:46] !log sfaci@deploy1002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply [09:58:08] FIRING: KubernetesCalicoDown: kubernetes2032.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2032.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:59:03] (03Abandoned) 10Hnowlan: cassandra-http-gateway: use cassandra module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015085 (owner: 10Hnowlan) [09:59:15] !log sfaci@deploy1002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply [09:59:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240529T1000) [10:00:41] !log sfaci@deploy1002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [10:00:46] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUns [10:00:52] !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [10:00:53] !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:01:06] !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [10:01:18] !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:02:18] !log sfaci@deploy1002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [10:04:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1234.eqiad.wmnet [10:04:43] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl1001.eqiad.wmnet with reason: disable puppet and k8s controlplane [10:04:57] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl1001.eqiad.wmnet with reason: disable puppet and k8s controlplane [10:05:07] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl1002.eqiad.wmnet with reason: disable puppet and k8s controlplane [10:05:13] 06SRE, 06serviceops, 13Patch-For-Review: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9841272 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b65d2df8-871b-4064-b329-026af4d7ec1d) set by akosiaris@cumin1002 for 2:00:00 on 1 host(s) and their services with reason:... [10:05:21] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl1002.eqiad.wmnet with reason: disable puppet and k8s controlplane [10:05:32] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: sync [10:05:32] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: sync [10:05:34] 06SRE, 06serviceops, 13Patch-For-Review: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9841277 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8fa8366a-d3f2-4a77-8e2b-45de66551026) set by akosiaris@cumin1002 for 2:00:00 on 1 host(s) and their services with reason:... [10:05:36] (03CR) 10Santiago Faci: [C:03+2] editor-analytics deployment: big refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036191 (https://phabricator.wikimedia.org/T355408) (owner: 10Santiago Faci) [10:05:51] !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [10:05:54] !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:06:01] !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:06:27] !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:06:31] (03Merged) 10jenkins-bot: editor-analytics deployment: big refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036191 (https://phabricator.wikimedia.org/T355408) (owner: 10Santiago Faci) [10:06:55] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: sync [10:07:03] !log installing systemd security updates [10:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:07] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: sync [10:07:28] (03CR) 10Effie Mouzeli: [C:03+2] memcached::instance: add the actual datafile in the options [puppet] - 10https://gerrit.wikimedia.org/r/1037037 (owner: 10Effie Mouzeli) [10:08:56] (03PS4) 10Klausman: install/partman: Tweak kubelet partition size for ML workers [puppet] - 10https://gerrit.wikimedia.org/r/1036195 (https://phabricator.wikimedia.org/T365971) [10:09:33] !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [10:10:19] !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [10:10:24] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1235.eqiad.wmnet [10:10:55] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2678/console" [puppet] - 10https://gerrit.wikimedia.org/r/1036195 (https://phabricator.wikimedia.org/T365971) (owner: 10Klausman) [10:12:48] !log sfaci@deploy1002 helmfile [codfw] START helmfile.d/services/editor-analytics: apply [10:13:03] (03CR) 10Klausman: [V:03+1 C:03+2] install/partman: Tweak kubelet partition size for ML workers [puppet] - 10https://gerrit.wikimedia.org/r/1036195 (https://phabricator.wikimedia.org/T365971) (owner: 10Klausman) [10:14:26] !log sfaci@deploy1002 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply [10:14:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [10:14:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [10:15:06] !log sfaci@deploy1002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply [10:16:33] !log sfaci@deploy1002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply [10:16:44] !log installing python-idna security updates [10:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:50] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubemaster1002.eqiad.wmnet with reason: disable puppet and k8s controlplane [10:17:04] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubemaster1002.eqiad.wmnet with reason: disable puppet and k8s controlplane [10:17:15] 06SRE, 06serviceops, 13Patch-For-Review: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9841311 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2f1b90d9-2cd4-4705-bbf1-70fdacf169cd) set by akosiaris@cumin1002 for 2:00:00 on 1 host(s) and their services with reason:... [10:17:30] (03PS1) 10Klausman: install/partman: Separate out DSE cluster partman recipe from ML [puppet] - 10https://gerrit.wikimedia.org/r/1037042 (https://phabricator.wikimedia.org/T365971) [10:18:03] (03PS2) 10Klausman: install/partman: Separate out DSE cluster partman recipe from ML [puppet] - 10https://gerrit.wikimedia.org/r/1037042 (https://phabricator.wikimedia.org/T365971) [10:19:50] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: sync [10:19:51] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: sync [10:20:22] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers kubemaster1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:20:22] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers kubemaster1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:21:03] ah dammit [10:21:27] but it shouldn't reply anyway [10:22:39] (03CR) 10Alexandros Kosiaris: [C:03+1] otelcol: add three new k8s ctrl IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036708 (https://phabricator.wikimedia.org/T366094) (owner: 10CDanis) [10:24:35] !log akosiaris@cumin1002 conftool action : set/pooled=inactive; selector: service=kubemaster,dc=eqiad,cluster=kubernetes,name=kubemaster1002.eqiad.wmnet [10:24:41] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: sync [10:24:41] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: sync [10:24:46] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: sync [10:24:57] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: sync [10:26:01] (03CR) 10Slyngshede: [V:03+2 C:03+2] Bump to version 6.6.15.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1037000 (https://phabricator.wikimedia.org/T366140) (owner: 10Slyngshede) [10:26:32] !log installing intel-microcode security updates [10:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:41] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: sync [10:26:43] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: sync [10:28:06] 06SRE, 10Wikimedia-SVG-rendering: Install 'ttf-ubuntu-font-family' on clusters rendering SVG to PNG - https://phabricator.wikimedia.org/T32288#9841338 (10Arthur2e5) Undone by https://phabricator.wikimedia.org/rOPUP33b0f4f1308bd03d1422f34e23c0ac8794ab86bf because Ubuntu is non-free. Welp, there goes my fanc... [10:28:17] (03PS1) 10Jelto: docker_registry_ha: replace deprecated /-/jwks endpoint on gitlab [puppet] - 10https://gerrit.wikimedia.org/r/1037043 (https://phabricator.wikimedia.org/T365675) [10:29:06] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdl) failed in moss-be1002 - https://phabricator.wikimedia.org/T366153 (10MatthewVernon) 03NEW [10:29:52] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc1049.eqiad.wmnet with OS bookworm [10:30:19] (03PS1) 10Slyngshede: P:idp::build remove duplicate rsync restart. [puppet] - 10https://gerrit.wikimedia.org/r/1037044 [10:30:53] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2679/co" [puppet] - 10https://gerrit.wikimedia.org/r/1037043 (https://phabricator.wikimedia.org/T365675) (owner: 10Jelto) [10:30:58] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1037044 (owner: 10Slyngshede) [10:35:22] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:35:31] (03CR) 10Slyngshede: [C:03+2] P:idp::build remove duplicate rsync restart. [puppet] - 10https://gerrit.wikimedia.org/r/1037044 (owner: 10Slyngshede) [10:35:45] !log akosiaris@cumin1002 conftool action : set/pooled=inactive; selector: name=parse1002.eqiad.wmnet [10:36:48] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: sync [10:36:48] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: sync [10:38:52] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: sync [10:38:53] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: sync [10:39:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [10:43:20] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: sync [10:43:20] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: sync [10:43:20] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: sync [10:43:20] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: sync [10:43:24] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1049.eqiad.wmnet with reason: host reimage [10:43:35] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: sync [10:43:40] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: sync [10:44:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [10:44:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [10:45:02] (03PS1) 10Marostegui: filtered_tables.txt: Remove gu_salt [puppet] - 10https://gerrit.wikimedia.org/r/1037046 (https://phabricator.wikimedia.org/T366123) [10:45:27] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: sync [10:45:28] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: sync [10:46:25] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1049.eqiad.wmnet with reason: host reimage [10:49:47] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance [10:49:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance [10:50:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance [10:50:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance [10:51:09] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: sync [10:51:10] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: sync [10:51:10] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: sync [10:51:10] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: sync [10:51:10] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: sync [10:51:10] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: sync [10:51:44] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: sync [10:52:23] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: sync [10:54:17] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance [10:54:18] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, and 2 others: Degraded RAID on cloudcephosd1031 - https://phabricator.wikimedia.org/T364060#9841481 (10dcaro) 05Resolvedβ†’03In progress Thank @Jclark-ctr, I don't see the drive on the host (sda) though: ` root@cloudcephosd1031:~# ls -la /dev/sd? br... [10:54:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance [10:54:32] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:54:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:54:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T366123)', diff saved to https://phabricator.wikimedia.org/P63515 and previous config saved to /var/cache/conftool/dbconfig/20240529-105454-marostegui.json [10:55:01] T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123 [10:55:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 21.69% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:55:26] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: sync [10:55:29] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: sync [10:55:42] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: sync [10:55:43] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: sync [10:55:43] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: sync [10:55:43] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: sync [10:55:43] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: sync [10:55:43] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: sync [10:55:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [10:55:45] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: sync [10:56:03] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: sync [10:56:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T366123)', diff saved to https://phabricator.wikimedia.org/P63516 and previous config saved to /var/cache/conftool/dbconfig/20240529-105604-marostegui.json [10:56:08] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: sync [10:56:19] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: sync [10:56:32] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: sync [10:56:43] !incidents [10:56:44] 4709 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [10:56:44] 4708 (RESOLVED) [2x] ProbeDown sre (kubemaster1002:6443 probes/custom eqiad) [10:56:44] 4707 (RESOLVED) [2x] ProbeDown sre (kubemaster1001:6443 probes/custom eqiad) [10:56:44] 4706 (RESOLVED) ProbeDown sre (2620:0:861:103:10:64:32:116 ip6 kubemaster1002:6443 probes/custom http_eqiad_kube_apiserver_ip6 eqiad) [10:56:44] 4705 (RESOLVED) ProbeDown sre (2620:0:861:103:10:64:32:116 ip6 kubemaster1002:6443 probes/custom http_eqiad_kube_apiserver_ip6 eqiad) [10:56:44] 4703 (RESOLVED) ProbeDown sre (2620:0:861:103:10:64:32:116 ip6 kubemaster1002:6443 probes/custom http_eqiad_kube_apiserver_ip6 eqiad) [10:56:56] what's cache_text about? [10:57:15] akosiaris: I assume unrelated to the k8s thing, looking [10:57:34] at least the linked metric in https://grafana.wikimedia.org/d/000000479/cdn-frontend-traffic?viewPanel=13&orgId=1&from=now-30m&to=now is recovering again [10:57:54] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: sync [10:57:54] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: sync [10:58:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 1.702s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:58:21] I have 1 rollback btw [10:58:23] had* [10:58:32] which explains some of the high latencies etc [10:58:51] oh, okay [10:59:51] (03PS1) 10Santiago Faci: device-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037048 (https://phabricator.wikimedia.org/T360524) [10:59:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:00:04] mvolz: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240529T1100). [11:00:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 21.69% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:00:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [11:02:50] (03CR) 10Brouberol: [C:03+1] device-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037048 (https://phabricator.wikimedia.org/T360524) (owner: 10Santiago Faci) [11:03:13] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: sync [11:03:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 1.067s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:03:20] (03CR) 10Santiago Faci: [C:03+2] device-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037048 (https://phabricator.wikimedia.org/T360524) (owner: 10Santiago Faci) [11:03:20] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1049.eqiad.wmnet with OS bookworm [11:03:42] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'. [11:03:43] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [11:03:56] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'. [11:03:57] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [11:04:12] !log redeploy opentelemetry collector T366094 [11:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:17] T366094: k8s master capacity issues - https://phabricator.wikimedia.org/T366094 [11:04:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:05:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T364299)', diff saved to https://phabricator.wikimedia.org/P63517 and previous config saved to /var/cache/conftool/dbconfig/20240529-110501-marostegui.json [11:05:07] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [11:05:31] (03Merged) 10jenkins-bot: device-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037048 (https://phabricator.wikimedia.org/T360524) (owner: 10Santiago Faci) [11:06:39] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'. [11:06:42] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [11:07:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:10:08] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: sync [11:10:08] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: sync [11:10:08] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: sync [11:10:08] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: sync [11:10:08] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: sync [11:10:09] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: sync [11:10:56] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: sync [11:11:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P63518 and previous config saved to /var/cache/conftool/dbconfig/20240529-111112-marostegui.json [11:11:15] !incidents [11:11:15] 4709 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [11:11:16] 4708 (RESOLVED) [2x] ProbeDown sre (kubemaster1002:6443 probes/custom eqiad) [11:11:16] 4707 (RESOLVED) [2x] ProbeDown sre (kubemaster1001:6443 probes/custom eqiad) [11:11:16] 4706 (RESOLVED) ProbeDown sre (2620:0:861:103:10:64:32:116 ip6 kubemaster1002:6443 probes/custom http_eqiad_kube_apiserver_ip6 eqiad) [11:11:16] 4705 (RESOLVED) ProbeDown sre (2620:0:861:103:10:64:32:116 ip6 kubemaster1002:6443 probes/custom http_eqiad_kube_apiserver_ip6 eqiad) [11:11:16] 4703 (RESOLVED) ProbeDown sre (2620:0:861:103:10:64:32:116 ip6 kubemaster1002:6443 probes/custom http_eqiad_kube_apiserver_ip6 eqiad) [11:12:03] (03CR) 10Ladsgroup: Use pt-heartbeat for all non-static external clusters (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893835 (https://phabricator.wikimedia.org/T129093) (owner: 10Aaron Schulz) [11:12:04] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: sync [11:14:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [11:14:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [11:14:52] here we go again [11:15:01] yeah, that one was expected [11:15:08] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: sync [11:15:10] it was my last test, pinky promise [11:15:13] haha [11:15:14] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: sync [11:15:19] thinking of deploying... should I hold off? [11:15:38] mvolz: yeah, wait like 5-10 m [11:15:48] gotcha [11:16:07] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: sync [11:17:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 1.593s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:18:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:18:50] gerrit isn't related to my tests btw [11:18:56] !incidents [11:18:56] 4710 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [11:18:56] 4709 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [11:18:57] 4708 (RESOLVED) [2x] ProbeDown sre (kubemaster1002:6443 probes/custom eqiad) [11:18:57] 4707 (RESOLVED) [2x] ProbeDown sre (kubemaster1001:6443 probes/custom eqiad) [11:18:57] 4706 (RESOLVED) ProbeDown sre (2620:0:861:103:10:64:32:116 ip6 kubemaster1002:6443 probes/custom http_eqiad_kube_apiserver_ip6 eqiad) [11:18:57] 4705 (RESOLVED) ProbeDown sre (2620:0:861:103:10:64:32:116 ip6 kubemaster1002:6443 probes/custom http_eqiad_kube_apiserver_ip6 eqiad) [11:18:57] 4703 (RESOLVED) ProbeDown sre (2620:0:861:103:10:64:32:116 ip6 kubemaster1002:6443 probes/custom http_eqiad_kube_apiserver_ip6 eqiad) [11:19:29] Hi, is Gerrit working for you? [11:19:33] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: sync [11:19:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [11:19:45] FIRING: [2x] CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [11:20:07] Kizule: we have an active alert for gerrit that fired 1minute ago [11:20:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P63519 and previous config saved to /var/cache/conftool/dbconfig/20240529-112009-marostegui.json [11:20:21] akosiaris: I haven't seen it, sorry for asking then. [11:20:36] no worries, just letting you know we are aware of the problem [11:20:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [11:20:45] (03PS1) 10Muehlenhoff: Remove access for mabualruz [puppet] - 10https://gerrit.wikimedia.org/r/1037051 [11:20:51] I can take a look at gerrit [11:20:57] (03PS1) 10Santiago Faci: device-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037052 (https://phabricator.wikimedia.org/T360524) [11:21:11] thanks jelto! [11:21:48] FIRING: JobUnavailable: Reduced availability for job gerrit-metrics in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:22:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 1.297s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:22:51] (03CR) 10Santiago Faci: [C:03+2] device-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037052 (https://phabricator.wikimedia.org/T360524) (owner: 10Santiago Faci) [11:22:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:23:27] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'. [11:23:29] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [11:23:57] RESOLVED: JobUnavailable: Reduced availability for job gerrit-metrics in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:24:14] (03Merged) 10jenkins-bot: device-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037052 (https://phabricator.wikimedia.org/T360524) (owner: 10Santiago Faci) [11:24:45] RESOLVED: [2x] CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [11:25:45] RESOLVED: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [11:25:51] !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/device-analytics: apply [11:25:53] Gerrit is back for me. :) [11:25:57] Thanks! [11:26:17] !log akosiaris@cumin1002 conftool action : set/pooled=yes; selector: service=kubemaster,dc=eqiad,cluster=kubernetes,name=kubemaster1002.eqiad.wmnet [11:26:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P63520 and previous config saved to /var/cache/conftool/dbconfig/20240529-112621-marostegui.json [11:26:45] !log akosiaris@cumin1002 conftool action : set/pooled=yes; selector: service=kubemaster,dc=eqiad,cluster=kubernetes,name=wikikube-ctrl1001.eqiad.wmnet [11:26:48] !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [11:27:16] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 1.08s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:28:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:31:56] (03PS1) 10Effie Mouzeli: memcached: test extstore on 10 servers [puppet] - 10https://gerrit.wikimedia.org/r/1037053 (https://phabricator.wikimedia.org/T352885) [11:32:32] (03CR) 10Muehlenhoff: [C:03+2] Remove access for mabualruz [puppet] - 10https://gerrit.wikimedia.org/r/1037051 (owner: 10Muehlenhoff) [11:32:50] (03PS2) 10Effie Mouzeli: memcached: test extstore on 10 servers [puppet] - 10https://gerrit.wikimedia.org/r/1037053 (https://phabricator.wikimedia.org/T352885) [11:35:13] yeah gerrit should be back :) [11:35:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P63521 and previous config saved to /var/cache/conftool/dbconfig/20240529-113517-marostegui.json [11:35:44] (03PS1) 10Muehlenhoff: Switch db1235 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1037054 (https://phabricator.wikimedia.org/T349619) [11:38:35] !log sfaci@deploy1002 helmfile [codfw] START helmfile.d/services/device-analytics: apply [11:38:47] (03Abandoned) 10Zabe: filtered_tables: Remove gu_salt [puppet] - 10https://gerrit.wikimedia.org/r/1031608 (https://phabricator.wikimedia.org/T364435) (owner: 10Zabe) [11:39:55] (03PS3) 10Santiago Faci: geo-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036192 (https://phabricator.wikimedia.org/T360525) [11:40:23] !log sfaci@deploy1002 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply [11:40:42] (03PS1) 10Kosta Harlan: alertmanager: route Trust and Safety Product team alerts [puppet] - 10https://gerrit.wikimedia.org/r/1037056 (https://phabricator.wikimedia.org/T366165) [11:41:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T366123)', diff saved to https://phabricator.wikimedia.org/P63522 and previous config saved to /var/cache/conftool/dbconfig/20240529-114129-marostegui.json [11:41:32] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [11:41:35] T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123 [11:41:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [11:41:47] !log homer "cr*eqiad*" commit 'adding bgp state for wikikube-ctrl1002' [11:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1170 (T366123)', diff saved to https://phabricator.wikimedia.org/P63523 and previous config saved to /var/cache/conftool/dbconfig/20240529-114153-marostegui.json [11:42:18] (03PS4) 10Santiago Faci: media-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036194 (https://phabricator.wikimedia.org/T360526) [11:42:43] !log recreate triggers on s7 eqiad db maint db1155:3317 T366167 [11:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:48] T366167: Update centralauth triggers - https://phabricator.wikimedia.org/T366167 [11:42:58] !log sfaci@deploy1002 helmfile [eqiad] START helmfile.d/services/device-analytics: apply [11:44:45] !log sfaci@deploy1002 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply [11:44:53] (03CR) 10Brouberol: [C:03+1] media-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036194 (https://phabricator.wikimedia.org/T360526) (owner: 10Santiago Faci) [11:44:59] (03CR) 10Brouberol: [C:03+1] geo-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036192 (https://phabricator.wikimedia.org/T360525) (owner: 10Santiago Faci) [11:45:48] (03CR) 10Kosta Harlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037056 (https://phabricator.wikimedia.org/T366165) (owner: 10Kosta Harlan) [11:45:52] (03CR) 10Dreamy Jazz: [C:03+1] alertmanager: route Trust and Safety Product team alerts [puppet] - 10https://gerrit.wikimedia.org/r/1037056 (https://phabricator.wikimedia.org/T366165) (owner: 10Kosta Harlan) [11:46:11] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Mabualruz out of all services on: 2198 hosts [11:46:14] (03PS5) 10Santiago Faci: page-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036261 (https://phabricator.wikimedia.org/T360523) [11:46:40] (03CR) 10Marostegui: "Sorry, I forgot you also created the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/1031608 (https://phabricator.wikimedia.org/T364435) (owner: 10Zabe) [11:46:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Mabualruz out of all services on: 2198 hosts [11:47:45] (03CR) 10Brouberol: [C:03+1] page-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036261 (https://phabricator.wikimedia.org/T360523) (owner: 10Santiago Faci) [11:49:23] (03CR) 10Santiago Faci: [C:03+2] page-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036261 (https://phabricator.wikimedia.org/T360523) (owner: 10Santiago Faci) [11:50:22] (03Merged) 10jenkins-bot: page-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036261 (https://phabricator.wikimedia.org/T360523) (owner: 10Santiago Faci) [11:50:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T364299)', diff saved to https://phabricator.wikimedia.org/P63524 and previous config saved to /var/cache/conftool/dbconfig/20240529-115025-marostegui.json [11:50:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2122.codfw.wmnet with reason: Maintenance [11:50:32] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [11:50:35] (03CR) 10Muehlenhoff: [C:03+2] Switch db1235 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1037054 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:50:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2122.codfw.wmnet with reason: Maintenance [11:50:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2122 (T364299)', diff saved to https://phabricator.wikimedia.org/P63525 and previous config saved to /var/cache/conftool/dbconfig/20240529-115051-marostegui.json [11:53:54] !log klausman@cumin2002 START - Cookbook sre.hosts.reimage for host ml-staging2002.codfw.wmnet with OS bookworm [11:54:46] (03PS2) 10Hashar: contint: enable zuul-merger daemon on contint2002 [puppet] - 10https://gerrit.wikimedia.org/r/1036762 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [11:55:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1235.eqiad.wmnet [12:02:09] (03CR) 10Effie Mouzeli: [C:03+2] memcached: test extstore on 10 servers [puppet] - 10https://gerrit.wikimedia.org/r/1037053 (https://phabricator.wikimedia.org/T352885) (owner: 10Effie Mouzeli) [12:04:24] !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc2048.codfw.wmnet with OS bookworm [12:05:29] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc1048.eqiad.wmnet with OS bookworm [12:06:56] !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/page-analytics: apply [12:07:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T366123)', diff saved to https://phabricator.wikimedia.org/P63526 and previous config saved to /var/cache/conftool/dbconfig/20240529-120730-marostegui.json [12:07:37] T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123 [12:07:39] (03CR) 10Esanders: [C:03+1] "I don't have +2 in this repo, but LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034860 (https://phabricator.wikimedia.org/T366093) (owner: 10Mvolz) [12:07:50] !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [12:08:52] !log sfaci@deploy1002 helmfile [codfw] START helmfile.d/services/page-analytics: apply [12:10:20] !log sfaci@deploy1002 helmfile [codfw] DONE helmfile.d/services/page-analytics: apply [12:11:04] (03PS1) 10Slyngshede: IDP: Failover for 6.6.15 upgrade [dns] - 10https://gerrit.wikimedia.org/r/1037061 (https://phabricator.wikimedia.org/T366140) [12:11:13] !log sfaci@deploy1002 helmfile [eqiad] START helmfile.d/services/page-analytics: apply [12:12:49] !log sfaci@deploy1002 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply [12:13:52] (03CR) 10Santiago Faci: [C:03+2] geo-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036192 (https://phabricator.wikimedia.org/T360525) (owner: 10Santiago Faci) [12:14:37] (03Merged) 10jenkins-bot: geo-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036192 (https://phabricator.wikimedia.org/T360525) (owner: 10Santiago Faci) [12:14:41] !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-staging2002.codfw.wmnet with reason: host reimage [12:15:34] !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/geo-analytics: apply [12:16:20] !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [12:16:41] (03CR) 10Filippo Giunchedi: [C:03+2] alertmanager: route Trust and Safety Product team alerts [puppet] - 10https://gerrit.wikimedia.org/r/1037056 (https://phabricator.wikimedia.org/T366165) (owner: 10Kosta Harlan) [12:17:08] !log sfaci@deploy1002 helmfile [codfw] START helmfile.d/services/geo-analytics: apply [12:17:12] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-staging2002.codfw.wmnet with reason: host reimage [12:18:15] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1037061 (https://phabricator.wikimedia.org/T366140) (owner: 10Slyngshede) [12:18:28] (03CR) 10Slyngshede: [C:03+2] IDP: Failover for 6.6.15 upgrade [dns] - 10https://gerrit.wikimedia.org/r/1037061 (https://phabricator.wikimedia.org/T366140) (owner: 10Slyngshede) [12:18:57] !log sfaci@deploy1002 helmfile [codfw] DONE helmfile.d/services/geo-analytics: apply [12:19:02] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1048.eqiad.wmnet with reason: host reimage [12:19:04] (03CR) 10Filippo Giunchedi: [C:03+1] role::thanos::frontend: move all envoy TLS certs to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036643 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [12:19:40] !log Failover idp.wikimedia.org for CAS upgrade to 6.6.15 [12:19:41] !log sfaci@deploy1002 helmfile [eqiad] START helmfile.d/services/geo-analytics: apply [12:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:54] (03PS1) 10Ladsgroup: admin: Remove home files for several departed staff [puppet] - 10https://gerrit.wikimedia.org/r/1037062 [12:21:27] !log sfaci@deploy1002 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: apply [12:22:20] (03CR) 10Santiago Faci: [C:03+2] media-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036194 (https://phabricator.wikimedia.org/T360526) (owner: 10Santiago Faci) [12:22:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P63527 and previous config saved to /var/cache/conftool/dbconfig/20240529-122239-marostegui.json [12:22:45] !log jiji@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2048.codfw.wmnet with reason: host reimage [12:22:49] (03Abandoned) 10Ladsgroup: admin: Remove home files for several departed staff [puppet] - 10https://gerrit.wikimedia.org/r/1037062 (owner: 10Ladsgroup) [12:23:14] (03Merged) 10jenkins-bot: media-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036194 (https://phabricator.wikimedia.org/T360526) (owner: 10Santiago Faci) [12:23:42] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:24:22] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1048.eqiad.wmnet with reason: host reimage [12:25:07] !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/media-analytics: apply [12:26:03] !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [12:28:02] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2048.codfw.wmnet with reason: host reimage [12:29:11] !log sfaci@deploy1002 helmfile [codfw] START helmfile.d/services/media-analytics: apply [12:30:40] !log sfaci@deploy1002 helmfile [codfw] DONE helmfile.d/services/media-analytics: apply [12:34:49] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-staging2002.codfw.wmnet with OS bookworm [12:35:34] !log sfaci@deploy1002 helmfile [eqiad] START helmfile.d/services/media-analytics: apply [12:36:36] (03CR) 10Elukey: [V:03+1 C:03+2] role::thanos::frontend: move all envoy TLS certs to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036643 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [12:37:09] !log sfaci@deploy1002 helmfile [eqiad] DONE helmfile.d/services/media-analytics: apply [12:37:42] upcoming backport deployers: I have to drop kid off at daycare, may be back slightly after the hour. cc RoanKattouw Lucas_WMDE etc. [12:37:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P63528 and previous config saved to /var/cache/conftool/dbconfig/20240529-123746-marostegui.json [12:38:04] (03PS1) 10Muehlenhoff: Remove skel files for former WMF staff members [puppet] - 10https://gerrit.wikimedia.org/r/1037064 [12:39:17] !log move thanos-fe100[3,4] and thanos-fe2* to PKI TLS certs (envoy, backends for thanos-swift.discovery.wmnet) - T344324 [12:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:22] T344324: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023) - https://phabricator.wikimedia.org/T344324 [12:39:33] (03CR) 10Jelto: [C:03+2] contint: enable zuul-merger daemon on contint2002 [puppet] - 10https://gerrit.wikimedia.org/r/1036762 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [12:40:27] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1048.eqiad.wmnet with OS bookworm [12:42:46] !log recreate triggers on s7 codfw db maint db1155:3317 T366167 [12:42:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:51] T366167: Update centralauth triggers - https://phabricator.wikimedia.org/T366167 [12:42:54] !log recreate triggers on s7 codfw db maint db2187:3317 T366167 [12:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:12] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on db1196.eqiad.wmnet with reason: reimage [12:43:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1196.eqiad.wmnet with reason: reimage [12:43:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db1196 T364290', diff saved to https://phabricator.wikimedia.org/P63529 and previous config saved to /var/cache/conftool/dbconfig/20240529-124352-arnaudb.json [12:43:58] T364290: Upgrade s1 to MariaDB 10.6 - https://phabricator.wikimedia.org/T364290 [12:45:08] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db[1154,1196].eqiad.wmnet with reason: reimage db1196 [12:45:24] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db[1154,1196].eqiad.wmnet with reason: reimage db1196 [12:45:25] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2048.codfw.wmnet with OS bookworm [12:46:47] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1196.eqiad.wmnet with OS bookworm [12:49:07] (03Abandoned) 10Ssingh: mw-api-ext: Add 20 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1022062 (owner: 10ClΓ©ment Goubert) [12:49:29] (03Abandoned) 10Ssingh: Disable Enterprise bypassing CDN rate limits [puppet] - 10https://gerrit.wikimedia.org/r/1022092 (owner: 10CDanis) [12:49:56] (03CR) 10Ssingh: [C:03+1] "Keith: just clearing up the backlog, do we still need to merge this? Thanks!" [dns] - 10https://gerrit.wikimedia.org/r/961132 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [12:50:33] (03PS1) 10Hashar: gerrit: enable change.diff3ConflictView [puppet] - 10https://gerrit.wikimedia.org/r/1037065 (https://phabricator.wikimedia.org/T359821) [12:50:48] (03CR) 10Filippo Giunchedi: [C:03+1] "Tested and LGTM, thank you! Adding other o11y folks as heads up" [puppet] - 10https://gerrit.wikimedia.org/r/1036763 (owner: 10JHathaway) [12:52:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T366123)', diff saved to https://phabricator.wikimedia.org/P63530 and previous config saved to /var/cache/conftool/dbconfig/20240529-125255-marostegui.json [12:52:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance [12:53:01] T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123 [12:53:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance [12:53:35] (03CR) 10Brouberol: [C:03+1] "Perfect!" [puppet] - 10https://gerrit.wikimedia.org/r/1034961 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [12:53:42] (03PS2) 10Hashar: gerrit: enable change.diff3ConflictView [puppet] - 10https://gerrit.wikimedia.org/r/1037065 (https://phabricator.wikimedia.org/T359821) [12:53:56] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: mariadb::core [12:54:06] (03CR) 10Hashar: "I will upgrade Gerrit to 3.9.x on Monday and we can apply that setting ahead of time to have the feature enabled as we upgrade. `diff3` is" [puppet] - 10https://gerrit.wikimedia.org/r/1037065 (https://phabricator.wikimedia.org/T359821) (owner: 10Hashar) [12:54:54] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc1047.eqiad.wmnet with OS bookworm [12:54:58] (03PS1) 10Filippo Giunchedi: rsyslog: notify receiver on cert change [puppet] - 10https://gerrit.wikimedia.org/r/1037066 [12:55:00] (03PS1) 10Muehlenhoff: Switch mariadb::core to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1037067 (https://phabricator.wikimedia.org/T349619) [12:55:05] !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc2047.codfw.wmnet with OS bookworm [12:55:19] (03CR) 10CI reject: [V:04-1] rsyslog: notify receiver on cert change [puppet] - 10https://gerrit.wikimedia.org/r/1037066 (owner: 10Filippo Giunchedi) [12:56:04] (03CR) 10Muehlenhoff: [C:03+2] Switch mariadb::core to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1037067 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:57:06] (03PS5) 10Ayounsi: Add SameSite=Lax attribute to NetworkProbeLimit cookie [puppet] - 10https://gerrit.wikimedia.org/r/989457 (https://phabricator.wikimedia.org/T342624) [12:57:41] I'm back! [12:57:46] (03CR) 10CDanis: [C:03+2] Add SameSite=Lax attribute to NetworkProbeLimit cookie [puppet] - 10https://gerrit.wikimedia.org/r/989457 (https://phabricator.wikimedia.org/T342624) (owner: 10Ayounsi) [12:57:46] (03CR) 10Brouberol: [C:04-1] "While these services will need to be removed from the service catalog, thus is too soon. You should follow the instructions at https://wik" [puppet] - 10https://gerrit.wikimedia.org/r/1036994 (https://phabricator.wikimedia.org/T366137) (owner: 10Stevemunene) [12:59:30] (03PS2) 10Filippo Giunchedi: rsyslog: notify receiver on cert change [puppet] - 10https://gerrit.wikimedia.org/r/1037066 [12:59:42] 06SRE, 06serviceops, 13Patch-For-Review: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9841984 (10CDanis) >>! In T366094#9841558, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-sre), href=https://sal.toolforge.org/log/OwwWxI8BGiVuUzOd3n4x} [2024-05-29T11:23:04Z]... [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240529T1300). [13:00:05] ottomata: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] o/ [13:00:57] Hi, its been a while since I've deployed config, and I only really knew how to do one file at a time. [13:01:02] https://deploy-commands.toolforge.org/bacc/985023 looks new(ish) to me [13:01:04] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1196.eqiad.wmnet with reason: host reimage [13:01:19] I can do it if it is really that easy :) [13:01:50] ottomata: `scap backport` is really that easy, yes :) [13:02:01] okay, i'm the only one in the window, so I am proceeding [13:02:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: mariadb::core [13:03:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by otto@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [13:04:30] (03Merged) 10jenkins-bot: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [13:04:33] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1196.eqiad.wmnet with reason: host reimage [13:05:29] !log otto@deploy1002 Started scap: Backport for [[gerrit:985023|Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (T353817 T323828)]] [13:05:34] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9841998 (10MoritzMuehlenhoff) [13:05:35] T353817: Create legacy EventLogging proxy HTTP intake (for MediaWikiPingback) endpoint to EventGate - https://phabricator.wikimedia.org/T353817 [13:05:35] T323828: Update Pingback to use the Event Platform - https://phabricator.wikimedia.org/T323828 [13:07:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T364299)', diff saved to https://phabricator.wikimedia.org/P63531 and previous config saved to /var/cache/conftool/dbconfig/20240529-130713-marostegui.json [13:07:20] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [13:08:21] !log otto@deploy1002 otto: Backport for [[gerrit:985023|Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (T353817 T323828)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:08:22] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1047.eqiad.wmnet with reason: host reimage [13:10:58] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1047.eqiad.wmnet with reason: host reimage [13:11:26] (03CR) 10Vgutierrez: [C:03+1] benthos:cache: switch to rfc5424 format [puppet] - 10https://gerrit.wikimedia.org/r/1036711 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [13:11:29] !log installing apache2 security updates [13:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:22] !log jiji@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2047.codfw.wmnet with reason: host reimage [13:14:39] !log otto@deploy1002 otto: Continuing with sync [13:15:06] (03PS1) 10Marostegui: es*.yaml: Clean up puppet7 lines [puppet] - 10https://gerrit.wikimedia.org/r/1037069 [13:16:03] !log temporary disabling puppet on A:cp to rollout https://gerrit.wikimedia.org/r/c/operations/puppet/+/1036711 (T365718) [13:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:08] T365718: Switch HAProxy/Benthos to rfc5424 - https://phabricator.wikimedia.org/T365718 [13:16:41] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2047.codfw.wmnet with reason: host reimage [13:17:26] (03CR) 10Fabfur: [V:03+1 C:03+2] benthos:cache: switch to rfc5424 format [puppet] - 10https://gerrit.wikimedia.org/r/1036711 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [13:21:09] o/ [13:21:20] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9842054 (10VRiley-WMF) a:05Jclark-ctrβ†’03VRiley-WMF [13:21:58] ottomata: yeah, `scap backport` should be all you need :) [13:22:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P63532 and previous config saved to /var/cache/conftool/dbconfig/20240529-132221-marostegui.json [13:23:54] !log otto@deploy1002 Finished scap: Backport for [[gerrit:985023|Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (T353817 T323828)]] (duration: 18m 25s) [13:24:04] T353817: Create legacy EventLogging proxy HTTP intake (for MediaWikiPingback) endpoint to EventGate - https://phabricator.wikimedia.org/T353817 [13:24:05] T323828: Update Pingback to use the Event Platform - https://phabricator.wikimedia.org/T323828 [13:25:32] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance [13:25:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance [13:25:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T366123)', diff saved to https://phabricator.wikimedia.org/P63533 and previous config saved to /var/cache/conftool/dbconfig/20240529-132553-marostegui.json [13:26:00] T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123 [13:26:50] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1196.eqiad.wmnet with OS bookworm [13:27:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 10%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63534 and previous config saved to /var/cache/conftool/dbconfig/20240529-132726-arnaudb.json [13:27:38] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1047.eqiad.wmnet with OS bookworm [13:28:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db1169 T364290', diff saved to https://phabricator.wikimedia.org/P63535 and previous config saved to /var/cache/conftool/dbconfig/20240529-132818-arnaudb.json [13:28:24] T364290: Upgrade s1 to MariaDB 10.6 - https://phabricator.wikimedia.org/T364290 [13:28:47] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on db1169.eqiad.wmnet with reason: reimage [13:28:53] (03CR) 10Bking: [C:03+2] dse-k8s: add new airflow service to k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/1034961 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [13:29:00] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1169.eqiad.wmnet with reason: reimage [13:30:03] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1169.eqiad.wmnet with OS bookworm [13:33:00] 06SRE, 10SRE-swift-storage: Consolidate TLS cert puppetry for ms and thanos swift frontends - https://phabricator.wikimedia.org/T356412#9842102 (10elukey) 05Stalledβ†’03Resolved a:03elukey Thanos-Swift is running with PKI TLS certs, so now all Swift clusters use PKI. The puppet code seems already clean... [13:33:08] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9842107 (10elukey) [13:34:14] (03CR) 10Elukey: [C:03+1] maps: Switch kartotherian on maps2007 to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036236 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff) [13:35:17] (03PS1) 10Bking: Revert "dse-k8s: add new airflow service to k8s cluster" [puppet] - 10https://gerrit.wikimedia.org/r/1037014 [13:35:52] (03CR) 10Brouberol: [C:03+1] Revert "dse-k8s: add new airflow service to k8s cluster" [puppet] - 10https://gerrit.wikimedia.org/r/1037014 (owner: 10Bking) [13:36:18] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2047.codfw.wmnet with OS bookworm [13:37:00] (03PS1) 10Marostegui: redact_sanitarium.sh: Update sanitarium hosts [puppet] - 10https://gerrit.wikimedia.org/r/1037072 [13:37:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P63536 and previous config saved to /var/cache/conftool/dbconfig/20240529-133729-marostegui.json [13:38:14] (03PS1) 10NMW03: Enable wmgUseSandboxLink for Swahili Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037073 (https://phabricator.wikimedia.org/T365970) [13:38:34] (03PS4) 10Marostegui: mariadb: Promote db1192 to master [puppet] - 10https://gerrit.wikimedia.org/r/1035315 (https://phabricator.wikimedia.org/T364541) [13:38:46] (03PS1) 10Muehlenhoff: Remove ms-fe certs [puppet] - 10https://gerrit.wikimedia.org/r/1037074 (https://phabricator.wikimedia.org/T357750) [13:39:19] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037074 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [13:42:11] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1037066 (owner: 10Filippo Giunchedi) [13:42:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 25%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63537 and previous config saved to /var/cache/conftool/dbconfig/20240529-134232-arnaudb.json [13:42:43] (03PS1) 10Jgreen: Add an icinga/nsca collector for Fundraising kafka client cert expire check. [puppet] - 10https://gerrit.wikimedia.org/r/1037075 (https://phabricator.wikimedia.org/T360779) [13:42:56] thanks cdanis that was pretty easy :) [13:43:13] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1036763 (owner: 10JHathaway) [13:43:47] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1169.eqiad.wmnet with reason: host reimage [13:45:43] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:46:01] (03CR) 10Filippo Giunchedi: [C:03+2] rsyslog: notify receiver on cert change [puppet] - 10https://gerrit.wikimedia.org/r/1037066 (owner: 10Filippo Giunchedi) [13:46:37] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: host reimage [13:49:20] (03PS1) 10Bking: dse-k8s: add new airflow service to k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/1037077 (https://phabricator.wikimedia.org/T363001) [13:51:20] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037077 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [13:52:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T364299)', diff saved to https://phabricator.wikimedia.org/P63538 and previous config saved to /var/cache/conftool/dbconfig/20240529-135237-marostegui.json [13:52:40] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2150.codfw.wmnet with reason: Maintenance [13:52:43] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [13:52:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2150.codfw.wmnet with reason: Maintenance [13:53:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2150 (T364299)', diff saved to https://phabricator.wikimedia.org/P63539 and previous config saved to /var/cache/conftool/dbconfig/20240529-135300-marostegui.json [13:54:01] (03CR) 10MVernon: [C:03+1] "Looks reasonable to me (assuming PCC doesn't lie!)" [puppet] - 10https://gerrit.wikimedia.org/r/1037074 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [13:55:19] !log label wikikube-ctrl1002 as master [13:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:56] !log jiji@cumin1002 conftool action : set/pooled=yes:weight=10; selector: name=wikikube-ctrl1002.eqiad.wmnet [13:57:01] question for the assembled deployers here. I’m running a maintenance script (T315510, latest comments) which is expected to take about a week longer to finish [13:57:02] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [13:57:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137 (T364069)', diff saved to https://phabricator.wikimedia.org/P63540 and previous config saved to /var/cache/conftool/dbconfig/20240529-135706-marostegui.json [13:57:13] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [13:57:19] but I’m on holiday starting tomorrow, so I won’t be able to report whether the script finished successfully or not [13:57:28] does that sound okay? or should I stop the script now and hand it over to someone else? [13:57:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 50%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63541 and previous config saved to /var/cache/conftool/dbconfig/20240529-135738-arnaudb.json [13:58:35] (03CR) 10Brouberol: [C:03+1] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1037077 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [13:58:42] FIRING: KubernetesCalicoDown: kubernetes2032.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2032.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:59:00] (03CR) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [13:59:02] wat [13:59:49] (03PS2) 10CDanis: otelcol: add three new k8s ctrl IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036708 (https://phabricator.wikimedia.org/T366094) [13:59:49] (03PS1) 10CDanis: otelcol: disable logs & metrics pipelines [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037083 (https://phabricator.wikimedia.org/T366094) [14:00:05] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240529T1400) [14:00:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9842253 (10VRiley-WMF) Worked with Dell on kafka-main1009, we were able to replace some of the parts (Power Interface Board, and Right Control Panel) Which go... [14:01:35] 06SRE, 06serviceops, 13Patch-For-Review: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9842257 (10akosiaris) I 've gone ahead and created the following dashboard today [T366094](https://grafana-rw.wikimedia.org/d/d304d897-54ea-4062-a504-6f2567ed7dba/t366094?orgId=1&from=1716974133223... [14:02:50] (03CR) 10Bking: [C:03+2] dse-k8s: add new airflow service to k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/1037077 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [14:04:30] (03CR) 10Stevemunene: [C:03+2] provision datahub service records [dns] - 10https://gerrit.wikimedia.org/r/1035734 (https://phabricator.wikimedia.org/T363299) (owner: 10Stevemunene) [14:04:42] (03PS3) 10Stevemunene: provision datahub service records [dns] - 10https://gerrit.wikimedia.org/r/1035734 (https://phabricator.wikimedia.org/T363299) [14:04:57] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, and 2 others: Degraded RAID on cloudcephosd1031 - https://phabricator.wikimedia.org/T364060#9842272 (10Jclark-ctr) @dcaro the drive was listed as ready in idrac Converted to non-raid should be visible now [14:05:13] (03CR) 10Stevemunene: [V:03+2 C:03+2] provision datahub service records [dns] - 10https://gerrit.wikimedia.org/r/1035734 (https://phabricator.wikimedia.org/T363299) (owner: 10Stevemunene) [14:07:36] RECOVERY - Disk space on backup1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=backup1007&var-datasource=eqiad+prometheus/ops [14:08:18] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1169.eqiad.wmnet with OS bookworm [14:09:13] (03CR) 10Gehel: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1032772 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [14:09:15] 06SRE, 06serviceops, 13Patch-For-Review: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9842288 (10akosiaris) >>! In T366094#9840665, @CDanis wrote: Thanks for writing down all of this. > ===== This was a capacity crunch triggered by expensive operations > * For the past few months... [14:09:18] (03CR) 10Bking: [C:03+2] dse-k8s: add airflow-analytics-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035015 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [14:09:56] (03CR) 10Brouberol: [V:03+1 C:03+2] global_config: register IP/port for the datahubsearch opensearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/1032772 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [14:10:07] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-05-13-145903 to 2024-05-23-164021 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037084 (https://phabricator.wikimedia.org/T337589) [14:10:17] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-05-13-145650 to 2024-05-28-185827 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037085 (https://phabricator.wikimedia.org/T348370) [14:11:09] (03CR) 10Stevemunene: [C:03+2] trafficserver: add datahub redirects to ATS [puppet] - 10https://gerrit.wikimedia.org/r/1035731 (https://phabricator.wikimedia.org/T365668) (owner: 10Stevemunene) [14:11:11] (03CR) 10Hnowlan: [C:03+1] maps: Switch kartotherian on maps2007 to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036236 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff) [14:11:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63542 and previous config saved to /var/cache/conftool/dbconfig/20240529-141114-arnaudb.json [14:11:51] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2024-05-13-145903 to 2024-05-23-164021 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037084 (https://phabricator.wikimedia.org/T337589) (owner: 10Jforrester) [14:12:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137', diff saved to https://phabricator.wikimedia.org/P63543 and previous config saved to /var/cache/conftool/dbconfig/20240529-141213-marostegui.json [14:12:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 75%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63544 and previous config saved to /var/cache/conftool/dbconfig/20240529-141244-arnaudb.json [14:12:47] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-05-13-145903 to 2024-05-23-164021 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037084 (https://phabricator.wikimedia.org/T337589) (owner: 10Jforrester) [14:13:37] 06SRE, 06serviceops, 13Patch-For-Review: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9842327 (10akosiaris) >>! In T366094#9841984, @CDanis wrote: >>>! In T366094#9841558, @Stashbot wrote: >> {nav icon=file, name=Mentioned in SAL (#wikimedia-sre), href=https://sal.toolforge.org/log/... [14:13:54] (03PS10) 10Brennen Bearnes: gitlab-settings: add timer for configure-projects [puppet] - 10https://gerrit.wikimedia.org/r/1035820 (https://phabricator.wikimedia.org/T355097) [14:14:22] (03CR) 10Brennen Bearnes: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035820 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes) [14:14:30] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:15:03] I'm going to deploy admin_ng to deploy a small external-services addition [14:15:08] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:15:14] 07Puppet, 10Wikidata, 06Wikidata Dev Team, 10wmde-wikidata-tech, and 2 others: Remove the WDCM clone (stats1007) - https://phabricator.wikimedia.org/T351072#9842333 (10Lucas_Werkmeister_WMDE) I think we can resolve both. [14:15:42] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:16:33] !log brouberol@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [14:16:52] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:16:56] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:16:58] !log brouberol@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [14:17:21] !log brouberol@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:18:02] (03PS1) 10Fabfur: Revert "benthos:cache: switch to rfc5424 format" [puppet] - 10https://gerrit.wikimedia.org/r/1037015 [14:18:03] !log brouberol@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:18:15] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:19:06] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2024-05-13-145650 to 2024-05-28-185827 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037085 (https://phabricator.wikimedia.org/T348370) (owner: 10Jforrester) [14:19:22] !log brouberol@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:19:49] !log brouberol@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:20:36] !log brouberol@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [14:20:39] (03CR) 10Alexandros Kosiaris: [C:03+1] otelcol: disable logs & metrics pipelines [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037083 (https://phabricator.wikimedia.org/T366094) (owner: 10CDanis) [14:21:12] !log brouberol@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:21:20] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2024-05-13-145650 to 2024-05-28-185827 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037085 (https://phabricator.wikimedia.org/T348370) (owner: 10Jforrester) [14:22:04] !log brouberol@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [14:22:13] (03CR) 10CDanis: [C:03+2] otelcol: add three new k8s ctrl IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036708 (https://phabricator.wikimedia.org/T366094) (owner: 10CDanis) [14:22:17] (03CR) 10CDanis: [C:03+2] otelcol: disable logs & metrics pipelines [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037083 (https://phabricator.wikimedia.org/T366094) (owner: 10CDanis) [14:22:25] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:22:46] !log brouberol@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [14:22:53] (03CR) 10Fabfur: [C:03+2] Revert "benthos:cache: switch to rfc5424 format" [puppet] - 10https://gerrit.wikimedia.org/r/1037015 (owner: 10Fabfur) [14:23:58] klausman elukey: Hi! There's an istio-related pending admin-ng change on ml-serve-{eqiad,codfw}. Is that safe to deploy? [14:24:03] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:24:28] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:24:32] !log brouberol@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [14:24:55] !log brouberol@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:25:15] (03Merged) 10jenkins-bot: otelcol: add three new k8s ctrl IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036708 (https://phabricator.wikimedia.org/T366094) (owner: 10CDanis) [14:25:17] (03Merged) 10jenkins-bot: otelcol: disable logs & metrics pipelines [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037083 (https://phabricator.wikimedia.org/T366094) (owner: 10CDanis) [14:25:55] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [14:26:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T366123)', diff saved to https://phabricator.wikimedia.org/P63545 and previous config saved to /var/cache/conftool/dbconfig/20240529-142619-marostegui.json [14:26:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 25%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63546 and previous config saved to /var/cache/conftool/dbconfig/20240529-142627-arnaudb.json [14:26:30] T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123 [14:26:37] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc1046.eqiad.wmnet with OS bookworm [14:26:40] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:26:41] !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc2046.codfw.wmnet with OS bookworm [14:26:49] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:26:55] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:27:14] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:27:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137', diff saved to https://phabricator.wikimedia.org/P63547 and previous config saved to /var/cache/conftool/dbconfig/20240529-142721-marostegui.json [14:27:30] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1010.eqiad.wmnet with OS bullseye [14:27:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9842432 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1010.eqiad.wmnet with OS bullseye [14:27:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 100%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63548 and previous config saved to /var/cache/conftool/dbconfig/20240529-142750-arnaudb.json [14:28:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db1163 T364290', diff saved to https://phabricator.wikimedia.org/P63549 and previous config saved to /var/cache/conftool/dbconfig/20240529-142830-arnaudb.json [14:28:36] T364290: Upgrade s1 to MariaDB 10.6 - https://phabricator.wikimedia.org/T364290 [14:28:43] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:28:48] brouberol: in a meeting, will get back to you in a bit [14:28:49] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on db1163.eqiad.wmnet with reason: reimage [14:29:02] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1163.eqiad.wmnet with reason: reimage [14:30:04] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1163.eqiad.wmnet with OS bookworm [14:33:00] (03PS1) 10CDobbins: purged: roll out use_pki flag to all of drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1037089 (https://phabricator.wikimedia.org/T360506) [14:33:04] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:33:26] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 39 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:33:49] (03PS11) 10Brennen Bearnes: gitlab-settings: add timer for configure-projects [puppet] - 10https://gerrit.wikimedia.org/r/1035820 (https://phabricator.wikimedia.org/T355097) [14:35:40] (03CR) 10Brennen Bearnes: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035820 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes) [14:36:32] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2681/console" [puppet] - 10https://gerrit.wikimedia.org/r/1037089 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [14:37:32] !log enabled puppet on A:cp as https://gerrit.wikimedia.org/r/c/operations/puppet/+/1036711 has been reverted (not applied anywhere but cp4037) (T365718) [14:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:37] T365718: Switch HAProxy/Benthos to rfc5424 - https://phabricator.wikimedia.org/T365718 [14:38:26] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 30 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:38:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:54] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1046.eqiad.wmnet with reason: host reimage [14:40:38] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2682/co" [puppet] - 10https://gerrit.wikimedia.org/r/1037089 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [14:41:08] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [14:41:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P63550 and previous config saved to /var/cache/conftool/dbconfig/20240529-144129-marostegui.json [14:41:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 50%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63551 and previous config saved to /var/cache/conftool/dbconfig/20240529-144140-arnaudb.json [14:42:21] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1046.eqiad.wmnet with reason: host reimage [14:42:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137 (T364069)', diff saved to https://phabricator.wikimedia.org/P63552 and previous config saved to /var/cache/conftool/dbconfig/20240529-144229-marostegui.json [14:42:32] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [14:42:36] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [14:42:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [14:43:20] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Discovery IPs for apus service - mvernon@cumin2002" [14:43:38] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1163.eqiad.wmnet with reason: host reimage [14:43:45] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED [14:43:52] (03CR) 10CDobbins: [V:03+1] "I believe this is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/1037089 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [14:44:15] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Discovery IPs for apus service - mvernon@cumin2002" [14:44:15] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:44:52] !log jiji@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2046.codfw.wmnet with reason: host reimage [14:45:25] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED [14:45:38] (03CR) 10Pppery: [C:03+1] "Probably would have made more sense to do this on the translatewiki.net side rather than via Gerrit, and I would hold off merging this for" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1037026 (https://phabricator.wikimedia.org/T365853) (owner: 10Aklapper) [14:45:45] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED [14:46:29] (03CR) 10Ladsgroup: redact_sanitarium.sh: Update sanitarium hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037072 (owner: 10Marostegui) [14:47:00] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED [14:47:01] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1163.eqiad.wmnet with reason: host reimage [14:47:21] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED [14:47:33] (03CR) 10Ssingh: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1037089 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [14:48:26] (03CR) 10Marostegui: redact_sanitarium.sh: Update sanitarium hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037072 (owner: 10Marostegui) [14:48:30] (03PS2) 10Marostegui: redact_sanitarium.sh: Update sanitarium hosts [puppet] - 10https://gerrit.wikimedia.org/r/1037072 [14:48:49] (03CR) 10Marostegui: redact_sanitarium.sh: Update sanitarium hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037072 (owner: 10Marostegui) [14:48:59] (03CR) 10Aklapper: "I'm just very clueless about the process so if there's something on the twn side instead I'm cool with that too." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1037026 (https://phabricator.wikimedia.org/T365853) (owner: 10Aklapper) [14:49:11] brouberol: yes, that change can be pushed (or I can do it, if you prefer) [14:49:29] if you could, that'd be great! thanks [14:49:34] on it [14:49:44] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2046.codfw.wmnet with reason: host reimage [14:49:48] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [14:49:59] (03CR) 10Ladsgroup: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1037072 (owner: 10Marostegui) [14:50:10] (03CR) 10Ladsgroup: [C:03+1] redact_sanitarium.sh: Update sanitarium hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037072 (owner: 10Marostegui) [14:50:32] (03CR) 10Marostegui: [C:03+2] redact_sanitarium.sh: Update sanitarium hosts [puppet] - 10https://gerrit.wikimedia.org/r/1037072 (owner: 10Marostegui) [14:50:46] (03PS1) 10MVernon: Add apus svc records in codfw and eqiad [dns] - 10https://gerrit.wikimedia.org/r/1037095 (https://phabricator.wikimedia.org/T279621) [14:50:49] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED [14:50:52] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [14:52:05] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [14:52:46] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [14:53:02] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1041'] [14:53:41] (03PS1) 10Marostegui: pc1014: Remove puppet7 entries [puppet] - 10https://gerrit.wikimedia.org/r/1037096 [14:53:52] brouberol: all done [14:53:59] appreciated! [14:54:36] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED [14:54:37] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: sync [14:55:26] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 36 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:55:43] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:56:13] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: sync [14:56:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P63553 and previous config saved to /var/cache/conftool/dbconfig/20240529-145637-marostegui.json [14:56:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63554 and previous config saved to /var/cache/conftool/dbconfig/20240529-145646-arnaudb.json [14:58:06] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1046.eqiad.wmnet with OS bookworm [15:00:28] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 32 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:04:51] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1041'] [15:05:01] !log robh@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cloudvirt1041'] [15:05:24] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1041'] [15:05:28] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1041'] [15:05:50] (03PS12) 10Brennen Bearnes: gitlab-settings: add timer for configure-projects [puppet] - 10https://gerrit.wikimedia.org/r/1035820 (https://phabricator.wikimedia.org/T355097) [15:06:00] !log robh@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cloudvirt1041'] [15:06:29] (03PS2) 10MVernon: Add apus svc records in codfw and eqiad [dns] - 10https://gerrit.wikimedia.org/r/1037095 (https://phabricator.wikimedia.org/T279621) [15:06:46] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1009.mgmt.eqiad.wmnet with reboot policy FORCED [15:06:57] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 10Release-Engineering-Team (Priority Backlog πŸ“₯): Correct IDP login page Privacy Policy - https://phabricator.wikimedia.org/T350129#9842517 (10Pppery) Is there an estimated timeframe for when that will be? [15:07:01] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main1009.mgmt.eqiad.wmnet with reboot policy FORCED [15:07:22] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1037069 (owner: 10Marostegui) [15:07:22] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2046.codfw.wmnet with OS bookworm [15:07:22] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1009.mgmt.eqiad.wmnet with reboot policy FORCED [15:07:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T364299)', diff saved to https://phabricator.wikimedia.org/P63555 and previous config saved to /var/cache/conftool/dbconfig/20240529-150757-marostegui.json [15:08:03] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [15:08:05] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main1009.mgmt.eqiad.wmnet with reboot policy FORCED [15:08:57] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1163.eqiad.wmnet with OS bookworm [15:09:08] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1009.eqiad.wmnet with OS bullseye [15:09:13] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9842538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1009.eqiad.wmnet with OS bullseye [15:09:44] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 10Release-Engineering-Team (Priority Backlog πŸ“₯): Correct IDP login page Privacy Policy - https://phabricator.wikimedia.org/T350129#9842536 (10MoritzMuehlenhoff) 05Openβ†’03Resolved It's already live, we updated CAS two hours ago. If you log into idp.wik... [15:11:02] (03PS5) 10Elukey: redfish: expand support for Supermicro hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372) [15:11:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T366123)', diff saved to https://phabricator.wikimedia.org/P63556 and previous config saved to /var/cache/conftool/dbconfig/20240529-151145-marostegui.json [15:11:48] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1181.eqiad.wmnet with reason: Maintenance [15:11:51] T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123 [15:11:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63557 and previous config saved to /var/cache/conftool/dbconfig/20240529-151152-arnaudb.json [15:12:06] (03CR) 10Elukey: redfish: expand support for Supermicro hosts (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [15:12:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1181.eqiad.wmnet with reason: Maintenance [15:12:16] (03PS2) 10JHathaway: rsyslog: include slashes in program names [puppet] - 10https://gerrit.wikimedia.org/r/1036763 [15:12:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1181 (T366123)', diff saved to https://phabricator.wikimedia.org/P63558 and previous config saved to /var/cache/conftool/dbconfig/20240529-151219-marostegui.json [15:12:55] (03CR) 10Marostegui: [C:03+2] es*.yaml: Clean up puppet7 lines [puppet] - 10https://gerrit.wikimedia.org/r/1037069 (owner: 10Marostegui) [15:13:24] (03CR) 10Pppery: [C:03+1] "The change to make would be to edit https://translatewiki.net/wiki/Phabricator:arcanist-core-3a7b8e3fb7aa607f/qqq, and ditto for the other" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1037026 (https://phabricator.wikimedia.org/T365853) (owner: 10Aklapper) [15:13:28] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 48 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:14:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T366123)', diff saved to https://phabricator.wikimedia.org/P63559 and previous config saved to /var/cache/conftool/dbconfig/20240529-151430-marostegui.json [15:14:33] jouncebot now [15:14:33] No deployments scheduled for the next 1 hour(s) and 45 minute(s) [15:14:45] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036763 (owner: 10JHathaway) [15:14:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1163 (re)pooling @ 10%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63560 and previous config saved to /var/cache/conftool/dbconfig/20240529-151455-arnaudb.json [15:15:01] (03CR) 10Pppery: [C:03+1] "(Translation changes made via Gerrit do work - they cause FuzzyBot to update the page on translatewiki. But it would be cleaner IMO to do " [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1037026 (https://phabricator.wikimedia.org/T365853) (owner: 10Aklapper) [15:15:35] (03CR) 10Ssingh: [C:03+1] Add apus svc records in codfw and eqiad [dns] - 10https://gerrit.wikimedia.org/r/1037095 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [15:16:01] jan_drewniak: Are you around to test https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/1036665 if I deploy it? [15:16:02] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 10Release-Engineering-Team (Priority Backlog πŸ“₯): Correct IDP login page Privacy Policy - https://phabricator.wikimedia.org/T350129#9842612 (10Pppery) Thanks. [15:16:50] (03CR) 10Elukey: [C:03+1] Remove ms-fe certs [puppet] - 10https://gerrit.wikimedia.org/r/1037074 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [15:16:53] dancy: hi! I'm around, but it turns out there are more issues with approach, we're just debating what to do now. [15:17:07] ok. I'll wait for word from you. [15:17:38] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1041'] [15:17:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036750 (https://phabricator.wikimedia.org/T359643) (owner: 10Ahmon Dancy) [15:17:59] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1041'] [15:18:02] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: sync [15:18:02] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: sync [15:18:03] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: sync [15:18:03] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/mw-parsoid: sync [15:18:03] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: sync [15:18:03] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: sync [15:18:11] (03CR) 10CI reject: [V:04-1] redfish: expand support for Supermicro hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [15:18:19] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: sync [15:18:21] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: sync [15:18:48] (03Merged) 10jenkins-bot: Remove the php symlink (v2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036750 (https://phabricator.wikimedia.org/T359643) (owner: 10Ahmon Dancy) [15:19:18] (03CR) 10JHathaway: [C:03+2] rsyslog: include slashes in program names [puppet] - 10https://gerrit.wikimedia.org/r/1036763 (owner: 10JHathaway) [15:19:20] !log dancy@deploy1002 Started scap: Backport for [[gerrit:1036750|Remove the php symlink (v2) (T359643)]] [15:19:29] T359643: Get rid of the /srv/mediawiki/php symbolic link - https://phabricator.wikimedia.org/T359643 [15:19:32] dancy: Lovely work removing the symlink! [15:19:40] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: sync [15:19:42] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: sync [15:19:43] It was the bane of my deploy-life. [15:19:51] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: sync [15:20:00] James_F: Thanks! It was always confusing/annoying to me. [15:20:13] Back in the day all the mwscript calls would run through it. [15:20:18] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: sync [15:20:48] So every time you synced (this was pre-k8s) you could, but also might not, start running the "wrong" version of the code in some places, and break stuff. Or not! Fun times. [15:21:38] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1041'] [15:22:06] !log dancy@deploy1002 dancy: Backport for [[gerrit:1036750|Remove the php symlink (v2) (T359643)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:22:10] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1041'] [15:23:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P63561 and previous config saved to /var/cache/conftool/dbconfig/20240529-152305-marostegui.json [15:23:10] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [15:23:20] !log dancy@deploy1002 dancy: Continuing with sync [15:23:31] (03CR) 10Brennen Bearnes: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035820 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes) [15:23:46] 06SRE, 06Traffic: Anycast ns1.wikimedia.org - https://phabricator.wikimedia.org/T366193 (10ssingh) 03NEW [15:23:52] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1041'] [15:23:56] James_F: sounds like exactly what you want for things like db schema migrations [15:24:32] cdanis: Or purging cache of corrupted contents, or rotating the logs when they're about to reach the privacy cut-off, or… [15:24:38] mhm [15:24:58] All these deploy-fails pass, like tears in the rain. [15:25:00] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED [15:25:02] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED [15:25:42] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: correct IPs for apus - mvernon@cumin2002" [15:26:32] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: correct IPs for apus - mvernon@cumin2002" [15:26:32] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:27:50] (03PS1) 10Jdlrobson: Revert "Wrap tables with JS" [skins/Vector] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037018 [15:27:50] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED [15:27:54] (03CR) 10MVernon: [C:03+2] Add apus svc records in codfw and eqiad [dns] - 10https://gerrit.wikimedia.org/r/1037095 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [15:28:22] (03PS2) 10Jdlrobson: Revert "Wrap tables with JS" [skins/Vector] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037018 (https://phabricator.wikimedia.org/T330527) [15:29:04] !log robh@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cloudvirt1041'] [15:29:36] 06SRE, 06serviceops, 10API Platform (RESTBase Deprecation Roadmap): Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995#9842704 (10Jdforrester-WMF) [15:29:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P63562 and previous config saved to /var/cache/conftool/dbconfig/20240529-152937-marostegui.json [15:30:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1163 (re)pooling @ 25%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63563 and previous config saved to /var/cache/conftool/dbconfig/20240529-153001-arnaudb.json [15:30:15] (03CR) 10Elukey: "Need to fix CI's -1 sigh" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [15:30:17] (03PS1) 10JHathaway: rsyslog: kafka_shipper, use global_entry function [puppet] - 10https://gerrit.wikimedia.org/r/1037098 [15:30:40] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037098 (owner: 10JHathaway) [15:30:41] (03CR) 10Jdlrobson: [C:04-1] "We're just discussing an alternative less risky approach here: https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/1037018 after an" [skins/Vector] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1036665 (https://phabricator.wikimedia.org/T330527) (owner: 10Jdlrobson) [15:31:43] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1041'] [15:31:57] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1041'] [15:32:15] (03PS6) 10Elukey: redfish: expand support for Supermicro hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372) [15:32:23] !log dancy@deploy1002 Finished scap: Backport for [[gerrit:1036750|Remove the php symlink (v2) (T359643)]] (duration: 13m 03s) [15:32:28] T359643: Get rid of the /srv/mediawiki/php symbolic link - https://phabricator.wikimedia.org/T359643 [15:32:39] 06SRE, 06Traffic: Anycast ns1.wikimedia.org - https://phabricator.wikimedia.org/T366193#9842720 (10ssingh) p:05Triageβ†’03Medium [15:34:36] (03PS7) 10Elukey: redfish: expand support for Supermicro hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372) [15:34:51] (03PS4) 10Dreamy Jazz: [CheckUser] Stop writing old for event table migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686) [15:38:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P63564 and previous config saved to /var/cache/conftool/dbconfig/20240529-153813-marostegui.json [15:38:26] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1041'] [15:39:17] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1041'] [15:40:57] (03CR) 10CI reject: [V:04-1] redfish: expand support for Supermicro hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [15:44:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P63565 and previous config saved to /var/cache/conftool/dbconfig/20240529-154446-marostegui.json [15:45:01] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1041'] [15:45:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1163 (re)pooling @ 50%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63566 and previous config saved to /var/cache/conftool/dbconfig/20240529-154510-arnaudb.json [15:45:29] (03CR) 10Effie Mouzeli: "@Dduvall, thank you very much! It is sad to see blubberoid go. But it had a good run." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036716 (https://phabricator.wikimedia.org/T365742) (owner: 10Dduvall) [15:45:50] (03CR) 10Ahmon Dancy: [C:03+1] docker_registry_ha: replace deprecated /-/jwks endpoint on gitlab [puppet] - 10https://gerrit.wikimedia.org/r/1037043 (https://phabricator.wikimedia.org/T365675) (owner: 10Jelto) [15:45:56] (03CR) 10Jelto: [C:03+1] "lgtm, let me know when this should be merged" [puppet] - 10https://gerrit.wikimedia.org/r/1037065 (https://phabricator.wikimedia.org/T359821) (owner: 10Hashar) [15:46:07] (03PS1) 10Jcrespo: dbbackups: Migrate s1 backups to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1037107 (https://phabricator.wikimedia.org/T364290) [15:46:16] (03CR) 10Jforrester: "🫑 Farewell, Blubberoid." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036716 (https://phabricator.wikimedia.org/T365742) (owner: 10Dduvall) [15:47:21] (03CR) 10Volans: "The last PSes seems to have diverged a bit from the agreed path" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [15:48:26] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED [15:48:30] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 34 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:48:32] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on db2141.codfw.wmnet with reason: upgrade to 10.6 [15:48:46] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2141.codfw.wmnet with reason: upgrade to 10.6 [15:48:46] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED [15:49:00] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on dbprov1003.eqiad.wmnet with reason: upgrade to 10.6 [15:49:02] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbprov1003.eqiad.wmnet with reason: upgrade to 10.6 [15:49:12] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on dbprov2003.codfw.wmnet with reason: upgrade to 10.6 [15:49:25] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbprov2003.codfw.wmnet with reason: upgrade to 10.6 [15:52:10] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudvirt1041'] [15:53:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T364299)', diff saved to https://phabricator.wikimedia.org/P63567 and previous config saved to /var/cache/conftool/dbconfig/20240529-155321-marostegui.json [15:53:25] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2159.codfw.wmnet with reason: Maintenance [15:53:27] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [15:53:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2159.codfw.wmnet with reason: Maintenance [15:53:40] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2187.codfw.wmnet with reason: Maintenance [15:53:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2187.codfw.wmnet with reason: Maintenance [15:53:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2159 (T364299)', diff saved to https://phabricator.wikimedia.org/P63568 and previous config saved to /var/cache/conftool/dbconfig/20240529-155349-marostegui.json [15:55:06] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main1009.eqiad.wmnet with OS bullseye [15:55:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9842870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host kafka-main1009.eqiad.wmnet with OS bullseye executed... [15:55:40] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1041.eqiad.wmnet with OS bookworm [15:55:51] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984#9842872 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm [15:55:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9842874 (10akosiaris) The fail for kafka-main1009 is expected with the current recipe btw. Let me have a quick look. [15:55:56] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED [15:56:31] (03PS1) 10Ahmon Dancy: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1037108 [15:56:44] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED [15:56:56] (03CR) 10Ahmon Dancy: [C:03+2] Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1037108 (owner: 10Ahmon Dancy) [15:56:59] (03CR) 10Tchanders: [C:03+1] [CheckUser] Stop writing old for event table migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686) (owner: 10Dreamy Jazz) [15:57:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9842877 (10akosiaris) >>! In T363212#9842874, @akosiaris wrote: > The fail for kafka-main1009 is expected with the current recipe btw. Let me have a quick loo... [15:57:38] (03Merged) 10jenkins-bot: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1037108 (owner: 10Ahmon Dancy) [15:59:01] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED [15:59:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T366123)', diff saved to https://phabricator.wikimedia.org/P63569 and previous config saved to /var/cache/conftool/dbconfig/20240529-155954-marostegui.json [15:59:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1191.eqiad.wmnet with reason: Maintenance [16:00:00] T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123 [16:00:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1191.eqiad.wmnet with reason: Maintenance [16:00:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1163 (re)pooling @ 75%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63570 and previous config saved to /var/cache/conftool/dbconfig/20240529-160016-arnaudb.json [16:00:43] FIRING: [2x] KubernetesCalicoDown: kubernetes2032.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:01:09] !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc2045.codfw.wmnet with OS bookworm [16:01:13] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc1045.eqiad.wmnet with OS bookworm [16:04:18] !log sudo cumin 'A:cp and A:drmrs' 'disable-puppet "merging CR 1037089"' [16:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:57] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED [16:05:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T366123)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240529-160528-marostegui.json [16:05:41] T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123 [16:06:30] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 39 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:09:33] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED [16:10:27] (03CR) 10CDobbins: [V:03+1 C:03+2] purged: roll out use_pki flag to all of drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1037089 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [16:10:27] (03CR) 10Lucas Werkmeister: [C:03+1] gerrit: enable change.diff3ConflictView [puppet] - 10https://gerrit.wikimedia.org/r/1037065 (https://phabricator.wikimedia.org/T359821) (owner: 10Hashar) [16:10:54] (03PS1) 10CDanis: otelcol: limit collected k8s data [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037113 (https://phabricator.wikimedia.org/T366094) [16:11:30] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 32 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:12:33] 07Puppet, 10Wikidata, 06Wikidata Dev Team, 10wmde-wikidata-tech, and 2 others: Remove the WDCM clone (stats1007) - https://phabricator.wikimedia.org/T351072#9842955 (10AndrewTavis_WMDE) Perfect, @Lucas_Werkmeister_WMDE! Glad to have this all cleared up :) [16:13:13] 07Puppet, 10Wikidata, 06Wikidata Dev Team, 10wmde-wikidata-tech, and 2 others: Remove the WDCM clone (stats1007) - https://phabricator.wikimedia.org/T351072#9842957 (10AndrewTavis_WMDE) 05Openβ†’03Resolved a:03AndrewTavis_WMDE [16:13:22] (03PS1) 10JHathaway: rsyslog kafka: add postfix programs [puppet] - 10https://gerrit.wikimedia.org/r/1037114 (https://phabricator.wikimedia.org/T325395) [16:14:03] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037114 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [16:14:26] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T366134#9842964 (10Papaul) 05Openβ†’03Resolved a:03Papaul complete [16:14:28] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1045.eqiad.wmnet with reason: host reimage [16:15:04] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED [16:15:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1163 (re)pooling @ 100%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63572 and previous config saved to /var/cache/conftool/dbconfig/20240529-161522-arnaudb.json [16:15:26] (03PS2) 10CDanis: otelcol: limit collected k8s data [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037113 (https://phabricator.wikimedia.org/T366094) [16:16:27] (03CR) 10Dzahn: [C:03+2] vrts: add missing comma to vrts_aliases.py [puppet] - 10https://gerrit.wikimedia.org/r/1036760 (https://phabricator.wikimedia.org/T284145) (owner: 10Dzahn) [16:17:13] (03CR) 10Jcrespo: [C:03+2] "https://puppet-compiler.wmflabs.org/output/1037107/2683/" [puppet] - 10https://gerrit.wikimedia.org/r/1037107 (https://phabricator.wikimedia.org/T364290) (owner: 10Jcrespo) [16:17:13] 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for STran, Madalina, Tchanders and JayCano - https://phabricator.wikimedia.org/T366198 (10JayCano) 03NEW [16:17:14] (03CR) 10JHathaway: [C:03+2] rsyslog kafka: add postfix programs [puppet] - 10https://gerrit.wikimedia.org/r/1037114 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [16:17:36] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1045.eqiad.wmnet with reason: host reimage [16:18:32] !log sudo cumin -b1 -s60 'A:cp and A:drmrs' 'run-puppet-agent --enable "merging CR 1037089"' [16:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:13] !log jiji@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2045.codfw.wmnet with reason: host reimage [16:19:20] (03PS3) 10CDanis: otelcol: limit collected k8s data [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037113 (https://phabricator.wikimedia.org/T366094) [16:19:36] jynus: can I merge in your s1 backup patch? [16:20:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P63573 and previous config saved to /var/cache/conftool/dbconfig/20240529-162040-marostegui.json [16:21:05] jhathaway: I was asking you on the other channel [16:21:08] please do [16:21:16] nod! [16:22:12] 10ops-codfw, 06SRE, 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9843039 (10Volans) a:03Volans [16:22:40] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2045.codfw.wmnet with reason: host reimage [16:23:01] (03Abandoned) 10Jdlrobson: Limit responsive tables to .wikitables [skins/Vector] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1036665 (https://phabricator.wikimedia.org/T330527) (owner: 10Jdlrobson) [16:23:55] dancy: Hi, regarding the train blocker, we've decided to revert the original change, this is the patch that can be deployed now: https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/1037018 [16:24:36] jan_drewniak: OK. I'll start right now. [16:24:39] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 43 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:25:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1002 using scap backport" [skins/Vector] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037018 (https://phabricator.wikimedia.org/T330527) (owner: 10Jdlrobson) [16:25:37] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [16:25:42] huh [16:25:43] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:27:00] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED [16:28:06] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1010.eqiad.wmnet with OS bullseye [16:28:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9843081 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1010.eqiad.wmnet with OS bullseye [16:29:41] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 30 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:29:58] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1009.eqiad.wmnet with OS bullseye [16:30:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9843083 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1009.eqiad.wmnet with OS bullseye [16:31:03] (03PS1) 10Dzahn: mx: stop ignoring VRTS alias errors, email on error [puppet] - 10https://gerrit.wikimedia.org/r/1037117 (https://phabricator.wikimedia.org/T284145) [16:32:06] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main1009.eqiad.wmnet with reason: host reimage [16:32:09] jclark@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [16:32:29] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [16:32:30] !log restart pybal on lvs1019 [16:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:39] PROBLEM - Host parse1002 is DOWN: PING CRITICAL - Packet loss = 100% [16:34:27] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1045.eqiad.wmnet with OS bookworm [16:35:21] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main1009.eqiad.wmnet with reason: host reimage [16:35:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P63574 and previous config saved to /var/cache/conftool/dbconfig/20240529-163549-marostegui.json [16:36:20] (03PS5) 10JHathaway: spf recs update: phabricator, gitlab, wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1036739 (https://phabricator.wikimedia.org/T366113) [16:37:49] (03CR) 10Muehlenhoff: mx: stop ignoring VRTS alias errors, email on error (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037117 (https://phabricator.wikimedia.org/T284145) (owner: 10Dzahn) [16:38:02] (03PS2) 10JHathaway: rsyslog: kafka_shipper, use global_entry function [puppet] - 10https://gerrit.wikimedia.org/r/1037098 [16:38:08] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037098 (owner: 10JHathaway) [16:38:42] FIRING: [2x] KubernetesCalicoDown: kubernetes2032.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:39:11] (03PS1) 10Jsn.sherman: CommonSettings: correct AutoModerator load order [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037118 [16:39:25] (03CR) 10JHathaway: [C:03+2] "Thanks Dallas, Arzhel, & Eoghan for the reviews" [dns] - 10https://gerrit.wikimedia.org/r/1036739 (https://phabricator.wikimedia.org/T366113) (owner: 10JHathaway) [16:40:16] (03PS2) 10Dzahn: mx: stop ignoring VRTS alias errors [puppet] - 10https://gerrit.wikimedia.org/r/1037117 (https://phabricator.wikimedia.org/T284145) [16:40:20] (03CR) 10Dzahn: mx: stop ignoring VRTS alias errors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037117 (https://phabricator.wikimedia.org/T284145) (owner: 10Dzahn) [16:40:39] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2045.codfw.wmnet with OS bookworm [16:42:15] RECOVERY - Host parse1002 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [16:42:40] (03PS2) 10Jsn.sherman: CommonSettings: correct AutoModerator load order [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037118 (https://phabricator.wikimedia.org/T366203) [16:43:42] FIRING: [2x] KubernetesCalicoDown: kubernetes2032.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:47:09] (03PS1) 10JHathaway: Revert "Revert "phabricator: Move outbound email to mx-out{1001,2001}.wikimedia.org"" [puppet] - 10https://gerrit.wikimedia.org/r/1037129 [16:49:01] (03CR) 10JHathaway: [C:03+2] Revert "Revert "phabricator: Move outbound email to mx-out{1001,2001}.wikimedia.org"" [puppet] - 10https://gerrit.wikimedia.org/r/1037129 (owner: 10JHathaway) [16:49:04] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9843219 (10Papaul) [16:49:29] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for rickijay - https://phabricator.wikimedia.org/T365574#9843222 (10Dzahn) >>! In T365574#9829202, @jon_amar-WMDE wrote: > Hi @Dzahn I'm not clear whether I can provide approval (I'm the Product Manager for Wik... [16:49:55] (03Merged) 10jenkins-bot: Revert "Wrap tables with JS" [skins/Vector] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037018 (https://phabricator.wikimedia.org/T330527) (owner: 10Jdlrobson) [16:50:25] !log dancy@deploy1002 Started scap: Backport for [[gerrit:1037018|Revert "Wrap tables with JS" (T330527)]] [16:50:30] T330527: Wider tables overlap sticky page tools (Upstream Minerva's responsive table styles to core SkinModule) - https://phabricator.wikimedia.org/T330527 [16:50:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T366123)', diff saved to https://phabricator.wikimedia.org/P63575 and previous config saved to /var/cache/conftool/dbconfig/20240529-165057-marostegui.json [16:51:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1194.eqiad.wmnet with reason: Maintenance [16:51:03] T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123 [16:51:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1194.eqiad.wmnet with reason: Maintenance [16:51:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T366123)', diff saved to https://phabricator.wikimedia.org/P63576 and previous config saved to /var/cache/conftool/dbconfig/20240529-165121-marostegui.json [16:52:11] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:52:18] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9843239 (10VRiley-WMF) Investigated this unit with the assistance of Dell. After some troubleshooting and pulling logs, they will be sending out a new motherboard as a replacement (tomorrow). Wil... [16:52:53] (03Abandoned) 10Herron: pyrra add service dns entries [dns] - 10https://gerrit.wikimedia.org/r/961132 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [16:53:22] (03Abandoned) 10Herron: services: add pyrra conftool-data and service stub entry [puppet] - 10https://gerrit.wikimedia.org/r/961129 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [16:53:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T366123)', diff saved to https://phabricator.wikimedia.org/P63577 and previous config saved to /var/cache/conftool/dbconfig/20240529-165333-marostegui.json [16:53:40] (03Abandoned) 10Herron: pyrra: use load balancing [puppet] - 10https://gerrit.wikimedia.org/r/961130 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [16:53:53] (03PS1) 10JHathaway: rsyslog: fix undef var in global entry [puppet] - 10https://gerrit.wikimedia.org/r/1037121 [16:54:19] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037121 (owner: 10JHathaway) [16:57:05] (03CR) 10JHathaway: [C:03+2] rsyslog: fix undef var in global entry [puppet] - 10https://gerrit.wikimedia.org/r/1037121 (owner: 10JHathaway) [16:58:53] (03PS3) 10JHathaway: rsyslog kafka_shipper: use the new global_entry function [puppet] - 10https://gerrit.wikimedia.org/r/1037098 [16:59:11] !log stevemunene@deploy1002 Started deploy [airflow-dags/analytics@229b278]: (no justification provided) [16:59:19] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037098 (owner: 10JHathaway) [16:59:38] !log stevemunene@deploy1002 Finished deploy [airflow-dags/analytics@229b278]: (no justification provided) (duration: 00m 26s) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240529T1700) [17:01:34] I'm getting testserver check failures during scap mediawiki deployment: [17:01:34] ``` [17:01:34] 17:00:31 Check 'check_testservers_k8s' failed: Sending to mwdebug.discovery.wmnet... [17:01:34] https://foundation.wikimedia.org/wiki/Home (/srv/deployment/httpbb-tests/appserver/test_foundation.yaml:2) [17:01:34] ERROR: HTTPSConnectionPool(host='mwdebug.discovery.wmnet', port=4444): Max retries exceeded with url: /wiki/Home (Caused by ConnectTimeoutError(, 'Connection to mwdebug.discovery.wmnet timed out. (connect timeout=10)')) [17:01:35] ``` [17:01:41] rzl: Any ideas? [17:01:58] The error persists when retrying [17:02:16] curious, taking a look [17:02:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T364299)', diff saved to https://phabricator.wikimedia.org/P63578 and previous config saved to /var/cache/conftool/dbconfig/20240529-170242-marostegui.json [17:02:49] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [17:02:56] port 4444? [17:03:08] (03PS8) 10Elukey: redfish: expand support for Supermicro hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372) [17:03:28] hashar: nod.. as configured in /etc/scap.cfg: `testservers_check_cmd_k8s: httpbb /srv/deployment/httpbb-tests/appserver/* --hosts=mwdebug.discovery.wmnet --https_port=4444 --retry_on_timeout` [17:04:42] there are a ton of open ports on mwdebug1001. 4444 is not one of them. [17:04:59] mwdebug1001 is bare metal. This is the k8s check failing [17:05:08] ah, nod [17:05:09] interesting, `curl https://foundation.wikimedia.org/wiki/Home --resolve 'foundation.wikimedia.org:443:4444'` works reliably but I'm getting the same timeout from httpbb, still looking [17:05:17] we would have lost the kubernetes debug pod so? [17:05:24] 06SRE, 10Cassandra, 06Data Products, 06serviceops, and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9843292 (10Scott_French) [17:05:25] er, because I messed up the --resolve :) hang on [17:05:27] Hopping into my team meating. [17:05:33] meat time! [17:05:37] meating! [17:05:38] haha [17:05:42] that sounds delicious [17:05:49] how do you want your steak today? [17:05:49] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - mwdebug_4444: Servers parse1013.eqiad.wmnet, mw1442.eqiad.wmnet, kubernetes1022.eqiad.wmnet, mw1384.eqiad.wmnet, mw1479.eqiad.wmnet, mw1470.eqiad.wmnet, kubernetes1021.eqiad.wmnet, mw1430.eqiad.wmnet, mw1388.eqiad.wmnet, mw1482.eqiad.wmnet, parse1009.eqiad.wmnet, mw1449.eqiad.wmnet, mw1391.eqiad.wmnet, parse1024.eqiad.wmnet, mw1408.eqiad.wmnet, mw14 [17:05:49] wmnet, mw1357.eqiad.wmnet, kubernetes1017.eqiad.wmnet, mw1425.eqiad.wmnet, mw1397.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1051.eqiad.wmnet, kubernetes1058.eqiad.wmnet, mw1452.eqiad.wmnet, mw1356.eqiad.wmnet, mw1374.eqiad.wmnet, mw1414.eqiad.wmnet, mw1371.eqiad.wmnet, mw1473.eqiad.wmnet, mw1392.eqiad.wmnet, kubernetes1028.eqiad.wmnet, mw1485.eqiad.wmnet, kubernetes1043.eqiad.wmnet, kubernetes1008.eqiad.wmnet, mw1362.eqiad.wmnet [17:05:49] eqiad.wmnet, mw1463.eqiad.wmnet, mw1421.eqiad.wmnet, mw1441.eqiad.wmnet, parse1006.eqiad.wmnet, parse1004.eqiad.wmnet, parse1016.eqiad.wmnet, kubernetes1052.eqiad.wmnet, parse1022.eqiad https://wikitech.wikimedia.org/wiki/PyBal [17:05:54] ah [17:05:58] well, that's probably not unrelated [17:06:00] more stuff exploding with pybal .. [17:06:31] (03CR) 10Reedy: [C:03+1] CommonSettings: correct AutoModerator load order [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037118 (https://phabricator.wikimedia.org/T366203) (owner: 10Jsn.sherman) [17:06:49] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:08:10] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1041.eqiad.wmnet with OS bookworm [17:08:27] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1041.eqiad.wmnet with OS bookworm [17:08:36] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984#9843300 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm [17:08:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P63579 and previous config saved to /var/cache/conftool/dbconfig/20240529-170841-marostegui.json [17:09:49] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - mwdebug_4444: Servers parse1011.eqiad.wmnet, mw1462.eqiad.wmnet, kubernetes1025.eqiad.wmnet, mw1457.eqiad.wmnet, mw1442.eqiad.wmnet, mw1478.eqiad.wmnet, kubernetes1037.eqiad.wmnet, mw1384.eqiad.wmnet, mw1479.eqiad.wmnet, kubernetes1044.eqiad.wmnet, mw1449.eqiad.wmnet, mw1399.eqiad.wmnet, mw1424.eqiad.wmnet, parse1024.eqiad.wmnet, mw1454.eqiad.wmnet, [17:09:49] 0.eqiad.wmnet, mw1423.eqiad.wmnet, mw1496.eqiad.wmnet, kubernetes1060.eqiad.wmnet, mw1466.eqiad.wmnet, kubernetes1059.eqiad.wmnet, mw1469.eqiad.wmnet, mw1394.eqiad.wmnet, mw1452.eqiad.wmnet, mw1422.eqiad.wmnet, mw1374.eqiad.wmnet, mw1414.eqiad.wmnet, parse1020.eqiad.wmnet, mw1485.eqiad.wmnet, kubernetes1019.eqiad.wmnet, kubernetes1009.eqiad.wmnet, mw1448.eqiad.wmnet, mw1381.eqiad.wmnet, mw1362.eqiad.wmnet, kubernetes1042.eqiad.wmnet, mw1 [17:09:49] .wmnet, kubernetes1056.eqiad.wmnet, kubernetes1029.eqiad.wmnet, mw1472.eqiad.wmnet, parse1022.eqiad.wmnet, kubernetes1032.eqiad.wmnet, parse1017.eqiad.wmnet, mw1440.eqiad.wmnet, kuberne https://wikitech.wikimedia.org/wiki/PyBal [17:10:40] httpbb was passing but is failing again, so this isn't an httpbb problem but httpbb surfacing a load-balancing problem [17:11:32] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9843301 (10jhathaway) [17:13:25] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, can you please also take care of merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1031761 ?" [puppet] - 10https://gerrit.wikimedia.org/r/1037117 (https://phabricator.wikimedia.org/T284145) (owner: 10Dzahn) [17:13:57] (03CR) 10Dduvall: "That works for me. Thanks, @effie!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036716 (https://phabricator.wikimedia.org/T365742) (owner: 10Dduvall) [17:14:29] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host kafka-main1010.eqiad.wmnet with OS bullseye [17:16:47] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 36 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:17:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P63580 and previous config saved to /var/cache/conftool/dbconfig/20240529-171750-marostegui.json [17:19:42] (03PS3) 10Muehlenhoff: vrts: Stop ignoring errors from alias sync [puppet] - 10https://gerrit.wikimedia.org/r/1031761 (https://phabricator.wikimedia.org/T284145) [17:20:40] (03CR) 10Dzahn: [C:03+2] vrts: Stop ignoring errors from alias sync [puppet] - 10https://gerrit.wikimedia.org/r/1031761 (https://phabricator.wikimedia.org/T284145) (owner: 10Muehlenhoff) [17:23:23] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1041.eqiad.wmnet with reason: host reimage [17:23:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P63581 and previous config saved to /var/cache/conftool/dbconfig/20240529-172349-marostegui.json [17:23:56] rzl: I'm back at keys and looking now [17:25:44] thanks, still looking too -- I'm still not 100% sure this isn't just a genuine mw-debug issue caused by the deploy, but it doesn't look like it [17:26:01] This is what I was trying to deploy: https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/1037018 [17:26:06] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1041.eqiad.wmnet with reason: host reimage [17:26:20] `curl https://foundation.wikimedia.org/wiki/Main_Page --connect-to foundation.wikimedia.org:443:mwdebug.discovery.wmnet:4444` also hangs, so it definitely isn't just httpbb [17:27:31] `curl https://foundation.wikimedia.org/wiki/Main_Page --connect-to foundation.wikimedia.org:443:mw-web.discovery.wmnet:4450` works so it isn't all of mw-on-k8s [17:29:36] (03PS3) 10Dzahn: mx: stop ignoring VRTS alias errors [puppet] - 10https://gerrit.wikimedia.org/r/1037117 (https://phabricator.wikimedia.org/T284145) [17:30:10] rzl: https://grafana.wikimedia.org/d/000000422/pybal-service?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-server=All&var-service=mwdebug_4444 [17:30:19] (03CR) 10Dzahn: mx: stop ignoring VRTS alias errors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037117 (https://phabricator.wikimedia.org/T284145) (owner: 10Dzahn) [17:31:41] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 34 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:32:24] hrm, and scap started at 16:50 [17:32:35] (03CR) 10Dzahn: mx: stop ignoring VRTS alias errors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037117 (https://phabricator.wikimedia.org/T284145) (owner: 10Dzahn) [17:32:41] (03Abandoned) 10Dzahn: mx: stop ignoring VRTS alias errors [puppet] - 10https://gerrit.wikimedia.org/r/1037117 (https://phabricator.wikimedia.org/T284145) (owner: 10Dzahn) [17:32:59] it doesn't necessarily need to have been the actual code getting deployed, but that looks likely to have been the trigger for whatever this is [17:32:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P63582 and previous config saved to /var/cache/conftool/dbconfig/20240529-173258-marostegui.json [17:34:36] the actual mw-debug pods are 35m and 38m old so they're not dying, and I don't immediately see anything in logs on the k8s side [17:35:31] May 29 17:35:03 lvs1019 pybal[273937]: [mwdebug_4444] INFO: Leaving previously pooled but down server mw1439.eqiad.wmnet pooled [17:35:33] May 29 17:35:03 lvs1019 pybal[273937]: [mwdebug_4444] ERROR: Monitoring instance IdleConnection reports server mw1393.eqiad.wmnet (enabled/up/pooled) down: User timeout caused connection failure. [17:35:35] May 29 17:35:03 lvs1019 pybal[273937]: [mwdebug_4444] ERROR: Could not depool server mw1393.eqiad.wmnet because of too many down! [17:35:37] May 29 17:35:03 lvs1019 pybal[273937]: [mwdebug_4444 IdleConnection] WARN: mw1393.eqiad.wmnet (enabled/down/pooled): Connection to 10.64.16.151:4444 failed. [17:35:39] May 29 17:35:03 lvs1019 pybal[273937]: [mwdebug_4444] ERROR: Monitoring instance IdleConnection reports server parse1003.eqiad.wmnet (enabled/up/pooled) down: User timeout caused connection failure. [17:35:41] May 29 17:35:03 lvs1019 pybal[273937]: [mwdebug_4444] ERROR: Could not depool server parse1003.eqiad.wmnet because of too many down! [17:35:43] May 29 17:35:03 lvs1019 pybal[273937]: [mwdebug_4444 IdleConnection] WARN: parse1003.eqiad.wmnet (enabled/down/pooled): Connection to 10.64.0.121:4444 failed. [17:36:14] I don't know how to quickly check but I think that presently lvs1019/lvs1020 can connect to 0 of the kubernetes hosts on 4444 [17:38:35] ah wait, that's not true [17:38:43] lvs1019 started doing bunch of disk write https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=lvs1019&var-datasource=thanos&var-cluster=lvs&from=now-1h&to=now&viewPanel=35 possibly writing logs [17:38:54] the situation is much worse on lvs1019, on lvs1020 it is actually okay-ish [17:38:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T366123)', diff saved to https://phabricator.wikimedia.org/P63583 and previous config saved to /var/cache/conftool/dbconfig/20240529-173857-marostegui.json [17:39:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1202.eqiad.wmnet with reason: Maintenance [17:39:03] T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123 [17:39:09] and it has TCP errors https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=lvs1019&var-datasource=thanos&var-cluster=lvs&from=now-1h&to=now&viewPanel=31 [17:39:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1202.eqiad.wmnet with reason: Maintenance [17:39:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T366123)', diff saved to https://phabricator.wikimedia.org/P63584 and previous config saved to /var/cache/conftool/dbconfig/20240529-173921-marostegui.json [17:39:40] SYN retransmits are consistent with what I'm seeing yeah [17:40:35] that is all I know :-] [17:41:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T366123)', diff saved to https://phabricator.wikimedia.org/P63585 and previous config saved to /var/cache/conftool/dbconfig/20240529-174132-marostegui.json [17:41:59] (03CR) 10Muehlenhoff: mx: stop ignoring VRTS alias errors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037117 (https://phabricator.wikimedia.org/T284145) (owner: 10Dzahn) [17:42:27] (03Restored) 10Dzahn: mx: stop ignoring VRTS alias errors [puppet] - 10https://gerrit.wikimedia.org/r/1037117 (https://phabricator.wikimedia.org/T284145) (owner: 10Dzahn) [17:43:14] rzl: I think this is a Calico issue [17:43:43] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 42 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:43:56] -> #wikimedia-sre [17:45:43] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:48:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T364299)', diff saved to https://phabricator.wikimedia.org/P63586 and previous config saved to /var/cache/conftool/dbconfig/20240529-174806-marostegui.json [17:48:09] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2168.codfw.wmnet with reason: Maintenance [17:48:12] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [17:48:22] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2168.codfw.wmnet with reason: Maintenance [17:48:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2168 (T364299)', diff saved to https://phabricator.wikimedia.org/P63587 and previous config saved to /var/cache/conftool/dbconfig/20240529-174829-marostegui.json [17:53:43] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 32 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:56:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P63588 and previous config saved to /var/cache/conftool/dbconfig/20240529-175640-marostegui.json [17:59:50] (03PS4) 10Dzahn: mx: stop ignoring VRTS alias errors [puppet] - 10https://gerrit.wikimedia.org/r/1037117 (https://phabricator.wikimedia.org/T284145) [18:00:05] dancy and andre: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240529T1800). [18:00:05] dancy and andre: Your horoscope predicts another MediaWiki train - Utc-7+Utc-0 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240529T1800). [18:00:27] I'm holding the train until k8s issues are worked out. [18:01:02] (03CR) 10Dzahn: "both defaults are false, so just removing all 3 lines then" [puppet] - 10https://gerrit.wikimedia.org/r/1037117 (https://phabricator.wikimedia.org/T284145) (owner: 10Dzahn) [18:04:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T352010)', diff saved to https://phabricator.wikimedia.org/P63589 and previous config saved to /var/cache/conftool/dbconfig/20240529-180442-ladsgroup.json [18:04:49] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [18:04:53] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:07:35] dancy: I think we figured it out, you can unhold the train [18:07:44] Excellent. [18:07:54] Re-doing the backport that I was originally attempting first. [18:08:26] !log dancy@deploy1002 Started scap: Backport for [[gerrit:1037018|Revert "Wrap tables with JS" (T330527)]] [18:08:32] T330527: Wider tables overlap sticky page tools (Upstream Minerva's responsive table styles to core SkinModule) - https://phabricator.wikimedia.org/T330527 [18:10:59] !log dancy@deploy1002 dancy and jdlrobson: Backport for [[gerrit:1037018|Revert "Wrap tables with JS" (T330527)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:11:45] jan_drewniak: Can you verify that the revert fixed the problem on testservers? [18:11:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P63590 and previous config saved to /var/cache/conftool/dbconfig/20240529-181148-marostegui.json [18:12:16] dancy: ok taking a look now [18:15:43] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 39 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:15:56] Jdlrobson: can you verify the fix? My computer just crashed :/ [18:17:43] Bummer! [18:19:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P63592 and previous config saved to /var/cache/conftool/dbconfig/20240529-181950-ladsgroup.json [18:20:43] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 32 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:24:30] dancy: we are good to sync [18:24:40] Excellent. Proceeding [18:24:42] !log dancy@deploy1002 dancy and jdlrobson: Continuing with sync [18:26:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T366123)', diff saved to https://phabricator.wikimedia.org/P63593 and previous config saved to /var/cache/conftool/dbconfig/20240529-182656-marostegui.json [18:26:59] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1227.eqiad.wmnet with reason: Maintenance [18:27:02] T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123 [18:27:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1227.eqiad.wmnet with reason: Maintenance [18:27:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T366123)', diff saved to https://phabricator.wikimedia.org/P63594 and previous config saved to /var/cache/conftool/dbconfig/20240529-182719-marostegui.json [18:29:37] (03CR) 10Dzahn: [C:03+2] mx: stop ignoring VRTS alias errors [puppet] - 10https://gerrit.wikimedia.org/r/1037117 (https://phabricator.wikimedia.org/T284145) (owner: 10Dzahn) [18:29:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:31:19] ^ not exactly "expected" but let's call it status quo [18:31:33] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" [18:32:43] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 39 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:32:45] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" [18:32:47] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1041.eqiad.wmnet with OS bookworm [18:32:54] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984#9843563 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm completed: - cloudvirt... [18:33:36] !log dancy@deploy1002 Finished scap: Backport for [[gerrit:1037018|Revert "Wrap tables with JS" (T330527)]] (duration: 25m 10s) [18:33:42] T330527: Wider tables overlap sticky page tools (Upstream Minerva's responsive table styles to core SkinModule) - https://phabricator.wikimedia.org/T330527 [18:34:00] Rolling the train. [18:34:19] (03PS1) 10TrainBranchBot: group1 wikis to 1.43.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037148 (https://phabricator.wikimedia.org/T361401) [18:34:21] (03CR) 10TrainBranchBot: [C:03+2] group1 wikis to 1.43.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037148 (https://phabricator.wikimedia.org/T361401) (owner: 10TrainBranchBot) [18:34:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P63595 and previous config saved to /var/cache/conftool/dbconfig/20240529-183458-ladsgroup.json [18:34:59] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users, wmf for Sonja Perry - https://phabricator.wikimedia.org/T365766#9843568 (10SonjaPerry) L3 signed, thank you! [18:35:01] (03Merged) 10jenkins-bot: group1 wikis to 1.43.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037148 (https://phabricator.wikimedia.org/T361401) (owner: 10TrainBranchBot) [18:35:07] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1012 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:37:43] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 26 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:38:21] (03PS1) 10Bernard Wang: POC: Wrap tables with JS [skins/Vector] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037131 [18:41:54] !log πŸ’™cdanis@lvs1020.eqiad.wmnet ~ πŸ•β˜• sudo systemctl restart pybal.service [18:41:57] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:12] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984#9843591 (10Andrew) a:05Jclark-ctrβ†’03None After a nic firmware upgrade things seem to be working. It took a couple of tries (suspicious!) but now the host is imaged an... [18:44:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:47:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9843636 (10wiki_willy) a:03VRiley-WMF [18:48:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9843643 (10wiki_willy) [18:48:45] (03PS1) 10Ahmon Dancy: httpbb-tests: Update https://donate.wikimedia.org redirect Location [puppet] - 10https://gerrit.wikimedia.org/r/1037149 (https://phabricator.wikimedia.org/T351325) [18:49:00] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9843647 (10wiki_willy) [18:49:27] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9843655 (10wiki_willy) [18:49:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9843656 (10wiki_willy) [18:50:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T352010)', diff saved to https://phabricator.wikimedia.org/P63597 and previous config saved to /var/cache/conftool/dbconfig/20240529-185006-ladsgroup.json [18:50:10] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance [18:50:14] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [18:50:23] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance [18:50:25] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [18:50:27] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [18:50:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2126 (T352010)', diff saved to https://phabricator.wikimedia.org/P63598 and previous config saved to /var/cache/conftool/dbconfig/20240529-185035-ladsgroup.json [18:55:38] (03PS2) 10Ahmon Dancy: httpbb-tests: test_foundation.yaml: Update a donate.wikimedia.org expected redirect [puppet] - 10https://gerrit.wikimedia.org/r/1037149 (https://phabricator.wikimedia.org/T351325) [18:55:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T366123)', diff saved to https://phabricator.wikimedia.org/P63599 and previous config saved to /var/cache/conftool/dbconfig/20240529-185541-marostegui.json [18:55:47] T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123 [18:57:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T364299)', diff saved to https://phabricator.wikimedia.org/P63600 and previous config saved to /var/cache/conftool/dbconfig/20240529-185719-marostegui.json [18:57:25] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [18:59:02] (03CR) 10CI reject: [V:04-1] httpbb-tests: test_foundation.yaml: Update a donate.wikimedia.org expected redirect [puppet] - 10https://gerrit.wikimedia.org/r/1037149 (https://phabricator.wikimedia.org/T351325) (owner: 10Ahmon Dancy) [18:59:45] (03PS1) 10Ebernhardson: cirrus: Send weighted tags to known clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037153 [18:59:53] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Joely Rooke WMDE - https://phabricator.wikimedia.org/T366145#9843702 (10Dzahn) Hi @KFrancis , @JoelyRooke-WMDE will need the usual NDA for WMDE employees. Thanks Hi @JoelyRooke-WMDE If you could send an email to Katie (https://meta.wikimedia.org/wik... [19:00:04] (03PS3) 10Ahmon Dancy: httpbb-tests: Update a donate.wikimedia.org expected redirect [puppet] - 10https://gerrit.wikimedia.org/r/1037149 (https://phabricator.wikimedia.org/T351325) [19:01:02] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf/analytics-privatedata-users for derenrich - https://phabricator.wikimedia.org/T365381#9843720 (10Dzahn) @derenrich From your direct manager by leaving a comment on this ticket, please. [19:03:27] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf/analytics-privatedata-users for derenrich - https://phabricator.wikimedia.org/T365381#9843724 (10derenrich) >>! In T365381#9843720, @Dzahn wrote: > @derenrich From your direct manager by leaving a comment on this ticket, please. @Dzahn that already happened.... [19:03:42] (03CR) 10RLazarus: [C:03+2] httpbb-tests: Update a donate.wikimedia.org expected redirect [puppet] - 10https://gerrit.wikimedia.org/r/1037149 (https://phabricator.wikimedia.org/T351325) (owner: 10Ahmon Dancy) [19:04:01] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf/analytics-privatedata-users for derenrich - https://phabricator.wikimedia.org/T365381#9843726 (10Dzahn) My bad, see my edit above though. [19:07:31] 06SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for Sohom Datta - https://phabricator.wikimedia.org/T366032#9843733 (10KFrancis) Hi all, the NDA has been sent out for signatures. I'll confirm when it's complete. [19:10:46] 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for STran, Madalina, Tchanders and JayCano - https://phabricator.wikimedia.org/T366198#9843745 (10Dzahn) Hi @JayCano sorry for the hassle but this isn't an LDAP group, so it's not really an LDAP-Access-Request. This is the right form f... [19:10:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P63601 and previous config saved to /var/cache/conftool/dbconfig/20240529-191049-marostegui.json [19:10:53] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984#9843740 (10Andrew) a:03aborrero This host is up and seems stable, but VMs running on it cannot reach the internet. Since this host was being moved from a 2-nic to 1-ni... [19:12:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P63602 and previous config saved to /var/cache/conftool/dbconfig/20240529-191227-marostegui.json [19:17:34] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.43.0-wmf.7 refs T361401 [19:17:39] T361401: 1.43.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T361401 [19:21:12] (03PS1) 10JHathaway: wikipedia.org dmarc: change to quarantine [dns] - 10https://gerrit.wikimedia.org/r/1037154 (https://phabricator.wikimedia.org/T211403) [19:22:19] 06SRE, 06Infrastructure-Foundations, 10Mail, 13Patch-For-Review, 07Security: Domains of most projects do not have DMARC policy - https://phabricator.wikimedia.org/T211403#9843816 (10jhathaway) Patch added to change wikipedia.org's policy to quarantine. [19:25:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P63603 and previous config saved to /var/cache/conftool/dbconfig/20240529-192559-marostegui.json [19:27:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P63604 and previous config saved to /var/cache/conftool/dbconfig/20240529-192735-marostegui.json [19:32:11] RECOVERY - Categories update lag on wdqs1016 is OK: OK - Categories lag: 14:32:10.202067 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [19:32:11] RECOVERY - Categories update lag on wdqs1012 is OK: OK - Categories lag: 14:32:10.223435 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [19:32:11] RECOVERY - Categories update lag on wdqs1014 is OK: OK - Categories lag: 14:32:10.256696 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [19:32:13] RECOVERY - Categories update lag on wdqs1020 is OK: OK - Categories lag: 14:32:11.934951 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [19:35:17] RECOVERY - Categories update lag on wdqs2015 is OK: OK - Categories lag: 14:35:15.504675 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [19:35:17] RECOVERY - Categories update lag on wdqs2021 is OK: OK - Categories lag: 14:35:15.519092 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [19:35:17] RECOVERY - Categories update lag on wdqs2019 is OK: OK - Categories lag: 14:35:15.523440 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [19:35:17] RECOVERY - Categories update lag on wdqs2017 is OK: OK - Categories lag: 14:35:15.534935 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [19:36:49] !log ebernhardson@deploy1002 Started deploy [airflow-dags/search@3287de9]: bump discolytics to 0.22.0 [19:37:17] !log ebernhardson@deploy1002 Finished deploy [airflow-dags/search@3287de9]: bump discolytics to 0.22.0 (duration: 00m 27s) [19:39:07] (03PS3) 10David Martin: Add a stream for tracking the API of WikiLambda [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017962 (https://phabricator.wikimedia.org/T356228) [19:41:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T366123)', diff saved to https://phabricator.wikimedia.org/P63605 and previous config saved to /var/cache/conftool/dbconfig/20240529-194107-marostegui.json [19:41:10] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [19:41:15] T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123 [19:41:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [19:42:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T364299)', diff saved to https://phabricator.wikimedia.org/P63606 and previous config saved to /var/cache/conftool/dbconfig/20240529-194245-marostegui.json [19:42:48] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2182.codfw.wmnet with reason: Maintenance [19:42:52] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [19:43:01] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2182.codfw.wmnet with reason: Maintenance [19:43:06] (03PS4) 10CDanis: otelcol: limit collected k8s data [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037113 (https://phabricator.wikimedia.org/T366094) [19:43:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2182 (T364299)', diff saved to https://phabricator.wikimedia.org/P63607 and previous config saved to /var/cache/conftool/dbconfig/20240529-194309-marostegui.json [19:45:59] (03PS1) 10Ahmon Dancy: Make header expected/got failure output multiline for easier human viewing [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037156 [19:46:51] 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for STran, Madalina, Tchanders and JayCano - https://phabricator.wikimedia.org/T366198#9843932 (10Aklapper) 05Openβ†’03Invalid Please see / follow the "Analytics" entry under "I need access or permissions to..." on the https://pha... [19:46:54] (03PS1) 10JHathaway: wikipedia.org spf: indicate mail is not sent from this domain. [dns] - 10https://gerrit.wikimedia.org/r/1037157 (https://phabricator.wikimedia.org/T211403) [19:47:11] RECOVERY - Categories update lag on wdqs1013 is OK: OK - Categories lag: 14:47:10.174585 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [19:47:11] RECOVERY - Categories update lag on wdqs1011 is OK: OK - Categories lag: 14:47:10.195421 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [19:47:13] RECOVERY - Categories update lag on wdqs1021 is OK: OK - Categories lag: 14:47:11.903546 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [19:47:29] (03PS3) 10Dzahn: gerrit: add parameter to toggle lfs_replica_sync [puppet] - 10https://gerrit.wikimedia.org/r/1036771 (https://phabricator.wikimedia.org/T363196) [19:47:42] (03CR) 10CI reject: [V:04-1] Make header expected/got failure output multiline for easier human viewing [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037156 (owner: 10Ahmon Dancy) [19:47:46] (03CR) 10CI reject: [V:04-1] gerrit: add parameter to toggle lfs_replica_sync [puppet] - 10https://gerrit.wikimedia.org/r/1036771 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn) [19:48:18] (03CR) 10RLazarus: [C:03+1] otelcol: limit collected k8s data [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037113 (https://phabricator.wikimedia.org/T366094) (owner: 10CDanis) [19:48:24] (03CR) 10CDanis: [C:03+2] otelcol: limit collected k8s data [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037113 (https://phabricator.wikimedia.org/T366094) (owner: 10CDanis) [19:49:00] (03PS1) 10Ahmon Dancy: Add more junk to .gitignore [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037158 [19:49:55] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1036771 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn) [19:50:13] RECOVERY - Categories update lag on wdqs2018 is OK: OK - Categories lag: 14:50:12.311387 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [19:50:13] RECOVERY - Categories update lag on wdqs2016 is OK: OK - Categories lag: 14:50:12.317364 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [19:50:15] RECOVERY - Categories update lag on wdqs2020 is OK: OK - Categories lag: 14:50:13.815870 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [19:50:59] (03CR) 10CI reject: [V:04-1] Add more junk to .gitignore [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037158 (owner: 10Ahmon Dancy) [19:51:20] (03Merged) 10jenkins-bot: otelcol: limit collected k8s data [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037113 (https://phabricator.wikimedia.org/T366094) (owner: 10CDanis) [19:51:46] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [19:52:03] (03CR) 10Ahmon Dancy: "Not sure what's up with the tests." [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037158 (owner: 10Ahmon Dancy) [19:56:11] (03PS4) 10Dzahn: gerrit: add parameter to toggle lfs_replica_sync [puppet] - 10https://gerrit.wikimedia.org/r/1036771 (https://phabricator.wikimedia.org/T363196) [19:58:42] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240529T2000). [20:00:05] JSherman: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:18] here and happy to self deploy [20:01:15] o/ present but someone in the wrong window again [20:02:05] Jdlrobson: are you here for https://gerrit.wikimedia.org/r/c/1034480/ ? [20:02:23] Fixed: https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=2183257&oldid=2183214 [20:02:53] nope for the follow up to that: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Popups/+/1036664 JSherman [20:03:07] thanks [20:04:30] hi, who is the deployer [20:04:50] Jdlrobson: it looks like it's simplifying things. Was it tested on beta already? [20:05:07] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1012 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:05:09] yeh [20:05:13] jouncebot now [20:05:13] For the next 0 hour(s) and 54 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240529T2000) [20:05:17] we did the config change yesterday [20:05:37] We want to do this now in case we need to revert before the Thursday train.. but it's easy to ficx! [20:05:51] s/fix/test [20:06:02] Nemoralis: haven't heard from one of the listed deployers, but I was about to self deploy and then do Jdlrobson's patch too if needed [20:06:17] I have patch too [20:06:36] Okay, I'm about to start mine. [20:06:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037118 (https://phabricator.wikimedia.org/T366203) (owner: 10Jsn.sherman) [20:07:33] JSherman: are merges to deploy branches still taking 30mins + ? [20:08:15] Honestly, I don't know; it's been pretty variable week to week in my experience [20:08:45] (03PS2) 10NMW03: Enable wmgUseSandboxLink for Swahili Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037073 (https://phabricator.wikimedia.org/T365970) [20:09:26] * cjming thanks JSherman for deploying! [20:09:42] (03Merged) 10jenkins-bot: CommonSettings: correct AutoModerator load order [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037118 (https://phabricator.wikimedia.org/T366203) (owner: 10Jsn.sherman) [20:10:12] !log jsn@deploy1002 Started scap: Backport for [[gerrit:1037118|CommonSettings: correct AutoModerator load order (T366203)]] [20:10:18] T366203: Check/move/document code in CommonSettings.php after require of CommonSettings-labs.php - https://phabricator.wikimedia.org/T366203 [20:10:18] Jdlrobson: in my experience, it's waiting for the CI to finish on release branches - so backports are time-consuming - sometimes over 20 mins to merge [20:11:30] config takes a few mins to merge -- and deploying to the test servers, then production have taken longer than i recall in recent memory [20:12:46] Jdlrobson: yeah it looks like gate-and-submit jobs are still running 20+ minutes, which lines up with what cjming: is saying [20:12:49] !log jsn@deploy1002 jsn: Backport for [[gerrit:1037118|CommonSettings: correct AutoModerator load order (T366203)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:12:53] !log jsn@deploy1002 jsn: Continuing with sync [20:12:58] so if there's a backport, i tend to manually +2 it while deploying a config patch [20:13:03] (03PS1) 10Scott French: function-evaluator: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037162 (https://phabricator.wikimedia.org/T362978) [20:13:17] (03PS1) 10Scott French: function-orchestrator: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037163 (https://phabricator.wikimedia.org/T362978) [20:13:36] (03PS1) 10Scott French: wikifeeds: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037164 (https://phabricator.wikimedia.org/T362978) [20:13:47] (03PS1) 10Scott French: toolhub: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037165 (https://phabricator.wikimedia.org/T362978) [20:14:00] (03PS1) 10Scott French: thumbor: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037166 (https://phabricator.wikimedia.org/T362978) [20:14:01] cjming: yeah, I did that a couple weeks ago and then got scared that it was wrong so I -1ed it. [20:14:32] I'll go ahead with Jdlrobson's patch. [20:14:45] (03CR) 10Jsn.sherman: [C:03+2] feature(Popups): Conditional User Defaults Implementation [extensions/Popups] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1036664 (https://phabricator.wikimedia.org/T364347) (owner: 10Jdlrobson) [20:14:55] (03CR) 10CI reject: [V:04-1] thumbor: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037166 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [20:15:19] (03PS2) 10Bernard Wang: POC: Wrap tables with JS [skins/Vector] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037131 [20:16:14] Nemoralis: your patch is a really straightforward config change, so I might be able to do it while we wait on Jdlrobson's backport to run through ci. [20:16:32] (y) [20:17:52] Nemoralis: not that I expect any trouble, but are you set up to test with the debug extension? [20:18:00] yes [20:21:35] !log jsn@deploy1002 Finished scap: Backport for [[gerrit:1037118|CommonSettings: correct AutoModerator load order (T366203)]] (duration: 11m 22s) [20:21:41] T366203: Check/move/document code in CommonSettings.php after require of CommonSettings-labs.php - https://phabricator.wikimedia.org/T366203 [20:21:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037073 (https://phabricator.wikimedia.org/T365970) (owner: 10NMW03) [20:22:51] (03Merged) 10jenkins-bot: Enable wmgUseSandboxLink for Swahili Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037073 (https://phabricator.wikimedia.org/T365970) (owner: 10NMW03) [20:23:22] !log jsn@deploy1002 Started scap: Backport for [[gerrit:1037073|Enable wmgUseSandboxLink for Swahili Wikipedia (T365970)]] [20:23:28] T365970: Add "Sandbox" link to top bar on Swahili Wikipedia - https://phabricator.wikimedia.org/T365970 [20:23:41] (03PS3) 10Bernard Wang: POC: t Wrap tables with JS [skins/Vector] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037131 [20:23:51] (03Merged) 10jenkins-bot: feature(Popups): Conditional User Defaults Implementation [extensions/Popups] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1036664 (https://phabricator.wikimedia.org/T364347) (owner: 10Jdlrobson) [20:25:43] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:25:52] !log jsn@deploy1002 jsn and nmw03: Backport for [[gerrit:1037073|Enable wmgUseSandboxLink for Swahili Wikipedia (T365970)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:26:27] Nemoralis: please test [20:29:22] (03PS4) 10Bernard Wang: POC: Wrap tables with JS [skins/Vector] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037131 [20:30:51] (03PS5) 10Bernard Wang: POC: Wrap tables with JS [skins/Vector] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037131 [20:31:41] Nemoralis: I went ahead and tested for you since Jdlrobson is waiting. I verified that the sandbox link is enabled for sw wiki on the debug host. [20:31:55] proceeding [20:32:00] !log jsn@deploy1002 jsn and nmw03: Continuing with sync [20:35:04] (03PS2) 10Scott French: thumbor: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037166 (https://phabricator.wikimedia.org/T362978) [20:35:11] Jdlrobson: FWIW, the gate-and-submit-wmf job for your backport only took 9 minutes. I stuck the other config change in front of you because I expected it to take longer. Apologies for the wait. [20:35:44] JSherman: np [20:37:10] JSherman sorry I was disconnected. It looks like my patch has been deployed [20:37:35] Nemoralis: yep, I went ahead and verified that sw had the sandbox link on the debug host [20:37:45] thanks! [20:37:52] I can close the phab task now [20:38:06] good deal! [20:38:55] well, I suppose you should wait to verify that it makes to sw wiki on the other hosts as well [20:39:16] we're about halfway through the php-fpm restarts [20:40:27] alright [20:40:31] !log jsn@deploy1002 Finished scap: Backport for [[gerrit:1037073|Enable wmgUseSandboxLink for Swahili Wikipedia (T365970)]] (duration: 17m 08s) [20:40:37] T365970: Add "Sandbox" link to top bar on Swahili Wikipedia - https://phabricator.wikimedia.org/T365970 [20:40:54] Nemoralis: and it's done; you should see the changes live on swwiki [20:41:10] thanks again! [20:41:52] Nemoralis: no prob! [20:41:52] !log jsn@deploy1002 Started scap: Backport for [[gerrit:1036664|feature(Popups): Conditional User Defaults Implementation (T364347)]] [20:41:58] T364347: Popups: Make use of conditional user defaults - https://phabricator.wikimedia.org/T364347 [20:43:43] FIRING: KubernetesCalicoDown: kubernetes2032.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2032.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:44:00] uh [20:44:22] !log jsn@deploy1002 jsn and jdlrobson: Backport for [[gerrit:1036664|feature(Popups): Conditional User Defaults Implementation (T364347)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:44:46] Jdlrobson: please test [20:45:11] I love seeing the self-organization around backports. Nice work folks. [20:45:29] JSherman: on it [20:47:05] Jdlrobson: By the way, yesterday when I was doing a backport I got a warning about a change of yours that had been merged but not deployed. Please make sure to fully scap backport beta-only config changes. Scap is smart enough to not do a full production for beta-only changes. Leaving a merged change undeployed is confusing for whoever deploys after you. [20:47:27] *full production deployment. [20:48:00] dancy: which change, sorry? I didn't merged anything yesterday (I don't have deploy rights) [20:48:35] lemme dig itup [20:48:57] JSherman: unfortunately there looks like there is a problem with this patch so it should be cancelled. [20:49:08] Jdlrobson: ack [20:49:12] !log jsn@deploy1002 Sync cancelled. [20:49:54] Jdlrobson: It was https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1036720 [20:50:39] Jdlrobson: reverting [20:50:52] Jdlrobson: Thanks for the info. I'll remind the +2'er [20:52:24] dancy: on the revert, it looks like I need to fix my git config on the deployment host; can I just ctrl-c out of the scap revert to fix it? [20:52:46] yes, it's always safe to control-c scap [20:53:14] excellent; scap has been awesome in my experience (of about 3 weeks) [20:54:06] But, depending on when you control-c, a change may be partially deployed, so there may be some action that needs to be taken to get to a consistent state (such as re-running or, backporting something else with a fix, etc). [20:54:24] Glad you like it! [20:55:56] dancy: ack. [20:56:14] JSherman: sorry about the need for the cancel that was unexpected :( [20:56:26] (03CR) 10CI reject: [V:04-1] POC: Wrap tables with JS [skins/Vector] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037131 (owner: 10Bernard Wang) [20:57:31] JSherman: are you done deploying? [20:57:51] cdanis: I'm muddling my way through a revert currently [20:57:56] ah okay, npnp [20:58:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T364299)', diff saved to https://phabricator.wikimedia.org/P63608 and previous config saved to /var/cache/conftool/dbconfig/20240529-205813-marostegui.json [20:58:19] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [21:00:05] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240529T2100) [21:00:33] dancy: It's asking me to do gerrit https user/password authentication for the revert. Should I be fowarding my gerrit ssh key etc? [21:00:49] hmm.. this is when using `scap backport --revert ..` ? [21:01:27] yep [21:02:07] ``` [21:02:07] jsn@deploy1002:/srv/mediawiki-staging$ scap backport --revert 1036664 [21:02:07] 21:00:54 Checking whether changes are in a branch and version deployed to production... [21:02:07] 21:00:54 Reverting 1 change(s) [21:02:07] Already on 'wmf/1.43.0-wmf.7' [21:02:08] Your branch is ahead of 'origin/wmf/1.43.0-wmf.7' by 1 commit. [21:02:08] ``` [21:02:33] Hmm. That's a bug. Please file a phab ticket w/ the transcript and we'll fix it. In the meantime you'll need to create the revert commit some other way (e.g, using the Gerrit UI). [21:02:58] dancy: wilco; thanks! [21:05:16] dancy: just to verify: I should do a revert after cancelling a sync at the test step, yes? [21:05:32] yes. [21:05:45] good deal; ty [21:06:20] (03PS1) 10Jsn.sherman: Revert "feature(Popups): Conditional User Defaults Implementation" [extensions/Popups] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037132 [21:06:36] and since the broken change never made it past testservers, you could cancel the deployment of the revert after testservers. [21:06:54] dancy: just what I was about to ask! [21:07:03] (if you know that no deployments happened in between) [21:07:19] okay, so I should be able to just scap deploy the revert [21:07:26] nod [21:08:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1002 using scap backport" [extensions/Popups] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037132 (owner: 10Jsn.sherman) [21:13:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P63609 and previous config saved to /var/cache/conftool/dbconfig/20240529-211321-marostegui.json [21:14:39] (03PS2) 10Scott French: toolhub: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037165 (https://phabricator.wikimedia.org/T362978) [21:18:29] (03Merged) 10jenkins-bot: Revert "feature(Popups): Conditional User Defaults Implementation" [extensions/Popups] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037132 (owner: 10Jsn.sherman) [21:19:00] !log jsn@deploy1002 Started scap: Backport for [[gerrit:1037132|Revert "feature(Popups): Conditional User Defaults Implementation"]] [21:21:34] !log jsn@deploy1002 jsn: Backport for [[gerrit:1037132|Revert "feature(Popups): Conditional User Defaults Implementation"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:21:40] !log jsn@deploy1002 Sync cancelled. [21:21:43] Jdlrobson: you should be reverted [21:28:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P63610 and previous config saved to /var/cache/conftool/dbconfig/20240529-212830-marostegui.json [21:31:33] cdanis: you should be good to go btw [21:37:41] dancy: created a phab task at https://phabricator.wikimedia.org/T366217 [21:37:50] Thanks! [21:38:29] (03PS1) 10CDanis: freshen hardcoded IDP addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037179 (https://phabricator.wikimedia.org/T365855) [21:38:52] (03PS1) 10RLazarus: Fix tests for Python 3.8+ [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037180 [21:40:31] (03CR) 10CDanis: [C:03+2] freshen hardcoded IDP addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037179 (https://phabricator.wikimedia.org/T365855) (owner: 10CDanis) [21:41:18] (03Merged) 10jenkins-bot: freshen hardcoded IDP addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037179 (https://phabricator.wikimedia.org/T365855) (owner: 10CDanis) [21:41:58] !log cdanis@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [21:42:31] !log cdanis@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [21:43:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T364299)', diff saved to https://phabricator.wikimedia.org/P63611 and previous config saved to /var/cache/conftool/dbconfig/20240529-214338-marostegui.json [21:43:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2198.codfw.wmnet with reason: Maintenance [21:43:44] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [21:43:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2198.codfw.wmnet with reason: Maintenance [21:44:57] (03PS2) 10CDanis: jaeger: link to Mediawiki debug Logstash [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035829 (https://phabricator.wikimedia.org/T320549) [21:45:43] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:46:17] (03CR) 10CDanis: [C:03+2] jaeger: link to Mediawiki debug Logstash (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035829 (https://phabricator.wikimedia.org/T320549) (owner: 10CDanis) [21:47:10] (03Merged) 10jenkins-bot: jaeger: link to Mediawiki debug Logstash [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035829 (https://phabricator.wikimedia.org/T320549) (owner: 10CDanis) [21:47:23] !log cdanis@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [21:47:27] (03CR) 10Dzahn: [C:03+1] "As pointed out by elukey on the linked ticket, we don't install systemd-coredump. There is one single system here https://debmonitor.wikim" [puppet] - 10https://gerrit.wikimedia.org/r/1029235 (https://phabricator.wikimedia.org/T236253) (owner: 10Ahmon Dancy) [21:47:59] !log cdanis@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [21:54:40] (03CR) 10Ahmon Dancy: [C:03+2] "Thanks!" [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037180 (owner: 10RLazarus) [21:56:20] (03Merged) 10jenkins-bot: Fix tests for Python 3.8+ [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037180 (owner: 10RLazarus) [21:57:23] (03PS2) 10Ahmon Dancy: Add more junk to .gitignore [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037158 [21:57:23] (03PS2) 10Ahmon Dancy: Make header expected/got failure output multiline for easier human viewing [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037156 [22:00:19] (03CR) 10RLazarus: [C:03+2] Add more junk to .gitignore [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037158 (owner: 10Ahmon Dancy) [22:00:29] (03CR) 10RLazarus: [C:03+2] "Thanks for this!" [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037156 (owner: 10Ahmon Dancy) [22:00:38] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1009.eqiad.wmnet with OS bullseye [22:00:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9844395 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1009.eqiad.wmnet with OS bullseye [22:01:47] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1010.eqiad.wmnet with OS bullseye [22:01:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9844410 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1010.eqiad.wmnet with OS bullseye [22:02:58] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main1009.eqiad.wmnet with reason: host reimage [22:03:41] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main1010.eqiad.wmnet with reason: host reimage [22:04:59] !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [22:05:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main1009.eqiad.wmnet with OS bullseye [22:05:07] !log jclark@cumin1002 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on kafka-main1010.eqiad.wmnet with reason: host reimage [22:05:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9844420 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host kafka-main1009.eqiad.wmnet with OS bullseye completed... [22:06:24] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main1009.eqiad.wmnet with reason: host reimage [22:07:38] (03Merged) 10jenkins-bot: Add more junk to .gitignore [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037158 (owner: 10Ahmon Dancy) [22:07:39] (03Merged) 10jenkins-bot: Make header expected/got failure output multiline for easier human viewing [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037156 (owner: 10Ahmon Dancy) [22:09:40] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:09:40] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:10:02] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:11:02] (03PS1) 10RLazarus: Release v0.0.4. [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037186 [22:13:40] (03CR) 10RLazarus: [C:03+2] Release v0.0.4. [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037186 (owner: 10RLazarus) [22:15:16] (03Merged) 10jenkins-bot: Release v0.0.4. [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037186 (owner: 10RLazarus) [22:16:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main1009.eqiad.wmnet with OS bullseye [22:17:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9844458 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host kafka-main1009.eqiad.wmnet with OS bullseye completed... [22:18:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9844468 (10Jclark-ctr) [22:27:33] dancy: lolsob, the test is failing in the debian build for bullseye -- I was moving too fast, it depends on the version of the jsonschema package, not the Python version πŸ™ƒ [22:27:49] I'll get it untangled and release a new version properly, but if it doesn't happen before I turn into a pumpkin in 33 minutes, it'll be tomorrow [22:28:40] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:29:06] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:29:40] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:30:10] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users, wmf for Sonja Perry - https://phabricator.wikimedia.org/T365766#9844495 (10colewhite) [22:31:19] rzl: good times. :-) [22:31:36] rzl: No rush. [22:31:40] πŸ‘ [22:32:31] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf/analytics-privatedata-users for derenrich - https://phabricator.wikimedia.org/T365381#9844511 (10Ahoelzl) Approved. [22:38:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2200.codfw.wmnet with reason: Maintenance [22:38:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2200.codfw.wmnet with reason: Maintenance [22:41:14] (03Abandoned) 10Jdlrobson: POC: Wrap tables with JS [skins/Vector] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037131 (owner: 10Bernard Wang) [22:49:10] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 190480992 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:50:10] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 8344 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:52:51] (03PS1) 10Jdlrobson: Popups setting should be string not integer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037189 (https://phabricator.wikimedia.org/T364347) [22:52:57] (03PS1) 10RLazarus: Really fix tests for jsonschema. [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037190 [22:54:28] (03CR) 10CI reject: [V:04-1] Really fix tests for jsonschema. [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037190 (owner: 10RLazarus) [22:54:50] that commit message was asking for it, I guess [22:55:02] lol [22:55:27] (03PS2) 10RLazarus: Really fix tests for jsonschema. [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037190 [22:56:03] (03PS1) 10Jdlrobson: Revert "feature(Popups): Conditional User Defaults Implementation" [extensions/Popups] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037133 (https://phabricator.wikimedia.org/T364347) [22:56:29] (03PS2) 10Jdlrobson: Revert "feature(Popups): Conditional User Defaults Implementation" [extensions/Popups] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037133 (https://phabricator.wikimedia.org/T364347) [22:56:35] (03PS3) 10Jdlrobson: Revert "feature(Popups): Conditional User Defaults Implementation" [extensions/Popups] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037133 (https://phabricator.wikimedia.org/T364347) [22:56:50] (03CR) 10Stoyofuku-wmf: [C:03+1] "😭" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037189 (https://phabricator.wikimedia.org/T364347) (owner: 10Jdlrobson) [23:00:46] (03Abandoned) 10Jdlrobson: Revert "feature(Popups): Conditional User Defaults Implementation" [extensions/Popups] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037133 (https://phabricator.wikimedia.org/T364347) (owner: 10Jdlrobson) [23:02:32] 06SRE, 06serviceops: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9844588 (10CDanis) >>! In T366094#9842327, @akosiaris wrote: > I am gonna disagree on this one. [This](https://grafana-rw.wikimedia.org/d/d304d897-54ea-4062-a504-6f2567ed7dba/t366094?orgId=1&from=1716910376624&to=171691... [23:05:54] (03PS1) 10Scott French: termbox: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037193 (https://phabricator.wikimedia.org/T362978) [23:06:09] (03PS1) 10Scott French: similar-users: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037194 (https://phabricator.wikimedia.org/T362978) [23:06:23] (03PS1) 10Scott French: kask: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037195 (https://phabricator.wikimedia.org/T362978) [23:06:38] (03PS1) 10Scott French: chromium-render: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037196 (https://phabricator.wikimedia.org/T362978) [23:15:49] (03CR) 10Jforrester: [C:03+1] Add a stream for tracking the API of WikiLambda [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017962 (https://phabricator.wikimedia.org/T356228) (owner: 10David Martin) [23:29:04] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2208.codfw.wmnet with reason: Maintenance [23:29:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2208.codfw.wmnet with reason: Maintenance [23:29:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2208 (T364299)', diff saved to https://phabricator.wikimedia.org/P63612 and previous config saved to /var/cache/conftool/dbconfig/20240529-232924-marostegui.json [23:29:34] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [23:38:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1036600 [23:38:27] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1036600 (owner: 10TrainBranchBot) [23:59:16] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1036600 (owner: 10TrainBranchBot)