[00:05:01] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:05:05] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:05:44] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:10:58] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:11:02] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:13:06] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:13:09] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:15:02] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:15:05] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:17:49] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:17:53] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:19:36] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:19:39] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:21:23] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:21:26] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:24:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T352010)', diff saved to https://phabricator.wikimedia.org/P63895 and previous config saved to /var/cache/conftool/dbconfig/20240603-002359-ladsgroup.json [00:24:02] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [00:31:20] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:31:23] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:32:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T364299)', diff saved to https://phabricator.wikimedia.org/P63896 and previous config saved to /var/cache/conftool/dbconfig/20240603-003247-marostegui.json [00:32:52] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [00:33:07] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:33:10] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:35:03] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:35:07] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:37:10] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:37:14] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:38:57] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:39:01] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:39:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P63897 and previous config saved to /var/cache/conftool/dbconfig/20240603-003907-ladsgroup.json [00:41:24] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:41:27] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:43:21] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:43:24] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:45:28] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:45:31] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:47:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P63898 and previous config saved to /var/cache/conftool/dbconfig/20240603-004757-marostegui.json [00:48:15] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:48:18] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:50:21] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:50:25] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:52:28] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:52:31] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:54:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P63899 and previous config saved to /var/cache/conftool/dbconfig/20240603-005415-ladsgroup.json [00:54:35] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:54:39] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:56:43] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:56:46] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:58:50] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:58:54] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:00:47] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:00:50] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:02:54] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:02:57] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:03:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P63900 and previous config saved to /var/cache/conftool/dbconfig/20240603-010305-marostegui.json [01:09:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T352010)', diff saved to https://phabricator.wikimedia.org/P63901 and previous config saved to /var/cache/conftool/dbconfig/20240603-010925-ladsgroup.json [01:09:27] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [01:09:28] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [01:09:29] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [01:11:22] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:11:25] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:18:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T364299)', diff saved to https://phabricator.wikimedia.org/P63902 and previous config saved to /var/cache/conftool/dbconfig/20240603-011813-marostegui.json [01:18:18] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2170.codfw.wmnet with reason: Maintenance [01:18:20] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [01:18:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2170.codfw.wmnet with reason: Maintenance [01:18:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2170 (T364299)', diff saved to https://phabricator.wikimedia.org/P63903 and previous config saved to /var/cache/conftool/dbconfig/20240603-011839-marostegui.json [01:23:58] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:24:01] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:25:55] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:25:59] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:27:52] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:27:55] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:29:59] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:30:02] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:31:55] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:31:58] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:38:22] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:38:25] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:40:29] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:40:32] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:42:26] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:42:29] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:44:17] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9853206 (10OKJ04) [01:44:25] 06SRE, 06Traffic: Anycast NTP and update the list of timeservers for P:systemd::timesyncd - https://phabricator.wikimedia.org/T366360#9853207 (10OKJ04) [01:44:47] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Spicerack: Migrate puppet merges to a cookbook - https://phabricator.wikimedia.org/T366355#9853212 (10OKJ04) [01:45:21] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Tchanders - https://phabricator.wikimedia.org/T366351#9853216 (10OKJ04) [01:45:23] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:45:26] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:45:40] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Include vlans with defined IRB int in device vlans even if no port present - https://phabricator.wikimedia.org/T366348#9853219 (10OKJ04) [01:51:42] 06SRE, 06Traffic: Anycast NTP and update the list of timeservers for P:systemd::timesyncd - https://phabricator.wikimedia.org/T366360#9853309 (10JJMC89) [01:51:48] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9853308 (10JJMC89) [01:52:09] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Spicerack: Migrate puppet merges to a cookbook - https://phabricator.wikimedia.org/T366355#9853314 (10JJMC89) [01:52:43] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Tchanders - https://phabricator.wikimedia.org/T366351#9853318 (10JJMC89) [01:53:01] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Include vlans with defined IRB int in device vlans even if no port present - https://phabricator.wikimedia.org/T366348#9853321 (10JJMC89) [02:03:43] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:10:00] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:10:03] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:11:57] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:12:00] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:14:34] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:14:37] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:16:31] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:16:33] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:25:17] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:25:20] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:27:14] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:27:17] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:29:00] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:29:04] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:31:08] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:31:11] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:33:04] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:33:07] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:35:01] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:35:04] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:36:58] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:37:00] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:38:43] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:38:54] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:38:58] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:42:22] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:42:25] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:44:19] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:44:22] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:47:28] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:47:31] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:49:25] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:49:28] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:52:32] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:52:35] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:56:39] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:56:43] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:58:43] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:07] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:01:10] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:03:04] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:03:07] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:03:43] RESOLVED: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:05:01] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:05:04] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:06:58] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:07:01] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:08:44] (03PS2) 10Andrew Bogott: cloud-vps: turn off puppet report storage on cloud-vps puppetservers [puppet] - 10https://gerrit.wikimedia.org/r/1037812 (https://phabricator.wikimedia.org/T366357) [03:08:54] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:08:58] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:10:41] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:10:44] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:14:08] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:14:11] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:16:05] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:16:08] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:26:42] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:26:45] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:29:48] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:29:52] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:31:56] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:31:59] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:34:03] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:34:06] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:36:10] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:36:13] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:38:07] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:38:10] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:40:03] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:40:07] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:41:51] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:41:54] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:43:48] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:43:52] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:46:15] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:46:18] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:09:32] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:09:35] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:25:25] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:29:41] (03PS1) 10Pppery: [pawiki] Enable wgMinervaEnableSiteNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037945 (https://phabricator.wikimedia.org/T332813) [04:30:35] (03PS2) 10Pppery: [pawiki] Enable wgMinervaEnableSiteNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037945 (https://phabricator.wikimedia.org/T366434) [04:34:49] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:34:52] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:35:04] (03PS1) 10Marostegui: db*: Remove puppet7 lines [puppet] - 10https://gerrit.wikimedia.org/r/1037946 [04:47:20] RECOVERY - MariaDB Replica Lag: s5 #page on db1213 is OK: OK slave_sql_lag Replication lag: 0.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:49:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 1%: Repooling T366429', diff saved to https://phabricator.wikimedia.org/P63904 and previous config saved to /var/cache/conftool/dbconfig/20240603-044918-root.json [04:49:21] T366429: db1213 replication broken (Index for table dewiki.page_props is corrupt) - https://phabricator.wikimedia.org/T366429 [05:04:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 5%: Repooling T366429', diff saved to https://phabricator.wikimedia.org/P63905 and previous config saved to /var/cache/conftool/dbconfig/20240603-050424-root.json [05:04:29] T366429: db1213 replication broken (Index for table dewiki.page_props is corrupt) - https://phabricator.wikimedia.org/T366429 [05:06:34] (03CR) 10Marostegui: [C:03+2] db*: Remove puppet7 lines [puppet] - 10https://gerrit.wikimedia.org/r/1037946 (owner: 10Marostegui) [05:19:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 10%: Repooling T366429', diff saved to https://phabricator.wikimedia.org/P63906 and previous config saved to /var/cache/conftool/dbconfig/20240603-051932-root.json [05:19:35] T366429: db1213 replication broken (Index for table dewiki.page_props is corrupt) - https://phabricator.wikimedia.org/T366429 [05:34:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 25%: Repooling T366429', diff saved to https://phabricator.wikimedia.org/P63907 and previous config saved to /var/cache/conftool/dbconfig/20240603-053438-root.json [05:34:43] T366429: db1213 replication broken (Index for table dewiki.page_props is corrupt) - https://phabricator.wikimedia.org/T366429 [05:35:49] (03PS1) 10KartikMistry: testwiki: Fix language for nan in Section Translation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037949 [05:39:26] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:39:30] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:49:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 50%: Repooling T366429', diff saved to https://phabricator.wikimedia.org/P63908 and previous config saved to /var/cache/conftool/dbconfig/20240603-054944-root.json [05:49:48] T366429: db1213 replication broken (Index for table dewiki.page_props is corrupt) - https://phabricator.wikimedia.org/T366429 [05:52:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T364299)', diff saved to https://phabricator.wikimedia.org/P63909 and previous config saved to /var/cache/conftool/dbconfig/20240603-055210-marostegui.json [05:52:14] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [05:52:33] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:52:37] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:04:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 75%: Repooling T366429', diff saved to https://phabricator.wikimedia.org/P63910 and previous config saved to /var/cache/conftool/dbconfig/20240603-060450-root.json [06:04:54] T366429: db1213 replication broken (Index for table dewiki.page_props is corrupt) - https://phabricator.wikimedia.org/T366429 [06:05:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:07:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P63911 and previous config saved to /var/cache/conftool/dbconfig/20240603-060719-marostegui.json [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:19:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 100%: Repooling T366429', diff saved to https://phabricator.wikimedia.org/P63912 and previous config saved to /var/cache/conftool/dbconfig/20240603-061956-root.json [06:20:03] T366429: db1213 replication broken (Index for table dewiki.page_props is corrupt) - https://phabricator.wikimedia.org/T366429 [06:22:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P63913 and previous config saved to /var/cache/conftool/dbconfig/20240603-062227-marostegui.json [06:37:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T364299)', diff saved to https://phabricator.wikimedia.org/P63914 and previous config saved to /var/cache/conftool/dbconfig/20240603-063735-marostegui.json [06:37:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2173.codfw.wmnet with reason: Maintenance [06:37:39] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [06:37:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2173.codfw.wmnet with reason: Maintenance [06:37:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2186.codfw.wmnet with reason: Maintenance [06:38:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2186.codfw.wmnet with reason: Maintenance [06:38:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T364299)', diff saved to https://phabricator.wikimedia.org/P63915 and previous config saved to /var/cache/conftool/dbconfig/20240603-063814-marostegui.json [06:39:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T364299)', diff saved to https://phabricator.wikimedia.org/P63916 and previous config saved to /var/cache/conftool/dbconfig/20240603-063925-marostegui.json [06:54:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P63917 and previous config saved to /var/cache/conftool/dbconfig/20240603-065434-marostegui.json [06:58:52] (03PS1) 10Muehlenhoff: Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/1038109 [07:00:04] Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240603T0700). [07:00:04] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:26] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for toolforge roles [puppet] - 10https://gerrit.wikimedia.org/r/1037730 (owner: 10Muehlenhoff) [07:00:53] * kart_ is here and will start deployment.. [07:03:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037949 (owner: 10KartikMistry) [07:03:41] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for mariadb roles [puppet] - 10https://gerrit.wikimedia.org/r/1037729 (owner: 10Muehlenhoff) [07:03:41] (03Merged) 10jenkins-bot: testwiki: Fix language for nan in Section Translation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037949 (owner: 10KartikMistry) [07:04:16] !log kartik@deploy1002 Started scap: Backport for [[gerrit:1037949|testwiki: Fix language for nan in Section Translation]] [07:09:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P63918 and previous config saved to /var/cache/conftool/dbconfig/20240603-070942-marostegui.json [07:18:18] !log kartik@deploy1002 kartik: Backport for [[gerrit:1037949|testwiki: Fix language for nan in Section Translation]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:22:29] !log kartik@deploy1002 kartik: Continuing with sync [07:24:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T364299)', diff saved to https://phabricator.wikimedia.org/P63919 and previous config saved to /var/cache/conftool/dbconfig/20240603-072450-marostegui.json [07:24:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2174.codfw.wmnet with reason: Maintenance [07:24:53] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [07:25:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2174.codfw.wmnet with reason: Maintenance [07:25:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T364299)', diff saved to https://phabricator.wikimedia.org/P63920 and previous config saved to /var/cache/conftool/dbconfig/20240603-072513-marostegui.json [07:32:54] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1037949|testwiki: Fix language for nan in Section Translation]] (duration: 28m 37s) [07:39:07] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete wikikube/staging etcd certificates [puppet] - 10https://gerrit.wikimedia.org/r/1036998 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [07:42:06] (03PS2) 10Muehlenhoff: Remove obsolete wikikube etcd certificates [puppet] - 10https://gerrit.wikimedia.org/r/1037002 (https://phabricator.wikimedia.org/T357750) [07:43:30] 06SRE, 06Traffic: Anycast ns1.wikimedia.org - https://phabricator.wikimedia.org/T366193#9853619 (10ayounsi) I think the difficult part is where to stop the overengineering, for example it could make sens to use Liberica to healthcheck/advertise one of the NS anycast IP, but it might not be worth using a differ... [07:43:52] 06SRE, 10Wikimedia-Mailing-lists: Cross post to multiple mailling lists is only received once by recipient - https://phabricator.wikimedia.org/T345691#9853620 (10hashar) The thread https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/thread/RCKRQ2GKRVLGVLFJMOCURY3BYM4GOWYA/ had two repli... [07:48:29] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037002 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [08:00:04] hashar: Deploy window Gerrit upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240603T0800) [08:00:48] ^ [08:03:07] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9853652 (10MoritzMuehlenhoff) [08:04:52] !log jiji@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mc1039.eqiad.wmnet'] [08:06:24] (03CR) 10Hashar: [C:03+2] Gerrit 3.9.5, rebuild plugins and update TypeScript API [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1037041 (https://phabricator.wikimedia.org/T354887) (owner: 10Hashar) [08:06:56] (03Merged) 10jenkins-bot: Gerrit 3.9.5, rebuild plugins and update TypeScript API [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1037041 (https://phabricator.wikimedia.org/T354887) (owner: 10Hashar) [08:08:13] !log hashar@deploy1002 Started deploy [gerrit/gerrit@7838134]: Gerrit to v3.9.5 on gerrit2002 - T354887 [08:08:15] T354887: Upgrade to Gerrit 3.9 - https://phabricator.wikimedia.org/T354887 [08:08:20] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@7838134]: Gerrit to v3.9.5 on gerrit2002 - T354887 (duration: 00m 08s) [08:08:35] !log hashar@deploy1002 Started deploy [gerrit/gerrit@7838134]: Gerrit to v3.9.5 on gerrit1003 - T354887 [08:08:40] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@7838134]: Gerrit to v3.9.5 on gerrit1003 - T354887 (duration: 00m 05s) [08:09:16] 06SRE, 06Traffic: Anycast ns1.wikimedia.org - https://phabricator.wikimedia.org/T366193#9853670 (10ayounsi) > - i.e. one in Germany which will pick ns0 rather than lower latency ns2 Seems like the main one is adguard-dns.com, which picks them randomly. https://w.wiki/AGmr We can't really afford to email all... [08:09:43] !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc-gp2002.codfw.wmnet with OS bookworm [08:09:46] OH f** [08:09:52] that is broken again :) [08:10:45] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc-gp1003.eqiad.wmnet with OS bookworm [08:11:02] (03PS1) 10Muehlenhoff: Remove obsolete stub certs [labs/private] - 10https://gerrit.wikimedia.org/r/1038222 (https://phabricator.wikimedia.org/T364622) [08:11:45] PROBLEM - gerrit process on gerrit2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [08:12:04] (03PS1) 10Hashar: Revert "Gerrit 3.9.5, rebuild plugins and update TypeScript API" [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1037922 (https://phabricator.wikimedia.org/T354887) [08:12:55] (03CR) 10Hashar: [C:03+2] Revert "Gerrit 3.9.5, rebuild plugins and update TypeScript API" [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1037922 (https://phabricator.wikimedia.org/T354887) (owner: 10Hashar) [08:13:30] (03Merged) 10jenkins-bot: Revert "Gerrit 3.9.5, rebuild plugins and update TypeScript API" [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1037922 (https://phabricator.wikimedia.org/T354887) (owner: 10Hashar) [08:14:09] (03PS1) 10MVernon: ceph: add mgr module for rgw (upstream packaging bug) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1038223 [08:15:24] !log hashar@deploy1002 Started deploy [gerrit/gerrit@c93e47d]: Revert Gerrit back to 3.8.6 - T354887 [08:15:26] T354887: Upgrade to Gerrit 3.9 - https://phabricator.wikimedia.org/T354887 [08:15:29] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@c93e47d]: Revert Gerrit back to 3.8.6 - T354887 (duration: 00m 05s) [08:16:45] RECOVERY - gerrit process on gerrit2002 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [08:16:46] I have canceled the Gerrit upgrade, the upstream .war requires Java 17 and we run Java 11 :/ [08:16:56] ack [08:16:57] which I would have caught had I had tested that grrr [08:17:22] the Gerrit replica got stopped for some minutes since I stopped the service manually to upgrade it [08:17:41] anyway, all set and we are sticking to Gerrit 3.8.6 for now :) [08:17:44] jelto: thanks! [08:19:04] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete stub certs [labs/private] - 10https://gerrit.wikimedia.org/r/1038222 (https://phabricator.wikimedia.org/T364622) (owner: 10Muehlenhoff) [08:23:16] (03CR) 10Arnaudb: [C:03+1] ceph: add mgr module for rgw (upstream packaging bug) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1038223 (owner: 10MVernon) [08:24:10] (03CR) 10MVernon: [V:03+2 C:03+2] ceph: add mgr module for rgw (upstream packaging bug) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1038223 (owner: 10MVernon) [08:25:06] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-gp1003.eqiad.wmnet with reason: host reimage [08:27:36] !log jiji@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-gp2002.codfw.wmnet with reason: host reimage [08:28:41] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-gp1003.eqiad.wmnet with reason: host reimage [08:29:32] (03Abandoned) 10Effie Mouzeli: mediawiki::memcached: switch to running as user memcache [puppet] - 10https://gerrit.wikimedia.org/r/1034839 (https://phabricator.wikimedia.org/T273950) (owner: 10Effie Mouzeli) [08:30:45] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance [08:30:58] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance [08:31:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1229 (T352010)', diff saved to https://phabricator.wikimedia.org/P63921 and previous config saved to /var/cache/conftool/dbconfig/20240603-083106-ladsgroup.json [08:31:33] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-gp2002.codfw.wmnet with reason: host reimage [08:32:30] (03PS21) 10DCausse: wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) [08:32:48] (03CR) 10Urbanecm: [C:04-1] "-1 for visibility, as it did not receive a response." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T359038) (owner: 10Sergio Gimeno) [08:35:36] (03CR) 10Brouberol: [C:03+1] "Looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037870 (https://phabricator.wikimedia.org/T359423) (owner: 10Scott French) [08:37:12] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for rickijay - https://phabricator.wikimedia.org/T365574#9853787 (10MoritzMuehlenhoff) >>! In T365574#9843222, @Dzahn wrote: >>>! In T365574#9829202, @jon_amar-WMDE wrote: >> Hi @Dzahn I'm not clear whether I c... [08:37:57] (03CR) 10Muehlenhoff: [C:03+2] configmaster: Enable profile::auto_restarts::service for apache/Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1023847 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:45:30] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-gp1003.eqiad.wmnet with OS bookworm [08:49:59] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-gp2002.codfw.wmnet with OS bookworm [08:57:20] (03CR) 10Clément Goubert: [C:03+1] Remove obsolete wikikube etcd certificates [puppet] - 10https://gerrit.wikimedia.org/r/1037002 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [08:58:11] (03CR) 10Muehlenhoff: [C:03+2] maps: Switch kartotherian on maps2007 to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036236 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff) [09:05:22] (03PS1) 10Arturo Borrero Gonzalez: toolforge: docker-registry: enable HTTP endpoint for svc.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/1038227 (https://phabricator.wikimedia.org/T366453) [09:05:33] (03PS1) 10Gergő Tisza: Remove references to upload.beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038230 (https://phabricator.wikimedia.org/T366415) [09:05:49] (03PS2) 10Arturo Borrero Gonzalez: toolforge: docker-registry: enable HTTP endpoint for svc.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/1038227 (https://phabricator.wikimedia.org/T366453) [09:06:23] (03PS2) 10Gergő Tisza: [beta] Remove references to upload.beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038230 (https://phabricator.wikimedia.org/T366415) [09:06:27] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038227 (https://phabricator.wikimedia.org/T366453) (owner: 10Arturo Borrero Gonzalez) [09:06:57] (03PS3) 10Arturo Borrero Gonzalez: toolforge: docker-registry: enable HTTP endpoint for svc.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/1038227 (https://phabricator.wikimedia.org/T366453) [09:07:03] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038227 (https://phabricator.wikimedia.org/T366453) (owner: 10Arturo Borrero Gonzalez) [09:08:07] !log jiji@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['mc1039.eqiad.wmnet'] [09:08:11] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:08:15] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:10:24] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc1039.eqiad.wmnet with OS bookworm [09:10:48] !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc-gp2001.codfw.wmnet with OS bookworm [09:16:42] (03PS4) 10Arturo Borrero Gonzalez: toolforge: docker-registry: enable HTTP endpoint for svc.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/1038227 (https://phabricator.wikimedia.org/T366453) [09:16:55] (03PS5) 10Arturo Borrero Gonzalez: toolforge: docker-registry: enable HTTP endpoint for svc.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/1038227 (https://phabricator.wikimedia.org/T366453) [09:17:03] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038227 (https://phabricator.wikimedia.org/T366453) (owner: 10Arturo Borrero Gonzalez) [09:22:43] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1039.eqiad.wmnet with reason: host reimage [09:23:59] (03PS1) 10Muehlenhoff: Switch maps/codfw to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1038240 (https://phabricator.wikimedia.org/T360778) [09:25:08] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1039.eqiad.wmnet with reason: host reimage [09:28:45] (03PS2) 10Muehlenhoff: Switch maps/codfw to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1038240 (https://phabricator.wikimedia.org/T360778) [09:29:08] (03PS3) 10Gergő Tisza: multiversion: Support beta for upload hostname check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037929 [09:29:08] (03PS3) 10Gergő Tisza: multiversion: Add tests for MWMultiVersion::getMediaWiki() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037930 [09:29:09] (03PS6) 10Gergő Tisza: [POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) [09:29:20] (03CR) 10Clément Goubert: "All done I think :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032525 (https://phabricator.wikimedia.org/T362978) (owner: 10Clément Goubert) [09:29:30] !log jiji@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-gp2001.codfw.wmnet with reason: host reimage [09:31:56] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-gp2001.codfw.wmnet with reason: host reimage [09:32:48] (03CR) 10Kosta Harlan: "Discussed with Jay and Suman. There should not be license concerns as the commercial redistribution license relates to including the DB fi" [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [09:33:46] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038240 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff) [09:37:51] (03PS1) 10Ladsgroup: Stop writing to the old pagelinks columns in s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038243 (https://phabricator.wikimedia.org/T352010) [09:38:23] jouncebot: nowandnext [09:38:23] No deployments scheduled for the next 0 hour(s) and 21 minute(s) [09:38:23] In 0 hour(s) and 21 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240603T1000) [09:38:45] (03CR) 10Ladsgroup: [C:03+2] Stop writing to the old pagelinks columns in s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038243 (https://phabricator.wikimedia.org/T352010) (owner: 10Ladsgroup) [09:39:27] (03Merged) 10jenkins-bot: Stop writing to the old pagelinks columns in s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038243 (https://phabricator.wikimedia.org/T352010) (owner: 10Ladsgroup) [09:40:00] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1038243|Stop writing to the old pagelinks columns in s8 (T352010)]] [09:40:03] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [09:40:52] (03PS1) 10GergesShamon: [trwiki] Reducing count edits ip and newbie per minute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038245 [09:41:27] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1039.eqiad.wmnet with OS bookworm [09:41:42] (03Abandoned) 10GergesShamon: [trwiki] Reducing count edits ip and newbie per minute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038245 (owner: 10GergesShamon) [09:42:30] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1038243|Stop writing to the old pagelinks columns in s8 (T352010)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:43:51] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host snapshot1013.eqiad.wmnet with OS bullseye [09:44:10] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.05.27 - 2024.06.16), 13Patch-For-Review: Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9854057 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cum... [09:44:55] (03PS1) 10GergesShamon: [trwiki] Reducing count edits ip and newbie per minute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038247 (https://phabricator.wikimedia.org/T330811) [09:45:38] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [09:49:22] (03CR) 10Tchanders: [C:03+1] "Looks good from the perspective of IPInfo." [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [09:49:43] !log jiji@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host mc-gp2001.codfw.wmnet with OS bookworm [09:50:22] (03PS2) 10Sergio Gimeno: CommunityConfiguration: set feedback url instead of bug tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036613 (https://phabricator.wikimedia.org/T363801) [09:55:39] > 09:55:05 ['/usr/bin/scap', 'pull', '--no-php-restart', '--no-update-l10n', '--exclude-wikiversions.php', 'mw1366.eqiad.wmnet', 'mw1407.eqiad.wmnet', 'mw2289.codfw.wmnet', 'mw2259.codfw.wmnet', 'mw1420.eqiad.wmnet', 'deploy2002.codfw.wmnet', 'mw1398.eqiad.wmnet', 'deploy1002.eqiad.wmnet', 'mw1404.eqiad.wmnet', 'mw2300.codfw.wmnet'] (ran as mwdeploy@snapshot1013.eqiad.wmnet) returned [255]: ssh: connect to host [09:55:39] snapshot1013.eqiad.wmnet port 22: Connection timed out [09:55:56] btullis: is that you? It sorta broke my mw deployment [09:56:25] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1013.eqiad.wmnet with reason: host reimage [09:56:51] (03PS1) 10Muehlenhoff: Switch gerrit2002 to Java 17 [puppet] - 10https://gerrit.wikimedia.org/r/1038249 (https://phabricator.wikimedia.org/T364342) [09:57:50] !log Restarting MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration [09:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:29] (03PS2) 10Muehlenhoff: Switch gerrit2002 to Java 17 [puppet] - 10https://gerrit.wikimedia.org/r/1038249 (https://phabricator.wikimedia.org/T364342) [09:58:40] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1038243|Stop writing to the old pagelinks columns in s8 (T352010)]] (duration: 18m 39s) [09:58:40] (03CR) 10David Caro: [C:03+1] toolforge: docker-registry: enable HTTP endpoint for svc.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/1038227 (https://phabricator.wikimedia.org/T366453) (owner: 10Arturo Borrero Gonzalez) [09:58:43] (03CR) 10Jelto: [V:03+1 C:03+2] docker_registry_ha: replace deprecated /-/jwks endpoint on gitlab [puppet] - 10https://gerrit.wikimedia.org/r/1037043 (https://phabricator.wikimedia.org/T365675) (owner: 10Jelto) [09:58:43] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [09:59:00] (03CR) 10Michael Große: "A different wiki would also be perfectly fine for me. Maybe beta enwiki? Because they also have it disabled in prod." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T359038) (owner: 10Sergio Gimeno) [09:59:40] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038249 (https://phabricator.wikimedia.org/T364342) (owner: 10Muehlenhoff) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240603T1000) [10:02:11] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1013.eqiad.wmnet with reason: host reimage [10:03:29] (03Abandoned) 10Arturo Borrero Gonzalez: toolforge: docker-registry: enable HTTP endpoint for svc.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/1038227 (https://phabricator.wikimedia.org/T366453) (owner: 10Arturo Borrero Gonzalez) [10:03:43] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc1038.eqiad.wmnet with OS bookworm [10:04:02] (03PS2) 10Sergio Gimeno: [Beta] cswiki: enable CommunityConfiguration for GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035726 (https://phabricator.wikimedia.org/T364892) [10:05:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:08:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T364069)', diff saved to https://phabricator.wikimedia.org/P63922 and previous config saved to /var/cache/conftool/dbconfig/20240603-100827-marostegui.json [10:08:31] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [10:10:54] (03PS3) 10Sergio Gimeno: [Beta] cswiki: enable CommunityConfiguration for GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035726 (https://phabricator.wikimedia.org/T364892) [10:10:54] (03PS3) 10Sergio Gimeno: [Beta] Enable CommunityConfiguration extension in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892) [10:14:16] (03CR) 10Sergio Gimeno: "Indeed eswiki is not the "best" pick up for this purposes. I amended the change to disable personalized praise only in dewiki with the sam" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892) (owner: 10Sergio Gimeno) [10:18:10] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1038.eqiad.wmnet with reason: host reimage [10:19:00] (03PS1) 10Clément Goubert: miscweb: Use a random miscweb image for default value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038251 (https://phabricator.wikimedia.org/T362518) [10:21:18] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1038.eqiad.wmnet with reason: host reimage [10:23:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P63923 and previous config saved to /var/cache/conftool/dbconfig/20240603-102335-marostegui.json [10:29:11] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host snapshot1013.eqiad.wmnet with OS bullseye [10:29:27] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.05.27 - 2024.06.16), 13Patch-For-Review: Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9854241 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin10... [10:37:47] (03PS1) 10Clément Goubert: Empty commit to trigger rebuild [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1038256 (https://phabricator.wikimedia.org/T362518) [10:38:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P63924 and previous config saved to /var/cache/conftool/dbconfig/20240603-103844-marostegui.json [10:40:02] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1038.eqiad.wmnet with OS bookworm [10:41:24] !log installing linux 5.10.218 security updates [10:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:32] (03PS1) 10Hashar: Rebuild plugins for Java 17 [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038259 (https://phabricator.wikimedia.org/T364342) [10:44:45] (03CR) 10Michael Große: [Beta] Enable CommunityConfiguration extension in all wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892) (owner: 10Sergio Gimeno) [10:44:52] (03CR) 10Sergio Gimeno: "Scheduled for today afternoon window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035726 (https://phabricator.wikimedia.org/T364892) (owner: 10Sergio Gimeno) [10:45:18] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.05.27 - 2024.06.16), 13Patch-For-Review: Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9854288 (10BTullis) [10:48:45] (03PS1) 10Hashar: [WMF] rebuild plugins for Java 17 [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038260 (https://phabricator.wikimedia.org/T364342) [10:49:55] (03CR) 10Hashar: [C:04-2] Rebuild plugins for Java 17 [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038259 (https://phabricator.wikimedia.org/T364342) (owner: 10Hashar) [10:50:06] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc1037.eqiad.wmnet with OS bookworm [10:53:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T364069)', diff saved to https://phabricator.wikimedia.org/P63925 and previous config saved to /var/cache/conftool/dbconfig/20240603-105352-marostegui.json [10:53:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance [10:53:55] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [10:54:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance [10:54:13] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host snapshot1013.eqiad.wmnet [10:54:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1199 (T364069)', diff saved to https://phabricator.wikimedia.org/P63926 and previous config saved to /var/cache/conftool/dbconfig/20240603-105416-marostegui.json [10:55:43] (03PS1) 10Muehlenhoff: Switch snapshot1013 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1038261 (https://phabricator.wikimedia.org/T349619) [11:01:42] (03CR) 10Muehlenhoff: [C:03+2] Switch snapshot1013 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1038261 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:03:41] (03CR) 10Volans: "I agree with the general direction, left some possible alternative approaches for some of the blocks. Let me know what do you think." [cookbooks] - 10https://gerrit.wikimedia.org/r/1037573 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [11:04:17] (03PS2) 10Hashar: [WMF] target Bazel to use Java 17 [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038260 (https://phabricator.wikimedia.org/T364342) [11:04:38] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1037.eqiad.wmnet with reason: host reimage [11:07:28] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1037.eqiad.wmnet with reason: host reimage [11:07:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host snapshot1013.eqiad.wmnet [11:09:30] (03CR) 10CI reject: [V:04-1] [WMF] target Bazel to use Java 17 [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038260 (https://phabricator.wikimedia.org/T364342) (owner: 10Hashar) [11:10:54] (03CR) 10Hashar: "recheck Download from https://nodejs.org/dist/v17.9.1/node-v17.9.1-linux-x64.tar.xz failed: class java.io.IOException Read timed out" [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038260 (https://phabricator.wikimedia.org/T364342) (owner: 10Hashar) [11:16:19] (03PS1) 10Effie Mouzeli: memcached: enable extstore on mc1050/mc2050 [puppet] - 10https://gerrit.wikimedia.org/r/1038262 (https://phabricator.wikimedia.org/T352885) [11:19:11] (03CR) 10Cathal Mooney: "recheck" [software/homer] - 10https://gerrit.wikimedia.org/r/1037773 (https://phabricator.wikimedia.org/T366348) (owner: 10Cathal Mooney) [11:21:55] (03PS1) 10Kosta Harlan: IPReputationHooks: Support disabling edit logging for older accounts [extensions/WikimediaEvents] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037925 (https://phabricator.wikimedia.org/T354597) [11:22:34] (03PS2) 10Effie Mouzeli: memcached: enable extstore on mc1050/mc2050 [puppet] - 10https://gerrit.wikimedia.org/r/1038262 (https://phabricator.wikimedia.org/T352885) [11:22:35] (03CR) 10Dreamy Jazz: [C:03+1] IPReputationHooks: Support disabling edit logging for older accounts [extensions/WikimediaEvents] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037925 (https://phabricator.wikimedia.org/T354597) (owner: 10Kosta Harlan) [11:23:44] (03PS4) 10Kosta Harlan: EventLogging: Enable IP reputation logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034882 (https://phabricator.wikimedia.org/T354597) [11:24:32] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1037.eqiad.wmnet with OS bookworm [11:25:34] (03PS1) 10Btullis: Absent the rsync configuration for deprecated misc jobs [puppet] - 10https://gerrit.wikimedia.org/r/1038263 (https://phabricator.wikimedia.org/T353785) [11:26:36] (03PS1) 10Muehlenhoff: Remove obsolete stub certs for labvirt-star [labs/private] - 10https://gerrit.wikimedia.org/r/1038264 [11:26:48] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2711/co" [puppet] - 10https://gerrit.wikimedia.org/r/1038263 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis) [11:26:50] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on backup2011.codfw.wmnet with reason: remount filesystem [11:27:04] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on backup2011.codfw.wmnet with reason: remount filesystem [11:28:00] (03PS1) 10Muehlenhoff: Remove dummy certs for hosts long gone [labs/private] - 10https://gerrit.wikimedia.org/r/1038265 [11:31:00] (03CR) 10Giuseppe Lavagetto: [C:03+1] Release 3.0.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/1035596 (https://phabricator.wikimedia.org/T365123) (owner: 10Scott French) [11:32:28] (03CR) 10Effie Mouzeli: [C:03+2] memcached: enable extstore on mc1050/mc2050 [puppet] - 10https://gerrit.wikimedia.org/r/1038262 (https://phabricator.wikimedia.org/T352885) (owner: 10Effie Mouzeli) [11:34:08] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [11:34:22] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [11:34:23] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:34:39] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:34:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T352010)', diff saved to https://phabricator.wikimedia.org/P63927 and previous config saved to /var/cache/conftool/dbconfig/20240603-113447-ladsgroup.json [11:34:51] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [11:35:40] !log restart memcached on mc1050 and mc2050 [11:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:05] (03PS1) 10Btullis: Remove absented rsync configs for deprecated dumps [puppet] - 10https://gerrit.wikimedia.org/r/1038266 (https://phabricator.wikimedia.org/T353785) [11:38:14] (03PS1) 10Muehlenhoff: memcached: Remove enable_16 option [puppet] - 10https://gerrit.wikimedia.org/r/1038267 [11:42:16] jouncebot: nowandnext [11:42:16] No deployments scheduled for the next 1 hour(s) and 17 minute(s) [11:42:16] In 1 hour(s) and 17 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240603T1300) [11:42:22] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038267 (owner: 10Muehlenhoff) [11:42:35] effie: would a mw deployment interfere with your memcached restarts? [11:42:48] not at all, thank you for asking [11:43:41] awesome [11:50:11] (03PS2) 10Ebrahim: Enable numeric sorting for Persian [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037942 (https://phabricator.wikimedia.org/T329440) [11:50:15] (03CR) 10Ladsgroup: [C:03+2] Enable numeric sorting for Persian [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037942 (https://phabricator.wikimedia.org/T329440) (owner: 10Ebrahim) [11:50:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037942 (https://phabricator.wikimedia.org/T329440) (owner: 10Ebrahim) [11:50:55] (03PS3) 10Hashar: [WMF] change build to use Java 17 [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038260 (https://phabricator.wikimedia.org/T364342) [11:50:56] (03Merged) 10jenkins-bot: Enable numeric sorting for Persian [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037942 (https://phabricator.wikimedia.org/T329440) (owner: 10Ebrahim) [11:51:07] (03CR) 10Hashar: [C:03+2] [WMF] change build to use Java 17 [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038260 (https://phabricator.wikimedia.org/T364342) (owner: 10Hashar) [11:51:11] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1037942|Enable numeric sorting for Persian (T329440)]] [11:51:15] T329440: Handling Number-sorted pages in categories for non-English numbers - https://phabricator.wikimedia.org/T329440 [11:53:30] !log ladsgroup@deploy1002 ebrahim and ladsgroup: Backport for [[gerrit:1037942|Enable numeric sorting for Persian (T329440)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:53:37] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on backup2011.codfw.wmnet with reason: remount filesystem [11:53:39] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on backup2011.codfw.wmnet with reason: remount filesystem [11:54:20] !log ladsgroup@deploy1002 ebrahim and ladsgroup: Continuing with sync [11:54:47] (03CR) 10Muehlenhoff: "The PCC errors are all false positives" [puppet] - 10https://gerrit.wikimedia.org/r/1038267 (owner: 10Muehlenhoff) [11:55:22] (03PS2) 10Hashar: Rebuild plugins for Java 17 [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038259 (https://phabricator.wikimedia.org/T364342) [11:59:07] (03Merged) 10jenkins-bot: [WMF] change build to use Java 17 [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038260 (https://phabricator.wikimedia.org/T364342) (owner: 10Hashar) [12:00:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T364299)', diff saved to https://phabricator.wikimedia.org/P63928 and previous config saved to /var/cache/conftool/dbconfig/20240603-120041-marostegui.json [12:00:45] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [12:02:11] RECOVERY - Disk space on karapace1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=karapace1002&var-datasource=eqiad+prometheus/ops [12:03:18] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1037942|Enable numeric sorting for Persian (T329440)]] (duration: 12m 07s) [12:03:21] T329440: Handling Number-sorted pages in categories for non-English numbers in Persian Wikis - https://phabricator.wikimedia.org/T329440 [12:03:36] (03CR) 10Hashar: Rebuild plugins for Java 17 [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038259 (https://phabricator.wikimedia.org/T364342) (owner: 10Hashar) [12:06:36] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc-gp1002.eqiad.wmnet with OS bookworm [12:08:53] (03PS2) 10Dreamrimmer: maiwiki: Remove 'CA' namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031533 (https://phabricator.wikimedia.org/T363667) [12:10:22] (03CR) 10Effie Mouzeli: [C:03+1] memcached: Remove enable_16 option [puppet] - 10https://gerrit.wikimedia.org/r/1038267 (owner: 10Muehlenhoff) [12:11:22] 06SRE, 06Traffic: Anycast NTP and update the list of timeservers for P:systemd::timesyncd - https://phabricator.wikimedia.org/T366360#9854526 (10ayounsi) Moving the dynamic nature of NTP definition to some automated system instead of human or Puppet is a great idea :) Human as in right now for network devices,... [12:13:31] (03PS1) 10JMeybohm: etcd::v3: Allow all nodes of an etcd cluster to connect to each other [puppet] - 10https://gerrit.wikimedia.org/r/1038283 (https://phabricator.wikimedia.org/T366465) [12:15:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P63929 and previous config saved to /var/cache/conftool/dbconfig/20240603-121549-marostegui.json [12:18:43] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038283 (https://phabricator.wikimedia.org/T366465) (owner: 10JMeybohm) [12:20:05] (03PS2) 10JMeybohm: etcd::v3: Allow all nodes of an etcd cluster to connect to each other [puppet] - 10https://gerrit.wikimedia.org/r/1038283 (https://phabricator.wikimedia.org/T366465) [12:20:15] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038283 (https://phabricator.wikimedia.org/T366465) (owner: 10JMeybohm) [12:20:58] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-gp1002.eqiad.wmnet with reason: host reimage [12:22:19] (03PS1) 10Btullis: Remove the discovery-analytics dsh config for stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/1038288 (https://phabricator.wikimedia.org/T353785) [12:23:55] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2712/console" [puppet] - 10https://gerrit.wikimedia.org/r/1038288 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis) [12:24:09] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-gp1002.eqiad.wmnet with reason: host reimage [12:24:23] (03CR) 10Jelto: [C:03+1] "lgtm, although there is no latest image for the secruity-landing-page" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038251 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert) [12:25:28] (03PS1) 10Vgutierrez: hiera: Enable IPIP on high-traffic1@ulsfo for text services [puppet] - 10https://gerrit.wikimedia.org/r/1038291 (https://phabricator.wikimedia.org/T366466) [12:27:05] (03PS2) 10Btullis: Remove the discovery-analytics dsh config for stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/1038288 (https://phabricator.wikimedia.org/T353785) [12:27:42] (03CR) 10JMeybohm: "This is maybe still not good enough as it requires a puppet run on the "first" etcd with DNS SRV record already populated and the A/AAAA r" [puppet] - 10https://gerrit.wikimedia.org/r/1038283 (https://phabricator.wikimedia.org/T366465) (owner: 10JMeybohm) [12:28:05] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 44 probes of 785 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:29:16] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1038291 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [12:30:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P63930 and previous config saved to /var/cache/conftool/dbconfig/20240603-123057-marostegui.json [12:31:10] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1038288 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis) [12:33:03] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 15 probes of 785 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:34:16] (03PS1) 10Vgutierrez: hiera: Enable IPIP on text@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1038294 (https://phabricator.wikimedia.org/T366466) [12:36:04] 10ops-codfw, 06DC-Ops: Relabel kubernetes2023 to wikikube-worker2001 - https://phabricator.wikimedia.org/T366468 (10JMeybohm) 03NEW [12:36:29] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1038294 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [12:37:31] (03CR) 10Effie Mouzeli: [C:03+2] memcached: Remove enable_16 option [puppet] - 10https://gerrit.wikimedia.org/r/1038267 (owner: 10Muehlenhoff) [12:37:42] (03PS1) 10Vgutierrez: depool text@ulsfo before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1038295 (https://phabricator.wikimedia.org/T366466) [12:40:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet [12:41:13] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-gp1002.eqiad.wmnet with OS bookworm [12:43:44] 10ops-codfw, 06DC-Ops: Relabel kubernetes2023 to wikikube-worker2001 - https://phabricator.wikimedia.org/T366468#9854603 (10JMeybohm) →14Duplicate dup:03T365712 [12:44:02] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Relabel codfw Kubernetes hosts - https://phabricator.wikimedia.org/T365712#9854605 (10JMeybohm) [12:45:07] 10ops-codfw, 06SRE, 06DC-Ops: Relabel kubernetes2032 to wikikube-worker2002 - https://phabricator.wikimedia.org/T366085#9854608 (10JMeybohm) →14Duplicate dup:03T365712 [12:45:27] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc-gp1001.eqiad.wmnet with OS bookworm [12:45:28] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Relabel codfw Kubernetes hosts - https://phabricator.wikimedia.org/T365712#9854610 (10JMeybohm) [12:46:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T364299)', diff saved to https://phabricator.wikimedia.org/P63931 and previous config saved to /var/cache/conftool/dbconfig/20240603-124605-marostegui.json [12:46:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2176.codfw.wmnet with reason: Maintenance [12:46:08] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [12:46:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2176.codfw.wmnet with reason: Maintenance [12:46:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T364299)', diff saved to https://phabricator.wikimedia.org/P63932 and previous config saved to /var/cache/conftool/dbconfig/20240603-124628-marostegui.json [12:47:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1001.eqiad.wmnet [12:49:01] (03CR) 10Fabfur: [C:03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/1038295 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [12:49:21] (03PS1) 10NMW03: Add project namespace alias for Azerbaijani Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037505 (https://phabricator.wikimedia.org/T365966) [12:50:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:52:09] (03CR) 10Elukey: [C:03+2] sre.hardware: add useless-suppression to pylint disable [cookbooks] - 10https://gerrit.wikimedia.org/r/1037572 (owner: 10Elukey) [12:54:52] Hi [12:55:12] I won't be available for half an hour from now [12:55:39] !log depool/restart swift-proxy/repool ms-fe10{09,11,12,14} due to rising connection failures T360913 [12:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:41] T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913 [12:55:52] (03Merged) 10jenkins-bot: sre.hardware: add useless-suppression to pylint disable [cookbooks] - 10https://gerrit.wikimedia.org/r/1037572 (owner: 10Elukey) [12:57:30] (03CR) 10Fabfur: [C:03+1] hiera: Enable IPIP on high-traffic1@ulsfo for text services [puppet] - 10https://gerrit.wikimedia.org/r/1038291 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [12:58:51] jouncebot next [12:58:51] In 0 hour(s) and 1 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240603T1300) [12:59:02] 10SRE-tools, 06Infrastructure-Foundations: Support creating phab tasks in wmflib.phabricator - https://phabricator.wikimedia.org/T366470 (10JMeybohm) 03NEW [12:59:29] (03PS2) 10Vgutierrez: hiera: Enable IPIP on text@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1038294 (https://phabricator.wikimedia.org/T366466) [12:59:42] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038294 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240603T1300). [13:00:05] tgr, _Gerges, sergi0, Dreamy_Jazz, and Nemoralis: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-gp1001.eqiad.wmnet with reason: host reimage [13:00:11] \o [13:00:18] \o [13:00:21] hi [13:00:35] o/ [13:01:38] (03PS1) 10David Caro: puppetserver: allow configuring the report cleanup frequency [puppet] - 10https://gerrit.wikimedia.org/r/1038296 (https://phabricator.wikimedia.org/T366406) [13:01:53] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038296 (https://phabricator.wikimedia.org/T366406) (owner: 10David Caro) [13:02:32] !log depool moss-fe1001 with a view to returning it to apus T279621 [13:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:35] T279621: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 [13:02:58] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-gp1001.eqiad.wmnet with reason: host reimage [13:03:25] !log depool moss-fe2001 with a view to returning it to apus T279621 [13:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:51] (03CR) 10David Caro: "Puppetservers seem to have many issues running pcc, @Muehlenhoff any ideas?" [puppet] - 10https://gerrit.wikimedia.org/r/1038296 (https://phabricator.wikimedia.org/T366406) (owner: 10David Caro) [13:04:55] (03CR) 10CI reject: [V:04-1] puppetserver: allow configuring the report cleanup frequency [puppet] - 10https://gerrit.wikimedia.org/r/1038296 (https://phabricator.wikimedia.org/T366406) (owner: 10David Caro) [13:05:12] (03CR) 10David Caro: puppetserver: allow configuring the report cleanup frequency (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1038296 (https://phabricator.wikimedia.org/T366406) (owner: 10David Caro) [13:06:27] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete stub certs for labvirt-star [labs/private] - 10https://gerrit.wikimedia.org/r/1038264 (owner: 10Muehlenhoff) [13:06:34] (03PS3) 10Vgutierrez: cache,hiera: Enable IPIP on text@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1038294 (https://phabricator.wikimedia.org/T366466) [13:06:35] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove dummy certs for hosts long gone [labs/private] - 10https://gerrit.wikimedia.org/r/1038265 (owner: 10Muehlenhoff) [13:06:40] (03PS2) 10David Caro: puppetserver: allow configuring the report cleanup frequency [puppet] - 10https://gerrit.wikimedia.org/r/1038296 (https://phabricator.wikimedia.org/T366406) [13:06:44] I suppose I can do the backports. [13:07:00] I don't mind self-deploying, but dealing with something else right now. [13:07:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [13:07:51] (03CR) 10Gergő Tisza: [C:03+2] IPReputationHooks: Support disabling edit logging for older accounts [extensions/WikimediaEvents] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037925 (https://phabricator.wikimedia.org/T354597) (owner: 10Kosta Harlan) [13:08:06] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038294 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [13:08:26] !log uploaded intel-microcode 3.20240312.1~deb11u1 to apt.wikimedia.org (import from bullseye-proposed-updates, to be coupled with forthcoming reboots) [13:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:36] (03CR) 10Ayounsi: Include vlans with an IRB int in device vlans even if not on L2 port (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/1037773 (https://phabricator.wikimedia.org/T366348) (owner: 10Cathal Mooney) [13:10:06] (03CR) 10David Caro: puppetserver: allow configuring the report cleanup frequency (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1038296 (https://phabricator.wikimedia.org/T366406) (owner: 10David Caro) [13:10:11] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038296 (https://phabricator.wikimedia.org/T366406) (owner: 10David Caro) [13:10:19] (03CR) 10Elukey: [C:03+1] "Left a comment to add some words related to a parameter, feel free to follow up or not :)" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1037538 (https://phabricator.wikimedia.org/T358542) (owner: 10Volans) [13:10:25] (03Merged) 10jenkins-bot: IPReputationHooks: Support disabling edit logging for older accounts [extensions/WikimediaEvents] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037925 (https://phabricator.wikimedia.org/T354597) (owner: 10Kosta Harlan) [13:10:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035726 (https://phabricator.wikimedia.org/T364892) (owner: 10Sergio Gimeno) [13:10:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036313 (owner: 10Gergő Tisza) [13:10:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037887 (https://phabricator.wikimedia.org/T366404) (owner: 10GergesShamon) [13:10:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T352010)', diff saved to https://phabricator.wikimedia.org/P63933 and previous config saved to /var/cache/conftool/dbconfig/20240603-131048-ladsgroup.json [13:10:52] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:11:05] (03Merged) 10jenkins-bot: [Beta] cswiki: enable CommunityConfiguration for GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035726 (https://phabricator.wikimedia.org/T364892) (owner: 10Sergio Gimeno) [13:11:08] (03Merged) 10jenkins-bot: [multiversion] Add 'manage-dblist init-labs' subcommand [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036313 (owner: 10Gergő Tisza) [13:11:10] (03Merged) 10jenkins-bot: [arwiki] add ipblock-exempt to bot group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037887 (https://phabricator.wikimedia.org/T366404) (owner: 10GergesShamon) [13:12:30] (03CR) 10CDanis: [C:03+1] Release 3.0.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/1035596 (https://phabricator.wikimedia.org/T365123) (owner: 10Scott French) [13:13:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [13:13:31] !log tgr@deploy1002 Started scap: Backport for [[gerrit:1035726|[Beta] cswiki: enable CommunityConfiguration for GrowthExperiments (T364892)]], [[gerrit:1036313|[multiversion] Add 'manage-dblist init-labs' subcommand]], [[gerrit:1037887|[arwiki] add ipblock-exempt to bot group (T366404)]] [13:13:35] T364892: Enable CommunityConfiguration on all beta wikis with GrowthExperiments - https://phabricator.wikimedia.org/T364892 [13:13:37] T366404: Add ipblock-exempt to arwiki bot user group - https://phabricator.wikimedia.org/T366404 [13:16:00] !log tgr@deploy1002 sgimeno and gergesshamon and tgr: Backport for [[gerrit:1035726|[Beta] cswiki: enable CommunityConfiguration for GrowthExperiments (T364892)]], [[gerrit:1036313|[multiversion] Add 'manage-dblist init-labs' subcommand]], [[gerrit:1037887|[arwiki] add ipblock-exempt to bot group (T366404)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:17:39] (FYI: arwiki patch LGTM) [13:18:12] Hi [13:18:24] Dreamy_Jazz: do you have time to test the patch on mwdebug? I can do it otherwise [13:18:48] There isn't anything that I can test for this patch as the config to enable it is not set. [13:18:50] AFAIK [13:19:27] The mediawiki config patch to enable this is planned for later today, so it would be indirectly tested by that. [13:20:09] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-gp1001.eqiad.wmnet with OS bookworm [13:20:32] which condif is that? at a glance it is not gated behind anything [13:20:38] ...config... [13:20:45] wgWikimediaEventsIPoidUrl [13:21:21] If it is unset, no data will ever be logged. [13:21:59] While the route the code follows to not log information will change with this patch, there isn't a difference in any output that could be checked to verify what prevented the data from being collected. [13:22:51] Also, I've dealt with the issue that I had before so am around now. [13:22:53] It is scheduled for late backport btw [13:22:54] https://gerrit.wikimedia.org/r/c/1034882/ [13:23:12] ^ [13:23:19] right, I was more thinking of making sure edits are not broken, but the code is pretty trivial [13:23:26] Oh I see. [13:23:27] !log tgr@deploy1002 sgimeno and gergesshamon and tgr: Continuing with sync [13:23:29] I can test that. [13:25:21] (03PS6) 10Elukey: sre.host.provision: no-op refactor to highlight DELL-specific confs [cookbooks] - 10https://gerrit.wikimedia.org/r/1037573 (https://phabricator.wikimedia.org/T365372) [13:25:21] (03PS3) 10Elukey: WIP: sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) [13:25:42] (03PS4) 10Vgutierrez: cache,hiera: Enable IPIP on text@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1038294 (https://phabricator.wikimedia.org/T366466) [13:25:42] (03PS1) 10Vgutierrez: interface::clsact: Fix syntax error [puppet] - 10https://gerrit.wikimedia.org/r/1038327 (https://phabricator.wikimedia.org/T366466) [13:25:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P63934 and previous config saved to /var/cache/conftool/dbconfig/20240603-132556-ladsgroup.json [13:26:48] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038294 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [13:26:52] (03PS1) 10Btullis: Remove temporary firewall rule for WDQS graph_split [puppet] - 10https://gerrit.wikimedia.org/r/1038328 (https://phabricator.wikimedia.org/T350106) [13:26:55] (03PS1) 10Btullis: Prepare stat100[4-7] for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1038329 (https://phabricator.wikimedia.org/T353785) [13:27:02] (03CR) 10Elukey: "Thanks for the review!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1037573 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [13:28:16] (03CR) 10Fabfur: [C:03+1] interface::clsact: Fix syntax error [puppet] - 10https://gerrit.wikimedia.org/r/1038327 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [13:29:39] (03Abandoned) 10Ayounsi: Junos: use "json compact" format [cookbooks] - 10https://gerrit.wikimedia.org/r/1032477 (https://phabricator.wikimedia.org/T362523) (owner: 10Ayounsi) [13:29:50] (03Abandoned) 10Ayounsi: Add export-format state-data json compact [homer/public] - 10https://gerrit.wikimedia.org/r/1032386 (https://phabricator.wikimedia.org/T362523) (owner: 10Ayounsi) [13:31:44] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038328 (https://phabricator.wikimedia.org/T350106) (owner: 10Btullis) [13:32:39] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:1035726|[Beta] cswiki: enable CommunityConfiguration for GrowthExperiments (T364892)]], [[gerrit:1036313|[multiversion] Add 'manage-dblist init-labs' subcommand]], [[gerrit:1037887|[arwiki] add ipblock-exempt to bot group (T366404)]] (duration: 19m 07s) [13:32:46] T364892: Enable CommunityConfiguration on all beta wikis with GrowthExperiments - https://phabricator.wikimedia.org/T364892 [13:32:46] T366404: Add ipblock-exempt to arwiki bot user group - https://phabricator.wikimedia.org/T366404 [13:32:57] Dreamy_Jazz: I think it's good practice but as I said the code looks simple enough, so up to you. The patch should be live now (I thought it would be slower to merge but it merged by the time scap started pulling the code for the config changes). [13:33:20] I can test editing now. [13:33:22] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1038329 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis) [13:33:40] Although I guess it would be noticeable if things stopped working now it's live :) [13:34:53] Looks like things are still working :) [13:35:12] I may need to backport another patch shortly, but can do that myself. [13:35:18] thx [13:35:21] unrelated to the one I requested [13:35:27] (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1038294 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [13:36:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037896 (https://phabricator.wikimedia.org/T356440) (owner: 10GergesShamon) [13:36:37] (03CR) 10Ssingh: [C:03+1] interface::clsact: Fix syntax error [puppet] - 10https://gerrit.wikimedia.org/r/1038327 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [13:36:42] (03CR) 10Ssingh: [C:03+1] hiera: Enable IPIP on high-traffic1@ulsfo for text services [puppet] - 10https://gerrit.wikimedia.org/r/1038291 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [13:36:49] (03CR) 10Ssingh: [C:03+1] cache,hiera: Enable IPIP on text@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1038294 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [13:36:57] !log depool text@ulsfo before enabling IPIP encapsulation - T366466 [13:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:59] T366466: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466 [13:37:19] (03PS5) 10GergesShamon: [trwiki] Create translator group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037896 (https://phabricator.wikimedia.org/T356440) [13:37:23] (03CR) 10Vgutierrez: [C:03+2] depool text@ulsfo before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1038295 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [13:37:31] (03CR) 10TrainBranchBot: "Approved by tgr@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037896 (https://phabricator.wikimedia.org/T356440) (owner: 10GergesShamon) [13:37:59] (03CR) 10Clément Goubert: "None of them do, and I didn't want to put a date tag as default. I can switch to `sre/miscweb/statictendril` which has a main tag though." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038251 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert) [13:38:18] (03Merged) 10jenkins-bot: [trwiki] Create translator group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037896 (https://phabricator.wikimedia.org/T356440) (owner: 10GergesShamon) [13:38:36] !log tgr@deploy1002 Started scap: Backport for [[gerrit:1037896|[trwiki] Create translator group (T356440)]] [13:38:40] T356440: Creation of a translator user group on trwikipedia - https://phabricator.wikimedia.org/T356440 [13:39:16] (03CR) 10Brouberol: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1038329 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis) [13:39:49] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Enable IPIP on high-traffic1@ulsfo for text services [puppet] - 10https://gerrit.wikimedia.org/r/1038291 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [13:40:55] !log tgr@deploy1002 gergesshamon and tgr: Backport for [[gerrit:1037896|[trwiki] Create translator group (T356440)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:41:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P63935 and previous config saved to /var/cache/conftool/dbconfig/20240603-134104-ladsgroup.json [13:41:08] !log disable puppet on A:cp-text before merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1038294/ - T366466 [13:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:58] (03CR) 10Vgutierrez: [C:03+2] interface::clsact: Fix syntax error [puppet] - 10https://gerrit.wikimedia.org/r/1038327 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [13:42:12] Thank you for the assistance Gergő [13:42:26] (03CR) 10Muehlenhoff: [C:03+2] Remove ms-fe certs [puppet] - 10https://gerrit.wikimedia.org/r/1037074 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [13:42:49] Hi [13:43:22] (03PS7) 10Gergő Tisza: [POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) [13:44:45] (03CR) 10Vgutierrez: [C:03+2] cache,hiera: Enable IPIP on text@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1038294 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [13:44:51] !log tgr@deploy1002 gergesshamon and tgr: Continuing with sync [13:46:10] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host snapshot1010.eqiad.wmnet with OS bullseye [13:46:25] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.05.27 - 2024.06.16), 13Patch-For-Review: Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9854843 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cum... [13:46:49] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host snapshot1012.eqiad.wmnet with OS bullseye [13:47:07] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.05.27 - 2024.06.16), 13Patch-For-Review: Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9854847 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cum... [13:48:25] (03CR) 10David Caro: "Oh no, it was my bad, I misread the errors, everything works ok now :)" [puppet] - 10https://gerrit.wikimedia.org/r/1038296 (https://phabricator.wikimedia.org/T366406) (owner: 10David Caro) [13:48:46] PROBLEM - Check whether ferm is active by checking the default input chain on wikikube-ctrl1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:49:08] !log re-enable puppet on "A:cp-text and not A:cp-text_ulsfo" - T366466 [13:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:12] T366466: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466 [13:50:37] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc-wf1001.eqiad.wmnet with OS bookworm [13:50:50] !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc-wf2001.codfw.wmnet with OS bookworm [13:54:27] !log re-enable puppet on "A:cp-text_ulsfo" - T366466 [13:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:31] T366466: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466 [13:55:54] it seems like scap got stuck [13:56:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T352010)', diff saved to https://phabricator.wikimedia.org/P63936 and previous config saved to /var/cache/conftool/dbconfig/20240603-135612-ladsgroup.json [13:56:14] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance [13:56:16] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:56:27] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance [13:56:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1233 (T352010)', diff saved to https://phabricator.wikimedia.org/P63937 and previous config saved to /var/cache/conftool/dbconfig/20240603-135634-ladsgroup.json [13:58:11] (03CR) 10Jgreen: [C:03+1] "I am not aware of any case where fundraising mail is sent from the wikipedia.org domain." [dns] - 10https://gerrit.wikimedia.org/r/1037157 (https://phabricator.wikimedia.org/T211403) (owner: 10JHathaway) [13:58:42] !log rolling restart of pybal on lvs4010 and lvs4008 - T366466 [13:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:57] (03CR) 10Jgreen: [C:03+1] "I am not aware of any case where fundraising mail is sent from the wikipedia.org domain." [dns] - 10https://gerrit.wikimedia.org/r/1037154 (https://phabricator.wikimedia.org/T211403) (owner: 10JHathaway) [13:59:08] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1010.eqiad.wmnet with reason: host reimage [13:59:20] (03CR) 10Eevans: [C:03+1] Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/1038109 (owner: 10Muehlenhoff) [13:59:33] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1012.eqiad.wmnet with reason: host reimage [14:00:35] (03PS1) 10Muehlenhoff: Remove Hiera entry [puppet] - 10https://gerrit.wikimedia.org/r/1038339 [14:00:44] not stuck but super slow [14:01:01] filed T366475 which might or might not be related [14:01:02] T366475: scap says "WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!" on snapshot1010.eqiad.wmnet - https://phabricator.wikimedia.org/T366475 [14:01:35] (03CR) 10Btullis: [V:03+1 C:03+2] Configure snapshot1017 to be the misc cron snapshot runner [puppet] - 10https://gerrit.wikimedia.org/r/1036626 (https://phabricator.wikimedia.org/T364455) (owner: 10Btullis) [14:01:51] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:1037896|[trwiki] Create translator group (T356440)]] (duration: 23m 15s) [14:01:54] T356440: Creation of a translator user group on trwikipedia - https://phabricator.wikimedia.org/T356440 [14:02:08] tgr|away: it's slow because it's rebuilding the cdb imo [14:02:22] The host ident change, let me see if I can't fix that through a puppet run on deploy [14:02:41] Gerges: Nemoralis: I have a meeting coming up so someone needs to take over or the rest of the patches need to be rescheduled, sorry [14:02:52] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1010.eqiad.wmnet with reason: host reimage [14:03:28] I'll mark them on the wiki page as not done for now, if someone does take over, feel free to revert that [14:04:27] I will need to deploy a fix likely soon, so I can look at the other patches. [14:04:59] tgr|away: Has the T356440 patch been run? [14:05:01] Dreamy_Jazz: thanks! won't update the page then [14:05:36] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-wf1001.eqiad.wmnet with reason: host reimage [14:05:42] https://deploy-commands.toolforge.org/bacc/1038247 is the only one left, I think [14:05:53] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1012.eqiad.wmnet with reason: host reimage [14:05:54] Gerges: It looks like that changed has been applied. [14:06:14] tgr|away: the snapshot nodes changed host key because they are being reimaged by btullis (see SAL) [14:06:54] Gerges: somehow https://tr.wikipedia.org/wiki/%C3%96zel:GrupHaklar%C4%B1Listesi#translator has the oathauth-enable right, not sure what's up with that [14:07:17] that specific right is harmless, but it's not reassuring that random rights show up [14:07:33] claime: thanks [14:08:15] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-wf1001.eqiad.wmnet with reason: host reimage [14:08:37] RECOVERY - Check whether ferm is active by checking the default input chain on wikikube-ctrl1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:08:40] I will check this out [14:08:54] Dreamy_Jazz: scap exited with [14:08:55] 14:01:51 Finished scap: Backport for [[gerrit:1037896|[trwiki] Create translator group (T356440)]] (duration: 23m 15s) [14:08:56] T356440: Creation of a translator user group on trwikipedia - https://phabricator.wikimedia.org/T356440 [14:09:00] 14:01:51 backport failed: Command '['/usr/bin/scap', 'sync-world', '--pause-after-testserver-sync', '--notify-user=gergesshamon', 'Backport for [[gerrit:1037896|[trwiki] Create translator group (T356440)]]']' returned non-zero exit status 1. [14:09:23] I think that's fine (at least in terms of the change getting deployed) but FYI [14:09:39] tgr|away: The translator group having `oathauth-enable` is expected [14:09:50] !log jiji@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-wf2001.codfw.wmnet with reason: host reimage [14:09:53] Because the group is listed in the `PrivilegedGroups`config [14:10:04] (03CR) 10Xcollazo: [C:03+1] Absent the rsync configuration for deprecated misc jobs [puppet] - 10https://gerrit.wikimedia.org/r/1038263 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis) [14:10:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:10:28] Which will add the `oauth-enable` right to these groups on all wikis. [14:10:52] (03CR) 10Xcollazo: [C:03+1] Remove absented rsync configs for deprecated dumps [puppet] - 10https://gerrit.wikimedia.org/r/1038266 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis) [14:10:56] https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/e77aed381917f42524f5ad873d6a5f6702ec54cc/wmf-config/InitialiseSettings.php#3237 [14:11:32] that is where the `translator` group is listed as a privileged group [14:11:44] Dreamy_Jazz: oh right, I seem to remember now [14:11:46] https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/e77aed381917f42524f5ad873d6a5f6702ec54cc/wmf-config/CommonSettings.php#3788 is the code that adds the right [14:12:16] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-wf2001.codfw.wmnet with reason: host reimage [14:12:23] it was only preveriously used on incubatorwiki, but it can change system messages there, or something like that? [14:13:01] I think scap failing should be okay, as I guess any problems should be fixed by the next scap run. [14:13:47] (03PS2) 10GergesShamon: [trwiki] Reducing count edits ip and newbie per minute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038247 (https://phabricator.wikimedia.org/T330811) [14:15:28] Gerges: Does the above change still have community consensus? [14:15:48] The linked discussion is from 2023 [14:17:10] I don't know, I will contact them [14:18:02] (03CR) 10Clément Goubert: [C:03+2] kubernetes: rename and repurpose 5 api appservers as k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1028840 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [14:18:18] (03PS1) 10Dreamy Jazz: Ensure excluded SHA-1s have numeric keys for scanFilesInScanTable.php [extensions/MediaModeration] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1038310 (https://phabricator.wikimedia.org/T366473) [14:18:53] (03CR) 10David Caro: [C:03+1] "Even better if it's not needed :), will this affect the puppetdb servers?" [puppet] - 10https://gerrit.wikimedia.org/r/1037812 (https://phabricator.wikimedia.org/T366357) (owner: 10Andrew Bogott) [14:18:55] Going to deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MediaModeration/+/1038310 now [14:19:03] (03CR) 10Dreamy Jazz: [C:03+2] Ensure excluded SHA-1s have numeric keys for scanFilesInScanTable.php [extensions/MediaModeration] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1038310 (https://phabricator.wikimedia.org/T366473) (owner: 10Dreamy Jazz) [14:19:36] Dreamy_Jazz: Do you know a solution to remove oathauth-enable from the translator group? [14:19:37] `14:19:19 backport failed: [Errno 2] Location is not a git repo: '/srv/mediawiki-staging'` [14:19:54] There isn't a solution to remove it. This is intended AFAIK. [14:19:55] 10ops-codfw, 06SRE, 10SRE-tools, 06DC-Ops, and 2 others: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9855011 (10Volans) p:05Triage→03Medium [14:19:59] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:20:00] 06SRE, 06Traffic: Anycast NTP and update the list of timeservers for P:systemd::timesyncd - https://phabricator.wikimedia.org/T366360#9855013 (10ssingh) >>! In T366360#9854526, @ayounsi wrote: > Moving the dynamic nature of NTP definition to some automated system instead of human or Puppet is a great idea :) >... [14:20:02] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:20:44] (03PS1) 10CDobbins: purged: add use_pki for eqsin (cp5017) [puppet] - 10https://gerrit.wikimedia.org/r/1038340 (https://phabricator.wikimedia.org/T360506) [14:21:13] Those in this translator group are considered privileged users, so are given the option to use two factor authentication. [14:21:30] Ok [14:21:37] (03Abandoned) 10David Caro: puppetserver: allow configuring the report cleanup frequency [puppet] - 10https://gerrit.wikimedia.org/r/1038296 (https://phabricator.wikimedia.org/T366406) (owner: 10David Caro) [14:21:41] (03Merged) 10jenkins-bot: Ensure excluded SHA-1s have numeric keys for scanFilesInScanTable.php [extensions/MediaModeration] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1038310 (https://phabricator.wikimedia.org/T366473) (owner: 10Dreamy Jazz) [14:22:29] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Spicerack: Migrate puppet merges to a cookbook - https://phabricator.wikimedia.org/T366355#9855026 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:23:23] Do I schedule patch 1038247 ([trwiki] Reducing count edits ip and newbie per minute), or is there someone who will take care of it? [14:24:06] I would like to see a more recent community discussion given the large risk of the change for IP editing, so I would re-schedule it after checking that the community is okay to proceed with this change. [14:24:34] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1358 to wikikube-worker1001 [14:24:53] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=99) from mw1358 to wikikube-worker1001 [14:25:48] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-wf1001.eqiad.wmnet with OS bookworm [14:26:43] Dreamy_Jazz: https://tr.m.wikipedia.org/w/index.php?oldid=29330922&title=Vikipedi:K%C3%B6y_%C3%A7e%C5%9Fmesi_(teknik)#Anonim_kullan%C4%B1c%C4%B1lar%C4%B1n%C4%B1n_seri_de%C4%9Fi%C5%9Fiklik_yapmalar%C4%B1n%C4%B1n_%C3%B6nlenmesi [14:27:16] Sure, but that discussion is from early 2023 [14:27:28] So is over a year old [14:27:50] While is still may have consensus, I am unsure about proceeding with it. [14:27:58] It also seems that deployments are currently broken [14:28:30] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host snapshot1010.eqiad.wmnet with OS bullseye [14:28:45] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.05.27 - 2024.06.16), 13Patch-For-Review: Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9855048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin10... [14:29:02] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1426 to wikikube-worker1002 [14:29:19] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [14:29:53] Dreamy_Jazz: What should i do now? [14:30:09] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-wf2001.codfw.wmnet with OS bookworm [14:30:10] I would ask the community if the change is still wanted [14:30:14] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw1358.eqiad.wmnet [14:30:15] (03PS1) 10Vgutierrez: Revert "depool text@ulsfo before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1038311 (https://phabricator.wikimedia.org/T366466) [14:30:22] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mw1358.eqiad.wmnet [14:30:27] But at this point I cannot proceed with backporting anything until scap backport is working. [14:30:45] (03CR) 10Ssingh: [C:03+1] Revert "depool text@ulsfo before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1038311 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [14:31:09] (03CR) 10Fabfur: [C:03+1] "ok for me!" [dns] - 10https://gerrit.wikimedia.org/r/1038311 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [14:31:22] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw1358.eqiad.wmnet [14:31:30] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mw1358.eqiad.wmnet [14:31:52] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host snapshot1012.eqiad.wmnet with OS bullseye [14:32:02] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.05.27 - 2024.06.16): Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9855072 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host snapshot10... [14:32:06] I'll ask him the question in the phabricator task [14:33:10] 10SRE-tools, 06Infrastructure-Foundations: Support creating phab tasks in wmflib.phabricator - https://phabricator.wikimedia.org/T366470#9855087 (10Volans) p:05Triage→03Medium [14:33:19] (03CR) 10Vgutierrez: [C:03+2] Revert "depool text@ulsfo before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1038311 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [14:33:28] !log repool text@ulsfo with IPIP encapsulation enabled - T366466 [14:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:32] T366466: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466 [14:33:50] bblack, Emperor: ^^ [14:33:56] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw1358.eqiad.wmnet [14:34:03] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mw1358.eqiad.wmnet [14:34:17] Can I run `git init` in `/srv/mediawiki-staging`? [14:34:25] It seems to be the way to fix the issue [14:34:31] But don't want to break things furhter [14:34:33] *further [14:35:38] Dreamy_Jazz: Are you talking about on the deploy server? [14:35:43] Yes [14:35:49] scap is broken [14:35:52] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1426 to wikikube-worker1002 - cgoubert@cumin1002" [14:36:01] `Location is not a git repo: '/srv/mediawiki-staging'` [14:36:04] Is there a ticket ? [14:36:10] Not that I am aware of [14:36:16] ok.. logging in to take a look [14:36:37] Thanks [14:36:57] https://www.irccloud.com/pastebin/Mc4dPjLW/ [14:37:01] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1426 to wikikube-worker1002 - cgoubert@cumin1002" [14:37:01] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:37:01] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1002 [14:37:07] Can you supply the full transcript of whatever you're doing? [14:37:15] Oh. I realised what I've been trying to do. [14:37:21] I'm logged into mwmaint. [14:37:25] That'll do ti. [14:37:27] *facepalm* [14:37:27] *it [14:37:36] (03PS4) 10Bking: blazegraph: Add alert for maxlag [alerts] - 10https://gerrit.wikimedia.org/r/1037850 [14:37:56] Crisis averted. [14:38:12] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1002 [14:38:18] Maybe a better error for people trying to run scap backport on the mwmaint hosts is needed :D [14:38:20] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1426 to wikikube-worker1002 [14:38:43] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:52] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1427 to wikikube-worker1003 [14:38:57] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [14:39:09] Dreamy_Jazz: I'll see what I can do. [14:39:19] Thanks. Again my mistake. Apologies. [14:39:22] !log dreamyjazz@deploy1002 Started scap: Backport for [[gerrit:1038310|Ensure excluded SHA-1s have numeric keys for scanFilesInScanTable.php (T366473)]] [14:39:26] T366473: InvalidArgumentException from line 46 of /srv/mediawiki/php-1.43.0-wmf.7/includes/libs/rdbms/expression/Expression.php: The array of values must be a list when running MediaModeration scanning script - https://phabricator.wikimedia.org/T366473 [14:41:04] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1427 to wikikube-worker1003 - cgoubert@cumin1002" [14:41:41] Dreamy_Jazz: Please do file a ticket w/ the full transcript. [14:41:43] !log dreamyjazz@deploy1002 dreamyjazz: Backport for [[gerrit:1038310|Ensure excluded SHA-1s have numeric keys for scanFilesInScanTable.php (T366473)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:42:37] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1427 to wikikube-worker1003 - cgoubert@cumin1002" [14:42:37] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:42:38] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1003 [14:43:00] !log dreamyjazz@deploy1002 dreamyjazz: Continuing with sync [14:44:10] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1003 [14:44:20] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1427 to wikikube-worker1003 [14:45:02] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1443 to wikikube-worker1004 [14:45:20] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [14:45:48] (03CR) 10Ssingh: "Looks good, let's run PCC and then we can +1 and merge." [puppet] - 10https://gerrit.wikimedia.org/r/1038340 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [14:46:00] Filed https://phabricator.wikimedia.org/T366480 [14:46:09] Thanks! [14:47:39] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.05.27 - 2024.06.16): Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9855143 (10BTullis) [14:47:52] (03PS1) 10Muehlenhoff: Remove obsolete thanos-swift.discovery.wmnet.crt certificate [puppet] - 10https://gerrit.wikimedia.org/r/1038368 (https://phabricator.wikimedia.org/T356412) [14:48:33] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.05.27 - 2024.06.16): Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9855144 (10BTullis) 05Open→03Resolved [14:48:42] (03PS1) 10Vgutierrez: ipip: Avoid puppet errors disabling rp_filter [puppet] - 10https://gerrit.wikimedia.org/r/1038369 (https://phabricator.wikimedia.org/T366466) [14:51:10] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038368 (https://phabricator.wikimedia.org/T356412) (owner: 10Muehlenhoff) [14:51:26] !log dreamyjazz@deploy1002 Finished scap: Backport for [[gerrit:1038310|Ensure excluded SHA-1s have numeric keys for scanFilesInScanTable.php (T366473)]] (duration: 12m 04s) [14:51:29] T366473: InvalidArgumentException from line 46 of /srv/mediawiki/php-1.43.0-wmf.7/includes/libs/rdbms/expression/Expression.php: The array of values must be a list when running MediaModeration scanning script - https://phabricator.wikimedia.org/T366473 [14:51:34] (03CR) 10Hashar: [C:03+2] Rebuild plugins for Java 17 [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038259 (https://phabricator.wikimedia.org/T364342) (owner: 10Hashar) [14:51:37] (03CR) 10Btullis: [V:03+1 C:03+2] Absent the rsync configuration for deprecated misc jobs [puppet] - 10https://gerrit.wikimedia.org/r/1038263 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis) [14:52:01] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1443 to wikikube-worker1004 - cgoubert@cumin1002" [14:52:04] Gerges: I am going to choose to decline to deploy your last configuration change and unless anyone else feels comfortable deploying it, I will close this backport window. [14:52:07] (03Merged) 10jenkins-bot: Rebuild plugins for Java 17 [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038259 (https://phabricator.wikimedia.org/T364342) (owner: 10Hashar) [14:52:12] (03PS3) 10Pppery: Rescue libphutil translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1037902 [14:52:12] PROBLEM - Disk space on karapace1002 is CRITICAL: DISK CRITICAL - free space: / 597 MB (3% inode=94%): /tmp 597 MB (3% inode=94%): /var/tmp 597 MB (3% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=karapace1002&var-datasource=eqiad+prometheus/ops [14:52:19] Oh, looks like they disconnected anyway. [14:52:27] !log Afternoon UTC backport window done [14:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:07] !log hashar@deploy1002 Started deploy [gerrit/gerrit@c93e47d]: Rebuild plugins for Java 17 - T364342 [14:53:10] T364342: Switch Gerrit from Java 11 to Java 17 - https://phabricator.wikimedia.org/T364342 [14:53:12] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@c93e47d]: Rebuild plugins for Java 17 - T364342 (duration: 00m 05s) [14:53:22] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1443 to wikikube-worker1004 - cgoubert@cumin1002" [14:53:22] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:53:22] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1004 [14:54:24] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1004 [14:54:27] !log hashar@deploy1002 Started deploy [gerrit/gerrit@6ba3f2e]: Rebuild plugins for Java 17 - T364342 [14:54:28] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9855203 (10VRiley-WMF) @kamila No problem! would 10:00AM EST work for you? Also, something to note with this move. We will have... [14:54:31] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2720/co" [puppet] - 10https://gerrit.wikimedia.org/r/1038369 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [14:54:33] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1443 to wikikube-worker1004 [14:54:35] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@6ba3f2e]: Rebuild plugins for Java 17 - T364342 (duration: 00m 08s) [14:55:03] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1490 to wikikube-worker1007 [14:55:20] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [14:55:44] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:57:06] !log hashar@deploy1002 Started deploy [gerrit/gerrit@6ba3f2e]: Rebuild plugins for Java 17 - T364342 [14:57:11] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@6ba3f2e]: Rebuild plugins for Java 17 - T364342 (duration: 00m 05s) [14:57:43] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1490 to wikikube-worker1007 - cgoubert@cumin1002" [14:58:43] (03CR) 10Eevans: services: add data-gateway service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032595 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [14:59:26] (03CR) 10Ssingh: "looks good, one nit that might be subjective so feel free to ignore." [puppet] - 10https://gerrit.wikimedia.org/r/1038369 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [15:00:19] (03PS2) 10Vgutierrez: ipip: Avoid puppet errors disabling rp_filter [puppet] - 10https://gerrit.wikimedia.org/r/1038369 (https://phabricator.wikimedia.org/T366466) [15:00:29] (03CR) 10Vgutierrez: ipip: Avoid puppet errors disabling rp_filter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1038369 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [15:00:40] (03CR) 10Volans: [C:03+1] "post-merge +1" [cookbooks] - 10https://gerrit.wikimedia.org/r/1037572 (owner: 10Elukey) [15:00:49] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1490 to wikikube-worker1007 - cgoubert@cumin1002" [15:00:50] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:00:50] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1007 [15:00:53] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-c2-codfw.mgmt.codfw.wmnet [15:00:55] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:01:10] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1007 [15:01:18] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1490 to wikikube-worker1007 [15:02:03] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2721/co" [puppet] - 10https://gerrit.wikimedia.org/r/1038369 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [15:02:38] (03CR) 10Ssingh: [C:03+1] ipip: Avoid puppet errors disabling rp_filter [puppet] - 10https://gerrit.wikimedia.org/r/1038369 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [15:03:19] !log dancy@deploy1002 Installing scap version "4.84.0" for 297 hosts [15:03:52] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-c2-codfw - pt1979@cumin2002" [15:04:03] (03CR) 10Volans: reports: accounting, support swapped motherboards (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1037538 (https://phabricator.wikimedia.org/T358542) (owner: 10Volans) [15:04:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-c2-codfw - pt1979@cumin2002" [15:04:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:08:24] (03CR) 10Volans: "replies inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1037573 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [15:09:43] (03CR) 10Vgutierrez: [V:03+1 C:03+2] ipip: Avoid puppet errors disabling rp_filter [puppet] - 10https://gerrit.wikimedia.org/r/1038369 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [15:10:57] (03CR) 10Btullis: [C:03+2] Remove absented rsync configs for deprecated dumps [puppet] - 10https://gerrit.wikimedia.org/r/1038266 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis) [15:11:08] (03PS2) 10Btullis: Remove absented rsync configs for deprecated dumps [puppet] - 10https://gerrit.wikimedia.org/r/1038266 (https://phabricator.wikimedia.org/T353785) [15:11:37] (03PS3) 10Hashar: Switch Gerrit to Java 17 [puppet] - 10https://gerrit.wikimedia.org/r/1038249 (https://phabricator.wikimedia.org/T364342) (owner: 10Muehlenhoff) [15:12:11] RECOVERY - Disk space on karapace1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=karapace1002&var-datasource=eqiad+prometheus/ops [15:12:30] (03CR) 10Brennen Bearnes: "Yeah, I have created a GitLab user:" [puppet] - 10https://gerrit.wikimedia.org/r/1035820 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes) [15:13:36] (03CR) 10Btullis: [V:03+2 C:03+2] Remove absented rsync configs for deprecated dumps [puppet] - 10https://gerrit.wikimedia.org/r/1038266 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis) [15:15:20] (03PS1) 10MVernon: ceph: also install systemctl into the image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1038372 [15:16:01] Hi Dreamy_Jazz [15:17:37] (03CR) 10Volans: "the approach looks good, left some questions on some parameters, to be investigated" [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [15:18:39] (03CR) 10Arnaudb: [C:03+1] ceph: also install systemctl into the image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1038372 (owner: 10MVernon) [15:21:30] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1002.eqiad.wmnet with OS bullseye [15:21:49] (03CR) 10MVernon: [V:03+2 C:03+2] ceph: also install systemctl into the image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1038372 (owner: 10MVernon) [15:22:31] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1003.eqiad.wmnet with OS bullseye [15:23:02] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1004.eqiad.wmnet with OS bullseye [15:23:27] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1007.eqiad.wmnet with OS bullseye [15:27:01] PROBLEM - SSH on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:27:06] !log dancy@mwmaint1002 scap failed: FileNotFoundError [Errno 2] No such file or directory: '/etc/helmfile-defaults/mediawiki-deployments.yaml' (duration: 00m 00s) [15:27:06] (03PS7) 10Elukey: sre.host.provision: no-op refactor to highlight DELL-specific confs [cookbooks] - 10https://gerrit.wikimedia.org/r/1037573 (https://phabricator.wikimedia.org/T365372) [15:27:06] (03PS4) 10Elukey: WIP: sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) [15:27:07] PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:27:15] PROBLEM - grafana-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:27:15] PROBLEM - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:27:21] PROBLEM - grafana.wikimedia.org on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [15:27:21] PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:28:55] (03CR) 10Elukey: sre.host.provision: no-op refactor to highlight DELL-specific confs (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1037573 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [15:29:25] Gerges: Hi. [15:29:35] mmhh grafana's unhappy, I'll take a look [15:29:39] (they left again it seems). [15:30:04] jan_drewniak: Time to snap out of that daydream and deploy Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240603T1530). [15:30:51] RECOVERY - SSH on grafana1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:30:55] !log dancy@deploy1002 Started scap: testing [15:30:56] !log dancy@deploy1002 sync-world aborted: testing (duration: 00m 00s) [15:31:09] RECOVERY - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Sun 16 Jun 2024 04:13:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:31:09] RECOVERY - grafana-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Sun 16 Jun 2024 04:13:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:31:13] RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 552 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:31:13] RECOVERY - grafana.wikimedia.org on grafana1002 is OK: HTTP OK: HTTP/1.1 200 OK - 134221 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [15:31:51] it is back by itself btw [15:32:32] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038249 (https://phabricator.wikimedia.org/T364342) (owner: 10Muehlenhoff) [15:33:06] (03CR) 10JHathaway: [C:03+2] wikipedia.org spf: indicate mail is not sent from this domain. [dns] - 10https://gerrit.wikimedia.org/r/1037157 (https://phabricator.wikimedia.org/T211403) (owner: 10JHathaway) [15:33:17] (03PS3) 10JHathaway: wikipedia.org spf: indicate mail is not sent from this domain. [dns] - 10https://gerrit.wikimedia.org/r/1037157 (https://phabricator.wikimedia.org/T211403) [15:33:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [15:33:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [15:34:43] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1002.eqiad.wmnet with reason: host reimage [15:35:14] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users for Mvolz - https://phabricator.wikimedia.org/T366088#9855432 (10WDoranWMF) Approved [15:35:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:35:26] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1003.eqiad.wmnet with reason: host reimage [15:36:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-c2-codfw.mgmt.codfw.wmnet [15:36:39] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1004.eqiad.wmnet with reason: host reimage [15:37:42] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1002.eqiad.wmnet with reason: host reimage [15:38:27] (03CR) 10JHathaway: [V:03+2 C:03+2] wikipedia.org spf: indicate mail is not sent from this domain. [dns] - 10https://gerrit.wikimedia.org/r/1037157 (https://phabricator.wikimedia.org/T211403) (owner: 10JHathaway) [15:38:41] (03PS3) 10JHathaway: wikipedia.org dmarc: change to quarantine [dns] - 10https://gerrit.wikimedia.org/r/1037154 (https://phabricator.wikimedia.org/T211403) [15:38:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [15:38:49] (03CR) 10Ryan Kemper: [C:03+1] Remove temporary firewall rule for WDQS graph_split [puppet] - 10https://gerrit.wikimedia.org/r/1038328 (https://phabricator.wikimedia.org/T350106) (owner: 10Btullis) [15:39:05] RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 562 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:39:51] (03CR) 10JHathaway: [C:03+2] wikipedia.org dmarc: change to quarantine [dns] - 10https://gerrit.wikimedia.org/r/1037154 (https://phabricator.wikimedia.org/T211403) (owner: 10JHathaway) [15:40:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:41:07] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1004.eqiad.wmnet with reason: host reimage [15:41:54] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#9855462 (10elukey) Network config for kubernetes2054 as seen by Redfish (supermicro): ` >>> pprint(a.request("get",... [15:42:26] (03CR) 10Elukey: WIP: sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [15:42:37] !log deploying more restrictive SPF & DMARC settings for wikipedia.org [15:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:23] !log hashar@deploy1002 Started deploy [gerrit/gerrit@c93e47d]: Revert "Rebuild plugins for Java 17" to stick to Java 11 based compiled plugins - T364342 [15:43:25] T364342: Switch Gerrit from Java 11 to Java 17 - https://phabricator.wikimedia.org/T364342 [15:43:28] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@c93e47d]: Revert "Rebuild plugins for Java 17" to stick to Java 11 based compiled plugins - T364342 (duration: 00m 05s) [15:43:43] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9855466 (10Papaul) [15:43:52] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1003.eqiad.wmnet with reason: host reimage [15:45:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:45:25] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:45:34] 06SRE, 06Traffic: Anycast NTP and update the list of timeservers for P:systemd::timesyncd - https://phabricator.wikimedia.org/T366360#9855483 (10ayounsi) > Last time we rolled out this change, it was simply updating modules/install_server/files/autoinstall/common.cfg. Do you have any other place in mind where... [15:48:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [15:48:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [15:50:25] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:50:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2212', diff saved to https://phabricator.wikimedia.org/P63939 and previous config saved to /var/cache/conftool/dbconfig/20240603-155048-root.json [15:55:55] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1002.eqiad.wmnet with OS bullseye [15:56:52] (03PS1) 10Muehlenhoff: Remove obsolete stub cert [labs/private] - 10https://gerrit.wikimedia.org/r/1038380 [15:57:53] 06SRE, 06Traffic: Anycast NTP and update the list of timeservers for P:systemd::timesyncd - https://phabricator.wikimedia.org/T366360#9855543 (10ssingh) >>! In T366360#9855483, @ayounsi wrote: >> Last time we rolled out this change, it was simply updating modules/install_server/files/autoinstall/common.cfg. Do... [15:59:11] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1004.eqiad.wmnet with OS bullseye [15:59:13] (03PS2) 10Cwhite: admin: add mvolz to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1036594 (https://phabricator.wikimedia.org/T366088) [15:59:23] (03PS3) 10Cwhite: admin: add mvolz to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1036594 (https://phabricator.wikimedia.org/T366088) [16:00:19] (03CR) 10Andrew Bogott: [C:03+2] "tools-puppetserver-01 (using puppetdb) has:" [puppet] - 10https://gerrit.wikimedia.org/r/1037812 (https://phabricator.wikimedia.org/T366357) (owner: 10Andrew Bogott) [16:01:57] (03CR) 10Cwhite: [C:03+2] admin: add mvolz to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1036594 (https://phabricator.wikimedia.org/T366088) (owner: 10Cwhite) [16:02:18] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1003.eqiad.wmnet with OS bullseye [16:02:43] (03CR) 10CDobbins: [V:03+1 C:03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2723/console" [puppet] - 10https://gerrit.wikimedia.org/r/1035538 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [16:04:10] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users for Mvolz - https://phabricator.wikimedia.org/T366088#9855557 (10colewhite) 05Stalled→03Resolved a:03colewhite The group membership change has been deployed. Please feel free to reopen if you encounter an... [16:05:50] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, and 2 others: Degraded RAID on cloudcephosd1031 - https://phabricator.wikimedia.org/T364060#9855591 (10Jclark-ctr) 05In progress→03Resolved [16:06:01] (03CR) 10Hashar: "Moritz and I will deploy it on Tuesday morning. Once upgraded, I will be able to roll https://gerrit.wikimedia.org/r/c/operations/software" [puppet] - 10https://gerrit.wikimedia.org/r/1038249 (https://phabricator.wikimedia.org/T364342) (owner: 10Muehlenhoff) [16:07:55] (03PS1) 10Andrew Bogott: puppetserver 'report' enum: allow 'none' as a value [puppet] - 10https://gerrit.wikimedia.org/r/1038381 (https://phabricator.wikimedia.org/T366357) [16:11:43] (03CR) 10JHathaway: [C:03+1] puppetserver 'report' enum: allow 'none' as a value [puppet] - 10https://gerrit.wikimedia.org/r/1038381 (https://phabricator.wikimedia.org/T366357) (owner: 10Andrew Bogott) [16:11:57] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2724/console" [puppet] - 10https://gerrit.wikimedia.org/r/1038340 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [16:13:29] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] puppetserver 'report' enum: allow 'none' as a value [puppet] - 10https://gerrit.wikimedia.org/r/1038381 (https://phabricator.wikimedia.org/T366357) (owner: 10Andrew Bogott) [16:14:55] (03PS1) 10Dzahn: gerrit: use Java 17 instead of Java 11 [puppet] - 10https://gerrit.wikimedia.org/r/1038382 (https://phabricator.wikimedia.org/T364342) [16:18:34] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host wikikube-worker1007.eqiad.wmnet with OS bullseye [16:19:09] 06SRE, 06Traffic: Anycast ns1.wikimedia.org - https://phabricator.wikimedia.org/T366193#9855670 (10BBlack) >>! In T366193#9853619, @ayounsi wrote: > I think the difficult part is where to stop the overengineering, for example it could make sens to use Liberica to healthcheck/advertise one of the NS anycast IP,... [16:20:24] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1007.eqiad.wmnet with OS bullseye [16:21:19] (03CR) 10DCausse: [C:03+1] blazegraph: Add alert for maxlag [alerts] - 10https://gerrit.wikimedia.org/r/1037850 (owner: 10Bking) [16:23:42] (03PS1) 10Andrew Bogott: puppetdb: Remove 'none' from puppet reports config when adding 'puppetdb' [puppet] - 10https://gerrit.wikimedia.org/r/1038387 (https://phabricator.wikimedia.org/T366357) [16:23:48] (03PS1) 10Kgraessle: InitaliseSettings-labs: Deploy Automoderator patroller workstream survey to cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038388 (https://phabricator.wikimedia.org/T362969) [16:25:41] (03PS2) 10Andrew Bogott: puppetdb: Remove 'none' from puppet reports config when adding 'puppetdb' [puppet] - 10https://gerrit.wikimedia.org/r/1038387 (https://phabricator.wikimedia.org/T366357) [16:26:17] 06SRE, 06Traffic: Anycast NTP and update the list of timeservers for P:systemd::timesyncd - https://phabricator.wikimedia.org/T366360#9855758 (10BBlack) Re: "same logic" - they're different protocols, different hierarchies, and much different on the client behavior front as well. It doesn't make sense to shar... [16:28:40] (03PS1) 10Bartosz Dziewoński: Show experimental login popup links on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038389 (https://phabricator.wikimedia.org/T366486) [16:31:00] (03CR) 10Dreamy Jazz: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686) (owner: 10Dreamy Jazz) [16:32:53] (03CR) 10JHathaway: puppetdb: Remove 'none' from puppet reports config when adding 'puppetdb' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1038387 (https://phabricator.wikimedia.org/T366357) (owner: 10Andrew Bogott) [16:33:39] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#9855808 (10elukey) I checked the BIOS settings of kubernetes2054 (Supermicro nodes already configured by DCops) and... [16:33:39] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1007.eqiad.wmnet with reason: host reimage [16:36:12] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#9855821 (10elukey) Next steps: * Refactor the provision cookbook to be less DELL specific and allow other vendors, l... [16:36:26] (03PS3) 10Andrew Bogott: puppetdb: Remove 'none' from puppet reports config when adding 'puppetdb' [puppet] - 10https://gerrit.wikimedia.org/r/1038387 (https://phabricator.wikimedia.org/T366357) [16:36:32] (03CR) 10Andrew Bogott: puppetdb: Remove 'none' from puppet reports config when adding 'puppetdb' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1038387 (https://phabricator.wikimedia.org/T366357) (owner: 10Andrew Bogott) [16:37:08] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1007.eqiad.wmnet with reason: host reimage [16:40:25] (03CR) 10JHathaway: [C:03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1038387 (https://phabricator.wikimedia.org/T366357) (owner: 10Andrew Bogott) [16:40:48] (03CR) 10Andrew Bogott: [C:03+2] puppetdb: Remove 'none' from puppet reports config when adding 'puppetdb' [puppet] - 10https://gerrit.wikimedia.org/r/1038387 (https://phabricator.wikimedia.org/T366357) (owner: 10Andrew Bogott) [16:42:28] (03PS1) 10MVernon: cephadm: allow ssh from all mgrs to all targets [puppet] - 10https://gerrit.wikimedia.org/r/1038391 (https://phabricator.wikimedia.org/T279621) [16:42:50] (03CR) 10CI reject: [V:04-1] cephadm: allow ssh from all mgrs to all targets [puppet] - 10https://gerrit.wikimedia.org/r/1038391 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [16:43:56] (03PS2) 10MVernon: cephadm: allow ssh from all mgrs to all targets [puppet] - 10https://gerrit.wikimedia.org/r/1038391 (https://phabricator.wikimedia.org/T279621) [16:44:22] (03CR) 10CI reject: [V:04-1] cephadm: allow ssh from all mgrs to all targets [puppet] - 10https://gerrit.wikimedia.org/r/1038391 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [16:44:32] (03PS3) 10MVernon: cephadm: allow ssh from all mgrs to all targets [puppet] - 10https://gerrit.wikimedia.org/r/1038391 (https://phabricator.wikimedia.org/T279621) [16:44:51] (03CR) 10CI reject: [V:04-1] cephadm: allow ssh from all mgrs to all targets [puppet] - 10https://gerrit.wikimedia.org/r/1038391 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [16:46:31] (03CR) 10Elukey: [C:03+1] Remove obsolete thanos-swift.discovery.wmnet.crt certificate [puppet] - 10https://gerrit.wikimedia.org/r/1038368 (https://phabricator.wikimedia.org/T356412) (owner: 10Muehlenhoff) [16:47:49] (03PS4) 10MVernon: cephadm: allow ssh from all mgrs to all targets [puppet] - 10https://gerrit.wikimedia.org/r/1038391 (https://phabricator.wikimedia.org/T279621) [16:48:08] (03CR) 10CI reject: [V:04-1] cephadm: allow ssh from all mgrs to all targets [puppet] - 10https://gerrit.wikimedia.org/r/1038391 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [16:48:52] (03PS5) 10MVernon: cephadm: allow ssh from all mgrs to all targets [puppet] - 10https://gerrit.wikimedia.org/r/1038391 (https://phabricator.wikimedia.org/T279621) [16:49:14] (03CR) 10CI reject: [V:04-1] cephadm: allow ssh from all mgrs to all targets [puppet] - 10https://gerrit.wikimedia.org/r/1038391 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [16:50:53] (03PS1) 10Dr0ptp4kt: Bump XML dump schema to version 0.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038392 (https://phabricator.wikimedia.org/T365155) [16:51:27] (03PS6) 10MVernon: cephadm: allow ssh from all mgrs to all targets [puppet] - 10https://gerrit.wikimedia.org/r/1038391 (https://phabricator.wikimedia.org/T279621) [16:53:06] (03CR) 10Dr0ptp4kt: "Preparing." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038392 (https://phabricator.wikimedia.org/T365155) (owner: 10Dr0ptp4kt) [16:53:33] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038391 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [16:54:35] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9855905 (10Papaul) [16:54:38] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9855906 (10Papaul) @cmooney all good on lsw1-d4, lsw1-c2 and lsw1-d8 [16:55:28] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1007.eqiad.wmnet with OS bullseye [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240603T1700) [17:00:05] ryankemper: May I have your attention please! Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240603T1700) [17:05:43] (03PS1) 10Clément Goubert: mw1358: Put back insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/1038395 (https://phabricator.wikimedia.org/T365571) [17:07:39] (03CR) 10Clément Goubert: [V:03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2725/" [puppet] - 10https://gerrit.wikimedia.org/r/1038395 (https://phabricator.wikimedia.org/T365571) (owner: 10Clément Goubert) [17:08:18] (03CR) 10Bartosz Dziewoński: [C:03+1] "Seems logical to me as well…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037929 (owner: 10Gergő Tisza) [17:08:36] (03CR) 10JMeybohm: [C:03+1] mw1358: Put back insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/1038395 (https://phabricator.wikimedia.org/T365571) (owner: 10Clément Goubert) [17:08:39] (03CR) 10Kamila Součková: [C:03+1] mw1358: Put back insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/1038395 (https://phabricator.wikimedia.org/T365571) (owner: 10Clément Goubert) [17:08:48] (03CR) 10Clément Goubert: [V:03+1 C:03+2] mw1358: Put back insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/1038395 (https://phabricator.wikimedia.org/T365571) (owner: 10Clément Goubert) [17:09:49] (03PS1) 10BryanDavis: toolhub: Bump container to 2024-06-03-170318-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038399 (https://phabricator.wikimedia.org/T366506) [17:10:24] (03CR) 10Bartosz Dziewoński: [C:03+1] multiversion: Add tests for MWMultiVersion::getMediaWiki() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037930 (owner: 10Gergő Tisza) [17:12:04] jouncebot: nowandnext [17:12:04] For the next 0 hour(s) and 47 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240603T1700) [17:12:04] For the next 0 hour(s) and 17 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240603T1700) [17:12:04] In 2 hour(s) and 47 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240603T2000) [17:12:05] (03CR) 10Bartosz Dziewoński: [C:03+1] [beta] Remove references to upload.beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038230 (https://phabricator.wikimedia.org/T366415) (owner: 10Gergő Tisza) [17:12:21] (03CR) 10BryanDavis: [C:03+2] toolhub: Bump container to 2024-06-03-170318-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038399 (https://phabricator.wikimedia.org/T366506) (owner: 10BryanDavis) [17:13:34] (03Merged) 10jenkins-bot: toolhub: Bump container to 2024-06-03-170318-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038399 (https://phabricator.wikimedia.org/T366506) (owner: 10BryanDavis) [17:14:49] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/toolhub: apply [17:15:21] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [17:16:36] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/toolhub: apply [17:17:17] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [17:17:35] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [17:17:51] !log homer 'lsw1-e2-eqiad*' commit 'T35107 [17:17:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:55] T35107: Double-click should highlight text in VisualEditor in IE8 - https://phabricator.wikimedia.org/T35107 [17:17:56] !log homer 'lsw1-e2-eqiad*' commit 'T351074' [17:17:57] (03CR) 10Ladsgroup: [C:03+1] "I can deploy it at any time people deem fit" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038392 (https://phabricator.wikimedia.org/T365155) (owner: 10Dr0ptp4kt) [17:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:58] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [17:18:28] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [17:19:24] !log homer 'cr*eqiad*' commit 'T351074' [17:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:20] !log Pooling and uncordoning wikikube-worker1002.eqiad.wmnet,wikikube-worker1003.eqiad.wmnet,wikikube-worker1007.eqiad.wmnet,wikikube-worker1004.eqiad.wmnet - T351074 [17:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:24] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [17:27:36] !log cgoubert@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(wikikube-worker1002.eqiad.wmnet|wikikube-worker1003.eqiad.wmnet|wikikube-worker1007.eqiad.wmnet|wikikube-worker1004.eqiad.wmnet),cluster=kubernetes,service=kubesvc [17:28:22] (03CR) 10Dzahn: [C:03+1] "So.. there are actually 116 hosts using this, the rest don't. It can be seen with cumin:" [puppet] - 10https://gerrit.wikimedia.org/r/1029235 (https://phabricator.wikimedia.org/T236253) (owner: 10Ahmon Dancy) [17:30:48] (03CR) 10Dzahn: [C:03+1] "But how does it makes sense even those 116 have the config file but don't install the package?" [puppet] - 10https://gerrit.wikimedia.org/r/1029235 (https://phabricator.wikimedia.org/T236253) (owner: 10Ahmon Dancy) [17:35:24] 06SRE, 06serviceops, 13Patch-For-Review: systemd-coredump can make a system unresponsive - https://phabricator.wikimedia.org/T236253#9856151 (10Dzahn) >>! In T236253#7361174, @elukey wrote: > To keep archives happy - we currently don't deploy `systemd-coredump` on our hosts (because of the reasons highlighte... [17:36:49] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Joely Rooke WMDE - https://phabricator.wikimedia.org/T366145#9856200 (10KFrancis) Hi all, the NDA has been signed. Thanks! [17:42:12] PROBLEM - Disk space on karapace1002 is CRITICAL: DISK CRITICAL - free space: / 568 MB (3% inode=94%): /tmp 568 MB (3% inode=94%): /var/tmp 568 MB (3% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=karapace1002&var-datasource=eqiad+prometheus/ops [17:44:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T364299)', diff saved to https://phabricator.wikimedia.org/P63940 and previous config saved to /var/cache/conftool/dbconfig/20240603-174442-marostegui.json [17:44:46] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [17:46:16] (03CR) 10Tchanders: [C:03+1] geoip: Download GeoLite2 ASN file [puppet] - 10https://gerrit.wikimedia.org/r/1037531 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [17:58:02] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/1038340/2727/cp5017.eqsin.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1038340 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [17:58:22] (03CR) 10Ssingh: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1038340 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [17:59:01] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1038340 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [17:59:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P63941 and previous config saved to /var/cache/conftool/dbconfig/20240603-175951-marostegui.json [17:59:53] (03CR) 10Xcollazo: [C:03+1] Bump XML dump schema to version 0.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038392 (https://phabricator.wikimedia.org/T365155) (owner: 10Dr0ptp4kt) [17:59:55] (03CR) 10Ssingh: [C:03+1] "Yep, looks good. Merge it whenever you want :)" [puppet] - 10https://gerrit.wikimedia.org/r/1038340 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [18:00:16] (03CR) 10CDobbins: [V:03+1 C:03+2] purged: add use_pki for eqsin (cp5017) [puppet] - 10https://gerrit.wikimedia.org/r/1038340 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [18:00:19] 06SRE, 06serviceops, 13Patch-For-Review: systemd-coredump can make a system unresponsive - https://phabricator.wikimedia.org/T236253#9856381 (10Dzahn) I talked a bit about this in #systemd IRC channel. Mostly to ask if the config is irrelevant as long as the package isn't installed, but once I explained the... [18:00:53] (03PS1) 10Eevans: cassandra: add commons_impact_metrics role to aqs cluster [puppet] - 10https://gerrit.wikimedia.org/r/1038409 (https://phabricator.wikimedia.org/T361835) [18:01:47] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 14), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9856388 (10Eevans) [18:04:02] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Joely Rooke WMDE - https://phabricator.wikimedia.org/T366145#9856396 (10Dzahn) Thanks, Katie. I see it on the spreadsheet. Adding to the groups. [18:06:12] (03CR) 10Dzahn: [C:03+2] "NDA is now complete. re-added to LDAP groups. done." [puppet] - 10https://gerrit.wikimedia.org/r/1037603 (https://phabricator.wikimedia.org/T366145) (owner: 10Dzahn) [18:07:14] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Joely Rooke WMDE - https://phabricator.wikimedia.org/T366145#9856410 (10Dzahn) 05In progress→03Resolved @JoelyRooke-WMDE All done now. You have the groups and things should work just like for other WMDE staff. [18:14:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P63942 and previous config saved to /var/cache/conftool/dbconfig/20240603-181459-marostegui.json [18:15:26] (03PS2) 10Jdlrobson: Wrap tables in Vector 2022 for projects where legacy Vector is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037600 (https://phabricator.wikimedia.org/T366314) [18:16:30] (03PS3) 10Jdlrobson: Wrap tables in Vector 2022 for projects where legacy Vector is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037600 (https://phabricator.wikimedia.org/T366314) [18:17:44] (03PS7) 10Jdlrobson: Drop unused config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031459 (https://phabricator.wikimedia.org/T301212) [18:20:26] (03CR) 10Kosta Harlan: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [18:22:19] (03PS2) 10Eevans: cassandra: create new commons_impact_analytics role [puppet] - 10https://gerrit.wikimedia.org/r/1038409 (https://phabricator.wikimedia.org/T361835) [18:25:39] (03PS1) 10Eevans: Faux commons_impact_analytics Cassandra creds [labs/private] - 10https://gerrit.wikimedia.org/r/1038416 (https://phabricator.wikimedia.org/T361835) [18:28:05] (03CR) 10Eevans: [C:03+2] Faux commons_impact_analytics Cassandra creds [labs/private] - 10https://gerrit.wikimedia.org/r/1038416 (https://phabricator.wikimedia.org/T361835) (owner: 10Eevans) [18:28:07] (03CR) 10Eevans: [V:03+2 C:03+2] Faux commons_impact_analytics Cassandra creds [labs/private] - 10https://gerrit.wikimedia.org/r/1038416 (https://phabricator.wikimedia.org/T361835) (owner: 10Eevans) [18:30:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T364299)', diff saved to https://phabricator.wikimedia.org/P63943 and previous config saved to /var/cache/conftool/dbconfig/20240603-183006-marostegui.json [18:30:09] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2188.codfw.wmnet with reason: Maintenance [18:30:10] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [18:30:22] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2188.codfw.wmnet with reason: Maintenance [18:30:23] (03CR) 10Dzahn: "@Muehlenhoff If you are doing this tomorrow, merge at will." [puppet] - 10https://gerrit.wikimedia.org/r/1038382 (https://phabricator.wikimedia.org/T364342) (owner: 10Dzahn) [18:30:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T364299)', diff saved to https://phabricator.wikimedia.org/P63944 and previous config saved to /var/cache/conftool/dbconfig/20240603-183029-marostegui.json [18:31:13] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038409 (https://phabricator.wikimedia.org/T361835) (owner: 10Eevans) [18:32:09] (03CR) 10Dzahn: "Not sure anymore - based on some chat about it on #systemd if the hangs indicate I/O issue rather than CPU then turning off compression mi" [puppet] - 10https://gerrit.wikimedia.org/r/1029235 (https://phabricator.wikimedia.org/T236253) (owner: 10Ahmon Dancy) [18:33:19] (03CR) 10Xcollazo: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1038409 (https://phabricator.wikimedia.org/T361835) (owner: 10Eevans) [18:34:47] (03CR) 10Scott French: [C:03+1] "Thanks, Eric! Since this is more complete (e.g., picks up the change to include Cassandra staging), I'll abandon my pending patch [0] for " [puppet] - 10https://gerrit.wikimedia.org/r/1038409 (https://phabricator.wikimedia.org/T361835) (owner: 10Eevans) [18:35:20] (03Abandoned) 10Scott French: DNM: cassandra: add commons_impact_analytics user [puppet] - 10https://gerrit.wikimedia.org/r/1023960 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French) [18:35:29] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/1038382/2729/gerrit1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1038382 (https://phabricator.wikimedia.org/T364342) (owner: 10Dzahn) [18:36:05] (03PS7) 10Kosta Harlan: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) [18:42:18] (03PS1) 10Kosta Harlan: IPInfo: Set GeoLite2Prefix path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038421 (https://phabricator.wikimedia.org/T361884) [18:42:35] (03CR) 10Bking: [C:03+2] blazegraph: Add alert for maxlag [alerts] - 10https://gerrit.wikimedia.org/r/1037850 (owner: 10Bking) [18:44:34] (03PS2) 10Kosta Harlan: IPInfo: Set GeoLite2Prefix path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038421 (https://phabricator.wikimedia.org/T361884) [18:49:39] (03CR) 10Bking: [C:03+2] Remove the discovery-analytics dsh config for stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/1038288 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis) [18:50:51] (03CR) 10Bking: [C:03+2] "approved/merged after IRC discussion w/Ebernhardson" [puppet] - 10https://gerrit.wikimedia.org/r/1038288 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis) [18:52:24] (03CR) 10Bking: [C:03+2] Remove temporary firewall rule for WDQS graph_split [puppet] - 10https://gerrit.wikimedia.org/r/1038328 (https://phabricator.wikimedia.org/T350106) (owner: 10Btullis) [18:56:18] (03CR) 10Paladox: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1038249 already exists by Muehlenhoff" [puppet] - 10https://gerrit.wikimedia.org/r/1038382 (https://phabricator.wikimedia.org/T364342) (owner: 10Dzahn) [18:59:23] (03PS1) 10Jdlrobson: Enable night theme on pages which have no color contrast issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038424 (https://phabricator.wikimedia.org/T366370) [18:59:50] (03CR) 10Dzahn: "ah! thanks Paladox, closing as duplicate" [puppet] - 10https://gerrit.wikimedia.org/r/1038382 (https://phabricator.wikimedia.org/T364342) (owner: 10Dzahn) [19:00:00] (03Abandoned) 10Dzahn: gerrit: use Java 17 instead of Java 11 [puppet] - 10https://gerrit.wikimedia.org/r/1038382 (https://phabricator.wikimedia.org/T364342) (owner: 10Dzahn) [19:09:03] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 14), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9856702 (10Scott_French) Added k8s secret for the commons_impact_analytics rol... [19:11:53] (03PS2) 10Jdlrobson: Enable night theme on pages which have no color contrast issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038424 (https://phabricator.wikimedia.org/T366370) [19:13:58] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:16:32] (03PS4) 10Scott French: DNM: services: add commons-impact-analytics service helmfile configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023957 (https://phabricator.wikimedia.org/T361835) [19:16:32] (03PS4) 10Scott French: DNM: rest-gateway: route commons-analytics via rest-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023958 (https://phabricator.wikimedia.org/T361835) [19:16:34] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:18:30] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 52067 bytes in 6.526 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:18:52] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.252 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:23:50] (03PS3) 10Jdlrobson: Enable night theme on pages which have no color contrast issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038424 (https://phabricator.wikimedia.org/T366370) [19:28:33] (03PS1) 10Ryan Kemper: wdqs: rip out outdated legacy_updater::journal var [puppet] - 10https://gerrit.wikimedia.org/r/1038427 (https://phabricator.wikimedia.org/T364366) [19:28:50] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038427 (https://phabricator.wikimedia.org/T364366) (owner: 10Ryan Kemper) [19:29:37] (03PS4) 10Jdlrobson: Enable night theme on pages which have no color contrast issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038424 (https://phabricator.wikimedia.org/T366370) [19:32:41] (03CR) 10Bking: [C:03+1] wdqs: rip out outdated legacy_updater::journal var [puppet] - 10https://gerrit.wikimedia.org/r/1038427 (https://phabricator.wikimedia.org/T364366) (owner: 10Ryan Kemper) [19:32:44] (03CR) 10Ryan Kemper: [C:03+2] wdqs: rip out outdated legacy_updater::journal var [puppet] - 10https://gerrit.wikimedia.org/r/1038427 (https://phabricator.wikimedia.org/T364366) (owner: 10Ryan Kemper) [19:42:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P63945 and previous config saved to /var/cache/conftool/dbconfig/20240603-194236-root.json [19:48:07] (03PS1) 10Ryan Kemper: wdqs: rename journal hiera var & plumb thru fully [puppet] - 10https://gerrit.wikimedia.org/r/1038428 (https://phabricator.wikimedia.org/T364366) [19:48:22] (03CR) 10Scott French: "Thanks, @cgoubert@wikimedia.org! One last thing, but otherwise looks good." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032525 (https://phabricator.wikimedia.org/T362978) (owner: 10Clément Goubert) [19:48:25] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038428 (https://phabricator.wikimedia.org/T364366) (owner: 10Ryan Kemper) [19:50:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:51:27] (03CR) 10Bking: [C:03+1] wdqs: rename journal hiera var & plumb thru fully [puppet] - 10https://gerrit.wikimedia.org/r/1038428 (https://phabricator.wikimedia.org/T364366) (owner: 10Ryan Kemper) [19:52:36] (03CR) 10Ryan Kemper: [C:03+2] wdqs: rename journal hiera var & plumb thru fully [puppet] - 10https://gerrit.wikimedia.org/r/1038428 (https://phabricator.wikimedia.org/T364366) (owner: 10Ryan Kemper) [19:57:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P63946 and previous config saved to /var/cache/conftool/dbconfig/20240603-195742-root.json [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240603T2000). [20:00:04] kostajh, _Gerges, and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:09] hello [20:00:14] i can deploy today [20:00:17] Hi [20:00:20] thanks urbanecm [20:00:21] hi kostajh and Gerges [20:00:31] * cjming thanks urbanecm [20:00:37] and hello cjming :) [20:00:48] 07sre-alert-triage, 06cloud-services-team: Alert triage: Adjust severity of backup_cinder_volumes from critical to warning - https://phabricator.wikimedia.org/T342764#9856932 (10Andrew) 05Open→03Declined This hasn't been an issue lately; also digging in code suggests that it's already not a critical wa... [20:01:16] o/ [20:01:32] (03PS5) 10Kosta Harlan: EventLogging: Enable IP reputation logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034882 (https://phabricator.wikimedia.org/T354597) [20:01:37] (03CR) 10Urbanecm: [C:03+2] EventLogging: Enable IP reputation logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034882 (https://phabricator.wikimedia.org/T354597) (owner: 10Kosta Harlan) [20:02:15] (03Merged) 10jenkins-bot: EventLogging: Enable IP reputation logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034882 (https://phabricator.wikimedia.org/T354597) (owner: 10Kosta Harlan) [20:02:57] 06SRE, 10SRE-tools: Provide an utility script to replace a failed device in raid 0 array - https://phabricator.wikimedia.org/T350492#9856945 (10Volans) p:05Triage→03Medium yeah, removing the unowned tag, doesn't seem to fit this one IMHO. [20:02:59] (03PS3) 10GergesShamon: [trwiki] Allow translator group to publish translation only in Extension:ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037897 (https://phabricator.wikimedia.org/T356440) [20:03:20] (03CR) 10Urbanecm: [C:03+2] [trwiki] Reducing count edits ip and newbie per minute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038247 (https://phabricator.wikimedia.org/T330811) (owner: 10GergesShamon) [20:03:23] (03CR) 10Urbanecm: [C:03+2] [trwiki] Allow translator group to publish translation only in Extension:ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037897 (https://phabricator.wikimedia.org/T356440) (owner: 10GergesShamon) [20:03:56] (03Merged) 10jenkins-bot: [trwiki] Reducing count edits ip and newbie per minute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038247 (https://phabricator.wikimedia.org/T330811) (owner: 10GergesShamon) [20:04:42] (03PS4) 10GergesShamon: [trwiki] Allow translator group to publish translation only in Extension:ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037897 (https://phabricator.wikimedia.org/T356440) [20:04:45] (03CR) 10Urbanecm: [trwiki] Allow translator group to publish translation only in Extension:ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037897 (https://phabricator.wikimedia.org/T356440) (owner: 10GergesShamon) [20:04:49] (03CR) 10Urbanecm: [C:03+2] [trwiki] Allow translator group to publish translation only in Extension:ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037897 (https://phabricator.wikimedia.org/T356440) (owner: 10GergesShamon) [20:04:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037897 (https://phabricator.wikimedia.org/T356440) (owner: 10GergesShamon) [20:05:41] (03Merged) 10jenkins-bot: [trwiki] Allow translator group to publish translation only in Extension:ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037897 (https://phabricator.wikimedia.org/T356440) (owner: 10GergesShamon) [20:05:58] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1034882|EventLogging: Enable IP reputation logging (T354597)]], [[gerrit:1037897|[trwiki] Allow translator group to publish translation only in Extension:ContentTranslation]], [[gerrit:1038247|[trwiki] Reducing count edits ip and newbie per minute (T330811)]] [20:06:03] T354597: Record IP reputation data for account creations and edits - https://phabricator.wikimedia.org/T354597 [20:06:03] T330811: Reduce the edit rate limit for trwiki - https://phabricator.wikimedia.org/T330811 [20:10:17] !log urbanecm@deploy1002 kharlan and urbanecm and gergesshamon: Backport for [[gerrit:1034882|EventLogging: Enable IP reputation logging (T354597)]], [[gerrit:1037897|[trwiki] Allow translator group to publish translation only in Extension:ContentTranslation]], [[gerrit:1038247|[trwiki] Reducing count edits ip and newbie per minute (T330811)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:10:34] kostajh: Gerges: your patches are available at mwdebug1001. can you test please? [20:10:40] urbanecm: looking [20:10:43] ty [20:11:03] Ok [20:12:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P63947 and previous config saved to /var/cache/conftool/dbconfig/20240603-201248-root.json [20:15:37] kostajh: how is it looking? [20:15:59] urbanecm: still testing [20:16:06] ok, waiting :) [20:16:49] urbanecm: I tested trwiki's patch, which is about restricting publishing [20:20:40] Gerges: did it work? :) [20:20:50] poor ircservserv :-( [20:21:32] welcome back! [20:23:01] urbanecm: im ready when you are both my patches can go out together btw [20:23:18] Jdlrobson: ack, waiting on tests of other patches [20:23:32] Yes [20:24:06] good [20:25:31] urbanecm: there are 30k messages related to trwiki in the mwdebug logstash dashboard, not sure if any of those are things to worry about [20:26:26] urbanecm: I haven't seen positive confirmation that events are flowing the way I want them to, but I also don't see errors in the mwdebug logstash [20:26:32] * kostajh so that is probably good enough to continue [20:26:41] kostajh: to me, it looks like "Verbose logging" was enabled on Gerges's end, which triggered them. but, they're not errors, so not a concern for me i think. [20:26:48] yeah [20:27:26] Gerges: fwiw, if you indeed enabled "Verbose logging" in XWikimediaDebug, please keep that box off when testing something the next time :). it is needed only in certain situations, and in others, it is likely to even confuse people. thanks! :) [20:27:27] kostajh: Sorry [20:27:31] no worries [20:27:36] !log urbanecm@deploy1002 kharlan and urbanecm and gergesshamon: Continuing with sync [20:27:40] going ahead then [20:27:43] nothing to apologize for! [20:27:53] (03PS4) 10Jdlrobson: Wrap tables in Vector 2022 for projects where legacy Vector is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037600 (https://phabricator.wikimedia.org/T366314) [20:27:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P63948 and previous config saved to /var/cache/conftool/dbconfig/20240603-202754-root.json [20:27:55] (03PS5) 10Jdlrobson: Enable night theme on pages which have no color contrast issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038424 (https://phabricator.wikimedia.org/T366370) [20:27:58] (03CR) 10Urbanecm: [C:03+2] Wrap tables in Vector 2022 for projects where legacy Vector is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037600 (https://phabricator.wikimedia.org/T366314) (owner: 10Jdlrobson) [20:28:00] (03CR) 10Urbanecm: [C:03+2] Enable night theme on pages which have no color contrast issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038424 (https://phabricator.wikimedia.org/T366370) (owner: 10Jdlrobson) [20:28:26] * urbanecm is easily confused by seeing wikibugs attribute his actions to Jdlrobson [20:28:32] (i know why is that, it is just very much confusing) [20:28:53] (03Merged) 10jenkins-bot: Wrap tables in Vector 2022 for projects where legacy Vector is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037600 (https://phabricator.wikimedia.org/T366314) (owner: 10Jdlrobson) [20:28:55] (03Merged) 10jenkins-bot: Enable night theme on pages which have no color contrast issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038424 (https://phabricator.wikimedia.org/T366370) (owner: 10Jdlrobson) [20:29:25] waiting on scap now... [20:29:33] * urbanecm misses <1 min deployments [20:31:54] (03CR) 10Gergő Tisza: [C:03+1] Show experimental login popup links on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038389 (https://phabricator.wikimedia.org/T366486) (owner: 10Bartosz Dziewoński) [20:34:20] urbanecm: I see that events are flowing via Kafka Grafana dashbaords [20:34:43] that's good i guess [20:34:48] yes [20:35:11] urbanecm: Is there anything else? I will close the IRC window [20:35:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T352010)', diff saved to https://phabricator.wikimedia.org/P63949 and previous config saved to /var/cache/conftool/dbconfig/20240603-203514-ladsgroup.json [20:35:20] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [20:35:30] Gerges: shouldn't be [20:36:13] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1034882|EventLogging: Enable IP reputation logging (T354597)]], [[gerrit:1037897|[trwiki] Allow translator group to publish translation only in Extension:ContentTranslation]], [[gerrit:1038247|[trwiki] Reducing count edits ip and newbie per minute (T330811)]] (duration: 30m 14s) [20:36:18] T354597: Record IP reputation data for account creations and edits - https://phabricator.wikimedia.org/T354597 [20:36:19] T330811: Reduce the edit rate limit for trwiki - https://phabricator.wikimedia.org/T330811 [20:36:51] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1037600|Wrap tables in Vector 2022 for projects where legacy Vector is default (T366314)]], [[gerrit:1038424|Enable night theme on pages which have no color contrast issues (T366370)]] [20:36:55] T366314: Deploy and QA responsive tables change - https://phabricator.wikimedia.org/T366314 [20:36:55] T366370: Enable night theme on pages which have no color contrast issues - https://phabricator.wikimedia.org/T366370 [20:37:01] Jdlrobson: working on your patches now [20:38:14] urbanecm: thanks :) [20:39:12] !log urbanecm@deploy1002 jdlrobson and urbanecm: Backport for [[gerrit:1037600|Wrap tables in Vector 2022 for projects where legacy Vector is default (T366314)]], [[gerrit:1038424|Enable night theme on pages which have no color contrast issues (T366370)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:40:06] Jdlrobson: please take a look :) [20:41:02] urbanecm: on it [20:42:57] urbanecm: LGTM! Please sync! [20:43:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P63950 and previous config saved to /var/cache/conftool/dbconfig/20240603-204300-root.json [20:43:04] proceeding! [20:43:05] !log urbanecm@deploy1002 jdlrobson and urbanecm: Continuing with sync [20:50:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P63951 and previous config saved to /var/cache/conftool/dbconfig/20240603-205024-ladsgroup.json [20:51:23] 06SRE, 06Infrastructure-Foundations, 06Traffic: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps - https://phabricator.wikimedia.org/T359054#9857120 (10CDanis) Results after adding BR.ix are in. The set of countries that magru improves hasn't changed: BR, CL, AR, UY, PY, BO... [20:51:48] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1037600|Wrap tables in Vector 2022 for projects where legacy Vector is default (T366314)]], [[gerrit:1038424|Enable night theme on pages which have no color contrast issues (T366370)]] (duration: 14m 57s) [20:51:52] Jdlrobson: and done [20:51:53] T366314: Deploy and QA responsive tables change - https://phabricator.wikimedia.org/T366314 [20:51:53] T366370: Enable night theme on pages which have no color contrast issues - https://phabricator.wikimedia.org/T366370 [20:51:55] anything else? [20:54:03] does not appear so [20:58:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P63952 and previous config saved to /var/cache/conftool/dbconfig/20240603-205806-root.json [21:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240603T2100). [21:00:58] thanks urbanecm for your help today! [21:01:37] Np [21:05:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P63953 and previous config saved to /var/cache/conftool/dbconfig/20240603-210532-ladsgroup.json [21:10:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:11:46] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:13:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P63954 and previous config saved to /var/cache/conftool/dbconfig/20240603-211312-root.json [21:20:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T352010)', diff saved to https://phabricator.wikimedia.org/P63955 and previous config saved to /var/cache/conftool/dbconfig/20240603-212040-ladsgroup.json [21:20:42] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance [21:20:43] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [21:20:56] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance [21:26:27] (03CR) 10JHathaway: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2731/co" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [21:32:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [21:32:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [21:37:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [21:37:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [21:39:07] (03PS4) 10JHathaway: phab: query for inbound mail servers [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) [21:39:07] (03PS1) 10JHathaway: role::mail::mx: tag servers with mx_in [puppet] - 10https://gerrit.wikimedia.org/r/1038442 (https://phabricator.wikimedia.org/T365395) [21:39:28] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038442 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [21:39:57] 06SRE, 10Wikimedia-Mailing-lists: Cross post to multiple mailling lists is only received once by recipient - https://phabricator.wikimedia.org/T345691#9857320 (10Dzahn) But "Receive list copies" is just about receiving copies of your own posts. So I'm confused why it's now about replies to your posts and gmai... [21:40:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T364069)', diff saved to https://phabricator.wikimedia.org/P63956 and previous config saved to /var/cache/conftool/dbconfig/20240603-214000-marostegui.json [21:40:03] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [21:41:23] (03CR) 10JHathaway: [C:03+2] role::mail::mx: tag servers with mx_in [puppet] - 10https://gerrit.wikimedia.org/r/1038442 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [21:51:44] (03CR) 10JHathaway: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2733/co" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [21:54:23] (03PS1) 10Stoyofuku-wmf: Disable font size options on specified pages for pt, ta, ja [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038444 (https://phabricator.wikimedia.org/T366334) [21:55:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P63957 and previous config saved to /var/cache/conftool/dbconfig/20240603-215508-marostegui.json [22:00:05] (03PS5) 10JHathaway: phab: query for inbound mail servers [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) [22:05:57] (03CR) 10Jdlrobson: "The configuration in wmf-config/InitialiseSettings-labs.php no longer applies anywhere so can be removed as part of this change!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038444 (https://phabricator.wikimedia.org/T366334) (owner: 10Stoyofuku-wmf) [22:10:08] (03CR) 10JHathaway: [V:03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2734/" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [22:10:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P63958 and previous config saved to /var/cache/conftool/dbconfig/20240603-221016-marostegui.json [22:10:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:11:46] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:13:19] (03PS6) 10JHathaway: phab: query for inbound mail servers [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) [22:14:40] (03CR) 10JHathaway: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2735/console" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [22:25:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T364069)', diff saved to https://phabricator.wikimedia.org/P63959 and previous config saved to /var/cache/conftool/dbconfig/20240603-222524-marostegui.json [22:25:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1221.eqiad.wmnet with reason: Maintenance [22:25:29] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [22:25:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1221.eqiad.wmnet with reason: Maintenance [22:25:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [22:26:00] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [22:26:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1221 (T364069)', diff saved to https://phabricator.wikimedia.org/P63960 and previous config saved to /var/cache/conftool/dbconfig/20240603-222607-marostegui.json [22:28:59] (03PS2) 10Stoyofuku-wmf: Disable font size options on specified pages for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038444 (https://phabricator.wikimedia.org/T366334) [22:29:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T364299)', diff saved to https://phabricator.wikimedia.org/P63961 and previous config saved to /var/cache/conftool/dbconfig/20240603-222900-marostegui.json [22:29:03] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [22:30:56] (03CR) 10Stoyofuku-wmf: "Taking a slightly different approach after talking it through with Jon (thank you!)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038444 (https://phabricator.wikimedia.org/T366334) (owner: 10Stoyofuku-wmf) [22:34:22] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 146143040 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:35:22] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 25160 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:40:42] (03PS5) 10Dreamy Jazz: [CheckUser] Stop writing old for event table migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686) [22:44:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P63962 and previous config saved to /var/cache/conftool/dbconfig/20240603-224408-marostegui.json [22:49:52] (03CR) 10Jdlrobson: [C:03+1] Disable font size options on specified pages for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038444 (https://phabricator.wikimedia.org/T366334) (owner: 10Stoyofuku-wmf) [22:59:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P63963 and previous config saved to /var/cache/conftool/dbconfig/20240603-225916-marostegui.json [23:14:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T364299)', diff saved to https://phabricator.wikimedia.org/P63965 and previous config saved to /var/cache/conftool/dbconfig/20240603-231424-marostegui.json [23:14:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2202.codfw.wmnet with reason: Maintenance [23:14:27] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [23:14:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2202.codfw.wmnet with reason: Maintenance [23:14:42] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki mediawikiwiki "Extension:DynamicPageList (Wikimedia)" "Extension:DynamicPageList" "Zabe" --reason "per request [[:phab:T366488|T366488]]" [23:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:44] T366488: Request to move translatable page: Extension:DynamicPageList (Wikimedia) - https://phabricator.wikimedia.org/T366488 [23:35:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T352010)', diff saved to https://phabricator.wikimedia.org/P63966 and previous config saved to /var/cache/conftool/dbconfig/20240603-233555-ladsgroup.json [23:35:59] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:38:29] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1038347 [23:38:30] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1038347 (owner: 10TrainBranchBot) [23:51:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P63967 and previous config saved to /var/cache/conftool/dbconfig/20240603-235104-ladsgroup.json