[00:00:18] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1018410 (owner: 10TrainBranchBot) [00:02:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T360332)', diff saved to https://phabricator.wikimedia.org/P60319 and previous config saved to /var/cache/conftool/dbconfig/20240411-000211-arnaudb.json [00:14:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P60320 and previous config saved to /var/cache/conftool/dbconfig/20240411-001458-marostegui.json [00:17:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P60321 and previous config saved to /var/cache/conftool/dbconfig/20240411-001718-arnaudb.json [00:30:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P60322 and previous config saved to /var/cache/conftool/dbconfig/20240411-003005-marostegui.json [00:32:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P60323 and previous config saved to /var/cache/conftool/dbconfig/20240411-003226-arnaudb.json [00:45:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T356166)', diff saved to https://phabricator.wikimedia.org/P60324 and previous config saved to /var/cache/conftool/dbconfig/20240411-004514-marostegui.json [00:45:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1244.eqiad.wmnet with reason: Maintenance [00:45:20] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [00:45:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1244.eqiad.wmnet with reason: Maintenance [00:45:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1244 (T356166)', diff saved to https://phabricator.wikimedia.org/P60325 and previous config saved to /var/cache/conftool/dbconfig/20240411-004536-marostegui.json [00:47:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T360332)', diff saved to https://phabricator.wikimedia.org/P60326 and previous config saved to /var/cache/conftool/dbconfig/20240411-004735-arnaudb.json [00:47:38] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance [00:47:41] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [00:47:51] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance [00:47:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2170 (T360332)', diff saved to https://phabricator.wikimedia.org/P60327 and previous config saved to /var/cache/conftool/dbconfig/20240411-004758-arnaudb.json [00:50:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T360332)', diff saved to https://phabricator.wikimedia.org/P60328 and previous config saved to /var/cache/conftool/dbconfig/20240411-005054-arnaudb.json [01:06:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P60329 and previous config saved to /var/cache/conftool/dbconfig/20240411-010601-arnaudb.json [01:21:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P60330 and previous config saved to /var/cache/conftool/dbconfig/20240411-012110-arnaudb.json [01:36:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T360332)', diff saved to https://phabricator.wikimedia.org/P60331 and previous config saved to /var/cache/conftool/dbconfig/20240411-013618-arnaudb.json [01:36:22] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2173.codfw.wmnet with reason: Maintenance [01:36:25] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [01:36:35] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2173.codfw.wmnet with reason: Maintenance [01:36:37] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [01:36:50] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [01:36:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T360332)', diff saved to https://phabricator.wikimedia.org/P60332 and previous config saved to /var/cache/conftool/dbconfig/20240411-013657-arnaudb.json [01:38:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T360332)', diff saved to https://phabricator.wikimedia.org/P60333 and previous config saved to /var/cache/conftool/dbconfig/20240411-013848-arnaudb.json [01:46:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T356166)', diff saved to https://phabricator.wikimedia.org/P60334 and previous config saved to /var/cache/conftool/dbconfig/20240411-014602-marostegui.json [01:46:07] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [01:53:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P60335 and previous config saved to /var/cache/conftool/dbconfig/20240411-015355-arnaudb.json [02:01:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P60336 and previous config saved to /var/cache/conftool/dbconfig/20240411-020110-marostegui.json [02:09:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P60337 and previous config saved to /var/cache/conftool/dbconfig/20240411-020903-arnaudb.json [02:16:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P60338 and previous config saved to /var/cache/conftool/dbconfig/20240411-021617-marostegui.json [02:20:40] (KubernetesRsyslogDown) firing: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:24:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T360332)', diff saved to https://phabricator.wikimedia.org/P60339 and previous config saved to /var/cache/conftool/dbconfig/20240411-022410-arnaudb.json [02:24:13] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2174.codfw.wmnet with reason: Maintenance [02:24:23] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [02:24:26] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2174.codfw.wmnet with reason: Maintenance [02:24:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T360332)', diff saved to https://phabricator.wikimedia.org/P60340 and previous config saved to /var/cache/conftool/dbconfig/20240411-022433-arnaudb.json [02:25:40] (KubernetesRsyslogDown) resolved: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:27:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T360332)', diff saved to https://phabricator.wikimedia.org/P60341 and previous config saved to /var/cache/conftool/dbconfig/20240411-022725-arnaudb.json [02:31:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T356166)', diff saved to https://phabricator.wikimedia.org/P60342 and previous config saved to /var/cache/conftool/dbconfig/20240411-023125-marostegui.json [02:31:27] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance [02:31:29] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [02:31:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance [02:38:28] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:42:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P60343 and previous config saved to /var/cache/conftool/dbconfig/20240411-024232-arnaudb.json [02:57:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P60344 and previous config saved to /var/cache/conftool/dbconfig/20240411-025740-arnaudb.json [03:12:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T360332)', diff saved to https://phabricator.wikimedia.org/P60345 and previous config saved to /var/cache/conftool/dbconfig/20240411-031247-arnaudb.json [03:12:50] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2176.codfw.wmnet with reason: Maintenance [03:12:53] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [03:13:04] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2176.codfw.wmnet with reason: Maintenance [03:13:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T360332)', diff saved to https://phabricator.wikimedia.org/P60346 and previous config saved to /var/cache/conftool/dbconfig/20240411-031310-arnaudb.json [03:16:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T360332)', diff saved to https://phabricator.wikimedia.org/P60347 and previous config saved to /var/cache/conftool/dbconfig/20240411-031602-arnaudb.json [03:23:28] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:31:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P60348 and previous config saved to /var/cache/conftool/dbconfig/20240411-033109-arnaudb.json [03:35:27] (SystemdUnitCrashLoop) firing: prometheus-wmf-elasticsearch-exporter-9400.service crashloop on elastic2090:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [03:46:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P60349 and previous config saved to /var/cache/conftool/dbconfig/20240411-034617-arnaudb.json [04:01:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T360332)', diff saved to https://phabricator.wikimedia.org/P60350 and previous config saved to /var/cache/conftool/dbconfig/20240411-040124-arnaudb.json [04:01:27] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2188.codfw.wmnet with reason: Maintenance [04:01:34] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [04:01:40] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2188.codfw.wmnet with reason: Maintenance [04:01:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T360332)', diff saved to https://phabricator.wikimedia.org/P60351 and previous config saved to /var/cache/conftool/dbconfig/20240411-040147-arnaudb.json [04:04:05] (03PS5) 10Hashar: logging: default to log any error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) [04:04:30] :old-man-yells-at-gerrit: [04:04:42] (03PS6) 10Hashar: logging: default to log any error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) [04:04:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T360332)', diff saved to https://phabricator.wikimedia.org/P60352 and previous config saved to /var/cache/conftool/dbconfig/20240411-040447-arnaudb.json [04:04:48] (03CR) 10CI reject: [V:04-1] logging: default to log any error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [04:05:28] (03CR) 10CI reject: [V:04-1] logging: default to log any error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [04:06:27] (03PS7) 10Hashar: logging: default to log any error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) [04:09:53] (03PS8) 10Hashar: logging: default to log any error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) [04:14:40] (KubernetesRsyslogDown) firing: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:19:40] (KubernetesRsyslogDown) resolved: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:19:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P60353 and previous config saved to /var/cache/conftool/dbconfig/20240411-041954-arnaudb.json [04:34:35] (03PS9) 10Hashar: logging: default to log any error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) [04:35:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P60354 and previous config saved to /var/cache/conftool/dbconfig/20240411-043502-arnaudb.json [04:44:54] (03CR) 10Hashar: "+ Gergo who filed T228838 and Daniel who was hit by the issue yesterday and had to add a log channel explicitly ( I97714e296c025fa2accb04" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [04:50:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T360332)', diff saved to https://phabricator.wikimedia.org/P60355 and previous config saved to /var/cache/conftool/dbconfig/20240411-045011-arnaudb.json [04:50:14] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2212.codfw.wmnet with reason: Maintenance [04:50:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2212.codfw.wmnet with reason: Maintenance [04:50:19] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [04:50:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2212 (T360332)', diff saved to https://phabricator.wikimedia.org/P60356 and previous config saved to /var/cache/conftool/dbconfig/20240411-045024-arnaudb.json [04:53:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T360332)', diff saved to https://phabricator.wikimedia.org/P60357 and previous config saved to /var/cache/conftool/dbconfig/20240411-045317-arnaudb.json [05:08:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P60358 and previous config saved to /var/cache/conftool/dbconfig/20240411-050825-arnaudb.json [05:13:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1189', diff saved to https://phabricator.wikimedia.org/P60359 and previous config saved to /var/cache/conftool/dbconfig/20240411-051341-root.json [05:14:06] (03PS1) 10Marostegui: db1189: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1018823 [05:15:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1189.eqiad.wmnet with OS bookworm [05:17:57] (03CR) 10Marostegui: [C:03+2] db1189: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1018823 (owner: 10Marostegui) [05:18:25] (03CR) 10Marostegui: [C:03+1] "Remember to drop those users with: drop user if exists 'USERNAME'@'IPS_REMOVED';" [puppet] - 10https://gerrit.wikimedia.org/r/1018407 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [05:18:36] (03CR) 10Marostegui: [C:03+1] mariadb: Reenable notifications for db2199, db2200 after setup [puppet] - 10https://gerrit.wikimedia.org/r/1018744 (https://phabricator.wikimedia.org/T355422) (owner: 10Jcrespo) [05:23:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P60360 and previous config saved to /var/cache/conftool/dbconfig/20240411-052333-arnaudb.json [05:27:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1189.eqiad.wmnet with reason: host reimage [05:31:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1189.eqiad.wmnet with reason: host reimage [05:34:40] (KubernetesRsyslogDown) firing: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:37:42] (03PS1) 10Marostegui: Revert "db1189: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1018697 [05:38:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T360332)', diff saved to https://phabricator.wikimedia.org/P60361 and previous config saved to /var/cache/conftool/dbconfig/20240411-053840-arnaudb.json [05:38:43] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2216.codfw.wmnet with reason: Maintenance [05:38:46] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [05:38:56] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2216.codfw.wmnet with reason: Maintenance [05:39:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T360332)', diff saved to https://phabricator.wikimedia.org/P60362 and previous config saved to /var/cache/conftool/dbconfig/20240411-053903-arnaudb.json [05:39:40] (KubernetesRsyslogDown) resolved: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:42:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T360332)', diff saved to https://phabricator.wikimedia.org/P60363 and previous config saved to /var/cache/conftool/dbconfig/20240411-054205-arnaudb.json [05:52:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1189.eqiad.wmnet with OS bookworm [05:54:05] (03CR) 10Marostegui: [C:03+2] Revert "db1189: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1018697 (owner: 10Marostegui) [05:54:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P60364 and previous config saved to /var/cache/conftool/dbconfig/20240411-055428-root.json [05:57:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P60365 and previous config saved to /var/cache/conftool/dbconfig/20240411-055712-arnaudb.json [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240411T0600) [06:00:05] kormat, marostegui, Amir1, and arnaudb: Time to do the Primary database switchover deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240411T0600). [06:00:25] (SystemdUnitFailed) firing: prometheus-wmf-elasticsearch-exporter-9600.service on elastic2090:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:09:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P60366 and previous config saved to /var/cache/conftool/dbconfig/20240411-060934-root.json [06:12:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P60367 and previous config saved to /var/cache/conftool/dbconfig/20240411-061220-arnaudb.json [06:15:25] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:24:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P60368 and previous config saved to /var/cache/conftool/dbconfig/20240411-062440-root.json [06:27:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T360332)', diff saved to https://phabricator.wikimedia.org/P60369 and previous config saved to /var/cache/conftool/dbconfig/20240411-062728-arnaudb.json [06:27:33] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [06:39:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P60370 and previous config saved to /var/cache/conftool/dbconfig/20240411-063946-root.json [06:54:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P60371 and previous config saved to /var/cache/conftool/dbconfig/20240411-065452-root.json [06:56:08] !log slyngshede@cumin1002 START - Cookbook sre.hosts.decommission for hosts idp1002.wikimedia.org [06:57:35] !log slyngshede@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts idp1002.wikimedia.org [06:58:33] (03CR) 10Ayounsi: [C:03+1] "Overall lgtm, one comment." [cookbooks] - 10https://gerrit.wikimedia.org/r/1018718 (https://phabricator.wikimedia.org/T361306) (owner: 10Volans) [07:00:05] Amir1 and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240411T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:16] !log slyngshede@cumin1002 START - Cookbook sre.hosts.decommission for hosts idp1002.wikimedia.org [07:05:01] !log slyngshede@cumin1002 START - Cookbook sre.dns.netbox [07:05:21] (03CR) 10Filippo Giunchedi: [C:03+2] titan: trim 5m retention to 3y + 2w [puppet] - 10https://gerrit.wikimedia.org/r/1018644 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi) [07:08:10] !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp1002.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1002" [07:09:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P60372 and previous config saved to /var/cache/conftool/dbconfig/20240411-070958-root.json [07:10:08] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp1002.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1002" [07:10:08] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:10:08] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts idp1002.wikimedia.org [07:10:50] (03PS1) 10Slyngshede: R:idp decommision Bullseye IDP hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1018871 (https://phabricator.wikimedia.org/T357748) [07:13:08] (03PS1) 10Filippo Giunchedi: opensearch: switch dashboards to sso auth [puppet] - 10https://gerrit.wikimedia.org/r/1018872 (https://phabricator.wikimedia.org/T246998) [07:13:32] (03PS1) 10Muehlenhoff: Blacklist n_gsm kernel module [puppet] - 10https://gerrit.wikimedia.org/r/1018873 [07:15:25] (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:17:07] (03CR) 10Majavah: [C:03+2] alertmanager: karma: Set group too [puppet] - 10https://gerrit.wikimedia.org/r/1018727 (owner: 10Majavah) [07:18:42] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1867/co" [puppet] - 10https://gerrit.wikimedia.org/r/1018872 (https://phabricator.wikimedia.org/T246998) (owner: 10Filippo Giunchedi) [07:25:04] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1018871 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [07:25:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P60373 and previous config saved to /var/cache/conftool/dbconfig/20240411-072503-root.json [07:25:14] (03CR) 10Muehlenhoff: [C:03+2] Blacklist n_gsm kernel module [puppet] - 10https://gerrit.wikimedia.org/r/1018873 (owner: 10Muehlenhoff) [07:35:18] (03CR) 10DCausse: cirrus-streaming-updater: swith to "failure-rate" retry strategy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018778 (owner: 10DCausse) [07:35:27] (SystemdUnitCrashLoop) firing: prometheus-wmf-elasticsearch-exporter-9400.service crashloop on elastic2090:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [07:39:13] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:39:26] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:44:23] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp3072.esams.wmnet [07:44:54] (03CR) 10Fabfur: [C:03+2] cp3072: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1015974 (https://phabricator.wikimedia.org/T360430) (owner: 10Ssingh) [07:47:21] !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp3072.esams.wmnet with OS bullseye [07:47:33] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9705875 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp3072.esams.wmnet with OS bullseye [07:52:03] the train is blocked on T362297 [07:52:04] T362297: [Bug] Mobile watchlist broken - https://phabricator.wikimedia.org/T362297 [07:52:17] some obscure UI regression on the mobile watchlist [07:55:16] 06SRE, 06Infrastructure-Foundations, 10Mail: 14Access to DMARCIAN - 14https://phabricator.wikimedia.org/T356920#9705894 (10Aklapper) 05Open→03Declined 14Declining request as the requester's account has been disabled. [07:56:54] !log slyngshede@cumin1002 START - Cookbook sre.hosts.decommission for hosts idp2002.wikimedia.org [08:00:04] hashar and jnuche: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240411T0800) [08:01:45] !log slyngshede@cumin1002 START - Cookbook sre.dns.netbox [08:03:42] !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp2002.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1002" [08:05:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1198', diff saved to https://phabricator.wikimedia.org/P60374 and previous config saved to /var/cache/conftool/dbconfig/20240411-080502-root.json [08:05:07] (03PS1) 10Marostegui: db1198: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1018934 [08:05:46] (03CR) 10Marostegui: [C:03+2] db1198: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1018934 (owner: 10Marostegui) [08:06:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1198.eqiad.wmnet with OS bookworm [08:06:18] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp2002.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1002" [08:06:18] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:06:19] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts idp2002.wikimedia.org [08:06:54] (03CR) 10Slyngshede: [C:03+2] R:idp decommision Bullseye IDP hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1018871 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [08:10:28] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3072.esams.wmnet with reason: host reimage [08:13:37] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3072.esams.wmnet with reason: host reimage [08:16:52] (03PS2) 10Volans: sre.hosts.decommission: ask on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/1018718 (https://phabricator.wikimedia.org/T361306) [08:17:16] (03CR) 10Volans: "addressed comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/1018718 (https://phabricator.wikimedia.org/T361306) (owner: 10Volans) [08:19:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1198.eqiad.wmnet with reason: host reimage [08:20:34] (03CR) 10Ayounsi: [C:03+1] sre.hosts.decommission: ask on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/1018718 (https://phabricator.wikimedia.org/T361306) (owner: 10Volans) [08:20:41] (03CR) 10CI reject: [V:04-1] sre.hosts.decommission: ask on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/1018718 (https://phabricator.wikimedia.org/T361306) (owner: 10Volans) [08:20:41] !log MediaWiki train is blocked [08:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:51] (03CR) 10Ayounsi: [C:03+2] add_ip6_mapped - don't fail if the host already have a /128 address [puppet] - 10https://gerrit.wikimedia.org/r/1017047 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [08:22:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1198.eqiad.wmnet with reason: host reimage [08:25:32] !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host testvm2006.codfw.wmnet [08:25:33] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [08:26:43] !log fabfur@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp3072.esams.wmnet with OS bullseye [08:26:52] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9705962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp3072.esams.wmnet with OS bullseye executed with errors: - cp3072 (... [08:27:39] (03PS3) 10Volans: sre.hosts.decommission: ask on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/1018718 (https://phabricator.wikimedia.org/T361306) [08:27:40] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2006.codfw.wmnet - ayounsi@cumin1002" [08:27:46] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host db1198.eqiad.wmnet with OS bookworm [08:28:31] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2006.codfw.wmnet - ayounsi@cumin1002" [08:28:31] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:28:31] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache testvm2006.codfw.wmnet on all recursors [08:28:35] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2006.codfw.wmnet on all recursors [08:29:01] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2006.codfw.wmnet - ayounsi@cumin1002" [08:29:55] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2006.codfw.wmnet - ayounsi@cumin1002" [08:31:07] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host testvm2006.codfw.wmnet with OS bookworm [08:36:11] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1198.eqiad.wmnet with OS bookworm [08:36:30] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [08:37:11] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1198.eqiad.wmnet with reason: host reimage [08:40:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1198.eqiad.wmnet with reason: host reimage [08:40:28] !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp3072.esams.wmnet with OS bullseye [08:40:42] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9705992 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp3072.esams.wmnet with OS bullseye [08:42:08] (03CR) 10Slyngshede: [C:03+2] Change ssh key validator from class to function. [software/bitu] - 10https://gerrit.wikimedia.org/r/1018635 (owner: 10Slyngshede) [08:42:45] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2006.codfw.wmnet with reason: host reimage [08:42:46] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [08:43:17] (03Merged) 10jenkins-bot: Change ssh key validator from class to function. [software/bitu] - 10https://gerrit.wikimedia.org/r/1018635 (owner: 10Slyngshede) [08:45:11] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on matomo1003.eqiad.wmnet with reason: Adding disk [08:45:25] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on matomo1003.eqiad.wmnet with reason: Adding disk [08:45:36] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [08:45:47] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2006.codfw.wmnet with reason: host reimage [08:47:33] (03PS1) 10Muehlenhoff: Remove obsolete grant [puppet] - 10https://gerrit.wikimedia.org/r/1018941 (https://phabricator.wikimedia.org/T357748) [08:50:31] !log jelto@cumin1002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [08:50:51] (03PS1) 10Marostegui: Revert "db1198: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1018704 [08:54:22] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2114 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1018411 (https://phabricator.wikimedia.org/T362302) [08:55:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1198.eqiad.wmnet with OS bookworm [08:57:26] (03CR) 10Marostegui: [C:03+2] Revert "db1198: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1018704 (owner: 10Marostegui) [08:57:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1198 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P60376 and previous config saved to /var/cache/conftool/dbconfig/20240411-085749-root.json [08:58:26] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testvm2006.codfw.wmnet with OS bookworm [08:58:26] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testvm2006.codfw.wmnet [08:58:56] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s6 T362302 [08:59:00] T362302: Switchover s6 master (db2129 -> db2114) - https://phabricator.wikimedia.org/T362302 [08:59:19] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s6 T362302 [08:59:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2114 with weight 0 T362302', diff saved to https://phabricator.wikimedia.org/P60377 and previous config saved to /var/cache/conftool/dbconfig/20240411-085926-arnaudb.json [09:03:09] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3072.esams.wmnet with reason: host reimage [09:06:37] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3072.esams.wmnet with reason: host reimage [09:10:31] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: 14Site: eqiad 1 VM for Matomo - 14https://phabricator.wikimedia.org/T362146#9706068 (10BTullis) 14I'm adding the second disk now. ` btullis@ganeti1027:~$ sudo gnt-instance modify --... [09:12:15] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [09:12:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1198 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P60378 and previous config saved to /var/cache/conftool/dbconfig/20240411-091255-root.json [09:12:59] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [09:15:51] (03PS4) 10Gmodena: analytics: refinery: add webrequest_frontend timer [puppet] - 10https://gerrit.wikimedia.org/r/1017041 (https://phabricator.wikimedia.org/T314956) [09:16:56] !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host testvm2007.codfw.wmnet [09:16:57] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [09:18:26] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2114 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1018411 (https://phabricator.wikimedia.org/T362302) (owner: 10Gerrit maintenance bot) [09:19:45] !log Starting s6 codfw failover from db2129 to db2114 - T362302 [09:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:55] T362302: Switchover s6 master (db2129 -> db2114) - https://phabricator.wikimedia.org/T362302 [09:20:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2114 to s6 primary T362302', diff saved to https://phabricator.wikimedia.org/P60379 and previous config saved to /var/cache/conftool/dbconfig/20240411-092012-arnaudb.json [09:20:24] (03PS1) 10Jelto: gitlab_runner: add dockerfile support for test runner in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1018944 (https://phabricator.wikimedia.org/T357612) [09:20:26] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2007.codfw.wmnet - ayounsi@cumin1002" [09:23:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 weight bump T362302', diff saved to https://phabricator.wikimedia.org/P60380 and previous config saved to /var/cache/conftool/dbconfig/20240411-092318-arnaudb.json [09:24:02] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2007.codfw.wmnet - ayounsi@cumin1002" [09:24:02] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:24:02] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache testvm2007.codfw.wmnet on all recursors [09:24:05] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2007.codfw.wmnet on all recursors [09:24:31] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2007.codfw.wmnet - ayounsi@cumin1002" [09:25:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 depool', diff saved to https://phabricator.wikimedia.org/P60381 and previous config saved to /var/cache/conftool/dbconfig/20240411-092501-arnaudb.json [09:25:23] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2007.codfw.wmnet - ayounsi@cumin1002" [09:25:34] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host testvm2007.codfw.wmnet with OS bookworm [09:26:06] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance [09:26:08] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance [09:26:10] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:26:15] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:26:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T360332)', diff saved to https://phabricator.wikimedia.org/P60382 and previous config saved to /var/cache/conftool/dbconfig/20240411-092622-arnaudb.json [09:26:27] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [09:27:09] !log arnaudb@cumin1002 dbctl restore of MediaWiki config (dc=all) from /var/cache/conftool/dbconfig/20240411-092622-arnaudb.json [09:27:58] (03CR) 10Jelto: [C:03+2] gitlab_runner: add dockerfile support for test runner in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1018944 (https://phabricator.wikimedia.org/T357612) (owner: 10Jelto) [09:28:10] (03CR) 10Effie Mouzeli: [C:03+1] kubernetes: Move 7 codfw appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1018719 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [09:28:34] (03CR) 10Clément Goubert: [C:03+2] kubernetes: Move 7 codfw appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1018719 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [09:29:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1198 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P60383 and previous config saved to /var/cache/conftool/dbconfig/20240411-092942-root.json [09:30:40] (KubernetesRsyslogDown) firing: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:31:01] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2129.codfw.wmnet with reason: Maintenance [09:31:03] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2129.codfw.wmnet with reason: Maintenance [09:31:06] (03CR) 10Majavah: [C:03+2] hieradata: Add CDN config for toolsadmin-toolsbeta.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1018665 (https://phabricator.wikimedia.org/T360025) (owner: 10Majavah) [09:32:08] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3072.esams.wmnet with OS bullseye [09:32:21] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9706148 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp3072.esams.wmnet with OS bullseye completed: - cp3072 (**WARN**)... [09:32:24] (03CR) 10Daniel Kinzler: [C:03+1] "I love this change. I have fallen into this trap too many times!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [09:34:58] 06SRE, 10Maps, 06serviceops: Move maps/karthoterian to PKI/cfssl - https://phabricator.wikimedia.org/T360778#9706160 (10jijiki) [09:35:00] 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9706161 (10jijiki) [09:35:08] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9706162 (10jijiki) [09:35:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 25%: post schema update', diff saved to https://phabricator.wikimedia.org/P60384 and previous config saved to /var/cache/conftool/dbconfig/20240411-093513-arnaudb.json [09:35:40] (KubernetesRsyslogDown) resolved: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:36:30] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9706184 (10jijiki) [09:37:07] 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9706185 (10jijiki) [09:37:16] 06SRE, 10Maps, 06serviceops: Move maps/karthoterian to PKI/cfssl - https://phabricator.wikimedia.org/T360778#9706189 (10jijiki) a:03jijiki [09:37:59] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp3072.esams.wmnet [09:38:00] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2007.codfw.wmnet with reason: host reimage [09:38:07] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw2412.codfw.wmnet with OS bullseye [09:38:31] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw2413.codfw.wmnet with OS bullseye [09:38:45] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9706201 (10Fabfur) [09:38:58] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw2414.codfw.wmnet with OS bullseye [09:39:30] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw2415.codfw.wmnet with OS bullseye [09:39:55] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw2416.codfw.wmnet with OS bullseye [09:40:23] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw2417.codfw.wmnet with OS bullseye [09:40:39] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2007.codfw.wmnet with reason: host reimage [09:40:50] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw2418.codfw.wmnet with OS bullseye [09:42:03] (03PS1) 10Fabfur: Revert "benthos: temporary disable haproxy metrics" [puppet] - 10https://gerrit.wikimedia.org/r/1018705 [09:44:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1198 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P60386 and previous config saved to /var/cache/conftool/dbconfig/20240411-094448-root.json [09:47:35] (03CR) 10Jcrespo: [C:03+2] mariadb: Reenable notifications for db2199, db2200 after setup [puppet] - 10https://gerrit.wikimedia.org/r/1018744 (https://phabricator.wikimedia.org/T355422) (owner: 10Jcrespo) [09:47:44] (03PS2) 10Jcrespo: mariadb: Reenable notifications for db2199, db2200 after setup [puppet] - 10https://gerrit.wikimedia.org/r/1018744 (https://phabricator.wikimedia.org/T355422) [09:47:59] (03PS2) 10Esanders: Set wgMFFallbackEditor to visual for most VE wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015086 (https://phabricator.wikimedia.org/T361134) [09:50:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 50%: post schema update', diff saved to https://phabricator.wikimedia.org/P60387 and previous config saved to /var/cache/conftool/dbconfig/20240411-095019-arnaudb.json [09:51:12] (03CR) 10Jcrespo: [V:03+2 C:03+2] mariadb: Reenable notifications for db2199, db2200 after setup [puppet] - 10https://gerrit.wikimedia.org/r/1018744 (https://phabricator.wikimedia.org/T355422) (owner: 10Jcrespo) [09:51:50] (03PS1) 10Btullis: Use a more WMF standard mariadb configuration for Matomo [puppet] - 10https://gerrit.wikimedia.org/r/1018948 (https://phabricator.wikimedia.org/T349397) [09:53:25] (SystemdUnitFailed) firing: ferm.service on mw2320:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:53:25] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1868/co" [puppet] - 10https://gerrit.wikimedia.org/r/1018948 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis) [09:54:24] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2414.codfw.wmnet with reason: host reimage [09:54:37] (03CR) 10Hashar: logging: default to log any error (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [09:55:05] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testvm2007.codfw.wmnet with OS bookworm [09:55:05] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testvm2007.codfw.wmnet [09:55:11] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2412.codfw.wmnet with reason: host reimage [09:55:14] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2413.codfw.wmnet with reason: host reimage [09:56:04] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2415.codfw.wmnet with reason: host reimage [09:56:08] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "testvm2007 - ayounsi@cumin1002" [09:56:37] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2416.codfw.wmnet with reason: host reimage [09:57:11] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2417.codfw.wmnet with reason: host reimage [09:57:12] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "testvm2007 - ayounsi@cumin1002" [09:57:14] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2414.codfw.wmnet with reason: host reimage [09:57:24] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2418.codfw.wmnet with reason: host reimage [09:57:44] (03CR) 10Fabfur: [C:03+2] Revert "benthos: temporary disable haproxy metrics" [puppet] - 10https://gerrit.wikimedia.org/r/1018705 (owner: 10Fabfur) [09:59:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1198 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P60388 and previous config saved to /var/cache/conftool/dbconfig/20240411-095954-root.json [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240411T1000) [10:00:08] (03PS1) 10JMeybohm: admin_ng: Refactor fetching pspClusterRole for namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018950 (https://phabricator.wikimedia.org/T273507) [10:00:09] (03PS1) 10JMeybohm: admin_ng: Stop adding kubernetes.io/metadata.name namespace label [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018951 (https://phabricator.wikimedia.org/T273507) [10:00:11] (03PS1) 10JMeybohm: admin_ng: Enable restriced PSS profile in audit mode in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018952 (https://phabricator.wikimedia.org/T273507) [10:00:17] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2417.codfw.wmnet with reason: host reimage [10:00:25] (SystemdUnitFailed) firing: prometheus-wmf-elasticsearch-exporter-9600.service on elastic2090:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:03:25] (SystemdUnitFailed) resolved: ferm.service on mw2320:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:03:40] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2416.codfw.wmnet with reason: host reimage [10:05:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 75%: post schema update', diff saved to https://phabricator.wikimedia.org/P60389 and previous config saved to /var/cache/conftool/dbconfig/20240411-100525-arnaudb.json [10:06:36] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2415.codfw.wmnet with reason: host reimage [10:09:43] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2412.codfw.wmnet with reason: host reimage [10:13:16] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2418.codfw.wmnet with reason: host reimage [10:15:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1198 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P60390 and previous config saved to /var/cache/conftool/dbconfig/20240411-101500-root.json [10:15:17] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2414.codfw.wmnet with OS bullseye [10:15:40] (KubernetesRsyslogDown) firing: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:15:50] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1018309 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [10:16:12] (03CR) 10Btullis: [V:03+1 C:03+2] Use a more WMF standard mariadb configuration for Matomo [puppet] - 10https://gerrit.wikimedia.org/r/1018948 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis) [10:17:13] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2413.codfw.wmnet with reason: host reimage [10:19:20] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2417.codfw.wmnet with OS bullseye [10:20:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 100%: post schema update', diff saved to https://phabricator.wikimedia.org/P60391 and previous config saved to /var/cache/conftool/dbconfig/20240411-102031-arnaudb.json [10:20:40] (KubernetesRsyslogDown) resolved: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:21:31] (03CR) 10Arnaudb: [C:03+1] Remove obsolete grant [puppet] - 10https://gerrit.wikimedia.org/r/1018941 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff) [10:21:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1247.eqiad.wmnet with reason: Maintenance [10:21:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1247.eqiad.wmnet with reason: Maintenance [10:21:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1247 (T356166)', diff saved to https://phabricator.wikimedia.org/P60392 and previous config saved to /var/cache/conftool/dbconfig/20240411-102153-marostegui.json [10:22:01] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [10:22:22] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2416.codfw.wmnet with OS bullseye [10:25:37] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2415.codfw.wmnet with OS bullseye [10:27:53] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST ipamblocks) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:28:48] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2412.codfw.wmnet with OS bullseye [10:30:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1198 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P60393 and previous config saved to /var/cache/conftool/dbconfig/20240411-103005-root.json [10:30:30] !log installing xerces-c security updates [10:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2418.codfw.wmnet with OS bullseye [10:32:53] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST ipamblocks) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:36:12] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2413.codfw.wmnet with OS bullseye [10:36:37] 06SRE, 06Machine-Learning-Team, 10MW-on-K8s, 06serviceops: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316 (10Clement_Goubert) 03NEW [10:36:51] 06SRE, 06Machine-Learning-Team, 10MW-on-K8s, 06serviceops: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9706397 (10Clement_Goubert) p:05Triage→03Medium [10:37:25] !log Running homer 'cr*codfw*' commit 'T351074' [10:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:32] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [10:43:03] !log installing modsecurity-apache security updates [10:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:37] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Steph Toyofuku - https://phabricator.wikimedia.org/T362113#9706439 (10Milimetric) Approved [10:45:21] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to shell access to analytics client servers for AndyRussG - https://phabricator.wikimedia.org/T361742#9706455 (10Milimetric) Approved, welcome back Andy :) [10:48:52] (03CR) 10Muehlenhoff: [C:03+2] Add stoyofuku to analytics-privatedata-access [puppet] - 10https://gerrit.wikimedia.org/r/1018634 (https://phabricator.wikimedia.org/T362113) (owner: 10Muehlenhoff) [10:52:52] !log Pooling and uncordoning mw2412.codfw.wmnet,mw2413.codfw.wmnet,mw2414.codfw.wmnet,mw2415.codfw.wmnet,mw2416.codfw.wmnet,mw2417.codfw.wmnet,mw2418.codfw.wmnet - T351074 [10:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:57] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [10:53:02] !log cgoubert@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(mw2412.codfw.wmnet|mw2413.codfw.wmnet|mw2414.codfw.wmnet|mw2415.codfw.wmnet|mw2416.codfw.wmnet|mw2417.codfw.wmnet|mw2418.codfw.wmnet),cluster=kubernetes,service=kubesvc [10:53:49] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: 14Requesting access to analytics-privatedata-users for Steph Toyofuku - 14https://phabricator.wikimedia.org/T362113#9706516 (10MoritzMuehlenhoff) 05Open→03Resolved 14@SToyofuku-WMF : I've enabled your access. You should already be able to log into st... [10:54:46] (03PS3) 10Muehlenhoff: Add andyrussg to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1018641 (https://phabricator.wikimedia.org/T361742) [10:59:50] (03CR) 10Muehlenhoff: [C:03+2] Add andyrussg to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1018641 (https://phabricator.wikimedia.org/T361742) (owner: 10Muehlenhoff) [11:05:25] (03PS1) 10Btullis: Add a partman recipe for the new matomo server [puppet] - 10https://gerrit.wikimedia.org/r/1018955 (https://phabricator.wikimedia.org/T349397) [11:09:32] (03PS1) 10Marostegui: db2177: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1018956 [11:09:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2177', diff saved to https://phabricator.wikimedia.org/P60394 and previous config saved to /var/cache/conftool/dbconfig/20240411-110938-root.json [11:09:40] (KubernetesRsyslogDown) firing: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:09:51] (03CR) 10Btullis: [C:03+2] Add a partman recipe for the new matomo server [puppet] - 10https://gerrit.wikimedia.org/r/1018955 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis) [11:10:15] (03CR) 10Marostegui: [C:03+2] db2177: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1018956 (owner: 10Marostegui) [11:10:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2177.codfw.wmnet with OS bookworm [11:11:02] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: 14Requesting access to shell access to analytics client servers for AndyRussG - 14https://phabricator.wikimedia.org/T361742#9706554 (10MoritzMuehlenhoff) 05Open→03Resolved 14@AndyRussG: I've enabled your access. You should already be able to log into... [11:11:57] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9706562 (10MoritzMuehlenhoff) [11:14:40] (KubernetesRsyslogDown) resolved: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:21:20] (03PS1) 10Marostegui: Revert "db2177: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1018966 [11:22:02] !log upload memkeys 20181031-2-s1 to bookworm-wikimedia main [11:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:21] !log upload memkeys 20181031-2-s1 to bookworm-wikimedia main - T362160 [11:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:33] T362160: Repackage memkeys for debian bookworm - https://phabricator.wikimedia.org/T362160 [11:24:41] !log upload prometheus-memcached-exporter 0.14.2-1~wmf1 to bookworm-wikimedia main - T350807 [11:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:55] T350807: Package latest version of prometheus-memcached-exporter (v0.14.2) - https://phabricator.wikimedia.org/T350807 [11:26:18] (03PS1) 10Clément Goubert: article-description: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018959 (https://phabricator.wikimedia.org/T362316) [11:26:20] (03PS1) 10Clément Goubert: article-description: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018960 (https://phabricator.wikimedia.org/T362316) [11:27:00] (03PS1) 10Clément Goubert: articletopic-outlink: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018961 (https://phabricator.wikimedia.org/T362316) [11:27:01] (03PS1) 10Clément Goubert: articletopic-outlink: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018962 (https://phabricator.wikimedia.org/T362316) [11:27:30] (03PS1) 10Clément Goubert: experimental: Switch to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018963 (https://phabricator.wikimedia.org/T362316) [11:27:58] (03PS1) 10Clément Goubert: readability: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018964 (https://phabricator.wikimedia.org/T362316) [11:27:59] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2177.codfw.wmnet with reason: host reimage [11:28:00] (03PS1) 10Clément Goubert: readability: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018965 (https://phabricator.wikimedia.org/T362316) [11:28:39] (03PS1) 10Clément Goubert: revertrisk: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018986 (https://phabricator.wikimedia.org/T362316) [11:28:40] (03PS1) 10Clément Goubert: revertrisk: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018987 (https://phabricator.wikimedia.org/T362316) [11:29:12] (03PS1) 10Clément Goubert: revscoring-articlequality: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018988 (https://phabricator.wikimedia.org/T362316) [11:29:13] (03PS1) 10Clément Goubert: revscoring-articlequality: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018989 (https://phabricator.wikimedia.org/T362316) [11:29:39] (03PS1) 10Clément Goubert: revscoring-articletopic: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018990 (https://phabricator.wikimedia.org/T362316) [11:29:40] (03PS1) 10Clément Goubert: revscoring-articletopic: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018991 (https://phabricator.wikimedia.org/T362316) [11:30:21] (03PS1) 10Clément Goubert: revscoring-draftquality: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018992 (https://phabricator.wikimedia.org/T362316) [11:30:24] (03PS1) 10Clément Goubert: revscoring-draftquality: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018993 (https://phabricator.wikimedia.org/T362316) [11:30:28] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9706629 (10MoritzMuehlenhoff) [11:30:59] (03PS1) 10Clément Goubert: revscoring-drafttopic: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018994 (https://phabricator.wikimedia.org/T362316) [11:31:01] (03PS1) 10Clément Goubert: revscoring-drafttopic: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018995 (https://phabricator.wikimedia.org/T362316) [11:31:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2177.codfw.wmnet with reason: host reimage [11:31:37] (03PS1) 10Clément Goubert: revscoring-editquality-damaging: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018996 (https://phabricator.wikimedia.org/T362316) [11:31:39] (03PS1) 10Clément Goubert: revscoring-editquality-damaging: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018997 (https://phabricator.wikimedia.org/T362316) [11:31:40] (03CR) 10Jelto: [C:03+1] "lgtm, diff also looks fine (noop)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018950 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [11:31:47] !log installing postgresql-15 security updates [11:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:15] (03PS1) 10Clément Goubert: revscoring-editquality-goodfaith: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018998 (https://phabricator.wikimedia.org/T362316) [11:32:16] (03PS1) 10Clément Goubert: revscoring-editquality-goodfaith: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018999 (https://phabricator.wikimedia.org/T362316) [11:32:49] (03PS1) 10Clément Goubert: revscoring-editquality-reverted: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019000 (https://phabricator.wikimedia.org/T362316) [11:32:50] (03PS1) 10Clément Goubert: revscoring-editquality-reverted: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019001 (https://phabricator.wikimedia.org/T362316) [11:33:40] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host matomo1003.eqiad.wmnet with OS bullseye [11:33:50] (03PS1) 10Ayounsi: Add public Ganeti IP ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1019002 (https://phabricator.wikimedia.org/T300152) [11:34:54] (03PS1) 10Muehlenhoff: Add library hint for psql 15 [puppet] - 10https://gerrit.wikimedia.org/r/1019003 [11:35:06] (03PS2) 10Slyngshede: IP blocking [software/bitu] - 10https://gerrit.wikimedia.org/r/1018256 (https://phabricator.wikimedia.org/T361066) [11:35:27] (SystemdUnitCrashLoop) firing: prometheus-wmf-elasticsearch-exporter-9400.service crashloop on elastic2090:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [11:36:25] (03PS1) 10Ayounsi: Add public testvm200x support [puppet] - 10https://gerrit.wikimedia.org/r/1019005 (https://phabricator.wikimedia.org/T300152) [11:36:51] (03PS1) 10Effie Mouzeli: Repo has been migrated to Gitlab [debs/memkeys] - 10https://gerrit.wikimedia.org/r/1019006 [11:37:05] (03PS2) 10Ayounsi: Add public Ganeti IP ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1019002 (https://phabricator.wikimedia.org/T300152) [11:37:10] (03CR) 10Effie Mouzeli: [V:03+2 C:03+2] Repo has been migrated to Gitlab [debs/memkeys] - 10https://gerrit.wikimedia.org/r/1019006 (owner: 10Effie Mouzeli) [11:37:34] (03CR) 10Jelto: [C:03+1] "lgtm, label should be set by the Kubernetes API server (https://kubernetes.io/docs/reference/labels-annotations-taints/#kubernetes-io-meta" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018951 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [11:37:50] (03CR) 10Btullis: [C:03+2] Migrate editor-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014656 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [11:38:23] (03CR) 10Ayounsi: [C:03+2] Add public Ganeti IP ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1019002 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [11:39:06] (03Merged) 10jenkins-bot: Migrate editor-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014656 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [11:40:43] (03PS1) 10JMeybohm: eventgate: Update mesh modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019007 (https://phabricator.wikimedia.org/T359423) [11:40:45] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1019005 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [11:41:07] (03PS2) 10JMeybohm: eventgate: Update mesh modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019007 (https://phabricator.wikimedia.org/T359423) [11:41:08] (03CR) 10Muehlenhoff: [C:03+2] Add library hint for psql 15 [puppet] - 10https://gerrit.wikimedia.org/r/1019003 (owner: 10Muehlenhoff) [11:42:09] (03CR) 10Ayounsi: [C:03+2] Add public testvm200x support [puppet] - 10https://gerrit.wikimedia.org/r/1019005 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [11:42:15] (03PS1) 10Muehlenhoff: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1019008 [11:44:00] 06SRE, 06Machine-Learning-Team, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9706666 (10Clement_Goubert) [11:45:40] (KubernetesRsyslogDown) firing: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:45:42] !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host testvm2008.wikimedia.org [11:45:44] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [11:47:38] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on matomo1003.eqiad.wmnet with reason: host reimage [11:47:46] (03CR) 10Muehlenhoff: [C:03+2] Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1019008 (owner: 10Muehlenhoff) [11:47:50] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2008.wikimedia.org - ayounsi@cumin1002" [11:48:28] (03CR) 10Jelto: [C:03+1] "lgtm, diff also looks good: staging-eqiad and staging-codfw namespaces have a additional label pod-security.kubernetes.io/audit: restricte" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018952 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [11:49:13] 06SRE, 06Machine-Learning-Team, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9706673 (10Clement_Goubert) Aaaand I just realized they all use http and not https, so now I can change them all. [11:49:14] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2008.wikimedia.org - ayounsi@cumin1002" [11:49:14] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:49:14] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache testvm2008.wikimedia.org on all recursors [11:49:18] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2008.wikimedia.org on all recursors [11:49:44] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2008.wikimedia.org - ayounsi@cumin1002" [11:50:20] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on matomo1003.eqiad.wmnet with reason: host reimage [11:50:34] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2008.wikimedia.org - ayounsi@cumin1002" [11:50:40] (KubernetesRsyslogDown) resolved: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:51:34] (03PS1) 10Hnowlan: jobqueue: increase concurrency for videoscalers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019010 [11:52:25] (03CR) 10Clément Goubert: [C:03+1] jobqueue: increase concurrency for videoscalers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019010 (owner: 10Hnowlan) [11:52:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2177.codfw.wmnet with OS bookworm [11:52:50] (03CR) 10Hnowlan: [C:03+2] jobqueue: increase concurrency for videoscalers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019010 (owner: 10Hnowlan) [11:53:45] (03Merged) 10jenkins-bot: jobqueue: increase concurrency for videoscalers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019010 (owner: 10Hnowlan) [11:54:26] (03CR) 10Hnowlan: [C:03+1] mw-web, mw-api-ext: Raise replicas for 70% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018721 (https://phabricator.wikimedia.org/T360763) (owner: 10Clément Goubert) [11:54:35] (03CR) 10Hnowlan: [C:03+1] trafficserver: move 70% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1018723 (https://phabricator.wikimedia.org/T360763) (owner: 10Clément Goubert) [11:55:09] (03CR) 10Clément Goubert: [C:03+2] mw-web, mw-api-ext: Raise replicas for 70% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018721 (https://phabricator.wikimedia.org/T360763) (owner: 10Clément Goubert) [11:55:48] (03PS1) 10Slyngshede: P:idm allow security key backended SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/1019013 (https://phabricator.wikimedia.org/T361714) [11:56:01] (03Merged) 10jenkins-bot: mw-web, mw-api-ext: Raise replicas for 70% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018721 (https://phabricator.wikimedia.org/T360763) (owner: 10Clément Goubert) [11:57:15] (03PS1) 10Majavah: O:mariadb::grants: drop unused clouddb.sql.erb [puppet] - 10https://gerrit.wikimedia.org/r/1019014 [11:57:20] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [11:57:40] (KubernetesRsyslogDown) firing: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:57:48] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [11:58:07] (03CR) 10Slyngshede: "These two key types are already supported by Striker, so Bitu needs to support them as well." [puppet] - 10https://gerrit.wikimedia.org/r/1019013 (https://phabricator.wikimedia.org/T361714) (owner: 10Slyngshede) [11:58:20] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host testvm2008.wikimedia.org with OS bookworm [11:58:37] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [11:58:43] (03PS2) 10Slyngshede: P:idm allow security key backended SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/1019013 (https://phabricator.wikimedia.org/T361714) [11:59:38] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [11:59:48] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240411T1200) [12:01:01] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [12:01:07] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [12:02:00] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [12:02:19] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1019013 (https://phabricator.wikimedia.org/T361714) (owner: 10Slyngshede) [12:02:40] (KubernetesRsyslogDown) resolved: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:02:48] (03PS1) 10JMeybohm: eventgate-*: Migrate to base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019018 (https://phabricator.wikimedia.org/T359423) [12:02:51] (03CR) 10Clément Goubert: [C:03+2] trafficserver: move 70% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1018723 (https://phabricator.wikimedia.org/T360763) (owner: 10Clément Goubert) [12:02:56] (03CR) 10Slyngshede: [C:03+2] P:idm allow security key backended SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/1019013 (https://phabricator.wikimedia.org/T361714) (owner: 10Slyngshede) [12:03:23] (ProbeDown) firing: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:05:57] !log ayounsi@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host testvm2008.wikimedia.org with OS bookworm [12:05:57] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host testvm2008.wikimedia.org [12:06:42] !log ayounsi@cumin1002 START - Cookbook sre.hosts.decommission for hosts testvm2008.wikimedia.org [12:08:05] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: 14Move 70% of mediawiki external requests to mw on k8s - 14https://phabricator.wikimedia.org/T360763#9706729 (10Clement_Goubert) 05In progress→03Resolved [12:08:23] (ProbeDown) resolved: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:10:41] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [12:12:46] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm2008.wikimedia.org decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002" [12:13:19] (03PS1) 10Dreamy Jazz: Ignore misisng title/page in CheckUserLookupUtils::getManualLogEntryFromRow [extensions/CheckUser] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018967 (https://phabricator.wikimedia.org/T362284) [12:13:32] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host matomo1003.eqiad.wmnet with OS bullseye [12:13:37] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm2008.wikimedia.org decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002" [12:13:37] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:13:37] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts testvm2008.wikimedia.org [12:13:38] (03PS5) 10S8321414: zhwikivoyage: Make RelatedArticles extension usable on zhwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015551 (https://phabricator.wikimedia.org/T361427) [12:13:45] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [12:13:47] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: 14Investigate Ganeti in routed mode - 14https://phabricator.wikimedia.org/T300152#9706756 (10ops-monitoring-bot) 14cookbooks.sre.hosts.decommission executed by ayounsi@cumin1002 for hosts: `testvm2008.wikimedia.org` - testv... [12:13:50] !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host testvm2008.wikimedia.org [12:13:51] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [12:14:21] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [12:14:28] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323 (10Clement_Goubert) 03NEW [12:14:53] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9706771 (10Clement_Goubert) p:05Triage→03High [12:15:43] !log installing gnutls28 security updates [12:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:58] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [12:16:05] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2008.wikimedia.org - ayounsi@cumin1002" [12:16:07] !log ayounsi@cumin1002 END (ERROR) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=97) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2008.wikimedia.org - ayounsi@cumin1002" [12:16:07] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=97) [12:16:21] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host testvm2008.wikimedia.org [12:16:44] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host matomo1003.eqiad.wmnet with OS bookworm [12:16:52] !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host testvm2008.wikimedia.org [12:16:54] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [12:16:55] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [12:18:59] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2008.wikimedia.org - ayounsi@cumin1002" [12:19:50] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2008.wikimedia.org - ayounsi@cumin1002" [12:19:51] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:19:51] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache testvm2008.wikimedia.org on all recursors [12:19:54] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [12:19:54] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2008.wikimedia.org on all recursors [12:20:20] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2008.wikimedia.org - ayounsi@cumin1002" [12:20:41] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [12:21:08] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2008.wikimedia.org - ayounsi@cumin1002" [12:21:34] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host testvm2008.wikimedia.org with OS bookworm [12:22:41] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/editor-analytics: apply [12:23:56] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply [12:24:02] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply [12:24:17] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply [12:26:17] (03CR) 10JMeybohm: [C:03+2] admin_ng: Enable restriced PSS profile in audit mode in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018952 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [12:26:20] (03CR) 10JMeybohm: [C:03+2] admin_ng: Stop adding kubernetes.io/metadata.name namespace label [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018951 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [12:26:23] (03CR) 10JMeybohm: [C:03+2] admin_ng: Refactor fetching pspClusterRole for namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018950 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [12:27:56] (03CR) 10Marostegui: [C:03+2] Revert "db2177: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1018966 (owner: 10Marostegui) [12:28:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P60396 and previous config saved to /var/cache/conftool/dbconfig/20240411-122810-root.json [12:28:37] (03CR) 10Btullis: [C:03+2] Migrate edit-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014657 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [12:29:31] (03Merged) 10jenkins-bot: admin_ng: Refactor fetching pspClusterRole for namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018950 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [12:29:34] (03Merged) 10jenkins-bot: admin_ng: Stop adding kubernetes.io/metadata.name namespace label [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018951 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [12:29:36] (03Merged) 10jenkins-bot: admin_ng: Enable restriced PSS profile in audit mode in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018952 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [12:30:25] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:31:40] (03Merged) 10jenkins-bot: Migrate edit-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014657 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [12:32:25] (03PS1) 10Slyngshede: Keymanagement, improve error message for key validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1019036 (https://phabricator.wikimedia.org/T361714) [12:32:53] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:32:54] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9706818 (10Clement_Goubert) [12:33:38] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-codfw [12:33:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 depool for reboot T356240', diff saved to https://phabricator.wikimedia.org/P60397 and previous config saved to /var/cache/conftool/dbconfig/20240411-123350-arnaudb.json [12:34:01] (03CR) 10Slyngshede: "We could also expand this to check a list of key types which we explicitly mark as insecure." [software/bitu] - 10https://gerrit.wikimedia.org/r/1019036 (https://phabricator.wikimedia.org/T361714) (owner: 10Slyngshede) [12:34:07] 10ops-codfw, 06SRE, 06serviceops: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9706836 (10Papaul) [12:34:10] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [12:34:38] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [12:34:56] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [12:35:01] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [12:35:08] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply [12:35:24] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2129.codfw.wmnet [12:35:25] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply [12:35:31] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [12:35:47] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [12:36:32] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [12:36:56] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [12:37:06] !log jayme@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [12:38:13] !log jayme@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [12:38:25] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [12:38:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-codfw [12:38:49] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-eqsin and not P{cp[5030,5032].eqsin.wmnet} and A:cp [12:39:20] (03PS4) 10Elukey: Force PKI TLS certs for cassandra instances on aqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/1018309 (https://phabricator.wikimedia.org/T352647) [12:39:52] 06SRE, 06serviceops: Memcached, mcrouter in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711#9706842 (10jijiki) [12:39:53] 06SRE, 10MW-on-K8s, 06serviceops: 14Create a basic helm chart to test MediaWiki on kubernetes - 14https://phabricator.wikimedia.org/T265327#9706844 (10jijiki) [12:39:54] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9706843 (10jijiki) [12:40:32] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [12:40:43] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:40:45] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9706846 (10jijiki) [12:40:53] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2129.codfw.wmnet [12:41:28] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:41:34] !log ayounsi@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host testvm2008.wikimedia.org with OS bookworm [12:41:34] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host testvm2008.wikimedia.org [12:41:53] !log ayounsi@cumin1002 START - Cookbook sre.hosts.decommission for hosts testvm2008.wikimedia.org [12:42:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 1%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P60398 and previous config saved to /var/cache/conftool/dbconfig/20240411-124248-arnaudb.json [12:43:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P60399 and previous config saved to /var/cache/conftool/dbconfig/20240411-124315-root.json [12:45:47] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [12:48:14] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9706856 (10Clement_Goubert) [12:49:18] !log jayme@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [12:49:51] !log jayme@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [12:50:03] !log jayme@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [12:51:02] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9706874 (10Clement_Goubert) [12:51:26] !log jayme@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [12:52:10] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [12:52:29] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:53:25] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db[2132,2160].codfw.wmnet with reason: reboot [12:53:28] !log jayme@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:53:38] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db[2132,2160].codfw.wmnet with reason: reboot [12:53:39] !log akosiaris@cumin1002 conftool action : set/weight=10; selector: name=mw1437.*.wmnet,dc=eqiad [12:53:50] !log jayme@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:53:57] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2132.codfw.wmnet [12:54:40] !log lower weight of mw1437 back to 10 from the 30 I had upped it to yesterday. The backlog of videoscaling is apparently now served and CPU usage has reached "normal" levels [12:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 2%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P60400 and previous config saved to /var/cache/conftool/dbconfig/20240411-125755-arnaudb.json [12:58:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P60401 and previous config saved to /var/cache/conftool/dbconfig/20240411-125821-root.json [12:58:40] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2132.codfw.wmnet [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240411T1300) [13:00:05] esanders and Dreamy Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:30] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host matomo1003.eqiad.wmnet with OS bookworm [13:07:07] I can’t deploy, sorry [13:08:41] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9706934 (10Papaul) [13:11:50] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db[2134,2160].codfw.wmnet with reason: reboot [13:12:04] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db[2134,2160].codfw.wmnet with reason: reboot [13:12:15] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2134.codfw.wmnet [13:12:40] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host matomo1003.eqiad.wmnet with OS bookworm [13:13:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 4%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P60402 and previous config saved to /var/cache/conftool/dbconfig/20240411-131301-arnaudb.json [13:13:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P60403 and previous config saved to /var/cache/conftool/dbconfig/20240411-131327-root.json [13:16:50] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2134.codfw.wmnet [13:17:31] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db[2135,2160].codfw.wmnet with reason: reboot [13:17:45] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db[2135,2160].codfw.wmnet with reason: reboot [13:18:28] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2135.codfw.wmnet [13:18:44] \o [13:18:48] I can deploy my patch [13:19:05] (03PS1) 10Jelto: miscweb/service::catalog: move blackbox checks to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1019039 (https://phabricator.wikimedia.org/T361090) [13:20:06] esanders: Are you around? [13:20:23] edsanders: [13:21:41] (03PS1) 10Btullis: Correct the device names for matomo disks [puppet] - 10https://gerrit.wikimedia.org/r/1019040 (https://phabricator.wikimedia.org/T349397) [13:23:15] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2135.codfw.wmnet [13:25:02] I'm going to go ahead with mine now as edsanders does not seem around for this window. [13:25:30] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1019039 (https://phabricator.wikimedia.org/T361090) (owner: 10Jelto) [13:25:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1002 using scap backport" [extensions/CheckUser] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018967 (https://phabricator.wikimedia.org/T362284) (owner: 10Dreamy Jazz) [13:25:58] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db[2133,2160].codfw.wmnet with reason: reboot [13:26:12] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db[2133,2160].codfw.wmnet with reason: reboot [13:26:19] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2133.codfw.wmnet [13:26:53] 06SRE, 10SRE-Access-Requests, 06Fundraising-Backlog: Can we please add our vendor to Google Postmaster Tools - https://phabricator.wikimedia.org/T360907#9706990 (10akosiaris) I don't think #SRE has ever administrated Google Postmaster Tools at all. In fact, a quick cross check in the team showcases almost ut... [13:27:39] (03CR) 10Btullis: [C:03+2] Correct the device names for matomo disks [puppet] - 10https://gerrit.wikimedia.org/r/1019040 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis) [13:27:58] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-eqsin and not P{cp[5030,5032].eqsin.wmnet} and A:cp [13:28:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 5%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P60404 and previous config saved to /var/cache/conftool/dbconfig/20240411-132807-arnaudb.json [13:28:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P60405 and previous config saved to /var/cache/conftool/dbconfig/20240411-132834-root.json [13:29:57] (03PS1) 10Hnowlan: jobqueue: Increase videoscaler job concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019045 [13:30:38] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2133.codfw.wmnet [13:30:47] (03PS2) 10Dreamy Jazz: Ignore missing title/page in CheckUserLookupUtils::getManualLogEntryFromRow [extensions/CheckUser] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018967 (https://phabricator.wikimedia.org/T362284) [13:30:52] (03CR) 10Dreamy Jazz: [C:03+2] Ignore missing title/page in CheckUserLookupUtils::getManualLogEntryFromRow [extensions/CheckUser] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018967 (https://phabricator.wikimedia.org/T362284) (owner: 10Dreamy Jazz) [13:30:56] (03CR) 10TrainBranchBot: "Approved by dreamyjazz@deploy1002 using scap backport" [extensions/CheckUser] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018967 (https://phabricator.wikimedia.org/T362284) (owner: 10Dreamy Jazz) [13:31:01] (03CR) 10TrainBranchBot: "Approved by dreamyjazz@deploy1002 using scap backport" [extensions/CheckUser] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018967 (https://phabricator.wikimedia.org/T362284) (owner: 10Dreamy Jazz) [13:31:40] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host matomo1003.eqiad.wmnet with OS bookworm [13:32:03] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host matomo1003.eqiad.wmnet with OS bookworm [13:32:25] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db2160.codfw.wmnet with reason: reboot multiinstance replica [13:32:27] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db2160.codfw.wmnet with reason: reboot multiinstance replica [13:32:58] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3073.esams.wmnet,service=(cdn|ats-be) [13:33:11] (03CR) 10Ssingh: [C:03+2] cp3073: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1015975 (https://phabricator.wikimedia.org/T360430) (owner: 10Ssingh) [13:34:59] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp3073.esams.wmnet with OS bullseye [13:35:09] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9707012 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp3073.esams.wmnet with OS bullseye [13:35:32] (03PS26) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) [13:36:38] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-eqiad [13:37:06] (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [13:39:38] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1019036 (https://phabricator.wikimedia.org/T361714) (owner: 10Slyngshede) [13:40:46] 10ops-codfw, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Degraded RAID on elastic2088 - https://phabricator.wikimedia.org/T361525#9707025 (10Jhancock.wm) Update: Dell finally agreed to replace the HBA card. I sent the shipping address confirmation just now. Hopefully it'll be here tomo... [13:41:11] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9707026 (10ssingh) Traffic reimaged 8 text nodes in esams and all of them PXE-booted the first time, without any issues. I think looking... [13:41:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad [13:43:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 10%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P60406 and previous config saved to /var/cache/conftool/dbconfig/20240411-134312-arnaudb.json [13:43:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P60407 and previous config saved to /var/cache/conftool/dbconfig/20240411-134341-root.json [13:45:00] (03PS1) 10Btullis: Ensure that matomo install grub to /dev/vda [puppet] - 10https://gerrit.wikimedia.org/r/1019048 (https://phabricator.wikimedia.org/T349397) [13:45:24] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=(cdn|ats-be) [13:45:50] (03CR) 10Btullis: [C:03+2] Ensure that matomo install grub to /dev/vda [puppet] - 10https://gerrit.wikimedia.org/r/1019048 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis) [13:46:48] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp2042.codfw.wmnet with OS bullseye [13:46:58] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host matomo1003.eqiad.wmnet with OS bookworm [13:47:00] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9707044 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp2042.codfw.wmnet with OS b... [13:47:05] (03PS1) 10JMeybohm: kubernetes::master: Forward audit logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1019049 (https://phabricator.wikimedia.org/T290020) [13:48:29] (03CR) 10Slyngshede: [C:03+2] Keymanagement, improve error message for key validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1019036 (https://phabricator.wikimedia.org/T361714) (owner: 10Slyngshede) [13:49:06] (03CR) 10Alexandros Kosiaris: [C:03+1] jobqueue: Increase videoscaler job concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019045 (owner: 10Hnowlan) [13:49:21] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host matomo1003.eqiad.wmnet with OS bookworm [13:49:22] (03PS1) 10Marostegui: db2149: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1019051 [13:49:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2149', diff saved to https://phabricator.wikimedia.org/P60408 and previous config saved to /var/cache/conftool/dbconfig/20240411-134932-root.json [13:49:36] (03Merged) 10jenkins-bot: Keymanagement, improve error message for key validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1019036 (https://phabricator.wikimedia.org/T361714) (owner: 10Slyngshede) [13:49:40] I’m here now, anything left to deploy? ^^ [13:50:06] (03Merged) 10jenkins-bot: Ignore missing title/page in CheckUserLookupUtils::getManualLogEntryFromRow [extensions/CheckUser] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018967 (https://phabricator.wikimedia.org/T362284) (owner: 10Dreamy Jazz) [13:50:08] (03CR) 10Marostegui: [C:03+2] db2149: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1019051 (owner: 10Marostegui) [13:50:11] I'm currently deploying [13:50:14] ok [13:50:49] !log dreamyjazz@deploy1002 Started scap: Backport for [[gerrit:1018967|Ignore missing title/page in CheckUserLookupUtils::getManualLogEntryFromRow (T362284)]] [13:50:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2149.codfw.wmnet with OS bookworm [13:51:06] T362284: Logs without a defined title or page_id cause an exception in CheckUser - https://phabricator.wikimedia.org/T362284 [13:52:01] (03CR) 10Herron: [C:03+1] opensearch: switch dashboards to sso auth [puppet] - 10https://gerrit.wikimedia.org/r/1018872 (https://phabricator.wikimedia.org/T246998) (owner: 10Filippo Giunchedi) [13:53:39] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm2008.wikimedia.org decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002" [13:54:10] (03PS27) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) [13:54:30] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm2008.wikimedia.org decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002" [13:54:30] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:54:31] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts testvm2008.wikimedia.org [13:54:41] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: 14Investigate Ganeti in routed mode - 14https://phabricator.wikimedia.org/T300152#9707155 (10ops-monitoring-bot) 14cookbooks.sre.hosts.decommission executed by ayounsi@cumin1002 for hosts: `testvm2008.wikimedia.org` - testv... [13:55:07] (03PS2) 10JMeybohm: kubernetes::master: Forward audit logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1019049 (https://phabricator.wikimedia.org/T290020) [13:55:33] (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [13:55:37] (03CR) 10Elukey: [C:03+2] Force PKI TLS certs for cassandra instances on aqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/1018309 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [13:55:47] !log dreamyjazz@deploy1002 dreamyjazz: Backport for [[gerrit:1018967|Ignore missing title/page in CheckUserLookupUtils::getManualLogEntryFromRow (T362284)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:55:59] !log dreamyjazz@deploy1002 dreamyjazz: Continuing with sync [13:56:26] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1871/co" [puppet] - 10https://gerrit.wikimedia.org/r/1019049 (https://phabricator.wikimedia.org/T290020) (owner: 10JMeybohm) [13:56:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2115 (re)pooling @ 10%: Repool', diff saved to https://phabricator.wikimedia.org/P60409 and previous config saved to /var/cache/conftool/dbconfig/20240411-135634-arnaudb.json [13:56:45] (03CR) 10Hnowlan: [C:03+2] jobqueue: Increase videoscaler job concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019045 (owner: 10Hnowlan) [13:57:42] (03Merged) 10jenkins-bot: jobqueue: Increase videoscaler job concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019045 (owner: 10Hnowlan) [13:57:48] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-codfw and not P{cp2042.codfw.wmnet} and A:cp [13:57:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2109 (re)pooling @ 25%: repool', diff saved to https://phabricator.wikimedia.org/P60410 and previous config saved to /var/cache/conftool/dbconfig/20240411-135754-arnaudb.json [13:58:11] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3073.esams.wmnet with reason: host reimage [13:58:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 20%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P60411 and previous config saved to /var/cache/conftool/dbconfig/20240411-135819-arnaudb.json [13:58:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P60412 and previous config saved to /var/cache/conftool/dbconfig/20240411-135846-root.json [13:58:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2103 (re)pooling @ 25%: reool', diff saved to https://phabricator.wikimedia.org/P60413 and previous config saved to /var/cache/conftool/dbconfig/20240411-135858-arnaudb.json [13:59:33] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on aqs1010.eqiad.wmnet with reason: Upgrade to PKI [13:59:47] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on aqs1010.eqiad.wmnet with reason: Upgrade to PKI [14:00:25] (SystemdUnitFailed) firing: prometheus-wmf-elasticsearch-exporter-9600.service on elastic2090:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:01:19] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3073.esams.wmnet with reason: host reimage [14:03:32] 10ops-eqiad, 06SRE, 06DC-Ops: 14Inconsistent data in Netbox for some msw device - 14https://phabricator.wikimedia.org/T359326#9707236 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr 14Corrected netbox errors  [14:03:42] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2042.codfw.wmnet with reason: host reimage [14:04:02] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9707240 (10ssingh) @Papaul suggested to try a host in codfw and `cp2042` PXE booted successfully. In one of the above messages, @cmooney... [14:06:05] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host matomo1003.eqiad.wmnet with OS bookworm [14:06:48] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2042.codfw.wmnet with reason: host reimage [14:06:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2149.codfw.wmnet with reason: host reimage [14:08:31] !log dreamyjazz@deploy1002 Finished scap: Backport for [[gerrit:1018967|Ignore missing title/page in CheckUserLookupUtils::getManualLogEntryFromRow (T362284)]] (duration: 17m 42s) [14:08:39] T362284: Logs without a defined title or page_id cause an exception in CheckUser - https://phabricator.wikimedia.org/T362284 [14:09:02] !log installing NSS security updates [14:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:13] !log Afternoon UTC backport window finished [14:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:54] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [14:10:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2149.codfw.wmnet with reason: host reimage [14:10:19] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [14:10:44] !log move cassandra instances on aqs1010 to PKI TLS certs - T352647 [14:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:48] T352647: Move Cassandra clusters to PKI - https://phabricator.wikimedia.org/T352647 [14:11:23] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [14:11:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2115 (re)pooling @ 25%: Repool', diff saved to https://phabricator.wikimedia.org/P60414 and previous config saved to /var/cache/conftool/dbconfig/20240411-141139-arnaudb.json [14:12:02] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [14:13:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2109 (re)pooling @ 50%: repool', diff saved to https://phabricator.wikimedia.org/P60415 and previous config saved to /var/cache/conftool/dbconfig/20240411-141300-arnaudb.json [14:13:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 25%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P60416 and previous config saved to /var/cache/conftool/dbconfig/20240411-141324-arnaudb.json [14:13:59] Dreamy_Jazz: sorry, you finished? [14:14:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2103 (re)pooling @ 50%: reool', diff saved to https://phabricator.wikimedia.org/P60417 and previous config saved to /var/cache/conftool/dbconfig/20240411-141404-arnaudb.json [14:14:37] Yeah. I have, but can extend the window if necessary. [14:15:11] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-codfw [14:15:30] As it seems nothing else is on the calendar for at least the next hour. [14:15:49] (03PS1) 10Hnowlan: jobqueue: increase videoscaler job concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019053 [14:17:33] edsanders: Do you have deployment rights? If not, do you want me to deploy? [14:18:31] Considering the window is done, I'd probably defer this change, but there isn't anything after this on the calendar. [14:18:31] !log drain and restart cassandra-b on aqs2007 - didn't pick up the new truststore during the past roll restart - T352647 [14:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:36] T352647: Move Cassandra clusters to PKI - https://phabricator.wikimedia.org/T352647 [14:19:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-codfw [14:20:13] (03PS1) 10Marostegui: Revert "db2149: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1018972 [14:21:31] (03PS28) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) [14:22:44] (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [14:23:50] (03PS1) 10Muehlenhoff: Remove global root for four engineering managers [puppet] - 10https://gerrit.wikimedia.org/r/1019054 [14:24:27] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3073.esams.wmnet with OS bullseye [14:24:51] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9707334 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp3073.esams.wmnet with OS bullseye completed: - cp3073 (**PASS**)... [14:25:23] Dreamy_Jazz: whichever you prefer [14:25:41] As long as you can be around to test, I can deploy. [14:26:26] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2042.codfw.wmnet with OS bullseye [14:26:33] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9707355 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp2042.codfw.wmnet with OS bulls... [14:26:39] Dreamy_Jazz: I can test [14:26:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015086 (https://phabricator.wikimedia.org/T361134) (owner: 10Esanders) [14:26:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2115 (re)pooling @ 50%: Repool', diff saved to https://phabricator.wikimedia.org/P60418 and previous config saved to /var/cache/conftool/dbconfig/20240411-142645-arnaudb.json [14:26:50] Thanks [14:26:55] No problem [14:27:08] !log Extending UTC Afternoon backport window [14:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:29] (03Merged) 10jenkins-bot: Set wgMFFallbackEditor to visual for most VE wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015086 (https://phabricator.wikimedia.org/T361134) (owner: 10Esanders) [14:27:36] (03CR) 10Marostegui: [C:03+2] Revert "db2149: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1018972 (owner: 10Marostegui) [14:27:56] !log dreamyjazz@deploy1002 Started scap: Backport for [[gerrit:1015086|Set wgMFFallbackEditor to visual for most VE wikis (T361134)]] [14:28:01] T361134: Set wgMFFallbackEditor to 'visual' for all other wikis - https://phabricator.wikimedia.org/T361134 [14:28:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P60419 and previous config saved to /var/cache/conftool/dbconfig/20240411-142801-root.json [14:28:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2109 (re)pooling @ 75%: repool', diff saved to https://phabricator.wikimedia.org/P60420 and previous config saved to /var/cache/conftool/dbconfig/20240411-142806-arnaudb.json [14:28:22] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3073.esams.wmnet,service=(cdn|ats-be) [14:28:27] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2042.codfw.wmnet,service=(cdn|ats-be) [14:28:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 50%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P60421 and previous config saved to /var/cache/conftool/dbconfig/20240411-142830-arnaudb.json [14:29:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2103 (re)pooling @ 75%: reool', diff saved to https://phabricator.wikimedia.org/P60422 and previous config saved to /var/cache/conftool/dbconfig/20240411-142910-arnaudb.json [14:29:47] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9707409 (10ssingh) [14:29:52] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1019014 (owner: 10Majavah) [14:30:15] (03CR) 10Majavah: [C:03+2] O:mariadb::grants: drop unused clouddb.sql.erb [puppet] - 10https://gerrit.wikimedia.org/r/1019014 (owner: 10Majavah) [14:30:19] (03Abandoned) 10Muehlenhoff: Remove obsolete grant [puppet] - 10https://gerrit.wikimedia.org/r/1018941 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff) [14:30:44] !log dreamyjazz@deploy1002 dreamyjazz and esanders: Backport for [[gerrit:1015086|Set wgMFFallbackEditor to visual for most VE wikis (T361134)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:30:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2149.codfw.wmnet with OS bookworm [14:31:00] (03PS6) 10Muehlenhoff: netbox: Set deployment method to avoid creating scap target [puppet] - 10https://gerrit.wikimedia.org/r/1002392 [14:31:29] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-eqiad [14:33:31] (03CR) 10Filippo Giunchedi: [C:04-1] "IIRC support for multiple probes of type http and tcp hasn't been implemented, so I'm afraid this won't work as-is." [puppet] - 10https://gerrit.wikimedia.org/r/1019039 (https://phabricator.wikimedia.org/T361090) (owner: 10Jelto) [14:33:46] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9707428 (10MoritzMuehlenhoff) [14:34:12] !log installing distro-info-data updates from Bullseye point release [14:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:19] (03PS3) 10Ssingh: hiera: unify trafficserver storage elements for esams [puppet] - 10https://gerrit.wikimedia.org/r/1014571 (https://phabricator.wikimedia.org/T360430) (owner: 10Fabfur) [14:34:30] (03PS4) 10Ssingh: hiera: unify trafficserver storage elements for esams [puppet] - 10https://gerrit.wikimedia.org/r/1014571 (https://phabricator.wikimedia.org/T360430) (owner: 10Fabfur) [14:35:20] (03CR) 10LSobanski: [C:03+1] Remove global root for four engineering managers [puppet] - 10https://gerrit.wikimedia.org/r/1019054 (owner: 10Muehlenhoff) [14:36:05] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1872/console" [puppet] - 10https://gerrit.wikimedia.org/r/1014571 (https://phabricator.wikimedia.org/T360430) (owner: 10Fabfur) [14:36:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad [14:38:18] edsanders: Can you test? [14:38:25] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9707455 (10MoritzMuehlenhoff) [14:38:26] testing [14:38:26] (03CR) 10Clément Goubert: [C:03+1] jobqueue: increase videoscaler job concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019053 (owner: 10Hnowlan) [14:38:28] (JobUnavailable) firing: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:16] (03PS5) 10Ssingh: hiera: unify trafficserver storage elements for esams [puppet] - 10https://gerrit.wikimedia.org/r/1014571 (https://phabricator.wikimedia.org/T360430) (owner: 10Fabfur) [14:39:20] (03CR) 10Muehlenhoff: [C:03+2] Remove global root for four engineering managers [puppet] - 10https://gerrit.wikimedia.org/r/1019054 (owner: 10Muehlenhoff) [14:39:29] Dreamy_Jazz: Looks good - thanks [14:39:36] Great. [14:39:52] !log dreamyjazz@deploy1002 dreamyjazz and esanders: Continuing with sync [14:40:19] * Dreamy_Jazz is thankful I ran scap backport on tmux as my client shell crashed. [14:40:36] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1873/console" [puppet] - 10https://gerrit.wikimedia.org/r/1014571 (https://phabricator.wikimedia.org/T360430) (owner: 10Fabfur) [14:41:32] (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1014571 (https://phabricator.wikimedia.org/T360430) (owner: 10Fabfur) [14:41:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2115 (re)pooling @ 75%: Repool', diff saved to https://phabricator.wikimedia.org/P60423 and previous config saved to /var/cache/conftool/dbconfig/20240411-144152-arnaudb.json [14:43:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P60424 and previous config saved to /var/cache/conftool/dbconfig/20240411-144307-root.json [14:43:09] !log sudo cumin "A:cp and A:esams" "disable-puppet 'merging CR 1014571'" [14:43:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2109 (re)pooling @ 100%: repool', diff saved to https://phabricator.wikimedia.org/P60425 and previous config saved to /var/cache/conftool/dbconfig/20240411-144311-arnaudb.json [14:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:16] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot-master rolling restart_daemons on A:maps-master [14:43:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 75%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P60426 and previous config saved to /var/cache/conftool/dbconfig/20240411-144336-arnaudb.json [14:43:52] (03CR) 10Hnowlan: [C:03+2] jobqueue: increase videoscaler job concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019053 (owner: 10Hnowlan) [14:44:13] (03CR) 10Ssingh: [V:03+1 C:03+2] hiera: unify trafficserver storage elements for esams [puppet] - 10https://gerrit.wikimedia.org/r/1014571 (https://phabricator.wikimedia.org/T360430) (owner: 10Fabfur) [14:44:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2103 (re)pooling @ 100%: reool', diff saved to https://phabricator.wikimedia.org/P60427 and previous config saved to /var/cache/conftool/dbconfig/20240411-144416-arnaudb.json [14:44:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot-master (exit_code=0) rolling restart_daemons on A:maps-master [14:44:38] (03Merged) 10jenkins-bot: jobqueue: increase videoscaler job concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019053 (owner: 10Hnowlan) [14:44:46] (03PS4) 10Ahmon Dancy: static.php: Handle mediawiki.org/ontology/ontology.owl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018354 (https://phabricator.wikimedia.org/T171807) [14:45:14] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [14:46:25] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: 14esams text cp nvme upgrade - 14https://phabricator.wikimedia.org/T360430#9707488 (10Fabfur) 05Open→03Resolved [14:47:20] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [14:47:21] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [14:47:23] (03CR) 10Alexandros Kosiaris: [C:03+1] kubernetes::master: Forward audit logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1019049 (https://phabricator.wikimedia.org/T290020) (owner: 10JMeybohm) [14:47:36] jouncebot nowandnext [14:47:36] No deployments scheduled for the next 1 hour(s) and 12 minute(s) [14:47:36] In 1 hour(s) and 12 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240411T1600) [14:47:57] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [14:47:58] Dreamy_Jazz: Ping me when you're done please. [14:48:09] Sure. [14:50:35] (03PS1) 10Clément Goubert: ml-serve: Add istio config for mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019061 (https://phabricator.wikimedia.org/T362316) [14:52:07] !log dreamyjazz@deploy1002 Finished scap: Backport for [[gerrit:1015086|Set wgMFFallbackEditor to visual for most VE wikis (T361134)]] (duration: 24m 11s) [14:52:08] dancy: Done. [14:52:13] thx [14:52:14] T361134: Set wgMFFallbackEditor to 'visual' for all other wikis - https://phabricator.wikimedia.org/T361134 [14:52:26] !log sudo cumin "A:cp and A:esams" "run-puppet-agent --enable 'merging CR 1014571'" [14:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018354 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy) [14:54:11] (03Merged) 10jenkins-bot: static.php: Handle mediawiki.org/ontology/ontology.owl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018354 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy) [14:54:17] (03PS3) 10JMeybohm: kubernetes::master: Forward audit logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1019049 (https://phabricator.wikimedia.org/T290020) [14:54:39] !log dancy@deploy1002 Started scap: Backport for [[gerrit:1018354|static.php: Handle mediawiki.org/ontology/ontology.owl (T171807 T359643)]] [14:54:43] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-codfw and not P{cp2042.codfw.wmnet} and A:cp [14:54:45] T171807: Create ontology URL for mediawiki - https://phabricator.wikimedia.org/T171807 [14:54:47] T359643: Get rid of the /srv/mediawiki/php symbolic link - https://phabricator.wikimedia.org/T359643 [14:56:57] (03PS1) 10Hnowlan: jobqueue: restore videoscaling concurrency to pre-outage level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019062 [14:56:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2115 (re)pooling @ 100%: Repool', diff saved to https://phabricator.wikimedia.org/P60428 and previous config saved to /var/cache/conftool/dbconfig/20240411-145658-arnaudb.json [14:57:21] !log dancy@deploy1002 dancy: Backport for [[gerrit:1018354|static.php: Handle mediawiki.org/ontology/ontology.owl (T171807 T359643)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:57:39] (03CR) 10Alexandros Kosiaris: [C:03+1] jobqueue: restore videoscaling concurrency to pre-outage level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019062 (owner: 10Hnowlan) [14:57:45] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-drmrs and A:cp [14:58:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P60429 and previous config saved to /var/cache/conftool/dbconfig/20240411-145813-root.json [14:58:24] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host matomo1003.eqiad.wmnet with OS bookworm [14:58:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 100%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P60430 and previous config saved to /var/cache/conftool/dbconfig/20240411-145841-arnaudb.json [15:00:20] !log dancy@deploy1002 dancy: Continuing with sync [15:02:30] 10ops-eqiad, 06SRE: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9707550 (10Eevans) >>! In T362033#9700949, @Jclark-ctr wrote: > @Eevans Hey looks like same drive as T354499 is failed again let me know if i can replace it again Sure, go ahead. P.S. I think this is the 4th time, ar... [15:03:05] (03PS2) 10Clément Goubert: article-description: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018959 (https://phabricator.wikimedia.org/T362316) [15:06:06] (03PS3) 10Clément Goubert: article-description: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018959 (https://phabricator.wikimedia.org/T362316) [15:09:44] (03PS1) 10Muehlenhoff: Pass the Ceph cluster address as an array [puppet] - 10https://gerrit.wikimedia.org/r/1019063 [15:11:02] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on matomo1003.eqiad.wmnet with reason: host reimage [15:11:52] (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1019065 [15:12:10] (03CR) 10Ahmon Dancy: [V:03+2 C:03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1019065 (owner: 10Ahmon Dancy) [15:12:21] !log dancy@deploy1002 Finished scap: Backport for [[gerrit:1018354|static.php: Handle mediawiki.org/ontology/ontology.owl (T171807 T359643)]] (duration: 17m 41s) [15:12:26] T171807: Create ontology URL for mediawiki - https://phabricator.wikimedia.org/T171807 [15:12:27] T359643: Get rid of the /srv/mediawiki/php symbolic link - https://phabricator.wikimedia.org/T359643 [15:13:05] (03CR) 10Elukey: "Left a small change request, after that +1!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019061 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [15:13:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P60431 and previous config saved to /var/cache/conftool/dbconfig/20240411-151319-root.json [15:14:32] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on matomo1003.eqiad.wmnet with reason: host reimage [15:14:35] (03PS2) 10Clément Goubert: ml-serve: Add istio config for mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019061 (https://phabricator.wikimedia.org/T362316) [15:15:07] (03PS2) 10Clément Goubert: article-description: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018960 (https://phabricator.wikimedia.org/T362316) [15:18:22] !log mvernon@cumin1002 START - Cookbook sre.hosts.reimage for host moss-fe1002.eqiad.wmnet with OS bookworm [15:18:25] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1019063 (owner: 10Muehlenhoff) [15:20:11] (03PS2) 10Clément Goubert: articletopic-outlink: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018961 (https://phabricator.wikimedia.org/T362316) [15:20:22] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe2002.codfw.wmnet with OS bookworm [15:20:32] (03CR) 10Hnowlan: [C:03+2] jobqueue: restore videoscaling concurrency to pre-outage level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019062 (owner: 10Hnowlan) [15:21:30] (03Merged) 10jenkins-bot: jobqueue: restore videoscaling concurrency to pre-outage level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019062 (owner: 10Hnowlan) [15:23:51] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [15:24:20] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [15:24:21] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [15:24:44] jan_drewniak: hi, will you backport the mobilefrontend revert ? :) [15:24:52] if we can land it and resume the train, that would be great [15:24:57] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [15:26:20] (03PS2) 10Clément Goubert: articletopic-outlink: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018962 (https://phabricator.wikimedia.org/T362316) [15:26:38] (03CR) 10JMeybohm: "PCC is at https://puppet-compiler.wmflabs.org/output/1019049/1874/" [puppet] - 10https://gerrit.wikimedia.org/r/1019049 (https://phabricator.wikimedia.org/T290020) (owner: 10JMeybohm) [15:27:09] (03PS3) 10Clément Goubert: articletopic-outlink: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018962 (https://phabricator.wikimedia.org/T362316) [15:27:31] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1019066 [15:28:18] (03PS2) 10Clément Goubert: experimental: Switch to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018963 (https://phabricator.wikimedia.org/T362316) [15:28:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P60432 and previous config saved to /var/cache/conftool/dbconfig/20240411-152825-root.json [15:28:31] (03PS3) 10Ahmon Dancy: Serve mw.org/ontology/ontology.owl via /w/static.php (v2) [puppet] - 10https://gerrit.wikimedia.org/r/1018335 (https://phabricator.wikimedia.org/T171807) [15:28:31] (03PS1) 10Ahmon Dancy: Revert "Route /w/docs/ to /w/static.php" [puppet] - 10https://gerrit.wikimedia.org/r/1019067 (https://phabricator.wikimedia.org/T171807) [15:29:06] (03CR) 10BCornwall: [C:03+1] trafficserver: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840141 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:29:57] (03PS2) 10Clément Goubert: readability: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018964 (https://phabricator.wikimedia.org/T362316) [15:30:05] (03PS2) 10Clément Goubert: readability: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018965 (https://phabricator.wikimedia.org/T362316) [15:30:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2110 (re)pooling @ 25%: repool', diff saved to https://phabricator.wikimedia.org/P60433 and previous config saved to /var/cache/conftool/dbconfig/20240411-153003-arnaudb.json [15:30:07] (03PS4) 10Ahmon Dancy: Serve mw.org/ontology/ontology.owl via /w/static.php (v2) [puppet] - 10https://gerrit.wikimedia.org/r/1018335 (https://phabricator.wikimedia.org/T171807) [15:30:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2111 (re)pooling @ 25%: repool', diff saved to https://phabricator.wikimedia.org/P60434 and previous config saved to /var/cache/conftool/dbconfig/20240411-153019-arnaudb.json [15:31:01] (03PS3) 10Clément Goubert: readability: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018965 (https://phabricator.wikimedia.org/T362316) [15:31:26] (03CR) 10BCornwall: [C:03+1] cache: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868709 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:31:29] (03PS2) 10Clément Goubert: revertrisk: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018986 (https://phabricator.wikimedia.org/T362316) [15:31:36] (03PS2) 10Clément Goubert: revertrisk: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018987 (https://phabricator.wikimedia.org/T362316) [15:31:37] !log mvernon@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe1002.eqiad.wmnet with reason: host reimage [15:31:44] (03CR) 10BCornwall: [V:03+1 C:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1877/console" [puppet] - 10https://gerrit.wikimedia.org/r/840141 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:32:01] (03PS3) 10Clément Goubert: revertrisk: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018987 (https://phabricator.wikimedia.org/T362316) [15:32:20] (03PS2) 10Clément Goubert: revscoring-articlequality: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018988 (https://phabricator.wikimedia.org/T362316) [15:32:25] (03PS3) 10Clément Goubert: revscoring-articlequality: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018988 (https://phabricator.wikimedia.org/T362316) [15:32:51] (03PS2) 10Clément Goubert: revscoring-articlequality: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018989 (https://phabricator.wikimedia.org/T362316) [15:33:05] (03PS3) 10Clément Goubert: revscoring-articlequality: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018989 (https://phabricator.wikimedia.org/T362316) [15:33:14] !log btullis@cumin1002 START - Cookbook sre.hadoop.roll-restart-workers restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade. [15:33:34] (03PS2) 10Clément Goubert: revscoring-articletopic: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018990 (https://phabricator.wikimedia.org/T362316) [15:33:40] (03CR) 10BCornwall: [V:03+1 C:03+2] trafficserver: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840141 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:33:45] (03PS2) 10Clément Goubert: revscoring-articletopic: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018991 (https://phabricator.wikimedia.org/T362316) [15:34:07] (03PS3) 10Clément Goubert: revscoring-articletopic: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018991 (https://phabricator.wikimedia.org/T362316) [15:34:28] (03CR) 10Btullis: [C:03+2] Migrate device-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014658 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [15:34:33] (03PS2) 10Clément Goubert: revscoring-draftquality: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018992 (https://phabricator.wikimedia.org/T362316) [15:34:39] (03PS2) 10Clément Goubert: revscoring-draftquality: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018993 (https://phabricator.wikimedia.org/T362316) [15:34:57] (03PS3) 10Clément Goubert: revscoring-draftquality: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018993 (https://phabricator.wikimedia.org/T362316) [15:35:04] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe1002.eqiad.wmnet with reason: host reimage [15:35:17] (03Merged) 10jenkins-bot: Migrate device-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014658 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [15:35:18] (03PS2) 10Clément Goubert: revscoring-drafttopic: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018994 (https://phabricator.wikimedia.org/T362316) [15:35:27] (SystemdUnitCrashLoop) firing: prometheus-wmf-elasticsearch-exporter-9400.service crashloop on elastic2090:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [15:35:27] (03PS2) 10Clément Goubert: revscoring-drafttopic: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018995 (https://phabricator.wikimedia.org/T362316) [15:35:47] (03PS3) 10Clément Goubert: revscoring-drafttopic: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018995 (https://phabricator.wikimedia.org/T362316) [15:35:53] (03PS1) 10Ahmon Dancy: Revert "mediawiki: Route /w/docs/ to /w/static.php" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019068 [15:36:06] (03PS2) 10Clément Goubert: revscoring-editquality-damaging: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018996 (https://phabricator.wikimedia.org/T362316) [15:36:14] (03PS2) 10Clément Goubert: revscoring-editquality-damaging: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018997 (https://phabricator.wikimedia.org/T362316) [15:36:40] (03PS3) 10Clément Goubert: revscoring-editquality-damaging: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018997 (https://phabricator.wikimedia.org/T362316) [15:36:46] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe2002.codfw.wmnet with reason: host reimage [15:37:00] (03PS2) 10Clément Goubert: revscoring-editquality-goodfaith: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018998 (https://phabricator.wikimedia.org/T362316) [15:37:08] (03PS2) 10Clément Goubert: revscoring-editquality-goodfaith: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018999 (https://phabricator.wikimedia.org/T362316) [15:37:27] (03PS3) 10Clément Goubert: revscoring-editquality-goodfaith: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018999 (https://phabricator.wikimedia.org/T362316) [15:37:46] (03PS2) 10Clément Goubert: revscoring-editquality-reverted: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019000 (https://phabricator.wikimedia.org/T362316) [15:37:54] (03PS2) 10Clément Goubert: revscoring-editquality-reverted: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019001 (https://phabricator.wikimedia.org/T362316) [15:38:12] (03PS3) 10Clément Goubert: revscoring-editquality-reverted: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019001 (https://phabricator.wikimedia.org/T362316) [15:38:19] (03PS1) 10Jcrespo: dbbackups: Add striker_toolsbeta to the list of m5 backups [puppet] - 10https://gerrit.wikimedia.org/r/1019069 (https://phabricator.wikimedia.org/T360149) [15:38:20] (03PS3) 10BCornwall: cache: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868709 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:39:43] (03PS10) 10Jgreen: community-crm: Add dyna and discovery records [dns] - 10https://gerrit.wikimedia.org/r/983950 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [15:39:44] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe2002.codfw.wmnet with reason: host reimage [15:40:27] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1881/co" [puppet] - 10https://gerrit.wikimedia.org/r/868709 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:41:08] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: 14Requesting access to shell access to analytics client servers for AndyRussG - 14https://phabricator.wikimedia.org/T361742#9707721 (10AndyRussG) 14>>! In T361742#9706455, @Milimetric wrote: > Approved, welcome back Andy :) Woohoo, thanks! :) :) >>! In... [15:41:31] (03CR) 10Jgreen: [C:03+2] community-crm: Add dyna and discovery records [dns] - 10https://gerrit.wikimedia.org/r/983950 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [15:43:16] (03CR) 10BCornwall: [V:03+1 C:03+2] cache: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868709 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:43:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P60435 and previous config saved to /var/cache/conftool/dbconfig/20240411-154330-root.json [15:44:30] (03CR) 10Clément Goubert: ml-serve: Add istio config for mw-api-int-ro (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019061 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [15:45:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2110 (re)pooling @ 50%: repool', diff saved to https://phabricator.wikimedia.org/P60436 and previous config saved to /var/cache/conftool/dbconfig/20240411-154510-arnaudb.json [15:45:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2111 (re)pooling @ 50%: repool', diff saved to https://phabricator.wikimedia.org/P60437 and previous config saved to /var/cache/conftool/dbconfig/20240411-154524-arnaudb.json [15:45:26] !log btullis@cumin1002 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade. [15:47:04] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-drmrs and A:cp [15:51:21] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-fe1002.eqiad.wmnet with OS bookworm [15:51:59] (03PS1) 10Cwhite: opensearch: bump curator version to wmf4 [puppet] - 10https://gerrit.wikimedia.org/r/1018417 (https://phabricator.wikimedia.org/T348508) [15:56:17] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-fe2002.codfw.wmnet with OS bookworm [15:57:18] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1019063 (owner: 10Muehlenhoff) [15:58:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P60438 and previous config saved to /var/cache/conftool/dbconfig/20240411-155836-root.json [15:59:26] (03CR) 10Scott French: [C:03+1] "LGTM. If you can have this merged and verified during the puppet window today, let me know and I can help you get this out to k8s during t" [puppet] - 10https://gerrit.wikimedia.org/r/1018335 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy) [16:00:05] jhathaway: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240411T1600). [16:00:05] dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2110 (re)pooling @ 75%: repool', diff saved to https://phabricator.wikimedia.org/P60439 and previous config saved to /var/cache/conftool/dbconfig/20240411-160016-arnaudb.json [16:00:25] o/ [16:00:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2111 (re)pooling @ 75%: repool', diff saved to https://phabricator.wikimedia.org/P60440 and previous config saved to /var/cache/conftool/dbconfig/20240411-160030-arnaudb.json [16:01:16] (03PS3) 10Clément Goubert: ml-serve: Add istio config for mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019061 (https://phabricator.wikimedia.org/T362316) [16:02:04] o/ [16:03:41] !log beginning rolling hardware upgrades for titan100[12] T361251 [16:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:54] dancy: just merge both patches? [16:03:55] T361251: titan100[12] ram/ssd upgrade coordination - https://phabricator.wikimedia.org/T361251 [16:04:25] jhathaway: Yes please. [16:04:35] (03CR) 10JHathaway: [C:03+2] Serve mw.org/ontology/ontology.owl via /w/static.php (v2) [puppet] - 10https://gerrit.wikimedia.org/r/1018335 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy) [16:04:43] (03CR) 10JHathaway: [C:03+2] Revert "Route /w/docs/ to /w/static.php" [puppet] - 10https://gerrit.wikimedia.org/r/1019067 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy) [16:04:49] (03PS4) 10Clément Goubert: ml-serve: Add istio config for mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019061 (https://phabricator.wikimedia.org/T362316) [16:05:23] (03CR) 10Ahmon Dancy: "Thanks! Jesse Hathaway is handling this one for me during the puppet window (right now). I did add https://gerrit.wikimedia.org/r/c/oper" [puppet] - 10https://gerrit.wikimedia.org/r/1018335 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy) [16:05:47] (03PS1) 10Hashar: Revert "Update mobile search for dark mode, remove unused functions in MobilePage.php" [extensions/MobileFrontend] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018974 (https://phabricator.wikimedia.org/T362297) [16:06:28] dancy: merged [16:06:54] Thanks! I'll do some testing in 10 minutes. [16:07:22] jouncebot: nowandnext [16:07:23] For the next 0 hour(s) and 52 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240411T1600) [16:07:23] In 0 hour(s) and 52 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240411T1700) [16:07:23] In 0 hour(s) and 52 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240411T1700) [16:07:33] (ProbeDown) firing: (2) Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:07:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1002 using scap backport" [extensions/MobileFrontend] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018974 (https://phabricator.wikimedia.org/T362297) (owner: 10Hashar) [16:08:50] (03PS1) 10Clément Goubert: ml-staging-codfw: Override mediawiki-app-vs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019074 (https://phabricator.wikimedia.org/T362316) [16:10:57] jhathaway: Can you run-puppet-agent on mwdebug1001.eqiad.wmnet ? [16:11:06] nod [16:12:50] dancy: done [16:15:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2110 (re)pooling @ 100%: repool', diff saved to https://phabricator.wikimedia.org/P60441 and previous config saved to /var/cache/conftool/dbconfig/20240411-161522-arnaudb.json [16:15:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2111 (re)pooling @ 100%: repool', diff saved to https://phabricator.wikimedia.org/P60442 and previous config saved to /var/cache/conftool/dbconfig/20240411-161536-arnaudb.json [16:19:21] (03PS3) 10Jcrespo: mariadb: Migrate db2098 backups to db2198 and upgrade dbprov2002 to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1018276 (https://phabricator.wikimedia.org/T360751) [16:19:21] (03PS1) 10Jcrespo: mariadb: Reenable notifications for db2201 & db2202 [puppet] - 10https://gerrit.wikimedia.org/r/1019077 (https://phabricator.wikimedia.org/T355422) [16:20:12] (03CR) 10Jcrespo: [C:04-1] mariadb: Reenable notifications for db2201 & db2202 [puppet] - 10https://gerrit.wikimedia.org/r/1019077 (https://phabricator.wikimedia.org/T355422) (owner: 10Jcrespo) [16:26:05] (03PS7) 10TChin: Add datasets-config helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) [16:27:10] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host matomo1003.eqiad.wmnet with OS bookworm [16:27:17] (03CR) 10CI reject: [V:04-1] Revert "Update mobile search for dark mode, remove unused functions in MobilePage.php" [extensions/MobileFrontend] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018974 (https://phabricator.wikimedia.org/T362297) (owner: 10Hashar) [16:27:33] (ProbeDown) resolved: (2) Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:28:01] (03CR) 10Hashar: [V:03+2] "I am submitting this change directly, the sole failure comes from a Selenium test for GrowthExperiments:" [extensions/MobileFrontend] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018974 (https://phabricator.wikimedia.org/T362297) (owner: 10Hashar) [16:28:28] (JobUnavailable) resolved: (2) Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:29:00] !log hashar@deploy1002 Started scap: Backport for [[gerrit:1018974|Revert "Update mobile search for dark mode, remove unused functions in MobilePage.php" (T362297)]] [16:29:05] T362297: [Bug] Mobile watchlist broken - https://phabricator.wikimedia.org/T362297 [16:30:41] (03PS1) 10BryanDavis: developer-portal: Bump container to 2024-04-11-122429-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019080 [16:31:41] !log hashar@deploy1002 hashar: Backport for [[gerrit:1018974|Revert "Update mobile search for dark mode, remove unused functions in MobilePage.php" (T362297)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:33:17] !log hashar@deploy1002 hashar: Continuing with sync [16:33:28] (JobUnavailable) firing: (2) Reduced availability for job pint in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:33:50] ^ verified via the debug server and using https://m.mediawiki.org/wiki/Special:Watchlist?debug=1 to nuke the resourceloader cache [16:34:51] (03CR) 10Jcrespo: [C:04-1] "Was the host reimaged?" [puppet] - 10https://gerrit.wikimedia.org/r/1019077 (https://phabricator.wikimedia.org/T355422) (owner: 10Jcrespo) [16:39:16] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2024-04-11-122429-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019080 (owner: 10BryanDavis) [16:40:22] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2024-04-11-122429-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019080 (owner: 10BryanDavis) [16:44:18] 10ops-eqiad, 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4): titan100[12] ram/ssd upgrade coordination - https://phabricator.wikimedia.org/T361251#9708003 (10VRiley-WMF) [16:45:38] 10ops-eqiad, 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4): 14titan100[12] ram/ssd upgrade coordination - 14https://phabricator.wikimedia.org/T361251#9708006 (10VRiley-WMF) 05Open→03Resolved 14Worked with @herron and upgraded these servers. They came back properly and making this ti... [16:45:48] !log hashar@deploy1002 Finished scap: Backport for [[gerrit:1018974|Revert "Update mobile search for dark mode, remove unused functions in MobilePage.php" (T362297)]] (duration: 16m 47s) [16:45:53] T362297: [Bug] Mobile watchlist broken - https://phabricator.wikimedia.org/T362297 [16:46:27] (03PS1) 10TChin: Add datasets-config helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434) [16:48:28] (JobUnavailable) resolved: (2) Reduced availability for job thanos-query-frontend in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:49:43] (03PS1) 10Dzahn: delete cas-logtash.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1019086 [16:50:29] (03CR) 10Dzahn: "de" [dns] - 10https://gerrit.wikimedia.org/r/1019086 (owner: 10Dzahn) [16:52:06] (03CR) 10TChin: Add datasets-config helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [16:53:50] (03CR) 10TChin: Add datasets-config helm chart (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [16:55:39] (03PS1) 10Dzahn: delete kibana-next.svc.[eqiad|codfw].wmnet records [dns] - 10https://gerrit.wikimedia.org/r/1019087 [17:00:05] bd808: I, the Bot under the Fountain, call upon thee, The Deployer, to do Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240411T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240411T1700) [17:00:05] dancy: A patch you scheduled for MediaWiki infrastructure (UTC late) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:11] o/ [17:02:58] I am done with the backport [17:06:05] dancy: I don't know who run those helm deployment :) [17:06:15] but once you are done, I will proceed with the train [17:06:16] (03PS2) 10Ahmon Dancy: Revert "mediawiki: Route /w/docs/ to /w/static.php" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019068 (https://phabricator.wikimedia.org/T171807) [17:06:51] Thanks hashar. I'll ping you. [17:07:08] Or you can hand it off to me if you wish. [17:07:31] (03CR) 10Scott French: [C:03+1] "LGTM. Thanks for cleaning this up since it won't be used. I'll work with you to get this deployed." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019068 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy) [17:07:37] if you don't mind, that would let me have dinner with kids :-] [17:07:47] I don't mind. Enjoy the fam! [17:07:57] there was not much showing up, but the train got blocked due to some off by one issue in the mobile Special:Watchlist [17:08:15] (03CR) 10Scott French: [C:03+2] Revert "mediawiki: Route /w/docs/ to /w/static.php" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019068 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy) [17:08:16] and we did the train log triage a couple hour ago, it is all quiet [17:08:24] Excellent. [17:08:24] cool! thank you Ahmon! [17:09:33] * hashar heads to dinner [17:09:51] (03Merged) 10jenkins-bot: Revert "mediawiki: Route /w/docs/ to /w/static.php" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019068 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy) [17:10:15] * bd808 has a developer-portal version bump to roll out [17:10:47] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:10:53] (03CR) 10Jcrespo: [C:04-1] "Data sources:" [puppet] - 10https://gerrit.wikimedia.org/r/1019077 (https://phabricator.wikimedia.org/T355422) (owner: 10Jcrespo) [17:11:05] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:11:18] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:11:53] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:12:03] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:12:30] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:16:14] * bd808 is done [17:17:16] dancy: https://gerrit.wikimedia.org/r/1019068 has been picked up on deploy1002 - I'll get that moving shortly [17:17:31] 👍🏾 [17:20:00] !log swfrench@deploy1002 Started scap: (no justification provided) [17:27:58] !log swfrench@deploy1002 Finished scap: (no justification provided) (duration: 07m 57s) [17:32:38] Rolling the train! [17:35:12] (03PS1) 10TrainBranchBot: group2 wikis to 1.42.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019097 (https://phabricator.wikimedia.org/T360158) [17:35:14] (03CR) 10TrainBranchBot: [C:03+2] group2 wikis to 1.42.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019097 (https://phabricator.wikimedia.org/T360158) (owner: 10TrainBranchBot) [17:36:00] (03Merged) 10jenkins-bot: group2 wikis to 1.42.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019097 (https://phabricator.wikimedia.org/T360158) (owner: 10TrainBranchBot) [17:40:42] (03PS29) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) [17:41:48] (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [17:50:47] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.26 refs T360158 [17:50:54] T360158: 1.42.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T360158 [18:00:25] (SystemdUnitFailed) firing: prometheus-wmf-elasticsearch-exporter-9600.service on elastic2090:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:04:47] 10ops-eqiad, 06SRE, 10decommission-hardware: decommission wdqs1025.eqiad.wmnet - https://phabricator.wikimedia.org/T362122#9708246 (10VRiley-WMF) a:03VRiley-WMF [18:09:15] 10ops-eqiad, 06SRE, 10decommission-hardware: decommission wdqs1025.eqiad.wmnet - https://phabricator.wikimedia.org/T362122#9708248 (10VRiley-WMF) [18:10:18] 10ops-eqiad, 06SRE, 10decommission-hardware: 14decommission wdqs1025.eqiad.wmnet - 14https://phabricator.wikimedia.org/T362122#9708249 (10VRiley-WMF) 05Open→03Resolved 14Removed server and ran decommission script [18:10:40] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2010 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:20:40] (KubernetesRsyslogDown) resolved: rsyslog on kubernetes2010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2010 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:26:54] (03PS11) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1017321 [18:30:46] (03CR) 10CDobbins: [C:03+2] purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1017321 (owner: 10CDobbins) [18:36:40] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2010 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:39:15] (JobrunnerPHPBusyWorkers) firing: Not enough idle php7.4-fpm.service workers for Mediawiki jobrunner at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&viewPanel=54 - https://alerts.wikimedia.org/?q=alertname%3DJobrunnerPHPBusyWorkers [18:41:40] (KubernetesRsyslogDown) resolved: rsyslog on kubernetes2010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2010 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:49:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T356166)', diff saved to https://phabricator.wikimedia.org/P60443 and previous config saved to /var/cache/conftool/dbconfig/20240411-184951-marostegui.json [18:49:53] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: 14Requesting access to analytics-privatedata-users for Steph Toyofuku - 14https://phabricator.wikimedia.org/T362113#9708314 (10SToyofuku-WMF) 14Thank you so much!!! [18:49:57] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [18:50:05] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [dns] - 10https://gerrit.wikimedia.org/r/1019086 (owner: 10Dzahn) [18:50:32] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [dns] - 10https://gerrit.wikimedia.org/r/1019087 (owner: 10Dzahn) [18:51:08] 06SRE, 06serviceops: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9708317 (10jijiki) [18:54:43] (03CR) 10Dwisehaupt: [V:03+1] "Yes, I have verified that 443 is not in use. I'm ok with doing two puppet runs in this case as we are still spinning up the service. I bel" [puppet] - 10https://gerrit.wikimedia.org/r/1018362 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [18:55:40] (03CR) 10Dwisehaupt: "Thanks. I think we should hold on this until the last step, once everything else is verified as working." [puppet] - 10https://gerrit.wikimedia.org/r/983951 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [19:03:57] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:04:48] \o [19:04:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P60445 and previous config saved to /var/cache/conftool/dbconfig/20240411-190459-marostegui.json [19:07:19] urandom: same as yesterday it seems... [19:07:33] what is http_jobrunner_ip4 precisely? [19:08:13] i.e. what is the relationship there? [19:08:17] 1m loadavg ~ 250 on 4 machines each with 24 physical cores ... apparently that [19:08:26] 's the tipping point or thereabouts [19:08:27] it's probably a large video being scaled [19:08:31] it's a blackbox probe job. prometheus<->blackbox exporter<->jobrunner host [19:08:51] think something like icinga check_http [19:08:57] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:09:00] since videoscaler and jobrunner share machines [19:09:03] urandom: jobrunners are just a special class of appservers, if that's what you're asking [19:10:14] then how does it relate/compare with http_vidoscaler_ip4? [19:10:58] urandom: https://config-master.wikimedia.org/pybal/eqiad/videoscaler https://config-master.wikimedia.org/pybal/eqiad/jobrunner [19:11:06] ^ same mw machines hosting both services [19:11:34] I'm articulating my question wrong, I think [19:11:46] they are separate services but run on the same backends [19:12:08] at one point we had different weight settings though [19:12:28] and some machines were made decdicated only-on-or-the-other, to prevent that [19:12:57] https://phabricator.wikimedia.org/T279100 [19:13:02] https://phabricator.wikimedia.org/T306860 [19:13:25] (03PS1) 10CDobbins: Revert "purged: add PKI cert handling" [puppet] - 10https://gerrit.wikimedia.org/r/1018977 [19:14:49] Ok, so near-term it would seem we need to bring the concurrency down again, yes? [19:14:58] yes [19:15:03] !incidents [19:15:03] 4584 (RESOLVED) [2x] ProbeDown sre (ip4 probes/service eqiad) [19:15:23] I'll work on that... [19:15:27] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:15:31] urandom: another thing that has been done in the past is to have some jobrunners that aren't also videoscalers [19:15:56] i think that would be reasonable as well, and it would likely make the probedown for jobrunner stop [19:16:22] do the mw-on-k8s jobrunners use a different endpoint? [19:17:10] jobrunner.discovery.wmnet vs. mw-jobrunner.discovery.wmnet [19:17:55] actually, given that changeprop-jobqueue only uses the latter now, are there _any_ uses of jobrunner.discovery.wmnet? (as in non-videoscaling hitting these machines) [19:17:56] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1882/console" [puppet] - 10https://gerrit.wikimedia.org/r/1018977 (owner: 10CDobbins) [19:18:05] swfrench-wmf: great question [19:18:13] mw1445.eqiad.wmnet: [apache2, nginx] # Only pooled as videoscaler [19:18:16] mw1446.eqiad.wmnet: [apache2, nginx] # Only pooled as videoscaler [19:18:19] (03CR) 10CDobbins: [V:03+1 C:03+2] Revert "purged: add PKI cert handling" [puppet] - 10https://gerrit.wikimedia.org/r/1018977 (owner: 10CDobbins) [19:18:28] ^ this is from conftool-data/node/eqiad.yaml [19:18:54] mutante: ok but the weights aren't set like that in etcd [19:19:17] was just about to say - the comments are a lie :) [19:19:29] it seems like serviceops decided to change that back [19:20:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P60446 and previous config saved to /var/cache/conftool/dbconfig/20240411-192006-marostegui.json [19:20:11] I just remember this from before k8s [19:20:27] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:21:08] swfrench-wmf: this is making me hopeful -- https://codesearch.wmcloud.org/search/?q=%5B%5E-%5Djobrunner%5C.discovery%5C.wmnet&files=&excludeFiles=&repos= [19:22:33] cdanis: so we could depool a node or two from videoscaler and hopefully restore some headroom for other things? [19:22:50] cdanis: that looks promising, yeah :) [19:22:56] so maybe we can just disable pages for the baremetal jobrunner service? [19:23:00] yeah I think that would be very reasonable urandom [19:23:10] i also think what taavi said is reasonable, although it's got more risk imo [19:23:27] cdanis: but we should lower concurrency, no? [19:23:50] (since no one has expressed certainty that the baremetal jobrunner is vestigal) [19:23:52] err...also [19:24:08] urandom: yes but also jobs will get retried, so that's not as critical, afaik [19:24:39] https://phabricator.wikimedia.org/T349796#9562813 <- "All (non-videoscaler) jobs migrated to Kubernetes jobrunners. " [19:24:55] ok [19:24:59] silence that probedown :D [19:25:29] win 14 [19:25:46] :) [19:25:48] The suggestion to disable paging could maybe be part of the next "alert review" (i think that's quarterly?) [19:26:01] mutante: i mean, the endpoint should just probably be removed entirely at some point [19:26:07] *nod* [19:26:09] wouldn't hurt to start with the prober definition though [19:27:34] did the videoscaler alert get silenced or something? that was alerting yesterday, was it not? [19:27:57] it would probably be worth still dropping the concurrency though, in addition (since it's questionable whether these will recover on their own) [19:29:47] (03PS1) 10Eevans: changeprop-jobqueue: temporarily reduce video transcode concurrency (again) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019106 [19:31:00] !incidents [19:31:01] 4585 (RESOLVED) [2x] ProbeDown sre (ip4 probes/service eqiad) [19:31:01] 4584 (RESOLVED) [2x] ProbeDown sre (ip4 probes/service eqiad) [19:32:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:32:34] (03CR) 10Scott French: [C:03+1] "LGTM. This gets us to roughly where we were right before dropping all the way to 1 / 1 yesterday." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019106 (owner: 10Eevans) [19:34:30] Is there a way to create an indefinite silence in the alertmanager interface? [19:35:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T356166)', diff saved to https://phabricator.wikimedia.org/P60447 and previous config saved to /var/cache/conftool/dbconfig/20240411-193514-marostegui.json [19:35:17] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1248.eqiad.wmnet with reason: Maintenance [19:35:27] (SystemdUnitCrashLoop) firing: prometheus-wmf-elasticsearch-exporter-9400.service crashloop on elastic2090:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [19:35:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1248.eqiad.wmnet with reason: Maintenance [19:35:31] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [19:35:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1248 (T356166)', diff saved to https://phabricator.wikimedia.org/P60448 and previous config saved to /var/cache/conftool/dbconfig/20240411-193537-marostegui.json [19:35:53] urandom: there is a way to do them via command line with this: https://wikitech.wikimedia.org/wiki/Alertmanager#Add_a_silence_via_CLI [19:36:19] I don't think that's the right approach. If it's indeed not worth paging on, then we should disable its paging. [19:37:33] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:38:40] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2010 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:39:02] (03CR) 10Eevans: [C:03+2] changeprop-jobqueue: temporarily reduce video transcode concurrency (again) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019106 (owner: 10Eevans) [19:40:50] !log eevans@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [19:41:28] !log eevans@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [19:42:28] (03PS1) 10Andrea Denisse: Revert "ssl: Delete dummy TLS key for the Prometheus hosts" [labs/private] - 10https://gerrit.wikimedia.org/r/1018978 [19:42:44] (03CR) 10Andrea Denisse: [V:03+2 C:03+2] Revert "ssl: Delete dummy TLS key for the Prometheus hosts" [labs/private] - 10https://gerrit.wikimedia.org/r/1018978 (owner: 10Andrea Denisse) [19:43:00] (03PS30) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) [19:43:40] (KubernetesRsyslogDown) resolved: rsyslog on kubernetes2010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2010 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:43:54] (03PS1) 10CDanis: homedir prompt update [puppet] - 10https://gerrit.wikimedia.org/r/1019108 [19:44:09] (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [19:44:11] (03CR) 10CDanis: [C:03+2] homedir prompt update [puppet] - 10https://gerrit.wikimedia.org/r/1019108 (owner: 10CDanis) [19:45:57] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:46:18] (03PS1) 10Dzahn: cloud/devtools: switch default puppetmaster from 1001 to 1003 [puppet] - 10https://gerrit.wikimedia.org/r/1019109 (https://phabricator.wikimedia.org/T360470) [19:46:20] (03PS2) 10David Martin: ext-EventLogging: Add mediawiki.product_metrics.wikifunctions_ui to $wgEventLoggingStreamNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018317 (owner: 10Phuedx) [19:46:33] (03PS1) 10Cwhite: service catalog: disable paging on jobrunner and videoscaler services [puppet] - 10https://gerrit.wikimedia.org/r/1018420 (https://phabricator.wikimedia.org/T349796) [19:47:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:48:52] (03CR) 10Gergő Tisza: [C:04-1] logging: default to log any error (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [19:49:00] if it's not paging and not emailing or creating tickets, does it have value to monitor at all [19:49:43] (03PS1) 10JHathaway: vrts: create a profile for alias generation [puppet] - 10https://gerrit.wikimedia.org/r/1019110 (https://phabricator.wikimedia.org/T325395) [19:49:44] (03PS1) 10JHathaway: otrs.conf: rename and tighten up epp type definitions [puppet] - 10https://gerrit.wikimedia.org/r/1019111 (https://phabricator.wikimedia.org/T325395) [19:50:17] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1019110 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [19:50:43] (03PS6) 10Andrea Denisse: prometheus: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) [19:50:44] (03CR) 10Andrea Denisse: "PCC results now show the certs are generated by CFSSL: https://puppet-compiler.wmflabs.org/output/1018749/1883/" [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [19:50:57] (ProbeDown) resolved: Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:51:02] (03CR) 10Cwhite: "PCC OK: https://puppet-compiler.wmflabs.org/output/1018420/1884/" [puppet] - 10https://gerrit.wikimedia.org/r/1018420 (https://phabricator.wikimedia.org/T349796) (owner: 10Cwhite) [19:51:24] (03CR) 10CDanis: [C:03+1] service catalog: disable paging on jobrunner and videoscaler services [puppet] - 10https://gerrit.wikimedia.org/r/1018420 (https://phabricator.wikimedia.org/T349796) (owner: 10Cwhite) [19:51:26] cwhite: are we disabling paging for jobrunner *and* videoscaler? [19:52:14] I'd understood the former to be noise, because it was collateral damage, and strictly in service to, the latter [19:52:33] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:52:38] urandom: I offer that up for discussion. The two are both paging simultaneously. [19:52:58] Just disabling the jobrunner one will still render videoscaler pages. [19:53:19] (03CR) 10CI reject: [V:04-1] vrts: create a profile for alias generation [puppet] - 10https://gerrit.wikimedia.org/r/1019110 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [19:54:07] I haven't seen any videoscaler pages since https://portal.victorops.com/ui/wikimedia/incident/4578/details (yesterday) [19:54:28] (03CR) 10Dzahn: [C:03+2] cloud/devtools: switch default puppetmaster from 1001 to 1003 [puppet] - 10https://gerrit.wikimedia.org/r/1019109 (https://phabricator.wikimedia.org/T360470) (owner: 10Dzahn) [19:54:43] would be worth silencing for now and giving h.nowlan a chance to review https://gerrit.wikimedia.org/r/1018420 before merging? [19:54:49] though they were precipitated by/accompanied with plenty of these http_jobrunner_ip4 pages [19:55:19] basically, while the right call right now is to silence, I would say that we also took an action (attempt to mitigate by dropping concurrency) [19:55:34] duration=7d ? [19:56:03] urandom: Pretty sure that's what the `firing (2)` means. It means 2xProbeDown alerts are firing, but only one description gets added to the IRC message [19:56:42] cwhite: oh, right you are [19:56:51] c.f.: https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes%2Fservice&var-module=All&orgId=1 [19:56:59] whether that action is required in order for the videoscalers to come good is a question I don't have the expertise to answer (but judging by h.nowlan's actions in the first part of 4/10, I suspect so) [19:58:15] (03PS2) 10JHathaway: vrts: create a profile for alias generation [puppet] - 10https://gerrit.wikimedia.org/r/1019110 (https://phabricator.wikimedia.org/T325395) [19:58:15] (03PS2) 10JHathaway: otrs.conf: rename and tighten up epp type definitions [puppet] - 10https://gerrit.wikimedia.org/r/1019111 (https://phabricator.wikimedia.org/T325395) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240411T2000). [20:00:05] dmartin-WMF: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:35] Hello folks, I'm here [20:00:47] I can deploy today [20:00:50] How are you? [20:01:01] Great. I am fine very fine. How are you Martin? [20:01:33] Good as well! [20:01:58] (03CR) 10Urbanecm: [C:03+2] ext-EventLogging: Add mediawiki.product_metrics.wikifunctions_ui to $wgEventLoggingStreamNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018317 (owner: 10Phuedx) [20:02:03] What's your location, if I may ask. I'm in the Bay area, Northern California [20:02:26] Prague, Czech Republic. Quite far from the Bay area :) [20:02:40] Wow. I visited Prague once briefly. I loved it [20:02:46] (03Merged) 10jenkins-bot: ext-EventLogging: Add mediawiki.product_metrics.wikifunctions_ui to $wgEventLoggingStreamNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018317 (owner: 10Phuedx) [20:02:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018317 (owner: 10Phuedx) [20:03:04] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1018317|ext-EventLogging: Add mediawiki.product_metrics.wikifunctions_ui to $wgEventLoggingStreamNames]] [20:03:07] location: Cross Club [20:03:43] (VarnishUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [20:03:44] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [20:03:51] (03CR) 10Scott French: "Thanks, Cole. Would it be ok to silence for now, and wait a review cycle to get Hnowlan's thoughts on this too?" [puppet] - 10https://gerrit.wikimedia.org/r/1018420 (https://phabricator.wikimedia.org/T349796) (owner: 10Cwhite) [20:03:57] awesome. [20:04:09] not the best message to see during a window [20:05:33] !log urbanecm@deploy1002 urbanecm and phuedx: Backport for [[gerrit:1018317|ext-EventLogging: Add mediawiki.product_metrics.wikifunctions_ui to $wgEventLoggingStreamNames]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:05:52] Okay I should do my verification step now right? Give me a minute please [20:06:20] urbanecm: it looks like there was a spike of traffic right as the window started, that since stopped, I think the alerts will clear in a minute or two and it's fine to proceed [20:06:25] 10ops-eqiad, 06SRE, 10decommission-hardware: 14decommission dumpsdata1002.eqiad.wmnet - 14https://phabricator.wikimedia.org/T362065#9708448 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF 14Unracked server and ran the script for decommission  [20:06:29] dmartin-WMF: indeed, please check. [20:06:47] cdanis: ack, thanks for the info. do you want me to wait for the clear just in case? [20:07:20] Okay, the verification for my patch is completed [20:07:23] urbanecm: you're good to go, the unavailable metrics track error rate, and it's returned to 0 https://grafana.wikimedia.org/d/O_OXJyTVk/home-w-wiki-status?orgId=1&refresh=5m&from=now-30m&to=now&viewPanel=8 [20:07:33] awesome, thanks cdanis [20:07:41] dmartin-WMF: i take it that the patch works as expected? :) [20:07:59] Yes it does. This is in reference to gerrit:1018317 [20:08:13] thanks! proceeding [20:08:14] !log urbanecm@deploy1002 urbanecm and phuedx: Continuing with sync [20:08:44] (VarnishUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [20:08:44] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [20:09:25] (03PS1) 10JHathaway: postfix: prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/1019115 (https://phabricator.wikimedia.org/T325395) [20:09:28] (03PS1) 10JHathaway: postfix: prometheus ops config [puppet] - 10https://gerrit.wikimedia.org/r/1019116 (https://phabricator.wikimedia.org/T325395) [20:09:44] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1019110 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [20:10:30] cwhite: what happens with `page: false` here? Does it still show in alertmanager? Do we still get alerts here in IRC (i.e. w/o the #p.age tag)? [20:11:27] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1019111 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [20:12:21] urandom: It still checks and fires alerts, but doesn't get sent to the pager anymore [20:13:11] and the #p.age tag gets dropped from the IRC message [20:13:18] (03CR) 10Eevans: [C:03+1] "I'd be fine with either approach." [puppet] - 10https://gerrit.wikimedia.org/r/1018420 (https://phabricator.wikimedia.org/T349796) (owner: 10Cwhite) [20:14:04] * cwhite submitted a silence on the jobrunner module in alertmanager that expires on Monday at 08:00Z [20:14:16] that works too [20:15:20] !incidents [20:15:20] 4588 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [20:15:21] 4587 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule) [20:15:21] 4586 (RESOLVED) ProbeDown sre (10.2.2.26 ip4 jobrunner:443 probes/service http_jobrunner_ip4 eqiad) [20:15:21] 4585 (RESOLVED) [2x] ProbeDown sre (ip4 probes/service eqiad) [20:15:21] 4584 (RESOLVED) [2x] ProbeDown sre (ip4 probes/service eqiad) [20:16:01] Since we reduced the concurrency, I've updated the doc status back to monitoring [20:17:11] it's worth watching for a while longer, but it doesn't seem like we moved the needle much [20:20:42] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1018317|ext-EventLogging: Add mediawiki.product_metrics.wikifunctions_ui to $wgEventLoggingStreamNames]] (duration: 17m 38s) [20:22:37] dmartin-WMF: should be deployed by now :) [20:24:01] urbanecm: That's great. Yes in fact I was able to verify on deployment just now. Thank you! [20:24:20] sounds good [20:24:25] I mean, verify on production [20:26:12] i understood :) [20:26:53] Right :) [20:32:33] (03PS1) 10Ahmon Dancy: values-traindev.yaml: Update train-dev repo URL in comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019122 [20:34:23] (ProbeDown) firing: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:36:06] (03CR) 10Brennen Bearnes: [C:03+1] cloud/devtools: switch default puppetmaster from 1001 to 1003 [puppet] - 10https://gerrit.wikimedia.org/r/1019109 (https://phabricator.wikimedia.org/T360470) (owner: 10Dzahn) [20:37:36] (03CR) 10Herron: "LGTM please see one minor comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [20:39:23] (ProbeDown) resolved: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:42:49] (03PS31) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) [20:43:56] (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [20:51:57] (03CR) 10Scott French: [C:03+2] values-traindev.yaml: Update train-dev repo URL in comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019122 (owner: 10Ahmon Dancy) [20:52:51] (03Merged) 10jenkins-bot: values-traindev.yaml: Update train-dev repo URL in comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019122 (owner: 10Ahmon Dancy) [20:55:05] (03PS3) 10JHathaway: vrts: create a profile for alias generation [puppet] - 10https://gerrit.wikimedia.org/r/1019110 (https://phabricator.wikimedia.org/T325395) [20:55:05] (03PS3) 10JHathaway: otrs.conf: rename and tighten up epp type definitions [puppet] - 10https://gerrit.wikimedia.org/r/1019111 (https://phabricator.wikimedia.org/T325395) [20:55:41] (03PS32) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) [20:56:21] (03CR) 10Dzahn: prometheus: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [20:56:55] (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [20:57:09] (03CR) 10Dzahn: "A fail isn't a NOOP though?" [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [20:58:45] (03PS7) 10Andrea Denisse: prometheus: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) [20:59:19] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1019111 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [20:59:46] (03CR) 10Herron: "> A fail isn't a NOOP though?" [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [21:00:22] (03CR) 10Andrea Denisse: prometheus: Ensure TLS certificates are provided by CFSSL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [21:01:06] (03CR) 10Dzahn: "Ack" [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [21:03:13] (03CR) 10Herron: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [21:06:44] (03PS33) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) [21:07:08] (03CR) 10Andrea Denisse: "Here are the PCC results with the latest patchset, certificates are now correctly generated by CFSSL. Keep in mind that the Hosts that hav" [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [21:07:50] (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [21:56:15] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [22:00:25] (SystemdUnitFailed) firing: prometheus-wmf-elasticsearch-exporter-9600.service on elastic2090:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:01:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [22:01:16] Hey all! I would like to update a DB row in production for a record that got soft-deleted accidentally. Can I use mysql.php to do so, and are there specific precautions I should take? Context is T362365 [22:01:16] T362365: Event registration should not be disabled after marking the event page for translation - https://phabricator.wikimedia.org/T362365 [22:03:35] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T362366 (10phaultfinder) 03NEW [22:04:57] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:09:57] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:12:15] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [22:13:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 820.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:14:45] looks we had a silence in place for the jobrunner, but not the videoscaler, so I went ahead and created a matching one [22:16:34] (03PS1) 10JHathaway: Postfix profile [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398) [22:17:04] (03CR) 10CI reject: [V:04-1] Postfix profile [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [22:18:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 820.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:21:48] (03PS2) 10JHathaway: Postfix profile [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398) [22:22:17] (03CR) 10CI reject: [V:04-1] Postfix profile [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [22:27:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [22:39:15] (JobrunnerPHPBusyWorkers) firing: Not enough idle php7.4-fpm.service workers for Mediawiki jobrunner at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&viewPanel=54 - https://alerts.wikimedia.org/?q=alertname%3DJobrunnerPHPBusyWorkers [22:55:38] hashar: jan_drewniak: hi, will you backport the mobilefrontend revert ? :) if we can land it and resume the train, that would be great. -- Sorry I just saw this message, thank you for taking care of that earlier today! [23:06:43] (VarnishUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [23:06:44] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [23:11:20] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [23:11:43] (VarnishUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [23:11:44] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [23:17:04] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [23:25:15] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [23:35:27] (SystemdUnitCrashLoop) firing: prometheus-wmf-elasticsearch-exporter-9400.service crashloop on elastic2090:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [23:38:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1018421 [23:38:13] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1018421 (owner: 10TrainBranchBot) [23:40:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [23:46:15] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [23:53:28] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable