[00:00:18] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1018410 (owner: 10TrainBranchBot)
[00:02:11] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T360332)', diff saved to https://phabricator.wikimedia.org/P60319 and previous config saved to /var/cache/conftool/dbconfig/20240411-000211-arnaudb.json
[00:14:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P60320 and previous config saved to /var/cache/conftool/dbconfig/20240411-001458-marostegui.json
[00:17:19] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P60321 and previous config saved to /var/cache/conftool/dbconfig/20240411-001718-arnaudb.json
[00:30:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P60322 and previous config saved to /var/cache/conftool/dbconfig/20240411-003005-marostegui.json
[00:32:28] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P60323 and previous config saved to /var/cache/conftool/dbconfig/20240411-003226-arnaudb.json
[00:45:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T356166)', diff saved to https://phabricator.wikimedia.org/P60324 and previous config saved to /var/cache/conftool/dbconfig/20240411-004514-marostegui.json
[00:45:16] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1244.eqiad.wmnet with reason: Maintenance
[00:45:20] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[00:45:29] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1244.eqiad.wmnet with reason: Maintenance
[00:45:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1244 (T356166)', diff saved to https://phabricator.wikimedia.org/P60325 and previous config saved to /var/cache/conftool/dbconfig/20240411-004536-marostegui.json
[00:47:36] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T360332)', diff saved to https://phabricator.wikimedia.org/P60326 and previous config saved to /var/cache/conftool/dbconfig/20240411-004735-arnaudb.json
[00:47:38] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance
[00:47:41] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[00:47:51] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance
[00:47:58] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2170 (T360332)', diff saved to https://phabricator.wikimedia.org/P60327 and previous config saved to /var/cache/conftool/dbconfig/20240411-004758-arnaudb.json
[00:50:54] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T360332)', diff saved to https://phabricator.wikimedia.org/P60328 and previous config saved to /var/cache/conftool/dbconfig/20240411-005054-arnaudb.json
[01:06:03] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P60329 and previous config saved to /var/cache/conftool/dbconfig/20240411-010601-arnaudb.json
[01:21:11] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P60330 and previous config saved to /var/cache/conftool/dbconfig/20240411-012110-arnaudb.json
[01:36:19] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T360332)', diff saved to https://phabricator.wikimedia.org/P60331 and previous config saved to /var/cache/conftool/dbconfig/20240411-013618-arnaudb.json
[01:36:22] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2173.codfw.wmnet with reason: Maintenance
[01:36:25] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[01:36:35] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2173.codfw.wmnet with reason: Maintenance
[01:36:37] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[01:36:50] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[01:36:58] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T360332)', diff saved to https://phabricator.wikimedia.org/P60332 and previous config saved to /var/cache/conftool/dbconfig/20240411-013657-arnaudb.json
[01:38:48] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T360332)', diff saved to https://phabricator.wikimedia.org/P60333 and previous config saved to /var/cache/conftool/dbconfig/20240411-013848-arnaudb.json
[01:46:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T356166)', diff saved to https://phabricator.wikimedia.org/P60334 and previous config saved to /var/cache/conftool/dbconfig/20240411-014602-marostegui.json
[01:46:07] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[01:53:56] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P60335 and previous config saved to /var/cache/conftool/dbconfig/20240411-015355-arnaudb.json
[02:01:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P60336 and previous config saved to /var/cache/conftool/dbconfig/20240411-020110-marostegui.json
[02:09:03] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P60337 and previous config saved to /var/cache/conftool/dbconfig/20240411-020903-arnaudb.json
[02:16:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P60338 and previous config saved to /var/cache/conftool/dbconfig/20240411-021617-marostegui.json
[02:20:40] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[02:24:11] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T360332)', diff saved to https://phabricator.wikimedia.org/P60339 and previous config saved to /var/cache/conftool/dbconfig/20240411-022410-arnaudb.json
[02:24:13] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2174.codfw.wmnet with reason: Maintenance
[02:24:23] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[02:24:26] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2174.codfw.wmnet with reason: Maintenance
[02:24:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T360332)', diff saved to https://phabricator.wikimedia.org/P60340 and previous config saved to /var/cache/conftool/dbconfig/20240411-022433-arnaudb.json
[02:25:40] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[02:27:26] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T360332)', diff saved to https://phabricator.wikimedia.org/P60341 and previous config saved to /var/cache/conftool/dbconfig/20240411-022725-arnaudb.json
[02:31:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T356166)', diff saved to https://phabricator.wikimedia.org/P60342 and previous config saved to /var/cache/conftool/dbconfig/20240411-023125-marostegui.json
[02:31:27] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance
[02:31:29] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[02:31:40] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance
[02:38:28] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:42:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P60343 and previous config saved to /var/cache/conftool/dbconfig/20240411-024232-arnaudb.json
[02:57:40] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P60344 and previous config saved to /var/cache/conftool/dbconfig/20240411-025740-arnaudb.json
[03:12:48] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T360332)', diff saved to https://phabricator.wikimedia.org/P60345 and previous config saved to /var/cache/conftool/dbconfig/20240411-031247-arnaudb.json
[03:12:50] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2176.codfw.wmnet with reason: Maintenance
[03:12:53] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[03:13:04] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2176.codfw.wmnet with reason: Maintenance
[03:13:11] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T360332)', diff saved to https://phabricator.wikimedia.org/P60346 and previous config saved to /var/cache/conftool/dbconfig/20240411-031310-arnaudb.json
[03:16:02] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T360332)', diff saved to https://phabricator.wikimedia.org/P60347 and previous config saved to /var/cache/conftool/dbconfig/20240411-031602-arnaudb.json
[03:23:28] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:31:10] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P60348 and previous config saved to /var/cache/conftool/dbconfig/20240411-033109-arnaudb.json
[03:35:27] <jinxer-wm>	 (SystemdUnitCrashLoop) firing: prometheus-wmf-elasticsearch-exporter-9400.service crashloop on elastic2090:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[03:46:17] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P60349 and previous config saved to /var/cache/conftool/dbconfig/20240411-034617-arnaudb.json
[04:01:25] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T360332)', diff saved to https://phabricator.wikimedia.org/P60350 and previous config saved to /var/cache/conftool/dbconfig/20240411-040124-arnaudb.json
[04:01:27] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2188.codfw.wmnet with reason: Maintenance
[04:01:34] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[04:01:40] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2188.codfw.wmnet with reason: Maintenance
[04:01:48] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T360332)', diff saved to https://phabricator.wikimedia.org/P60351 and previous config saved to /var/cache/conftool/dbconfig/20240411-040147-arnaudb.json
[04:04:05] <wikibugs>	 (03PS5) 10Hashar: logging: default to log any error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838)
[04:04:30] <hashar>	 :old-man-yells-at-gerrit:
[04:04:42] <wikibugs>	 (03PS6) 10Hashar: logging: default to log any error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838)
[04:04:47] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T360332)', diff saved to https://phabricator.wikimedia.org/P60352 and previous config saved to /var/cache/conftool/dbconfig/20240411-040447-arnaudb.json
[04:04:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] logging: default to log any error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar)
[04:05:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] logging: default to log any error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar)
[04:06:27] <wikibugs>	 (03PS7) 10Hashar: logging: default to log any error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838)
[04:09:53] <wikibugs>	 (03PS8) 10Hashar: logging: default to log any error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838)
[04:14:40] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[04:19:40] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[04:19:55] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P60353 and previous config saved to /var/cache/conftool/dbconfig/20240411-041954-arnaudb.json
[04:34:35] <wikibugs>	 (03PS9) 10Hashar: logging: default to log any error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838)
[04:35:04] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P60354 and previous config saved to /var/cache/conftool/dbconfig/20240411-043502-arnaudb.json
[04:44:54] <wikibugs>	 (03CR) 10Hashar: "+ Gergo who filed T228838  and Daniel who was hit by the issue yesterday and had to add a log channel explicitly ( I97714e296c025fa2accb04" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar)
[04:50:12] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T360332)', diff saved to https://phabricator.wikimedia.org/P60355 and previous config saved to /var/cache/conftool/dbconfig/20240411-045011-arnaudb.json
[04:50:14] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2212.codfw.wmnet with reason: Maintenance
[04:50:17] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2212.codfw.wmnet with reason: Maintenance
[04:50:19] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[04:50:25] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2212 (T360332)', diff saved to https://phabricator.wikimedia.org/P60356 and previous config saved to /var/cache/conftool/dbconfig/20240411-045024-arnaudb.json
[04:53:18] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T360332)', diff saved to https://phabricator.wikimedia.org/P60357 and previous config saved to /var/cache/conftool/dbconfig/20240411-045317-arnaudb.json
[05:08:26] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P60358 and previous config saved to /var/cache/conftool/dbconfig/20240411-050825-arnaudb.json
[05:13:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1189', diff saved to https://phabricator.wikimedia.org/P60359 and previous config saved to /var/cache/conftool/dbconfig/20240411-051341-root.json
[05:14:06] <wikibugs>	 (03PS1) 10Marostegui: db1189: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1018823
[05:15:00] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1189.eqiad.wmnet with OS bookworm
[05:17:57] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1189: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1018823 (owner: 10Marostegui)
[05:18:25] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] "Remember to drop those users with: drop user if exists 'USERNAME'@'IPS_REMOVED';" [puppet] - 10https://gerrit.wikimedia.org/r/1018407 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb)
[05:18:36] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] mariadb: Reenable notifications for db2199, db2200 after setup [puppet] - 10https://gerrit.wikimedia.org/r/1018744 (https://phabricator.wikimedia.org/T355422) (owner: 10Jcrespo)
[05:23:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P60360 and previous config saved to /var/cache/conftool/dbconfig/20240411-052333-arnaudb.json
[05:27:39] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1189.eqiad.wmnet with reason: host reimage
[05:31:57] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1189.eqiad.wmnet with reason: host reimage
[05:34:40] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[05:37:42] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1189: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1018697
[05:38:41] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T360332)', diff saved to https://phabricator.wikimedia.org/P60361 and previous config saved to /var/cache/conftool/dbconfig/20240411-053840-arnaudb.json
[05:38:43] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2216.codfw.wmnet with reason: Maintenance
[05:38:46] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[05:38:56] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2216.codfw.wmnet with reason: Maintenance
[05:39:04] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T360332)', diff saved to https://phabricator.wikimedia.org/P60362 and previous config saved to /var/cache/conftool/dbconfig/20240411-053903-arnaudb.json
[05:39:40] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[05:42:05] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T360332)', diff saved to https://phabricator.wikimedia.org/P60363 and previous config saved to /var/cache/conftool/dbconfig/20240411-054205-arnaudb.json
[05:52:37] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1189.eqiad.wmnet with OS bookworm
[05:54:05] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1189: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1018697 (owner: 10Marostegui)
[05:54:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P60364 and previous config saved to /var/cache/conftool/dbconfig/20240411-055428-root.json
[05:57:13] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P60365 and previous config saved to /var/cache/conftool/dbconfig/20240411-055712-arnaudb.json
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240411T0600)
[06:00:05] <jouncebot>	 kormat, marostegui, Amir1, and arnaudb: Time to do the Primary database switchover deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240411T0600).
[06:00:25] <jinxer-wm>	 (SystemdUnitFailed) firing: prometheus-wmf-elasticsearch-exporter-9600.service on elastic2090:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:09:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P60366 and previous config saved to /var/cache/conftool/dbconfig/20240411-060934-root.json
[06:12:21] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P60367 and previous config saved to /var/cache/conftool/dbconfig/20240411-061220-arnaudb.json
[06:15:25] <jinxer-wm>	 (SystemdUnitFailed) firing: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:24:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P60368 and previous config saved to /var/cache/conftool/dbconfig/20240411-062440-root.json
[06:27:28] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T360332)', diff saved to https://phabricator.wikimedia.org/P60369 and previous config saved to /var/cache/conftool/dbconfig/20240411-062728-arnaudb.json
[06:27:33] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[06:39:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P60370 and previous config saved to /var/cache/conftool/dbconfig/20240411-063946-root.json
[06:54:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P60371 and previous config saved to /var/cache/conftool/dbconfig/20240411-065452-root.json
[06:56:08] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.hosts.decommission for hosts idp1002.wikimedia.org
[06:57:35] <logmsgbot>	 !log slyngshede@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts idp1002.wikimedia.org
[06:58:33] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "Overall lgtm, one comment." [cookbooks] - 10https://gerrit.wikimedia.org/r/1018718 (https://phabricator.wikimedia.org/T361306) (owner: 10Volans)
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240411T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:00:16] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.hosts.decommission for hosts idp1002.wikimedia.org
[07:05:01] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.dns.netbox
[07:05:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] titan: trim 5m retention to 3y + 2w [puppet] - 10https://gerrit.wikimedia.org/r/1018644 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi)
[07:08:10] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp1002.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1002"
[07:09:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P60372 and previous config saved to /var/cache/conftool/dbconfig/20240411-070958-root.json
[07:10:08] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp1002.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1002"
[07:10:08] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:10:08] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts idp1002.wikimedia.org
[07:10:50] <wikibugs>	 (03PS1) 10Slyngshede: R:idp decommision Bullseye IDP hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1018871 (https://phabricator.wikimedia.org/T357748)
[07:13:08] <wikibugs>	 (03PS1) 10Filippo Giunchedi: opensearch: switch dashboards to sso auth [puppet] - 10https://gerrit.wikimedia.org/r/1018872 (https://phabricator.wikimedia.org/T246998)
[07:13:32] <wikibugs>	 (03PS1) 10Muehlenhoff: Blacklist n_gsm kernel module [puppet] - 10https://gerrit.wikimedia.org/r/1018873
[07:15:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:17:07] <wikibugs>	 (03CR) 10Majavah: [C:03+2] alertmanager: karma: Set group too [puppet] - 10https://gerrit.wikimedia.org/r/1018727 (owner: 10Majavah)
[07:18:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1867/co" [puppet] - 10https://gerrit.wikimedia.org/r/1018872 (https://phabricator.wikimedia.org/T246998) (owner: 10Filippo Giunchedi)
[07:25:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1018871 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede)
[07:25:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P60373 and previous config saved to /var/cache/conftool/dbconfig/20240411-072503-root.json
[07:25:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Blacklist n_gsm kernel module [puppet] - 10https://gerrit.wikimedia.org/r/1018873 (owner: 10Muehlenhoff)
[07:35:18] <wikibugs>	 (03CR) 10DCausse: cirrus-streaming-updater: swith to "failure-rate" retry strategy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018778 (owner: 10DCausse)
[07:35:27] <jinxer-wm>	 (SystemdUnitCrashLoop) firing: prometheus-wmf-elasticsearch-exporter-9400.service crashloop on elastic2090:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[07:39:13] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[07:39:26] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[07:44:23] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp3072.esams.wmnet
[07:44:54] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] cp3072: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1015974 (https://phabricator.wikimedia.org/T360430) (owner: 10Ssingh)
[07:47:21] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp3072.esams.wmnet with OS bullseye
[07:47:33] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9705875 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp3072.esams.wmnet with OS bullseye
[07:52:03] <hashar>	 the train is blocked on T362297
[07:52:04] <stashbot>	 T362297: [Bug] Mobile watchlist broken - https://phabricator.wikimedia.org/T362297
[07:52:17] <hashar>	 some obscure UI regression on the mobile watchlist 
[07:55:16] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Mail: 14Access to DMARCIAN - 14https://phabricator.wikimedia.org/T356920#9705894 (10Aklapper) 05Open→03Declined 14Declining request as the requester's account has been disabled.
[07:56:54] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.hosts.decommission for hosts idp2002.wikimedia.org
[08:00:04] <jouncebot>	 hashar and jnuche: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240411T0800)
[08:01:45] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.dns.netbox
[08:03:42] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp2002.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1002"
[08:05:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1198', diff saved to https://phabricator.wikimedia.org/P60374 and previous config saved to /var/cache/conftool/dbconfig/20240411-080502-root.json
[08:05:07] <wikibugs>	 (03PS1) 10Marostegui: db1198: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1018934
[08:05:46] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1198: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1018934 (owner: 10Marostegui)
[08:06:07] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1198.eqiad.wmnet with OS bookworm
[08:06:18] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp2002.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1002"
[08:06:18] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:06:19] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts idp2002.wikimedia.org
[08:06:54] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] R:idp decommision Bullseye IDP hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1018871 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede)
[08:10:28] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3072.esams.wmnet with reason: host reimage
[08:13:37] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3072.esams.wmnet with reason: host reimage
[08:16:52] <wikibugs>	 (03PS2) 10Volans: sre.hosts.decommission: ask on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/1018718 (https://phabricator.wikimedia.org/T361306)
[08:17:16] <wikibugs>	 (03CR) 10Volans: "addressed comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/1018718 (https://phabricator.wikimedia.org/T361306) (owner: 10Volans)
[08:19:02] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1198.eqiad.wmnet with reason: host reimage
[08:20:34] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] sre.hosts.decommission: ask on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/1018718 (https://phabricator.wikimedia.org/T361306) (owner: 10Volans)
[08:20:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.hosts.decommission: ask on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/1018718 (https://phabricator.wikimedia.org/T361306) (owner: 10Volans)
[08:20:41] <hashar>	 !log MediaWiki train is blocked
[08:20:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:51] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] add_ip6_mapped - don't fail if the host already have a /128 address [puppet] - 10https://gerrit.wikimedia.org/r/1017047 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[08:22:03] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1198.eqiad.wmnet with reason: host reimage
[08:25:32] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host testvm2006.codfw.wmnet
[08:25:33] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox
[08:26:43] <logmsgbot>	 !log fabfur@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp3072.esams.wmnet with OS bullseye
[08:26:52] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9705962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp3072.esams.wmnet with OS bullseye executed with errors: - cp3072 (...
[08:27:39] <wikibugs>	 (03PS3) 10Volans: sre.hosts.decommission: ask on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/1018718 (https://phabricator.wikimedia.org/T361306)
[08:27:40] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2006.codfw.wmnet - ayounsi@cumin1002"
[08:27:46] <logmsgbot>	 !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host db1198.eqiad.wmnet with OS bookworm
[08:28:31] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2006.codfw.wmnet - ayounsi@cumin1002"
[08:28:31] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:28:31] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache testvm2006.codfw.wmnet on all recursors
[08:28:35] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2006.codfw.wmnet on all recursors
[08:29:01] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2006.codfw.wmnet - ayounsi@cumin1002"
[08:29:55] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2006.codfw.wmnet - ayounsi@cumin1002"
[08:31:07] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host testvm2006.codfw.wmnet with OS bookworm
[08:36:11] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1198.eqiad.wmnet with OS bookworm
[08:36:30] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version
[08:37:11] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1198.eqiad.wmnet with reason: host reimage
[08:40:06] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1198.eqiad.wmnet with reason: host reimage
[08:40:28] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp3072.esams.wmnet with OS bullseye
[08:40:42] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9705992 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp3072.esams.wmnet with OS bullseye
[08:42:08] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Change ssh key validator from class to function. [software/bitu] - 10https://gerrit.wikimedia.org/r/1018635 (owner: 10Slyngshede)
[08:42:45] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2006.codfw.wmnet with reason: host reimage
[08:42:46] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version
[08:43:17] <wikibugs>	 (03Merged) 10jenkins-bot: Change ssh key validator from class to function. [software/bitu] - 10https://gerrit.wikimedia.org/r/1018635 (owner: 10Slyngshede)
[08:45:11] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on matomo1003.eqiad.wmnet with reason: Adding disk
[08:45:25] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on matomo1003.eqiad.wmnet with reason: Adding disk
[08:45:36] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version
[08:45:47] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2006.codfw.wmnet with reason: host reimage
[08:47:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete grant [puppet] - 10https://gerrit.wikimedia.org/r/1018941 (https://phabricator.wikimedia.org/T357748)
[08:50:31] <logmsgbot>	 !log jelto@cumin1002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version
[08:50:51] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1198: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1018704
[08:54:22] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2114 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1018411 (https://phabricator.wikimedia.org/T362302)
[08:55:33] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1198.eqiad.wmnet with OS bookworm
[08:57:26] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1198: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1018704 (owner: 10Marostegui)
[08:57:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1198 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P60376 and previous config saved to /var/cache/conftool/dbconfig/20240411-085749-root.json
[08:58:26] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testvm2006.codfw.wmnet with OS bookworm
[08:58:26] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testvm2006.codfw.wmnet
[08:58:56] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s6 T362302
[08:59:00] <stashbot>	 T362302: Switchover s6 master (db2129 -> db2114) - https://phabricator.wikimedia.org/T362302
[08:59:19] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s6 T362302
[08:59:26] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2114 with weight 0 T362302', diff saved to https://phabricator.wikimedia.org/P60377 and previous config saved to /var/cache/conftool/dbconfig/20240411-085926-arnaudb.json
[09:03:09] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3072.esams.wmnet with reason: host reimage
[09:06:37] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3072.esams.wmnet with reason: host reimage
[09:10:31] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10vm-requests, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: 14Site: eqiad 1 VM for Matomo - 14https://phabricator.wikimedia.org/T362146#9706068 (10BTullis) 14I'm adding the second disk now. ` btullis@ganeti1027:~$ sudo gnt-instance modify --...
[09:12:15] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version
[09:12:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1198 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P60378 and previous config saved to /var/cache/conftool/dbconfig/20240411-091255-root.json
[09:12:59] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version
[09:15:51] <wikibugs>	 (03PS4) 10Gmodena: analytics: refinery: add webrequest_frontend timer [puppet] - 10https://gerrit.wikimedia.org/r/1017041 (https://phabricator.wikimedia.org/T314956)
[09:16:56] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host testvm2007.codfw.wmnet
[09:16:57] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox
[09:18:26] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2114 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1018411 (https://phabricator.wikimedia.org/T362302) (owner: 10Gerrit maintenance bot)
[09:19:45] <arnaudb>	 !log Starting s6 codfw failover from db2129 to db2114 - T362302
[09:19:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:55] <stashbot>	 T362302: Switchover s6 master (db2129 -> db2114) - https://phabricator.wikimedia.org/T362302
[09:20:13] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2114 to s6 primary T362302', diff saved to https://phabricator.wikimedia.org/P60379 and previous config saved to /var/cache/conftool/dbconfig/20240411-092012-arnaudb.json
[09:20:24] <wikibugs>	 (03PS1) 10Jelto: gitlab_runner: add dockerfile support for test runner in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1018944 (https://phabricator.wikimedia.org/T357612)
[09:20:26] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2007.codfw.wmnet - ayounsi@cumin1002"
[09:23:19] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 weight bump T362302', diff saved to https://phabricator.wikimedia.org/P60380 and previous config saved to /var/cache/conftool/dbconfig/20240411-092318-arnaudb.json
[09:24:02] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2007.codfw.wmnet - ayounsi@cumin1002"
[09:24:02] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:24:02] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache testvm2007.codfw.wmnet on all recursors
[09:24:05] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2007.codfw.wmnet on all recursors
[09:24:31] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2007.codfw.wmnet - ayounsi@cumin1002"
[09:25:01] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 depool', diff saved to https://phabricator.wikimedia.org/P60381 and previous config saved to /var/cache/conftool/dbconfig/20240411-092501-arnaudb.json
[09:25:23] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2007.codfw.wmnet - ayounsi@cumin1002"
[09:25:34] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host testvm2007.codfw.wmnet with OS bookworm
[09:26:06] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[09:26:08] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[09:26:10] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[09:26:15] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[09:26:23] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T360332)', diff saved to https://phabricator.wikimedia.org/P60382 and previous config saved to /var/cache/conftool/dbconfig/20240411-092622-arnaudb.json
[09:26:27] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[09:27:09] <logmsgbot>	 !log arnaudb@cumin1002 dbctl restore of MediaWiki config (dc=all) from /var/cache/conftool/dbconfig/20240411-092622-arnaudb.json
[09:27:58] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gitlab_runner: add dockerfile support for test runner in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1018944 (https://phabricator.wikimedia.org/T357612) (owner: 10Jelto)
[09:28:10] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] kubernetes: Move 7 codfw appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1018719 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert)
[09:28:34] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] kubernetes: Move 7 codfw appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1018719 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert)
[09:29:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1198 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P60383 and previous config saved to /var/cache/conftool/dbconfig/20240411-092942-root.json
[09:30:40] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[09:31:01] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2129.codfw.wmnet with reason: Maintenance
[09:31:03] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2129.codfw.wmnet with reason: Maintenance
[09:31:06] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: Add CDN config for toolsadmin-toolsbeta.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1018665 (https://phabricator.wikimedia.org/T360025) (owner: 10Majavah)
[09:32:08] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3072.esams.wmnet with OS bullseye
[09:32:21] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9706148 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp3072.esams.wmnet with OS bullseye completed: - cp3072 (**WARN**)...
[09:32:24] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+1] "I love this change. I have fallen into this trap too many times!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar)
[09:34:58] <wikibugs>	 06SRE, 10Maps, 06serviceops: Move maps/karthoterian to PKI/cfssl - https://phabricator.wikimedia.org/T360778#9706160 (10jijiki)
[09:35:00] <wikibugs>	 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9706161 (10jijiki)
[09:35:08] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9706162 (10jijiki)
[09:35:14] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 25%: post schema update', diff saved to https://phabricator.wikimedia.org/P60384 and previous config saved to /var/cache/conftool/dbconfig/20240411-093513-arnaudb.json
[09:35:40] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[09:36:30] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9706184 (10jijiki)
[09:37:07] <wikibugs>	 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9706185 (10jijiki)
[09:37:16] <wikibugs>	 06SRE, 10Maps, 06serviceops: Move maps/karthoterian to PKI/cfssl - https://phabricator.wikimedia.org/T360778#9706189 (10jijiki) a:03jijiki
[09:37:59] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp3072.esams.wmnet
[09:38:00] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2007.codfw.wmnet with reason: host reimage
[09:38:07] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw2412.codfw.wmnet with OS bullseye
[09:38:31] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw2413.codfw.wmnet with OS bullseye
[09:38:45] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9706201 (10Fabfur)
[09:38:58] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw2414.codfw.wmnet with OS bullseye
[09:39:30] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw2415.codfw.wmnet with OS bullseye
[09:39:55] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw2416.codfw.wmnet with OS bullseye
[09:40:23] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw2417.codfw.wmnet with OS bullseye
[09:40:39] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2007.codfw.wmnet with reason: host reimage
[09:40:50] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw2418.codfw.wmnet with OS bullseye
[09:42:03] <wikibugs>	 (03PS1) 10Fabfur: Revert "benthos: temporary disable haproxy metrics" [puppet] - 10https://gerrit.wikimedia.org/r/1018705
[09:44:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1198 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P60386 and previous config saved to /var/cache/conftool/dbconfig/20240411-094448-root.json
[09:47:35] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] mariadb: Reenable notifications for db2199, db2200 after setup [puppet] - 10https://gerrit.wikimedia.org/r/1018744 (https://phabricator.wikimedia.org/T355422) (owner: 10Jcrespo)
[09:47:44] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Reenable notifications for db2199, db2200 after setup [puppet] - 10https://gerrit.wikimedia.org/r/1018744 (https://phabricator.wikimedia.org/T355422)
[09:47:59] <wikibugs>	 (03PS2) 10Esanders: Set wgMFFallbackEditor to visual for most VE wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015086 (https://phabricator.wikimedia.org/T361134)
[09:50:20] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 50%: post schema update', diff saved to https://phabricator.wikimedia.org/P60387 and previous config saved to /var/cache/conftool/dbconfig/20240411-095019-arnaudb.json
[09:51:12] <wikibugs>	 (03CR) 10Jcrespo: [V:03+2 C:03+2] mariadb: Reenable notifications for db2199, db2200 after setup [puppet] - 10https://gerrit.wikimedia.org/r/1018744 (https://phabricator.wikimedia.org/T355422) (owner: 10Jcrespo)
[09:51:50] <wikibugs>	 (03PS1) 10Btullis: Use a more WMF standard mariadb configuration for Matomo [puppet] - 10https://gerrit.wikimedia.org/r/1018948 (https://phabricator.wikimedia.org/T349397)
[09:53:25] <jinxer-wm>	 (SystemdUnitFailed) firing: ferm.service on mw2320:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:53:25] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1868/co" [puppet] - 10https://gerrit.wikimedia.org/r/1018948 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis)
[09:54:24] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2414.codfw.wmnet with reason: host reimage
[09:54:37] <wikibugs>	 (03CR) 10Hashar: logging: default to log any error (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar)
[09:55:05] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testvm2007.codfw.wmnet with OS bookworm
[09:55:05] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testvm2007.codfw.wmnet
[09:55:11] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2412.codfw.wmnet with reason: host reimage
[09:55:14] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2413.codfw.wmnet with reason: host reimage
[09:56:04] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2415.codfw.wmnet with reason: host reimage
[09:56:08] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "testvm2007 - ayounsi@cumin1002"
[09:56:37] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2416.codfw.wmnet with reason: host reimage
[09:57:11] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2417.codfw.wmnet with reason: host reimage
[09:57:12] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "testvm2007 - ayounsi@cumin1002"
[09:57:14] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2414.codfw.wmnet with reason: host reimage
[09:57:24] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2418.codfw.wmnet with reason: host reimage
[09:57:44] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] Revert "benthos: temporary disable haproxy metrics" [puppet] - 10https://gerrit.wikimedia.org/r/1018705 (owner: 10Fabfur)
[09:59:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1198 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P60388 and previous config saved to /var/cache/conftool/dbconfig/20240411-095954-root.json
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240411T1000)
[10:00:08] <wikibugs>	 (03PS1) 10JMeybohm: admin_ng: Refactor fetching pspClusterRole for namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018950 (https://phabricator.wikimedia.org/T273507)
[10:00:09] <wikibugs>	 (03PS1) 10JMeybohm: admin_ng: Stop adding kubernetes.io/metadata.name namespace label [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018951 (https://phabricator.wikimedia.org/T273507)
[10:00:11] <wikibugs>	 (03PS1) 10JMeybohm: admin_ng: Enable restriced PSS profile in audit mode in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018952 (https://phabricator.wikimedia.org/T273507)
[10:00:17] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2417.codfw.wmnet with reason: host reimage
[10:00:25] <jinxer-wm>	 (SystemdUnitFailed) firing: prometheus-wmf-elasticsearch-exporter-9600.service on elastic2090:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:03:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: ferm.service on mw2320:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:03:40] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2416.codfw.wmnet with reason: host reimage
[10:05:26] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 75%: post schema update', diff saved to https://phabricator.wikimedia.org/P60389 and previous config saved to /var/cache/conftool/dbconfig/20240411-100525-arnaudb.json
[10:06:36] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2415.codfw.wmnet with reason: host reimage
[10:09:43] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2412.codfw.wmnet with reason: host reimage
[10:13:16] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2418.codfw.wmnet with reason: host reimage
[10:15:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1198 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P60390 and previous config saved to /var/cache/conftool/dbconfig/20240411-101500-root.json
[10:15:17] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2414.codfw.wmnet with OS bullseye
[10:15:40] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[10:15:50] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1018309 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey)
[10:16:12] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Use a more WMF standard mariadb configuration for Matomo [puppet] - 10https://gerrit.wikimedia.org/r/1018948 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis)
[10:17:13] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2413.codfw.wmnet with reason: host reimage
[10:19:20] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2417.codfw.wmnet with OS bullseye
[10:20:32] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 100%: post schema update', diff saved to https://phabricator.wikimedia.org/P60391 and previous config saved to /var/cache/conftool/dbconfig/20240411-102031-arnaudb.json
[10:20:40] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[10:21:31] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] Remove obsolete grant [puppet] - 10https://gerrit.wikimedia.org/r/1018941 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff)
[10:21:33] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1247.eqiad.wmnet with reason: Maintenance
[10:21:46] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1247.eqiad.wmnet with reason: Maintenance
[10:21:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1247 (T356166)', diff saved to https://phabricator.wikimedia.org/P60392 and previous config saved to /var/cache/conftool/dbconfig/20240411-102153-marostegui.json
[10:22:01] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[10:22:22] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2416.codfw.wmnet with OS bullseye
[10:25:37] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2415.codfw.wmnet with OS bullseye
[10:27:53] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST ipamblocks) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:28:48] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2412.codfw.wmnet with OS bullseye
[10:30:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1198 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P60393 and previous config saved to /var/cache/conftool/dbconfig/20240411-103005-root.json
[10:30:30] <moritzm>	 !log installing xerces-c security updates
[10:30:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:31:53] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2418.codfw.wmnet with OS bullseye
[10:32:53] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST ipamblocks) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:36:12] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2413.codfw.wmnet with OS bullseye
[10:36:37] <wikibugs>	 06SRE, 06Machine-Learning-Team, 10MW-on-K8s, 06serviceops: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316 (10Clement_Goubert) 03NEW
[10:36:51] <wikibugs>	 06SRE, 06Machine-Learning-Team, 10MW-on-K8s, 06serviceops: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9706397 (10Clement_Goubert) p:05Triage→03Medium
[10:37:25] <claime>	 !log Running homer 'cr*codfw*' commit 'T351074'
[10:37:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:32] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[10:43:03] <moritzm>	 !log installing modsecurity-apache security updates
[10:43:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:37] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Steph Toyofuku - https://phabricator.wikimedia.org/T362113#9706439 (10Milimetric) Approved
[10:45:21] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to shell access to analytics client servers for AndyRussG - https://phabricator.wikimedia.org/T361742#9706455 (10Milimetric) Approved, welcome back Andy :)
[10:48:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add stoyofuku to analytics-privatedata-access [puppet] - 10https://gerrit.wikimedia.org/r/1018634 (https://phabricator.wikimedia.org/T362113) (owner: 10Muehlenhoff)
[10:52:52] <claime>	 !log Pooling and uncordoning mw2412.codfw.wmnet,mw2413.codfw.wmnet,mw2414.codfw.wmnet,mw2415.codfw.wmnet,mw2416.codfw.wmnet,mw2417.codfw.wmnet,mw2418.codfw.wmnet - T351074
[10:52:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:57] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[10:53:02] <logmsgbot>	 !log cgoubert@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(mw2412.codfw.wmnet|mw2413.codfw.wmnet|mw2414.codfw.wmnet|mw2415.codfw.wmnet|mw2416.codfw.wmnet|mw2417.codfw.wmnet|mw2418.codfw.wmnet),cluster=kubernetes,service=kubesvc
[10:53:49] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: 14Requesting access to analytics-privatedata-users for Steph Toyofuku - 14https://phabricator.wikimedia.org/T362113#9706516 (10MoritzMuehlenhoff) 05Open→03Resolved 14@SToyofuku-WMF : I've enabled your access. You should already be able to log into st...
[10:54:46] <wikibugs>	 (03PS3) 10Muehlenhoff: Add andyrussg to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1018641 (https://phabricator.wikimedia.org/T361742)
[10:59:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add andyrussg to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1018641 (https://phabricator.wikimedia.org/T361742) (owner: 10Muehlenhoff)
[11:05:25] <wikibugs>	 (03PS1) 10Btullis: Add a partman recipe for the new matomo server [puppet] - 10https://gerrit.wikimedia.org/r/1018955 (https://phabricator.wikimedia.org/T349397)
[11:09:32] <wikibugs>	 (03PS1) 10Marostegui: db2177: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1018956
[11:09:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2177', diff saved to https://phabricator.wikimedia.org/P60394 and previous config saved to /var/cache/conftool/dbconfig/20240411-110938-root.json
[11:09:40] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[11:09:51] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add a partman recipe for the new matomo server [puppet] - 10https://gerrit.wikimedia.org/r/1018955 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis)
[11:10:15] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2177: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1018956 (owner: 10Marostegui)
[11:10:57] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2177.codfw.wmnet with OS bookworm
[11:11:02] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: 14Requesting access to shell access to analytics client servers for AndyRussG - 14https://phabricator.wikimedia.org/T361742#9706554 (10MoritzMuehlenhoff) 05Open→03Resolved 14@AndyRussG: I've enabled your access. You should already be able to log into...
[11:11:57] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9706562 (10MoritzMuehlenhoff)
[11:14:40] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[11:21:20] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2177: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1018966
[11:22:02] <effie>	 !log upload memkeys  20181031-2-s1 to bookworm-wikimedia main
[11:22:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:22:21] <effie>	 !log upload memkeys  20181031-2-s1 to bookworm-wikimedia main - T362160
[11:22:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:22:33] <stashbot>	 T362160: Repackage memkeys for debian bookworm - https://phabricator.wikimedia.org/T362160
[11:24:41] <effie>	 !log upload prometheus-memcached-exporter 0.14.2-1~wmf1 to bookworm-wikimedia main - T350807
[11:24:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:55] <stashbot>	 T350807: Package latest version of prometheus-memcached-exporter (v0.14.2) - https://phabricator.wikimedia.org/T350807
[11:26:18] <wikibugs>	 (03PS1) 10Clément Goubert: article-description: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018959 (https://phabricator.wikimedia.org/T362316)
[11:26:20] <wikibugs>	 (03PS1) 10Clément Goubert: article-description: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018960 (https://phabricator.wikimedia.org/T362316)
[11:27:00] <wikibugs>	 (03PS1) 10Clément Goubert: articletopic-outlink: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018961 (https://phabricator.wikimedia.org/T362316)
[11:27:01] <wikibugs>	 (03PS1) 10Clément Goubert: articletopic-outlink: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018962 (https://phabricator.wikimedia.org/T362316)
[11:27:30] <wikibugs>	 (03PS1) 10Clément Goubert: experimental: Switch to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018963 (https://phabricator.wikimedia.org/T362316)
[11:27:58] <wikibugs>	 (03PS1) 10Clément Goubert: readability: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018964 (https://phabricator.wikimedia.org/T362316)
[11:27:59] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2177.codfw.wmnet with reason: host reimage
[11:28:00] <wikibugs>	 (03PS1) 10Clément Goubert: readability: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018965 (https://phabricator.wikimedia.org/T362316)
[11:28:39] <wikibugs>	 (03PS1) 10Clément Goubert: revertrisk: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018986 (https://phabricator.wikimedia.org/T362316)
[11:28:40] <wikibugs>	 (03PS1) 10Clément Goubert: revertrisk: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018987 (https://phabricator.wikimedia.org/T362316)
[11:29:12] <wikibugs>	 (03PS1) 10Clément Goubert: revscoring-articlequality: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018988 (https://phabricator.wikimedia.org/T362316)
[11:29:13] <wikibugs>	 (03PS1) 10Clément Goubert: revscoring-articlequality: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018989 (https://phabricator.wikimedia.org/T362316)
[11:29:39] <wikibugs>	 (03PS1) 10Clément Goubert: revscoring-articletopic: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018990 (https://phabricator.wikimedia.org/T362316)
[11:29:40] <wikibugs>	 (03PS1) 10Clément Goubert: revscoring-articletopic: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018991 (https://phabricator.wikimedia.org/T362316)
[11:30:21] <wikibugs>	 (03PS1) 10Clément Goubert: revscoring-draftquality: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018992 (https://phabricator.wikimedia.org/T362316)
[11:30:24] <wikibugs>	 (03PS1) 10Clément Goubert: revscoring-draftquality: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018993 (https://phabricator.wikimedia.org/T362316)
[11:30:28] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9706629 (10MoritzMuehlenhoff)
[11:30:59] <wikibugs>	 (03PS1) 10Clément Goubert: revscoring-drafttopic: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018994 (https://phabricator.wikimedia.org/T362316)
[11:31:01] <wikibugs>	 (03PS1) 10Clément Goubert: revscoring-drafttopic: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018995 (https://phabricator.wikimedia.org/T362316)
[11:31:36] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2177.codfw.wmnet with reason: host reimage
[11:31:37] <wikibugs>	 (03PS1) 10Clément Goubert: revscoring-editquality-damaging: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018996 (https://phabricator.wikimedia.org/T362316)
[11:31:39] <wikibugs>	 (03PS1) 10Clément Goubert: revscoring-editquality-damaging: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018997 (https://phabricator.wikimedia.org/T362316)
[11:31:40] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm, diff also looks fine (noop)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018950 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[11:31:47] <moritzm>	 !log installing postgresql-15 security updates
[11:31:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:15] <wikibugs>	 (03PS1) 10Clément Goubert: revscoring-editquality-goodfaith: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018998 (https://phabricator.wikimedia.org/T362316)
[11:32:16] <wikibugs>	 (03PS1) 10Clément Goubert: revscoring-editquality-goodfaith: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018999 (https://phabricator.wikimedia.org/T362316)
[11:32:49] <wikibugs>	 (03PS1) 10Clément Goubert: revscoring-editquality-reverted: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019000 (https://phabricator.wikimedia.org/T362316)
[11:32:50] <wikibugs>	 (03PS1) 10Clément Goubert: revscoring-editquality-reverted: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019001 (https://phabricator.wikimedia.org/T362316)
[11:33:40] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host matomo1003.eqiad.wmnet with OS bullseye
[11:33:50] <wikibugs>	 (03PS1) 10Ayounsi: Add public Ganeti IP ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1019002 (https://phabricator.wikimedia.org/T300152)
[11:34:54] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for psql 15 [puppet] - 10https://gerrit.wikimedia.org/r/1019003
[11:35:06] <wikibugs>	 (03PS2) 10Slyngshede: IP blocking [software/bitu] - 10https://gerrit.wikimedia.org/r/1018256 (https://phabricator.wikimedia.org/T361066)
[11:35:27] <jinxer-wm>	 (SystemdUnitCrashLoop) firing: prometheus-wmf-elasticsearch-exporter-9400.service crashloop on elastic2090:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[11:36:25] <wikibugs>	 (03PS1) 10Ayounsi: Add public testvm200x support [puppet] - 10https://gerrit.wikimedia.org/r/1019005 (https://phabricator.wikimedia.org/T300152)
[11:36:51] <wikibugs>	 (03PS1) 10Effie Mouzeli: Repo has been migrated to Gitlab [debs/memkeys] - 10https://gerrit.wikimedia.org/r/1019006
[11:37:05] <wikibugs>	 (03PS2) 10Ayounsi: Add public Ganeti IP ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1019002 (https://phabricator.wikimedia.org/T300152)
[11:37:10] <wikibugs>	 (03CR) 10Effie Mouzeli: [V:03+2 C:03+2] Repo has been migrated to Gitlab [debs/memkeys] - 10https://gerrit.wikimedia.org/r/1019006 (owner: 10Effie Mouzeli)
[11:37:34] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm, label should be set by the Kubernetes API server (https://kubernetes.io/docs/reference/labels-annotations-taints/#kubernetes-io-meta" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018951 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[11:37:50] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Migrate editor-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014656 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis)
[11:38:23] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Add public Ganeti IP ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1019002 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[11:39:06] <wikibugs>	 (03Merged) 10jenkins-bot: Migrate editor-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014656 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis)
[11:40:43] <wikibugs>	 (03PS1) 10JMeybohm: eventgate: Update mesh modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019007 (https://phabricator.wikimedia.org/T359423)
[11:40:45] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1019005 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[11:41:07] <wikibugs>	 (03PS2) 10JMeybohm: eventgate: Update mesh modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019007 (https://phabricator.wikimedia.org/T359423)
[11:41:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add library hint for psql 15 [puppet] - 10https://gerrit.wikimedia.org/r/1019003 (owner: 10Muehlenhoff)
[11:42:09] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Add public testvm200x support [puppet] - 10https://gerrit.wikimedia.org/r/1019005 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[11:42:15] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1019008
[11:44:00] <wikibugs>	 06SRE, 06Machine-Learning-Team, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9706666 (10Clement_Goubert)
[11:45:40] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[11:45:42] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host testvm2008.wikimedia.org
[11:45:44] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox
[11:47:38] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on matomo1003.eqiad.wmnet with reason: host reimage
[11:47:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1019008 (owner: 10Muehlenhoff)
[11:47:50] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2008.wikimedia.org - ayounsi@cumin1002"
[11:48:28] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm, diff also looks good: staging-eqiad and staging-codfw namespaces have a additional label pod-security.kubernetes.io/audit: restricte" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018952 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[11:49:13] <wikibugs>	 06SRE, 06Machine-Learning-Team, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9706673 (10Clement_Goubert) Aaaand I just realized they all use http and not https, so now I can change them all.
[11:49:14] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2008.wikimedia.org - ayounsi@cumin1002"
[11:49:14] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:49:14] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache testvm2008.wikimedia.org on all recursors
[11:49:18] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2008.wikimedia.org on all recursors
[11:49:44] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2008.wikimedia.org - ayounsi@cumin1002"
[11:50:20] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on matomo1003.eqiad.wmnet with reason: host reimage
[11:50:34] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2008.wikimedia.org - ayounsi@cumin1002"
[11:50:40] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[11:51:34] <wikibugs>	 (03PS1) 10Hnowlan: jobqueue: increase concurrency for videoscalers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019010
[11:52:25] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] jobqueue: increase concurrency for videoscalers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019010 (owner: 10Hnowlan)
[11:52:42] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2177.codfw.wmnet with OS bookworm
[11:52:50] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] jobqueue: increase concurrency for videoscalers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019010 (owner: 10Hnowlan)
[11:53:45] <wikibugs>	 (03Merged) 10jenkins-bot: jobqueue: increase concurrency for videoscalers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019010 (owner: 10Hnowlan)
[11:54:26] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] mw-web, mw-api-ext: Raise replicas for 70% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018721 (https://phabricator.wikimedia.org/T360763) (owner: 10Clément Goubert)
[11:54:35] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] trafficserver: move 70% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1018723 (https://phabricator.wikimedia.org/T360763) (owner: 10Clément Goubert)
[11:55:09] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mw-web, mw-api-ext: Raise replicas for 70% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018721 (https://phabricator.wikimedia.org/T360763) (owner: 10Clément Goubert)
[11:55:48] <wikibugs>	 (03PS1) 10Slyngshede: P:idm allow security key backended SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/1019013 (https://phabricator.wikimedia.org/T361714)
[11:56:01] <wikibugs>	 (03Merged) 10jenkins-bot: mw-web, mw-api-ext: Raise replicas for 70% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018721 (https://phabricator.wikimedia.org/T360763) (owner: 10Clément Goubert)
[11:57:15] <wikibugs>	 (03PS1) 10Majavah: O:mariadb::grants: drop unused clouddb.sql.erb [puppet] - 10https://gerrit.wikimedia.org/r/1019014
[11:57:20] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[11:57:40] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[11:57:48] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[11:58:07] <wikibugs>	 (03CR) 10Slyngshede: "These two key types are already supported by Striker, so Bitu needs to support them as well." [puppet] - 10https://gerrit.wikimedia.org/r/1019013 (https://phabricator.wikimedia.org/T361714) (owner: 10Slyngshede)
[11:58:20] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host testvm2008.wikimedia.org with OS bookworm
[11:58:37] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[11:58:43] <wikibugs>	 (03PS2) 10Slyngshede: P:idm allow security key backended SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/1019013 (https://phabricator.wikimedia.org/T361714)
[11:59:38] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[11:59:48] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240411T1200)
[12:01:01] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[12:01:07] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[12:02:00] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[12:02:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1019013 (https://phabricator.wikimedia.org/T361714) (owner: 10Slyngshede)
[12:02:40] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:02:48] <wikibugs>	 (03PS1) 10JMeybohm: eventgate-*: Migrate to base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019018 (https://phabricator.wikimedia.org/T359423)
[12:02:51] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] trafficserver: move 70% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1018723 (https://phabricator.wikimedia.org/T360763) (owner: 10Clément Goubert)
[12:02:56] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:idm allow security key backended SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/1019013 (https://phabricator.wikimedia.org/T361714) (owner: 10Slyngshede)
[12:03:23] <jinxer-wm>	 (ProbeDown) firing: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:05:57] <logmsgbot>	 !log ayounsi@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host testvm2008.wikimedia.org with OS bookworm
[12:05:57] <logmsgbot>	 !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host testvm2008.wikimedia.org
[12:06:42] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.decommission for hosts testvm2008.wikimedia.org
[12:08:05] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: 14Move 70% of mediawiki external requests to mw on k8s - 14https://phabricator.wikimedia.org/T360763#9706729 (10Clement_Goubert) 05In progress→03Resolved
[12:08:23] <jinxer-wm>	 (ProbeDown) resolved: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:10:41] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox
[12:12:46] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm2008.wikimedia.org decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002"
[12:13:19] <wikibugs>	 (03PS1) 10Dreamy Jazz: Ignore misisng title/page in CheckUserLookupUtils::getManualLogEntryFromRow [extensions/CheckUser] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018967 (https://phabricator.wikimedia.org/T362284)
[12:13:32] <logmsgbot>	 !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host matomo1003.eqiad.wmnet with OS bullseye
[12:13:37] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm2008.wikimedia.org decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002"
[12:13:37] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:13:37] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts testvm2008.wikimedia.org
[12:13:38] <wikibugs>	 (03PS5) 10S8321414: zhwikivoyage: Make RelatedArticles extension usable on zhwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015551 (https://phabricator.wikimedia.org/T361427)
[12:13:45] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[12:13:47] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: 14Investigate Ganeti in routed mode - 14https://phabricator.wikimedia.org/T300152#9706756 (10ops-monitoring-bot) 14cookbooks.sre.hosts.decommission executed by ayounsi@cumin1002 for hosts: `testvm2008.wikimedia.org` - testv...
[12:13:50] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host testvm2008.wikimedia.org
[12:13:51] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox
[12:14:21] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[12:14:28] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323 (10Clement_Goubert) 03NEW
[12:14:53] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9706771 (10Clement_Goubert) p:05Triage→03High
[12:15:43] <moritzm>	 !log installing gnutls28 security updates
[12:15:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:15:58] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[12:16:05] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2008.wikimedia.org - ayounsi@cumin1002"
[12:16:07] <logmsgbot>	 !log ayounsi@cumin1002 END (ERROR) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=97) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2008.wikimedia.org - ayounsi@cumin1002"
[12:16:07] <logmsgbot>	 !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=97)
[12:16:21] <logmsgbot>	 !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host testvm2008.wikimedia.org
[12:16:44] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host matomo1003.eqiad.wmnet with OS bookworm
[12:16:52] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host testvm2008.wikimedia.org
[12:16:54] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox
[12:16:55] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[12:18:59] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2008.wikimedia.org - ayounsi@cumin1002"
[12:19:50] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2008.wikimedia.org - ayounsi@cumin1002"
[12:19:51] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:19:51] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache testvm2008.wikimedia.org on all recursors
[12:19:54] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/editor-analytics: apply
[12:19:54] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2008.wikimedia.org on all recursors
[12:20:20] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2008.wikimedia.org - ayounsi@cumin1002"
[12:20:41] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply
[12:21:08] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2008.wikimedia.org - ayounsi@cumin1002"
[12:21:34] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host testvm2008.wikimedia.org with OS bookworm
[12:22:41] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/editor-analytics: apply
[12:23:56] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply
[12:24:02] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply
[12:24:17] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply
[12:26:17] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] admin_ng: Enable restriced PSS profile in audit mode in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018952 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[12:26:20] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] admin_ng: Stop adding kubernetes.io/metadata.name namespace label [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018951 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[12:26:23] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] admin_ng: Refactor fetching pspClusterRole for namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018950 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[12:27:56] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db2177: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1018966 (owner: 10Marostegui)
[12:28:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P60396 and previous config saved to /var/cache/conftool/dbconfig/20240411-122810-root.json
[12:28:37] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Migrate edit-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014657 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis)
[12:29:31] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: Refactor fetching pspClusterRole for namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018950 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[12:29:34] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: Stop adding kubernetes.io/metadata.name namespace label [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018951 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[12:29:36] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: Enable restriced PSS profile in audit mode in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018952 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[12:30:25] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[12:31:40] <wikibugs>	 (03Merged) 10jenkins-bot: Migrate edit-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014657 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis)
[12:32:25] <wikibugs>	 (03PS1) 10Slyngshede: Keymanagement, improve error message for key validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1019036 (https://phabricator.wikimedia.org/T361714)
[12:32:53] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[12:32:54] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9706818 (10Clement_Goubert)
[12:33:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-codfw
[12:33:50] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 depool for reboot T356240', diff saved to https://phabricator.wikimedia.org/P60397 and previous config saved to /var/cache/conftool/dbconfig/20240411-123350-arnaudb.json
[12:34:01] <wikibugs>	 (03CR) 10Slyngshede: "We could also expand this to check a list of key types which we explicitly mark as insecure." [software/bitu] - 10https://gerrit.wikimedia.org/r/1019036 (https://phabricator.wikimedia.org/T361714) (owner: 10Slyngshede)
[12:34:07] <wikibugs>	 10ops-codfw, 06SRE, 06serviceops: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9706836 (10Papaul)
[12:34:10] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[12:34:38] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply
[12:34:56] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply
[12:35:01] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[12:35:08] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply
[12:35:24] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2129.codfw.wmnet
[12:35:25] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply
[12:35:31] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply
[12:35:47] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply
[12:36:32] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[12:36:56] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[12:37:06] <logmsgbot>	 !log jayme@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[12:38:13] <logmsgbot>	 !log jayme@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[12:38:25] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[12:38:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-codfw
[12:38:49] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-eqsin and not P{cp[5030,5032].eqsin.wmnet} and A:cp
[12:39:20] <wikibugs>	 (03PS4) 10Elukey: Force PKI TLS certs for cassandra instances on aqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/1018309 (https://phabricator.wikimedia.org/T352647)
[12:39:52] <wikibugs>	 06SRE, 06serviceops: Memcached, mcrouter in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711#9706842 (10jijiki)
[12:39:53] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops: 14Create a basic helm chart to test MediaWiki on kubernetes - 14https://phabricator.wikimedia.org/T265327#9706844 (10jijiki)
[12:39:54] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9706843 (10jijiki)
[12:40:32] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[12:40:43] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[12:40:45] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9706846 (10jijiki)
[12:40:53] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2129.codfw.wmnet
[12:41:28] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[12:41:34] <logmsgbot>	 !log ayounsi@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host testvm2008.wikimedia.org with OS bookworm
[12:41:34] <logmsgbot>	 !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host testvm2008.wikimedia.org
[12:41:53] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.decommission for hosts testvm2008.wikimedia.org
[12:42:50] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 1%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P60398 and previous config saved to /var/cache/conftool/dbconfig/20240411-124248-arnaudb.json
[12:43:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P60399 and previous config saved to /var/cache/conftool/dbconfig/20240411-124315-root.json
[12:45:47] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox
[12:48:14] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9706856 (10Clement_Goubert)
[12:49:18] <logmsgbot>	 !log jayme@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[12:49:51] <logmsgbot>	 !log jayme@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[12:50:03] <logmsgbot>	 !log jayme@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[12:51:02] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9706874 (10Clement_Goubert)
[12:51:26] <logmsgbot>	 !log jayme@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[12:52:10] <logmsgbot>	 !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[12:52:29] <logmsgbot>	 !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[12:53:25] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db[2132,2160].codfw.wmnet with reason: reboot
[12:53:28] <logmsgbot>	 !log jayme@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[12:53:38] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db[2132,2160].codfw.wmnet with reason: reboot
[12:53:39] <logmsgbot>	 !log akosiaris@cumin1002 conftool action : set/weight=10; selector: name=mw1437.*.wmnet,dc=eqiad
[12:53:50] <logmsgbot>	 !log jayme@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[12:53:57] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2132.codfw.wmnet
[12:54:40] <akosiaris>	 !log lower weight of mw1437 back to 10 from the 30 I had upped it to yesterday. The backlog of videoscaling is apparently now served and CPU usage has reached "normal" levels
[12:54:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:56] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 2%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P60400 and previous config saved to /var/cache/conftool/dbconfig/20240411-125755-arnaudb.json
[12:58:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P60401 and previous config saved to /var/cache/conftool/dbconfig/20240411-125821-root.json
[12:58:40] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2132.codfw.wmnet
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240411T1300)
[13:00:05] <jouncebot>	 esanders and Dreamy Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:30] <logmsgbot>	 !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host matomo1003.eqiad.wmnet with OS bookworm
[13:07:07] <Lucas_WMDE>	 I can’t deploy, sorry
[13:08:41] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9706934 (10Papaul)
[13:11:50] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db[2134,2160].codfw.wmnet with reason: reboot
[13:12:04] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db[2134,2160].codfw.wmnet with reason: reboot
[13:12:15] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2134.codfw.wmnet
[13:12:40] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host matomo1003.eqiad.wmnet with OS bookworm
[13:13:01] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 4%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P60402 and previous config saved to /var/cache/conftool/dbconfig/20240411-131301-arnaudb.json
[13:13:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P60403 and previous config saved to /var/cache/conftool/dbconfig/20240411-131327-root.json
[13:16:50] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2134.codfw.wmnet
[13:17:31] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db[2135,2160].codfw.wmnet with reason: reboot
[13:17:45] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db[2135,2160].codfw.wmnet with reason: reboot
[13:18:28] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2135.codfw.wmnet
[13:18:44] <Dreamy_Jazz>	 \o
[13:18:48] <Dreamy_Jazz>	 I can deploy my patch
[13:19:05] <wikibugs>	 (03PS1) 10Jelto: miscweb/service::catalog: move blackbox checks to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1019039 (https://phabricator.wikimedia.org/T361090)
[13:20:06] <Dreamy_Jazz>	 esanders: Are you around?
[13:20:23] <Dreamy_Jazz>	 edsanders:
[13:21:41] <wikibugs>	 (03PS1) 10Btullis: Correct the device names for matomo disks [puppet] - 10https://gerrit.wikimedia.org/r/1019040 (https://phabricator.wikimedia.org/T349397)
[13:23:15] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2135.codfw.wmnet
[13:25:02] <Dreamy_Jazz>	 I'm going to go ahead with mine now as edsanders does not seem around for this window.
[13:25:30] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1019039 (https://phabricator.wikimedia.org/T361090) (owner: 10Jelto)
[13:25:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1002 using scap backport" [extensions/CheckUser] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018967 (https://phabricator.wikimedia.org/T362284) (owner: 10Dreamy Jazz)
[13:25:58] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db[2133,2160].codfw.wmnet with reason: reboot
[13:26:12] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db[2133,2160].codfw.wmnet with reason: reboot
[13:26:19] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2133.codfw.wmnet
[13:26:53] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Fundraising-Backlog: Can we please add our vendor to Google Postmaster Tools - https://phabricator.wikimedia.org/T360907#9706990 (10akosiaris) I don't think #SRE has ever administrated Google Postmaster Tools at all. In fact, a quick cross check in the team showcases almost ut...
[13:27:39] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Correct the device names for matomo disks [puppet] - 10https://gerrit.wikimedia.org/r/1019040 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis)
[13:27:58] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-eqsin and not P{cp[5030,5032].eqsin.wmnet} and A:cp
[13:28:07] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 5%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P60404 and previous config saved to /var/cache/conftool/dbconfig/20240411-132807-arnaudb.json
[13:28:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P60405 and previous config saved to /var/cache/conftool/dbconfig/20240411-132834-root.json
[13:29:57] <wikibugs>	 (03PS1) 10Hnowlan: jobqueue: Increase videoscaler job concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019045
[13:30:38] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2133.codfw.wmnet
[13:30:47] <wikibugs>	 (03PS2) 10Dreamy Jazz: Ignore missing title/page in CheckUserLookupUtils::getManualLogEntryFromRow [extensions/CheckUser] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018967 (https://phabricator.wikimedia.org/T362284)
[13:30:52] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] Ignore missing title/page in CheckUserLookupUtils::getManualLogEntryFromRow [extensions/CheckUser] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018967 (https://phabricator.wikimedia.org/T362284) (owner: 10Dreamy Jazz)
[13:30:56] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by dreamyjazz@deploy1002 using scap backport" [extensions/CheckUser] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018967 (https://phabricator.wikimedia.org/T362284) (owner: 10Dreamy Jazz)
[13:31:01] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by dreamyjazz@deploy1002 using scap backport" [extensions/CheckUser] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018967 (https://phabricator.wikimedia.org/T362284) (owner: 10Dreamy Jazz)
[13:31:40] <logmsgbot>	 !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host matomo1003.eqiad.wmnet with OS bookworm
[13:32:03] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host matomo1003.eqiad.wmnet with OS bookworm
[13:32:25] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db2160.codfw.wmnet with reason: reboot multiinstance replica
[13:32:27] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db2160.codfw.wmnet with reason: reboot multiinstance replica
[13:32:58] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3073.esams.wmnet,service=(cdn|ats-be)
[13:33:11] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] cp3073: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1015975 (https://phabricator.wikimedia.org/T360430) (owner: 10Ssingh)
[13:34:59] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp3073.esams.wmnet with OS bullseye
[13:35:09] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9707012 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp3073.esams.wmnet with OS bullseye
[13:35:32] <wikibugs>	 (03PS26) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213)
[13:36:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-eqiad
[13:37:06] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking)
[13:39:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1019036 (https://phabricator.wikimedia.org/T361714) (owner: 10Slyngshede)
[13:40:46] <wikibugs>	 10ops-codfw, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Degraded RAID on elastic2088 - https://phabricator.wikimedia.org/T361525#9707025 (10Jhancock.wm) Update: Dell finally agreed to replace the HBA card. I sent the shipping address confirmation just now. Hopefully it'll be here tomo...
[13:41:11] <wikibugs>	 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9707026 (10ssingh) Traffic reimaged 8 text nodes in esams and all of them PXE-booted the first time, without any issues. I think looking...
[13:41:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad
[13:43:14] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 10%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P60406 and previous config saved to /var/cache/conftool/dbconfig/20240411-134312-arnaudb.json
[13:43:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P60407 and previous config saved to /var/cache/conftool/dbconfig/20240411-134341-root.json
[13:45:00] <wikibugs>	 (03PS1) 10Btullis: Ensure that matomo install grub to /dev/vda [puppet] - 10https://gerrit.wikimedia.org/r/1019048 (https://phabricator.wikimedia.org/T349397)
[13:45:24] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=(cdn|ats-be)
[13:45:50] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Ensure that matomo install grub to /dev/vda [puppet] - 10https://gerrit.wikimedia.org/r/1019048 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis)
[13:46:48] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp2042.codfw.wmnet with OS bullseye
[13:46:58] <logmsgbot>	 !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host matomo1003.eqiad.wmnet with OS bookworm
[13:47:00] <wikibugs>	 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9707044 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp2042.codfw.wmnet with OS b...
[13:47:05] <wikibugs>	 (03PS1) 10JMeybohm: kubernetes::master: Forward audit logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1019049 (https://phabricator.wikimedia.org/T290020)
[13:48:29] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Keymanagement, improve error message for key validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1019036 (https://phabricator.wikimedia.org/T361714) (owner: 10Slyngshede)
[13:49:06] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] jobqueue: Increase videoscaler job concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019045 (owner: 10Hnowlan)
[13:49:21] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host matomo1003.eqiad.wmnet with OS bookworm
[13:49:22] <wikibugs>	 (03PS1) 10Marostegui: db2149: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1019051
[13:49:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2149', diff saved to https://phabricator.wikimedia.org/P60408 and previous config saved to /var/cache/conftool/dbconfig/20240411-134932-root.json
[13:49:36] <wikibugs>	 (03Merged) 10jenkins-bot: Keymanagement, improve error message for key validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1019036 (https://phabricator.wikimedia.org/T361714) (owner: 10Slyngshede)
[13:49:40] <Lucas_WMDE>	 I’m here now, anything left to deploy? ^^
[13:50:06] <wikibugs>	 (03Merged) 10jenkins-bot: Ignore missing title/page in CheckUserLookupUtils::getManualLogEntryFromRow [extensions/CheckUser] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018967 (https://phabricator.wikimedia.org/T362284) (owner: 10Dreamy Jazz)
[13:50:08] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2149: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1019051 (owner: 10Marostegui)
[13:50:11] <Dreamy_Jazz>	 I'm currently deploying
[13:50:14] <Lucas_WMDE>	 ok
[13:50:49] <logmsgbot>	 !log dreamyjazz@deploy1002 Started scap: Backport for [[gerrit:1018967|Ignore missing title/page in CheckUserLookupUtils::getManualLogEntryFromRow (T362284)]]
[13:50:53] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2149.codfw.wmnet with OS bookworm
[13:51:06] <stashbot>	 T362284: Logs without a defined title or page_id cause an exception in CheckUser - https://phabricator.wikimedia.org/T362284
[13:52:01] <wikibugs>	 (03CR) 10Herron: [C:03+1] opensearch: switch dashboards to sso auth [puppet] - 10https://gerrit.wikimedia.org/r/1018872 (https://phabricator.wikimedia.org/T246998) (owner: 10Filippo Giunchedi)
[13:53:39] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm2008.wikimedia.org decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002"
[13:54:10] <wikibugs>	 (03PS27) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213)
[13:54:30] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm2008.wikimedia.org decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002"
[13:54:30] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:54:31] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts testvm2008.wikimedia.org
[13:54:41] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: 14Investigate Ganeti in routed mode - 14https://phabricator.wikimedia.org/T300152#9707155 (10ops-monitoring-bot) 14cookbooks.sre.hosts.decommission executed by ayounsi@cumin1002 for hosts: `testvm2008.wikimedia.org` - testv...
[13:55:07] <wikibugs>	 (03PS2) 10JMeybohm: kubernetes::master: Forward audit logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1019049 (https://phabricator.wikimedia.org/T290020)
[13:55:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking)
[13:55:37] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Force PKI TLS certs for cassandra instances on aqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/1018309 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey)
[13:55:47] <logmsgbot>	 !log dreamyjazz@deploy1002 dreamyjazz: Backport for [[gerrit:1018967|Ignore missing title/page in CheckUserLookupUtils::getManualLogEntryFromRow (T362284)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:55:59] <logmsgbot>	 !log dreamyjazz@deploy1002 dreamyjazz: Continuing with sync
[13:56:26] <wikibugs>	 (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1871/co" [puppet] - 10https://gerrit.wikimedia.org/r/1019049 (https://phabricator.wikimedia.org/T290020) (owner: 10JMeybohm)
[13:56:34] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2115 (re)pooling @ 10%: Repool', diff saved to https://phabricator.wikimedia.org/P60409 and previous config saved to /var/cache/conftool/dbconfig/20240411-135634-arnaudb.json
[13:56:45] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] jobqueue: Increase videoscaler job concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019045 (owner: 10Hnowlan)
[13:57:42] <wikibugs>	 (03Merged) 10jenkins-bot: jobqueue: Increase videoscaler job concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019045 (owner: 10Hnowlan)
[13:57:48] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-codfw and not P{cp2042.codfw.wmnet} and A:cp
[13:57:55] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2109 (re)pooling @ 25%: repool', diff saved to https://phabricator.wikimedia.org/P60410 and previous config saved to /var/cache/conftool/dbconfig/20240411-135754-arnaudb.json
[13:58:11] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3073.esams.wmnet with reason: host reimage
[13:58:19] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 20%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P60411 and previous config saved to /var/cache/conftool/dbconfig/20240411-135819-arnaudb.json
[13:58:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P60412 and previous config saved to /var/cache/conftool/dbconfig/20240411-135846-root.json
[13:58:59] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2103 (re)pooling @ 25%: reool', diff saved to https://phabricator.wikimedia.org/P60413 and previous config saved to /var/cache/conftool/dbconfig/20240411-135858-arnaudb.json
[13:59:33] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on aqs1010.eqiad.wmnet with reason: Upgrade to PKI
[13:59:47] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on aqs1010.eqiad.wmnet with reason: Upgrade to PKI
[14:00:25] <jinxer-wm>	 (SystemdUnitFailed) firing: prometheus-wmf-elasticsearch-exporter-9600.service on elastic2090:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:01:19] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3073.esams.wmnet with reason: host reimage
[14:03:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: 14Inconsistent data in Netbox for some msw device - 14https://phabricator.wikimedia.org/T359326#9707236 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr 14Corrected netbox errors 
[14:03:42] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2042.codfw.wmnet with reason: host reimage
[14:04:02] <wikibugs>	 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9707240 (10ssingh) @Papaul suggested to try a host in codfw and `cp2042` PXE booted successfully. In one of the above messages, @cmooney...
[14:06:05] <logmsgbot>	 !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host matomo1003.eqiad.wmnet with OS bookworm
[14:06:48] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2042.codfw.wmnet with reason: host reimage
[14:06:56] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2149.codfw.wmnet with reason: host reimage
[14:08:31] <logmsgbot>	 !log dreamyjazz@deploy1002 Finished scap: Backport for [[gerrit:1018967|Ignore missing title/page in CheckUserLookupUtils::getManualLogEntryFromRow (T362284)]] (duration: 17m 42s)
[14:08:39] <stashbot>	 T362284: Logs without a defined title or page_id cause an exception in CheckUser - https://phabricator.wikimedia.org/T362284
[14:09:02] <moritzm>	 !log installing NSS security updates
[14:09:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:13] <Dreamy_Jazz>	 !log Afternoon UTC backport window finished
[14:09:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:54] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[14:10:09] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2149.codfw.wmnet with reason: host reimage
[14:10:19] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[14:10:44] <elukey>	 !log move cassandra instances on aqs1010 to PKI TLS certs - T352647
[14:10:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:48] <stashbot>	 T352647: Move Cassandra clusters to PKI - https://phabricator.wikimedia.org/T352647
[14:11:23] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[14:11:40] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2115 (re)pooling @ 25%: Repool', diff saved to https://phabricator.wikimedia.org/P60414 and previous config saved to /var/cache/conftool/dbconfig/20240411-141139-arnaudb.json
[14:12:02] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[14:13:00] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2109 (re)pooling @ 50%: repool', diff saved to https://phabricator.wikimedia.org/P60415 and previous config saved to /var/cache/conftool/dbconfig/20240411-141300-arnaudb.json
[14:13:25] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 25%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P60416 and previous config saved to /var/cache/conftool/dbconfig/20240411-141324-arnaudb.json
[14:13:59] <edsanders>	 Dreamy_Jazz: sorry, you finished?
[14:14:05] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2103 (re)pooling @ 50%: reool', diff saved to https://phabricator.wikimedia.org/P60417 and previous config saved to /var/cache/conftool/dbconfig/20240411-141404-arnaudb.json
[14:14:37] <Dreamy_Jazz>	 Yeah. I have, but can extend the window if necessary.
[14:15:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-codfw
[14:15:30] <Dreamy_Jazz>	 As it seems nothing else is on the calendar for at least the next hour.
[14:15:49] <wikibugs>	 (03PS1) 10Hnowlan: jobqueue: increase videoscaler job concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019053
[14:17:33] <Dreamy_Jazz>	 edsanders: Do you have deployment rights? If not, do you want me to deploy?
[14:18:31] <Dreamy_Jazz>	 Considering the window is done, I'd probably defer this change, but there isn't anything after this on the calendar.
[14:18:31] <elukey>	 !log drain and restart cassandra-b on aqs2007 - didn't pick up the new truststore during the past roll restart - T352647
[14:18:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:36] <stashbot>	 T352647: Move Cassandra clusters to PKI - https://phabricator.wikimedia.org/T352647
[14:19:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-codfw
[14:20:13] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2149: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1018972
[14:21:31] <wikibugs>	 (03PS28) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213)
[14:22:44] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking)
[14:23:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove global root for four engineering managers [puppet] - 10https://gerrit.wikimedia.org/r/1019054
[14:24:27] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3073.esams.wmnet with OS bullseye
[14:24:51] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9707334 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp3073.esams.wmnet with OS bullseye completed: - cp3073 (**PASS**)...
[14:25:23] <edsanders>	 Dreamy_Jazz: whichever you prefer
[14:25:41] <Dreamy_Jazz>	 As long as you can be around to test, I can deploy.
[14:26:26] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2042.codfw.wmnet with OS bullseye
[14:26:33] <wikibugs>	 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9707355 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp2042.codfw.wmnet with OS bulls...
[14:26:39] <edsanders>	 Dreamy_Jazz: I can test
[14:26:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015086 (https://phabricator.wikimedia.org/T361134) (owner: 10Esanders)
[14:26:47] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2115 (re)pooling @ 50%: Repool', diff saved to https://phabricator.wikimedia.org/P60418 and previous config saved to /var/cache/conftool/dbconfig/20240411-142645-arnaudb.json
[14:26:50] <edsanders>	 Thanks
[14:26:55] <Dreamy_Jazz>	 No problem
[14:27:08] <Dreamy_Jazz>	 !log Extending UTC Afternoon backport window
[14:27:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:27:29] <wikibugs>	 (03Merged) 10jenkins-bot: Set wgMFFallbackEditor to visual for most VE wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015086 (https://phabricator.wikimedia.org/T361134) (owner: 10Esanders)
[14:27:36] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db2149: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1018972 (owner: 10Marostegui)
[14:27:56] <logmsgbot>	 !log dreamyjazz@deploy1002 Started scap: Backport for [[gerrit:1015086|Set wgMFFallbackEditor to visual for most VE wikis (T361134)]]
[14:28:01] <stashbot>	 T361134: Set wgMFFallbackEditor to 'visual' for all other wikis - https://phabricator.wikimedia.org/T361134
[14:28:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P60419 and previous config saved to /var/cache/conftool/dbconfig/20240411-142801-root.json
[14:28:06] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2109 (re)pooling @ 75%: repool', diff saved to https://phabricator.wikimedia.org/P60420 and previous config saved to /var/cache/conftool/dbconfig/20240411-142806-arnaudb.json
[14:28:22] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3073.esams.wmnet,service=(cdn|ats-be)
[14:28:27] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2042.codfw.wmnet,service=(cdn|ats-be)
[14:28:31] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 50%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P60421 and previous config saved to /var/cache/conftool/dbconfig/20240411-142830-arnaudb.json
[14:29:11] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2103 (re)pooling @ 75%: reool', diff saved to https://phabricator.wikimedia.org/P60422 and previous config saved to /var/cache/conftool/dbconfig/20240411-142910-arnaudb.json
[14:29:47] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9707409 (10ssingh)
[14:29:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1019014 (owner: 10Majavah)
[14:30:15] <wikibugs>	 (03CR) 10Majavah: [C:03+2] O:mariadb::grants: drop unused clouddb.sql.erb [puppet] - 10https://gerrit.wikimedia.org/r/1019014 (owner: 10Majavah)
[14:30:19] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Remove obsolete grant [puppet] - 10https://gerrit.wikimedia.org/r/1018941 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff)
[14:30:44] <logmsgbot>	 !log dreamyjazz@deploy1002 dreamyjazz and esanders: Backport for [[gerrit:1015086|Set wgMFFallbackEditor to visual for most VE wikis (T361134)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:30:57] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2149.codfw.wmnet with OS bookworm
[14:31:00] <wikibugs>	 (03PS6) 10Muehlenhoff: netbox: Set deployment method to avoid creating scap target [puppet] - 10https://gerrit.wikimedia.org/r/1002392
[14:31:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-eqiad
[14:33:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:04-1] "IIRC support for multiple probes of type http and tcp hasn't been implemented, so I'm afraid this won't work as-is." [puppet] - 10https://gerrit.wikimedia.org/r/1019039 (https://phabricator.wikimedia.org/T361090) (owner: 10Jelto)
[14:33:46] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9707428 (10MoritzMuehlenhoff)
[14:34:12] <moritzm>	 !log installing distro-info-data updates from Bullseye point release
[14:34:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:19] <wikibugs>	 (03PS3) 10Ssingh: hiera: unify trafficserver storage elements for esams [puppet] - 10https://gerrit.wikimedia.org/r/1014571 (https://phabricator.wikimedia.org/T360430) (owner: 10Fabfur)
[14:34:30] <wikibugs>	 (03PS4) 10Ssingh: hiera: unify trafficserver storage elements for esams [puppet] - 10https://gerrit.wikimedia.org/r/1014571 (https://phabricator.wikimedia.org/T360430) (owner: 10Fabfur)
[14:35:20] <wikibugs>	 (03CR) 10LSobanski: [C:03+1] Remove global root for four engineering managers [puppet] - 10https://gerrit.wikimedia.org/r/1019054 (owner: 10Muehlenhoff)
[14:36:05] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1872/console" [puppet] - 10https://gerrit.wikimedia.org/r/1014571 (https://phabricator.wikimedia.org/T360430) (owner: 10Fabfur)
[14:36:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad
[14:38:18] <Dreamy_Jazz>	 edsanders: Can you test?
[14:38:25] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9707455 (10MoritzMuehlenhoff)
[14:38:26] <edsanders>	 testing
[14:38:26] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] jobqueue: increase videoscaler job concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019053 (owner: 10Hnowlan)
[14:38:28] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:16] <wikibugs>	 (03PS5) 10Ssingh: hiera: unify trafficserver storage elements for esams [puppet] - 10https://gerrit.wikimedia.org/r/1014571 (https://phabricator.wikimedia.org/T360430) (owner: 10Fabfur)
[14:39:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove global root for four engineering managers [puppet] - 10https://gerrit.wikimedia.org/r/1019054 (owner: 10Muehlenhoff)
[14:39:29] <edsanders>	 Dreamy_Jazz: Looks good - thanks
[14:39:36] <Dreamy_Jazz>	 Great.
[14:39:52] <logmsgbot>	 !log dreamyjazz@deploy1002 dreamyjazz and esanders: Continuing with sync
[14:40:19] * Dreamy_Jazz is thankful I ran scap backport on tmux as my client shell crashed.
[14:40:36] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1873/console" [puppet] - 10https://gerrit.wikimedia.org/r/1014571 (https://phabricator.wikimedia.org/T360430) (owner: 10Fabfur)
[14:41:32] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1014571 (https://phabricator.wikimedia.org/T360430) (owner: 10Fabfur)
[14:41:53] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2115 (re)pooling @ 75%: Repool', diff saved to https://phabricator.wikimedia.org/P60423 and previous config saved to /var/cache/conftool/dbconfig/20240411-144152-arnaudb.json
[14:43:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P60424 and previous config saved to /var/cache/conftool/dbconfig/20240411-144307-root.json
[14:43:09] <sukhe>	 !log sudo cumin "A:cp and A:esams" "disable-puppet 'merging CR 1014571'"
[14:43:12] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2109 (re)pooling @ 100%: repool', diff saved to https://phabricator.wikimedia.org/P60425 and previous config saved to /var/cache/conftool/dbconfig/20240411-144311-arnaudb.json
[14:43:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot-master rolling restart_daemons on A:maps-master
[14:43:37] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 75%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P60426 and previous config saved to /var/cache/conftool/dbconfig/20240411-144336-arnaudb.json
[14:43:52] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] jobqueue: increase videoscaler job concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019053 (owner: 10Hnowlan)
[14:44:13] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] hiera: unify trafficserver storage elements for esams [puppet] - 10https://gerrit.wikimedia.org/r/1014571 (https://phabricator.wikimedia.org/T360430) (owner: 10Fabfur)
[14:44:16] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2103 (re)pooling @ 100%: reool', diff saved to https://phabricator.wikimedia.org/P60427 and previous config saved to /var/cache/conftool/dbconfig/20240411-144416-arnaudb.json
[14:44:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot-master (exit_code=0) rolling restart_daemons on A:maps-master
[14:44:38] <wikibugs>	 (03Merged) 10jenkins-bot: jobqueue: increase videoscaler job concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019053 (owner: 10Hnowlan)
[14:44:46] <wikibugs>	 (03PS4) 10Ahmon Dancy: static.php: Handle mediawiki.org/ontology/ontology.owl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018354 (https://phabricator.wikimedia.org/T171807)
[14:45:14] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[14:46:25] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: 14esams text cp nvme upgrade - 14https://phabricator.wikimedia.org/T360430#9707488 (10Fabfur) 05Open→03Resolved
[14:47:20] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[14:47:21] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[14:47:23] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] kubernetes::master: Forward audit logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1019049 (https://phabricator.wikimedia.org/T290020) (owner: 10JMeybohm)
[14:47:36] <dancy>	 jouncebot nowandnext
[14:47:36] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 12 minute(s)
[14:47:36] <jouncebot>	 In 1 hour(s) and 12 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240411T1600)
[14:47:57] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[14:47:58] <dancy>	 Dreamy_Jazz: Ping me when you're done please.
[14:48:09] <Dreamy_Jazz>	 Sure.
[14:50:35] <wikibugs>	 (03PS1) 10Clément Goubert: ml-serve: Add istio config for mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019061 (https://phabricator.wikimedia.org/T362316)
[14:52:07] <logmsgbot>	 !log dreamyjazz@deploy1002 Finished scap: Backport for [[gerrit:1015086|Set wgMFFallbackEditor to visual for most VE wikis (T361134)]] (duration: 24m 11s)
[14:52:08] <Dreamy_Jazz>	 dancy: Done.
[14:52:13] <dancy>	 thx
[14:52:14] <stashbot>	 T361134: Set wgMFFallbackEditor to 'visual' for all other wikis - https://phabricator.wikimedia.org/T361134
[14:52:26] <sukhe>	 !log sudo cumin "A:cp and A:esams" "run-puppet-agent --enable 'merging CR 1014571'"
[14:52:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018354 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy)
[14:54:11] <wikibugs>	 (03Merged) 10jenkins-bot: static.php: Handle mediawiki.org/ontology/ontology.owl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018354 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy)
[14:54:17] <wikibugs>	 (03PS3) 10JMeybohm: kubernetes::master: Forward audit logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1019049 (https://phabricator.wikimedia.org/T290020)
[14:54:39] <logmsgbot>	 !log dancy@deploy1002 Started scap: Backport for [[gerrit:1018354|static.php: Handle mediawiki.org/ontology/ontology.owl (T171807 T359643)]]
[14:54:43] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-codfw and not P{cp2042.codfw.wmnet} and A:cp
[14:54:45] <stashbot>	 T171807: Create ontology URL for mediawiki - https://phabricator.wikimedia.org/T171807
[14:54:47] <stashbot>	 T359643: Get rid of the /srv/mediawiki/php symbolic link - https://phabricator.wikimedia.org/T359643
[14:56:57] <wikibugs>	 (03PS1) 10Hnowlan: jobqueue: restore videoscaling concurrency to pre-outage level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019062
[14:56:59] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2115 (re)pooling @ 100%: Repool', diff saved to https://phabricator.wikimedia.org/P60428 and previous config saved to /var/cache/conftool/dbconfig/20240411-145658-arnaudb.json
[14:57:21] <logmsgbot>	 !log dancy@deploy1002 dancy: Backport for [[gerrit:1018354|static.php: Handle mediawiki.org/ontology/ontology.owl (T171807 T359643)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:57:39] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] jobqueue: restore videoscaling concurrency to pre-outage level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019062 (owner: 10Hnowlan)
[14:57:45] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-drmrs and A:cp
[14:58:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P60429 and previous config saved to /var/cache/conftool/dbconfig/20240411-145813-root.json
[14:58:24] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host matomo1003.eqiad.wmnet with OS bookworm
[14:58:42] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 100%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P60430 and previous config saved to /var/cache/conftool/dbconfig/20240411-145841-arnaudb.json
[15:00:20] <logmsgbot>	 !log dancy@deploy1002 dancy: Continuing with sync
[15:02:30] <wikibugs>	 10ops-eqiad, 06SRE: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9707550 (10Eevans) >>! In T362033#9700949, @Jclark-ctr wrote: > @Eevans  Hey looks like same drive  as T354499 is failed again  let me know if i can replace it again   Sure, go ahead.  P.S. I think this is the 4th time, ar...
[15:03:05] <wikibugs>	 (03PS2) 10Clément Goubert: article-description: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018959 (https://phabricator.wikimedia.org/T362316)
[15:06:06] <wikibugs>	 (03PS3) 10Clément Goubert: article-description: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018959 (https://phabricator.wikimedia.org/T362316)
[15:09:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Pass the Ceph cluster address as an array [puppet] - 10https://gerrit.wikimedia.org/r/1019063
[15:11:02] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on matomo1003.eqiad.wmnet with reason: host reimage
[15:11:52] <wikibugs>	 (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1019065
[15:12:10] <wikibugs>	 (03CR) 10Ahmon Dancy: [V:03+2 C:03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1019065 (owner: 10Ahmon Dancy)
[15:12:21] <logmsgbot>	 !log dancy@deploy1002 Finished scap: Backport for [[gerrit:1018354|static.php: Handle mediawiki.org/ontology/ontology.owl (T171807 T359643)]] (duration: 17m 41s)
[15:12:26] <stashbot>	 T171807: Create ontology URL for mediawiki - https://phabricator.wikimedia.org/T171807
[15:12:27] <stashbot>	 T359643: Get rid of the /srv/mediawiki/php symbolic link - https://phabricator.wikimedia.org/T359643
[15:13:05] <wikibugs>	 (03CR) 10Elukey: "Left a small change request, after that +1!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019061 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert)
[15:13:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P60431 and previous config saved to /var/cache/conftool/dbconfig/20240411-151319-root.json
[15:14:32] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on matomo1003.eqiad.wmnet with reason: host reimage
[15:14:35] <wikibugs>	 (03PS2) 10Clément Goubert: ml-serve: Add istio config for mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019061 (https://phabricator.wikimedia.org/T362316)
[15:15:07] <wikibugs>	 (03PS2) 10Clément Goubert: article-description: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018960 (https://phabricator.wikimedia.org/T362316)
[15:18:22] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reimage for host moss-fe1002.eqiad.wmnet with OS bookworm
[15:18:25] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1019063 (owner: 10Muehlenhoff)
[15:20:11] <wikibugs>	 (03PS2) 10Clément Goubert: articletopic-outlink: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018961 (https://phabricator.wikimedia.org/T362316)
[15:20:22] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe2002.codfw.wmnet with OS bookworm
[15:20:32] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] jobqueue: restore videoscaling concurrency to pre-outage level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019062 (owner: 10Hnowlan)
[15:21:30] <wikibugs>	 (03Merged) 10jenkins-bot: jobqueue: restore videoscaling concurrency to pre-outage level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019062 (owner: 10Hnowlan)
[15:23:51] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[15:24:20] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[15:24:21] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[15:24:44] <hashar>	 jan_drewniak: hi, will you backport the mobilefrontend revert ? :)
[15:24:52] <hashar>	 if we can land it and resume the train, that would be great
[15:24:57] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[15:26:20] <wikibugs>	 (03PS2) 10Clément Goubert: articletopic-outlink: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018962 (https://phabricator.wikimedia.org/T362316)
[15:26:38] <wikibugs>	 (03CR) 10JMeybohm: "PCC is at https://puppet-compiler.wmflabs.org/output/1019049/1874/" [puppet] - 10https://gerrit.wikimedia.org/r/1019049 (https://phabricator.wikimedia.org/T290020) (owner: 10JMeybohm)
[15:27:09] <wikibugs>	 (03PS3) 10Clément Goubert: articletopic-outlink: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018962 (https://phabricator.wikimedia.org/T362316)
[15:27:31] <wikibugs>	 (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1019066
[15:28:18] <wikibugs>	 (03PS2) 10Clément Goubert: experimental: Switch to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018963 (https://phabricator.wikimedia.org/T362316)
[15:28:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P60432 and previous config saved to /var/cache/conftool/dbconfig/20240411-152825-root.json
[15:28:31] <wikibugs>	 (03PS3) 10Ahmon Dancy: Serve mw.org/ontology/ontology.owl via /w/static.php (v2) [puppet] - 10https://gerrit.wikimedia.org/r/1018335 (https://phabricator.wikimedia.org/T171807)
[15:28:31] <wikibugs>	 (03PS1) 10Ahmon Dancy: Revert "Route /w/docs/ to /w/static.php" [puppet] - 10https://gerrit.wikimedia.org/r/1019067 (https://phabricator.wikimedia.org/T171807)
[15:29:06] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] trafficserver: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840141 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[15:29:57] <wikibugs>	 (03PS2) 10Clément Goubert: readability: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018964 (https://phabricator.wikimedia.org/T362316)
[15:30:05] <wikibugs>	 (03PS2) 10Clément Goubert: readability: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018965 (https://phabricator.wikimedia.org/T362316)
[15:30:05] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2110 (re)pooling @ 25%: repool', diff saved to https://phabricator.wikimedia.org/P60433 and previous config saved to /var/cache/conftool/dbconfig/20240411-153003-arnaudb.json
[15:30:07] <wikibugs>	 (03PS4) 10Ahmon Dancy: Serve mw.org/ontology/ontology.owl via /w/static.php (v2) [puppet] - 10https://gerrit.wikimedia.org/r/1018335 (https://phabricator.wikimedia.org/T171807)
[15:30:19] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2111 (re)pooling @ 25%: repool', diff saved to https://phabricator.wikimedia.org/P60434 and previous config saved to /var/cache/conftool/dbconfig/20240411-153019-arnaudb.json
[15:31:01] <wikibugs>	 (03PS3) 10Clément Goubert: readability: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018965 (https://phabricator.wikimedia.org/T362316)
[15:31:26] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] cache: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868709 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[15:31:29] <wikibugs>	 (03PS2) 10Clément Goubert: revertrisk: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018986 (https://phabricator.wikimedia.org/T362316)
[15:31:36] <wikibugs>	 (03PS2) 10Clément Goubert: revertrisk: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018987 (https://phabricator.wikimedia.org/T362316)
[15:31:37] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe1002.eqiad.wmnet with reason: host reimage
[15:31:44] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1877/console" [puppet] - 10https://gerrit.wikimedia.org/r/840141 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[15:32:01] <wikibugs>	 (03PS3) 10Clément Goubert: revertrisk: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018987 (https://phabricator.wikimedia.org/T362316)
[15:32:20] <wikibugs>	 (03PS2) 10Clément Goubert: revscoring-articlequality: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018988 (https://phabricator.wikimedia.org/T362316)
[15:32:25] <wikibugs>	 (03PS3) 10Clément Goubert: revscoring-articlequality: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018988 (https://phabricator.wikimedia.org/T362316)
[15:32:51] <wikibugs>	 (03PS2) 10Clément Goubert: revscoring-articlequality: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018989 (https://phabricator.wikimedia.org/T362316)
[15:33:05] <wikibugs>	 (03PS3) 10Clément Goubert: revscoring-articlequality: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018989 (https://phabricator.wikimedia.org/T362316)
[15:33:14] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hadoop.roll-restart-workers restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade.
[15:33:34] <wikibugs>	 (03PS2) 10Clément Goubert: revscoring-articletopic: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018990 (https://phabricator.wikimedia.org/T362316)
[15:33:40] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] trafficserver: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840141 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[15:33:45] <wikibugs>	 (03PS2) 10Clément Goubert: revscoring-articletopic: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018991 (https://phabricator.wikimedia.org/T362316)
[15:34:07] <wikibugs>	 (03PS3) 10Clément Goubert: revscoring-articletopic: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018991 (https://phabricator.wikimedia.org/T362316)
[15:34:28] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Migrate device-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014658 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis)
[15:34:33] <wikibugs>	 (03PS2) 10Clément Goubert: revscoring-draftquality: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018992 (https://phabricator.wikimedia.org/T362316)
[15:34:39] <wikibugs>	 (03PS2) 10Clément Goubert: revscoring-draftquality: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018993 (https://phabricator.wikimedia.org/T362316)
[15:34:57] <wikibugs>	 (03PS3) 10Clément Goubert: revscoring-draftquality: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018993 (https://phabricator.wikimedia.org/T362316)
[15:35:04] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe1002.eqiad.wmnet with reason: host reimage
[15:35:17] <wikibugs>	 (03Merged) 10jenkins-bot: Migrate device-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014658 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis)
[15:35:18] <wikibugs>	 (03PS2) 10Clément Goubert: revscoring-drafttopic: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018994 (https://phabricator.wikimedia.org/T362316)
[15:35:27] <jinxer-wm>	 (SystemdUnitCrashLoop) firing: prometheus-wmf-elasticsearch-exporter-9400.service crashloop on elastic2090:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[15:35:27] <wikibugs>	 (03PS2) 10Clément Goubert: revscoring-drafttopic: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018995 (https://phabricator.wikimedia.org/T362316)
[15:35:47] <wikibugs>	 (03PS3) 10Clément Goubert: revscoring-drafttopic: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018995 (https://phabricator.wikimedia.org/T362316)
[15:35:53] <wikibugs>	 (03PS1) 10Ahmon Dancy: Revert "mediawiki: Route /w/docs/ to /w/static.php" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019068
[15:36:06] <wikibugs>	 (03PS2) 10Clément Goubert: revscoring-editquality-damaging: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018996 (https://phabricator.wikimedia.org/T362316)
[15:36:14] <wikibugs>	 (03PS2) 10Clément Goubert: revscoring-editquality-damaging: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018997 (https://phabricator.wikimedia.org/T362316)
[15:36:40] <wikibugs>	 (03PS3) 10Clément Goubert: revscoring-editquality-damaging: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018997 (https://phabricator.wikimedia.org/T362316)
[15:36:46] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe2002.codfw.wmnet with reason: host reimage
[15:37:00] <wikibugs>	 (03PS2) 10Clément Goubert: revscoring-editquality-goodfaith: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018998 (https://phabricator.wikimedia.org/T362316)
[15:37:08] <wikibugs>	 (03PS2) 10Clément Goubert: revscoring-editquality-goodfaith: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018999 (https://phabricator.wikimedia.org/T362316)
[15:37:27] <wikibugs>	 (03PS3) 10Clément Goubert: revscoring-editquality-goodfaith: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018999 (https://phabricator.wikimedia.org/T362316)
[15:37:46] <wikibugs>	 (03PS2) 10Clément Goubert: revscoring-editquality-reverted: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019000 (https://phabricator.wikimedia.org/T362316)
[15:37:54] <wikibugs>	 (03PS2) 10Clément Goubert: revscoring-editquality-reverted: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019001 (https://phabricator.wikimedia.org/T362316)
[15:38:12] <wikibugs>	 (03PS3) 10Clément Goubert: revscoring-editquality-reverted: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019001 (https://phabricator.wikimedia.org/T362316)
[15:38:19] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Add striker_toolsbeta to the list of m5 backups [puppet] - 10https://gerrit.wikimedia.org/r/1019069 (https://phabricator.wikimedia.org/T360149)
[15:38:20] <wikibugs>	 (03PS3) 10BCornwall: cache: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868709 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[15:39:43] <wikibugs>	 (03PS10) 10Jgreen: community-crm: Add dyna and discovery records [dns] - 10https://gerrit.wikimedia.org/r/983950 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[15:39:44] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe2002.codfw.wmnet with reason: host reimage
[15:40:27] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1881/co" [puppet] - 10https://gerrit.wikimedia.org/r/868709 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[15:41:08] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: 14Requesting access to shell access to analytics client servers for AndyRussG - 14https://phabricator.wikimedia.org/T361742#9707721 (10AndyRussG) 14>>! In T361742#9706455, @Milimetric wrote: > Approved, welcome back Andy :)  Woohoo, thanks! :) :)  >>! In...
[15:41:31] <wikibugs>	 (03CR) 10Jgreen: [C:03+2] community-crm: Add dyna and discovery records [dns] - 10https://gerrit.wikimedia.org/r/983950 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[15:43:16] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] cache: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868709 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[15:43:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P60435 and previous config saved to /var/cache/conftool/dbconfig/20240411-154330-root.json
[15:44:30] <wikibugs>	 (03CR) 10Clément Goubert: ml-serve: Add istio config for mw-api-int-ro (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019061 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert)
[15:45:11] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2110 (re)pooling @ 50%: repool', diff saved to https://phabricator.wikimedia.org/P60436 and previous config saved to /var/cache/conftool/dbconfig/20240411-154510-arnaudb.json
[15:45:25] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2111 (re)pooling @ 50%: repool', diff saved to https://phabricator.wikimedia.org/P60437 and previous config saved to /var/cache/conftool/dbconfig/20240411-154524-arnaudb.json
[15:45:26] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade.
[15:47:04] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-drmrs and A:cp
[15:51:21] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-fe1002.eqiad.wmnet with OS bookworm
[15:51:59] <wikibugs>	 (03PS1) 10Cwhite: opensearch: bump curator version to wmf4 [puppet] - 10https://gerrit.wikimedia.org/r/1018417 (https://phabricator.wikimedia.org/T348508)
[15:56:17] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-fe2002.codfw.wmnet with OS bookworm
[15:57:18] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1019063 (owner: 10Muehlenhoff)
[15:58:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P60438 and previous config saved to /var/cache/conftool/dbconfig/20240411-155836-root.json
[15:59:26] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "LGTM. If you can have this merged and verified during the puppet window today, let me know and I can help you get this out to k8s during t" [puppet] - 10https://gerrit.wikimedia.org/r/1018335 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy)
[16:00:05] <jouncebot>	 jhathaway: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240411T1600).
[16:00:05] <jouncebot>	 dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:17] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2110 (re)pooling @ 75%: repool', diff saved to https://phabricator.wikimedia.org/P60439 and previous config saved to /var/cache/conftool/dbconfig/20240411-160016-arnaudb.json
[16:00:25] <dancy>	 o/
[16:00:31] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2111 (re)pooling @ 75%: repool', diff saved to https://phabricator.wikimedia.org/P60440 and previous config saved to /var/cache/conftool/dbconfig/20240411-160030-arnaudb.json
[16:01:16] <wikibugs>	 (03PS3) 10Clément Goubert: ml-serve: Add istio config for mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019061 (https://phabricator.wikimedia.org/T362316)
[16:02:04] <jhathaway>	 o/
[16:03:41] <herron>	 !log beginning rolling hardware upgrades for titan100[12] T361251
[16:03:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:54] <jhathaway>	 dancy: just merge both patches?
[16:03:55] <stashbot>	 T361251: titan100[12] ram/ssd upgrade coordination - https://phabricator.wikimedia.org/T361251
[16:04:25] <dancy>	 jhathaway: Yes please.
[16:04:35] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] Serve mw.org/ontology/ontology.owl via /w/static.php (v2) [puppet] - 10https://gerrit.wikimedia.org/r/1018335 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy)
[16:04:43] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] Revert "Route /w/docs/ to /w/static.php" [puppet] - 10https://gerrit.wikimedia.org/r/1019067 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy)
[16:04:49] <wikibugs>	 (03PS4) 10Clément Goubert: ml-serve: Add istio config for mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019061 (https://phabricator.wikimedia.org/T362316)
[16:05:23] <wikibugs>	 (03CR) 10Ahmon Dancy: "Thanks!  Jesse Hathaway is handling this one for me during the puppet window (right now).  I did add https://gerrit.wikimedia.org/r/c/oper" [puppet] - 10https://gerrit.wikimedia.org/r/1018335 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy)
[16:05:47] <wikibugs>	 (03PS1) 10Hashar: Revert "Update mobile search for dark mode, remove unused functions in MobilePage.php" [extensions/MobileFrontend] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018974 (https://phabricator.wikimedia.org/T362297)
[16:06:28] <jhathaway>	 dancy: merged
[16:06:54] <dancy>	 Thanks!  I'll do some testing in 10 minutes.
[16:07:22] <hashar>	 jouncebot: nowandnext
[16:07:23] <jouncebot>	 For the next 0 hour(s) and 52 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240411T1600)
[16:07:23] <jouncebot>	 In 0 hour(s) and 52 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240411T1700)
[16:07:23] <jouncebot>	 In 0 hour(s) and 52 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240411T1700)
[16:07:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:07:59] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1002 using scap backport" [extensions/MobileFrontend] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018974 (https://phabricator.wikimedia.org/T362297) (owner: 10Hashar)
[16:08:50] <wikibugs>	 (03PS1) 10Clément Goubert: ml-staging-codfw: Override mediawiki-app-vs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019074 (https://phabricator.wikimedia.org/T362316)
[16:10:57] <dancy>	 jhathaway: Can you run-puppet-agent on mwdebug1001.eqiad.wmnet ?
[16:11:06] <jhathaway>	 nod
[16:12:50] <jhathaway>	 dancy: done
[16:15:23] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2110 (re)pooling @ 100%: repool', diff saved to https://phabricator.wikimedia.org/P60441 and previous config saved to /var/cache/conftool/dbconfig/20240411-161522-arnaudb.json
[16:15:37] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2111 (re)pooling @ 100%: repool', diff saved to https://phabricator.wikimedia.org/P60442 and previous config saved to /var/cache/conftool/dbconfig/20240411-161536-arnaudb.json
[16:19:21] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Migrate db2098 backups to db2198 and upgrade dbprov2002 to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1018276 (https://phabricator.wikimedia.org/T360751)
[16:19:21] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Reenable notifications for db2201 & db2202 [puppet] - 10https://gerrit.wikimedia.org/r/1019077 (https://phabricator.wikimedia.org/T355422)
[16:20:12] <wikibugs>	 (03CR) 10Jcrespo: [C:04-1] mariadb: Reenable notifications for db2201 & db2202 [puppet] - 10https://gerrit.wikimedia.org/r/1019077 (https://phabricator.wikimedia.org/T355422) (owner: 10Jcrespo)
[16:26:05] <wikibugs>	 (03PS7) 10TChin: Add datasets-config helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434)
[16:27:10] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host matomo1003.eqiad.wmnet with OS bookworm
[16:27:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "Update mobile search for dark mode, remove unused functions in MobilePage.php" [extensions/MobileFrontend] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018974 (https://phabricator.wikimedia.org/T362297) (owner: 10Hashar)
[16:27:33] <jinxer-wm>	 (ProbeDown) resolved: (2) Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:28:01] <wikibugs>	 (03CR) 10Hashar: [V:03+2] "I am submitting this change directly, the sole failure comes from a Selenium test for GrowthExperiments:" [extensions/MobileFrontend] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018974 (https://phabricator.wikimedia.org/T362297) (owner: 10Hashar)
[16:28:28] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:29:00] <logmsgbot>	 !log hashar@deploy1002 Started scap: Backport for [[gerrit:1018974|Revert "Update mobile search for dark mode, remove unused functions in MobilePage.php" (T362297)]]
[16:29:05] <stashbot>	 T362297: [Bug] Mobile watchlist broken - https://phabricator.wikimedia.org/T362297
[16:30:41] <wikibugs>	 (03PS1) 10BryanDavis: developer-portal: Bump container to 2024-04-11-122429-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019080
[16:31:41] <logmsgbot>	 !log hashar@deploy1002 hashar: Backport for [[gerrit:1018974|Revert "Update mobile search for dark mode, remove unused functions in MobilePage.php" (T362297)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[16:33:17] <logmsgbot>	 !log hashar@deploy1002 hashar: Continuing with sync
[16:33:28] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job pint in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:33:50] <hashar>	 ^ verified via the debug server and using  https://m.mediawiki.org/wiki/Special:Watchlist?debug=1 to nuke the resourceloader cache
[16:34:51] <wikibugs>	 (03CR) 10Jcrespo: [C:04-1] "Was the host reimaged?" [puppet] - 10https://gerrit.wikimedia.org/r/1019077 (https://phabricator.wikimedia.org/T355422) (owner: 10Jcrespo)
[16:39:16] <wikibugs>	 (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2024-04-11-122429-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019080 (owner: 10BryanDavis)
[16:40:22] <wikibugs>	 (03Merged) 10jenkins-bot: developer-portal: Bump container to 2024-04-11-122429-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019080 (owner: 10BryanDavis)
[16:44:18] <wikibugs>	 10ops-eqiad, 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4): titan100[12] ram/ssd upgrade coordination - https://phabricator.wikimedia.org/T361251#9708003 (10VRiley-WMF)
[16:45:38] <wikibugs>	 10ops-eqiad, 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4): 14titan100[12] ram/ssd upgrade coordination - 14https://phabricator.wikimedia.org/T361251#9708006 (10VRiley-WMF) 05Open→03Resolved 14Worked with @herron and upgraded these servers. They came back properly and making this ti...
[16:45:48] <logmsgbot>	 !log hashar@deploy1002 Finished scap: Backport for [[gerrit:1018974|Revert "Update mobile search for dark mode, remove unused functions in MobilePage.php" (T362297)]] (duration: 16m 47s)
[16:45:53] <stashbot>	 T362297: [Bug] Mobile watchlist broken - https://phabricator.wikimedia.org/T362297
[16:46:27] <wikibugs>	 (03PS1) 10TChin: Add datasets-config helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434)
[16:48:28] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job thanos-query-frontend in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:49:43] <wikibugs>	 (03PS1) 10Dzahn: delete cas-logtash.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1019086
[16:50:29] <wikibugs>	 (03CR) 10Dzahn: "de" [dns] - 10https://gerrit.wikimedia.org/r/1019086 (owner: 10Dzahn)
[16:52:06] <wikibugs>	 (03CR) 10TChin: Add datasets-config helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin)
[16:53:50] <wikibugs>	 (03CR) 10TChin: Add datasets-config helm chart (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin)
[16:55:39] <wikibugs>	 (03PS1) 10Dzahn: delete kibana-next.svc.[eqiad|codfw].wmnet records [dns] - 10https://gerrit.wikimedia.org/r/1019087
[17:00:05] <jouncebot>	 bd808: I, the Bot under the Fountain, call upon thee, The Deployer, to do Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240411T1700).
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240411T1700)
[17:00:05] <jouncebot>	 dancy: A patch you scheduled for MediaWiki infrastructure (UTC late) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[17:00:11] <dancy>	 o/
[17:02:58] <hashar>	 I am done with the backport
[17:06:05] <hashar>	 dancy: I don't know who run those helm deployment :)
[17:06:15] <hashar>	 but once you are done, I will proceed with the train 
[17:06:16] <wikibugs>	 (03PS2) 10Ahmon Dancy: Revert "mediawiki: Route /w/docs/ to /w/static.php" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019068 (https://phabricator.wikimedia.org/T171807)
[17:06:51] <dancy>	 Thanks hashar.  I'll ping you.
[17:07:08] <dancy>	 Or you can hand it off to me if you wish.
[17:07:31] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "LGTM. Thanks for cleaning this up since it won't be used. I'll work with you to get this deployed." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019068 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy)
[17:07:37] <hashar>	 if you don't mind, that would let me have dinner with kids :-]
[17:07:47] <dancy>	 I don't mind.  Enjoy the fam!
[17:07:57] <hashar>	 there was not much showing up, but the train got blocked due to some off by one issue in the mobile Special:Watchlist
[17:08:15] <wikibugs>	 (03CR) 10Scott French: [C:03+2] Revert "mediawiki: Route /w/docs/ to /w/static.php" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019068 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy)
[17:08:16] <hashar>	 and we did the train log triage a couple hour ago, it is all quiet
[17:08:24] <dancy>	 Excellent.
[17:08:24] <hashar>	 cool! thank you Ahmon!
[17:09:33] * hashar heads to dinner
[17:09:51] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mediawiki: Route /w/docs/ to /w/static.php" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019068 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy)
[17:10:15] * bd808 has a developer-portal version bump to roll out
[17:10:47] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply
[17:10:53] <wikibugs>	 (03CR) 10Jcrespo: [C:04-1] "Data sources:" [puppet] - 10https://gerrit.wikimedia.org/r/1019077 (https://phabricator.wikimedia.org/T355422) (owner: 10Jcrespo)
[17:11:05] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[17:11:18] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply
[17:11:53] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
[17:12:03] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
[17:12:30] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
[17:16:14] * bd808 is done
[17:17:16] <swfrench-wmf>	 dancy: https://gerrit.wikimedia.org/r/1019068 has been picked up on deploy1002 - I'll get that moving shortly
[17:17:31] <dancy>	 👍🏾
[17:20:00] <logmsgbot>	 !log swfrench@deploy1002 Started scap: (no justification provided)
[17:27:58] <logmsgbot>	 !log swfrench@deploy1002 Finished scap: (no justification provided) (duration: 07m 57s)
[17:32:38] <dancy>	 Rolling the train!
[17:35:12] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 wikis to 1.42.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019097 (https://phabricator.wikimedia.org/T360158)
[17:35:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group2 wikis to 1.42.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019097 (https://phabricator.wikimedia.org/T360158) (owner: 10TrainBranchBot)
[17:36:00] <wikibugs>	 (03Merged) 10jenkins-bot: group2 wikis to 1.42.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019097 (https://phabricator.wikimedia.org/T360158) (owner: 10TrainBranchBot)
[17:40:42] <wikibugs>	 (03PS29) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213)
[17:41:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking)
[17:50:47] <logmsgbot>	 !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.26  refs T360158
[17:50:54] <stashbot>	 T360158: 1.42.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T360158
[18:00:25] <jinxer-wm>	 (SystemdUnitFailed) firing: prometheus-wmf-elasticsearch-exporter-9600.service on elastic2090:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:04:47] <wikibugs>	 10ops-eqiad, 06SRE, 10decommission-hardware: decommission wdqs1025.eqiad.wmnet - https://phabricator.wikimedia.org/T362122#9708246 (10VRiley-WMF) a:03VRiley-WMF
[18:09:15] <wikibugs>	 10ops-eqiad, 06SRE, 10decommission-hardware: decommission wdqs1025.eqiad.wmnet - https://phabricator.wikimedia.org/T362122#9708248 (10VRiley-WMF)
[18:10:18] <wikibugs>	 10ops-eqiad, 06SRE, 10decommission-hardware: 14decommission wdqs1025.eqiad.wmnet - 14https://phabricator.wikimedia.org/T362122#9708249 (10VRiley-WMF) 05Open→03Resolved 14Removed server and ran decommission script
[18:10:40] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubernetes2010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2010 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[18:20:40] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on kubernetes2010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2010 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[18:26:54] <wikibugs>	 (03PS11) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1017321
[18:30:46] <wikibugs>	 (03CR) 10CDobbins: [C:03+2] purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1017321 (owner: 10CDobbins)
[18:36:40] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubernetes2010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2010 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[18:39:15] <jinxer-wm>	 (JobrunnerPHPBusyWorkers) firing: Not enough idle php7.4-fpm.service workers for Mediawiki jobrunner at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&viewPanel=54 - https://alerts.wikimedia.org/?q=alertname%3DJobrunnerPHPBusyWorkers
[18:41:40] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on kubernetes2010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2010 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[18:49:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T356166)', diff saved to https://phabricator.wikimedia.org/P60443 and previous config saved to /var/cache/conftool/dbconfig/20240411-184951-marostegui.json
[18:49:53] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: 14Requesting access to analytics-privatedata-users for Steph Toyofuku - 14https://phabricator.wikimedia.org/T362113#9708314 (10SToyofuku-WMF) 14Thank you so much!!!
[18:49:57] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[18:50:05] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [dns] - 10https://gerrit.wikimedia.org/r/1019086 (owner: 10Dzahn)
[18:50:32] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [dns] - 10https://gerrit.wikimedia.org/r/1019087 (owner: 10Dzahn)
[18:51:08] <wikibugs>	 06SRE, 06serviceops: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9708317 (10jijiki)
[18:54:43] <wikibugs>	 (03CR) 10Dwisehaupt: [V:03+1] "Yes, I have verified that 443 is not in use. I'm ok with doing two puppet runs in this case as we are still spinning up the service. I bel" [puppet] - 10https://gerrit.wikimedia.org/r/1018362 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[18:55:40] <wikibugs>	 (03CR) 10Dwisehaupt: "Thanks. I think we should hold on this until the last step, once everything else is verified as working." [puppet] - 10https://gerrit.wikimedia.org/r/983951 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[19:03:57] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:04:48] <urandom>	 \o
[19:04:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P60445 and previous config saved to /var/cache/conftool/dbconfig/20240411-190459-marostegui.json
[19:07:19] <cwhite>	 urandom: same as yesterday it seems...
[19:07:33] <urandom>	 what is http_jobrunner_ip4 precisely?
[19:08:13] <urandom>	 i.e. what is the relationship there?
[19:08:17] <swfrench-wmf>	 1m loadavg ~ 250 on 4 machines each with 24 physical cores ... apparently that
[19:08:26] <swfrench-wmf>	 's the tipping point or thereabouts
[19:08:27] <mutante>	 it's probably a large video being scaled
[19:08:31] <cwhite>	 it's a blackbox probe job.  prometheus<->blackbox exporter<->jobrunner host
[19:08:51] <cwhite>	 think something like icinga check_http
[19:08:57] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:09:00] <mutante>	 since videoscaler and jobrunner share machines
[19:09:03] <cdanis>	 urandom: jobrunners are just a special class of appservers, if that's what you're asking
[19:10:14] <urandom>	 then how does it relate/compare with http_vidoscaler_ip4?
[19:10:58] <mutante>	 urandom: https://config-master.wikimedia.org/pybal/eqiad/videoscaler  https://config-master.wikimedia.org/pybal/eqiad/jobrunner
[19:11:06] <mutante>	 ^ same mw machines hosting both services
[19:11:34] <urandom>	 I'm articulating my question wrong, I think
[19:11:46] <mutante>	 they are separate services but run on the same backends
[19:12:08] <mutante>	 at one point we had different weight settings though
[19:12:28] <mutante>	 and some machines were made decdicated only-on-or-the-other, to prevent that
[19:12:57] <cdanis>	 https://phabricator.wikimedia.org/T279100
[19:13:02] <cdanis>	 https://phabricator.wikimedia.org/T306860
[19:13:25] <wikibugs>	 (03PS1) 10CDobbins: Revert "purged: add PKI cert handling" [puppet] - 10https://gerrit.wikimedia.org/r/1018977
[19:14:49] <urandom>	 Ok, so near-term it would seem we need to bring the concurrency down again, yes?
[19:14:58] <cwhite>	 yes
[19:15:03] <urandom>	 !incidents
[19:15:03] <sirenbot>	 4584 (RESOLVED)  [2x] ProbeDown sre (ip4 probes/service eqiad)
[19:15:23] <urandom>	 I'll work on that...
[19:15:27] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:15:31] <cdanis>	 urandom: another thing that has been done in the past is to have some jobrunners that aren't also videoscalers
[19:15:56] <cdanis>	 i think that would be reasonable as well, and it would likely make the probedown for jobrunner stop
[19:16:22] <taavi>	 do the mw-on-k8s jobrunners use a different endpoint?
[19:17:10] <swfrench-wmf>	 jobrunner.discovery.wmnet vs. mw-jobrunner.discovery.wmnet
[19:17:55] <swfrench-wmf>	 actually, given that changeprop-jobqueue only uses the latter now, are there _any_ uses of jobrunner.discovery.wmnet? (as in non-videoscaling hitting these machines)
[19:17:56] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1882/console" [puppet] - 10https://gerrit.wikimedia.org/r/1018977 (owner: 10CDobbins)
[19:18:05] <cdanis>	 swfrench-wmf: great question
[19:18:13] <mutante>	     mw1445.eqiad.wmnet: [apache2, nginx]         # Only pooled as videoscaler
[19:18:16] <mutante>	     mw1446.eqiad.wmnet: [apache2, nginx]         # Only pooled as videoscaler
[19:18:19] <wikibugs>	 (03CR) 10CDobbins: [V:03+1 C:03+2] Revert "purged: add PKI cert handling" [puppet] - 10https://gerrit.wikimedia.org/r/1018977 (owner: 10CDobbins)
[19:18:28] <mutante>	 ^ this is from conftool-data/node/eqiad.yaml
[19:18:54] <cdanis>	 mutante: ok but the weights aren't set like that in etcd
[19:19:17] <swfrench-wmf>	 was just about to say - the comments are a lie :)
[19:19:29] <mutante>	 it seems like serviceops decided to change that back
[19:20:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P60446 and previous config saved to /var/cache/conftool/dbconfig/20240411-192006-marostegui.json
[19:20:11] <mutante>	 I just remember this from before k8s
[19:20:27] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:21:08] <cdanis>	 swfrench-wmf: this is making me hopeful -- https://codesearch.wmcloud.org/search/?q=%5B%5E-%5Djobrunner%5C.discovery%5C.wmnet&files=&excludeFiles=&repos=
[19:22:33] <urandom>	 cdanis: so we could depool a node or two from videoscaler and hopefully restore some headroom for other things?
[19:22:50] <swfrench-wmf>	 cdanis: that looks promising, yeah :)
[19:22:56] <taavi>	 so maybe we can just disable pages for the baremetal jobrunner service?
[19:23:00] <cdanis>	 yeah I think that would be very reasonable urandom
[19:23:10] <cdanis>	 i also think what taavi said is reasonable, although it's got more risk imo
[19:23:27] <urandom>	 cdanis: but we should lower concurrency, no?
[19:23:50] <cdanis>	 (since no one has expressed certainty that the baremetal jobrunner is vestigal)
[19:23:52] <urandom>	 err...also
[19:24:08] <cdanis>	 urandom: yes but also jobs will get retried, so that's not as critical, afaik
[19:24:39] <swfrench-wmf>	 https://phabricator.wikimedia.org/T349796#9562813 <- "All (non-videoscaler) jobs migrated to Kubernetes jobrunners. "
[19:24:55] <cdanis>	 ok
[19:24:59] <cdanis>	 silence that probedown :D
[19:25:29] <sukhe>	 win 14
[19:25:46] <urandom>	 :)
[19:25:48] <mutante>	 The suggestion to disable paging could maybe be part of the next "alert review" (i think that's quarterly?)
[19:26:01] <cdanis>	 mutante: i mean, the endpoint should just probably be removed entirely at some point
[19:26:07] <mutante>	 *nod*
[19:26:09] <cdanis>	 wouldn't hurt to start with the prober definition though
[19:27:34] <cdanis>	 did the videoscaler alert get silenced or something?  that was alerting yesterday, was it not?
[19:27:57] <swfrench-wmf>	 it would probably be worth still dropping the concurrency though, in addition (since it's questionable whether these will recover on their own)
[19:29:47] <wikibugs>	 (03PS1) 10Eevans: changeprop-jobqueue: temporarily reduce video transcode concurrency (again) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019106
[19:31:00] <urandom>	 !incidents
[19:31:01] <sirenbot>	 4585 (RESOLVED)  [2x] ProbeDown sre (ip4 probes/service eqiad)
[19:31:01] <sirenbot>	 4584 (RESOLVED)  [2x] ProbeDown sre (ip4 probes/service eqiad)
[19:32:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:32:34] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "LGTM. This gets us to roughly where we were right before dropping all the way to 1 / 1 yesterday." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019106 (owner: 10Eevans)
[19:34:30] <urandom>	 Is there a way to create an indefinite silence in the alertmanager interface?
[19:35:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T356166)', diff saved to https://phabricator.wikimedia.org/P60447 and previous config saved to /var/cache/conftool/dbconfig/20240411-193514-marostegui.json
[19:35:17] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1248.eqiad.wmnet with reason: Maintenance
[19:35:27] <jinxer-wm>	 (SystemdUnitCrashLoop) firing: prometheus-wmf-elasticsearch-exporter-9400.service crashloop on elastic2090:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[19:35:30] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1248.eqiad.wmnet with reason: Maintenance
[19:35:31] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[19:35:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1248 (T356166)', diff saved to https://phabricator.wikimedia.org/P60448 and previous config saved to /var/cache/conftool/dbconfig/20240411-193537-marostegui.json
[19:35:53] <mutante>	 urandom: there is a way to do them via command line with this:  https://wikitech.wikimedia.org/wiki/Alertmanager#Add_a_silence_via_CLI
[19:36:19] <cwhite>	 I don't think that's the right approach.  If it's indeed not worth paging on, then we should disable its paging.
[19:37:33] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:38:40] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubernetes2010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2010 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[19:39:02] <wikibugs>	 (03CR) 10Eevans: [C:03+2] changeprop-jobqueue: temporarily reduce video transcode concurrency (again) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019106 (owner: 10Eevans)
[19:40:50] <logmsgbot>	 !log eevans@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[19:41:28] <logmsgbot>	 !log eevans@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[19:42:28] <wikibugs>	 (03PS1) 10Andrea Denisse: Revert "ssl: Delete dummy TLS key for the Prometheus hosts" [labs/private] - 10https://gerrit.wikimedia.org/r/1018978
[19:42:44] <wikibugs>	 (03CR) 10Andrea Denisse: [V:03+2 C:03+2] Revert "ssl: Delete dummy TLS key for the Prometheus hosts" [labs/private] - 10https://gerrit.wikimedia.org/r/1018978 (owner: 10Andrea Denisse)
[19:43:00] <wikibugs>	 (03PS30) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213)
[19:43:40] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on kubernetes2010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2010 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[19:43:54] <wikibugs>	 (03PS1) 10CDanis: homedir prompt update [puppet] - 10https://gerrit.wikimedia.org/r/1019108
[19:44:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking)
[19:44:11] <wikibugs>	 (03CR) 10CDanis: [C:03+2] homedir prompt update [puppet] - 10https://gerrit.wikimedia.org/r/1019108 (owner: 10CDanis)
[19:45:57] <jinxer-wm>	 (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:46:18] <wikibugs>	 (03PS1) 10Dzahn: cloud/devtools: switch default puppetmaster from 1001 to 1003 [puppet] - 10https://gerrit.wikimedia.org/r/1019109 (https://phabricator.wikimedia.org/T360470)
[19:46:20] <wikibugs>	 (03PS2) 10David Martin: ext-EventLogging: Add mediawiki.product_metrics.wikifunctions_ui to $wgEventLoggingStreamNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018317 (owner: 10Phuedx)
[19:46:33] <wikibugs>	 (03PS1) 10Cwhite: service catalog: disable paging on jobrunner and videoscaler services [puppet] - 10https://gerrit.wikimedia.org/r/1018420 (https://phabricator.wikimedia.org/T349796)
[19:47:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:48:52] <wikibugs>	 (03CR) 10Gergő Tisza: [C:04-1] logging: default to log any error (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar)
[19:49:00] <mutante>	 if it's not paging and not emailing or creating tickets, does it have value to monitor at all
[19:49:43] <wikibugs>	 (03PS1) 10JHathaway: vrts: create a profile for alias generation [puppet] - 10https://gerrit.wikimedia.org/r/1019110 (https://phabricator.wikimedia.org/T325395)
[19:49:44] <wikibugs>	 (03PS1) 10JHathaway: otrs.conf: rename and tighten up epp type definitions [puppet] - 10https://gerrit.wikimedia.org/r/1019111 (https://phabricator.wikimedia.org/T325395)
[19:50:17] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1019110 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway)
[19:50:43] <wikibugs>	 (03PS6) 10Andrea Denisse: prometheus: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414)
[19:50:44] <wikibugs>	 (03CR) 10Andrea Denisse: "PCC results now show the certs are generated by CFSSL: https://puppet-compiler.wmflabs.org/output/1018749/1883/" [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[19:50:57] <jinxer-wm>	 (ProbeDown) resolved: Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:51:02] <wikibugs>	 (03CR) 10Cwhite: "PCC OK: https://puppet-compiler.wmflabs.org/output/1018420/1884/" [puppet] - 10https://gerrit.wikimedia.org/r/1018420 (https://phabricator.wikimedia.org/T349796) (owner: 10Cwhite)
[19:51:24] <wikibugs>	 (03CR) 10CDanis: [C:03+1] service catalog: disable paging on jobrunner and videoscaler services [puppet] - 10https://gerrit.wikimedia.org/r/1018420 (https://phabricator.wikimedia.org/T349796) (owner: 10Cwhite)
[19:51:26] <urandom>	 cwhite: are we disabling paging for jobrunner *and* videoscaler?
[19:52:14] <urandom>	 I'd understood the former to be noise, because it was collateral damage, and strictly in service to, the latter
[19:52:33] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:52:38] <cwhite>	 urandom: I offer that up for discussion.  The two are both paging simultaneously.
[19:52:58] <cwhite>	 Just disabling the jobrunner one will still render videoscaler pages.
[19:53:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] vrts: create a profile for alias generation [puppet] - 10https://gerrit.wikimedia.org/r/1019110 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway)
[19:54:07] <urandom>	 I haven't seen any videoscaler pages since https://portal.victorops.com/ui/wikimedia/incident/4578/details (yesterday)
[19:54:28] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] cloud/devtools: switch default puppetmaster from 1001 to 1003 [puppet] - 10https://gerrit.wikimedia.org/r/1019109 (https://phabricator.wikimedia.org/T360470) (owner: 10Dzahn)
[19:54:43] <swfrench-wmf>	 would be worth silencing for now and giving h.nowlan a chance to review https://gerrit.wikimedia.org/r/1018420 before merging?
[19:54:49] <urandom>	 though they were precipitated by/accompanied with plenty of these http_jobrunner_ip4 pages
[19:55:19] <swfrench-wmf>	 basically, while the right call right now is to silence, I would say that we also took an action (attempt to mitigate by dropping concurrency)
[19:55:34] <mutante>	 duration=7d ?
[19:56:03] <cwhite>	 urandom: Pretty sure that's what the `firing (2)` means.  It means 2xProbeDown alerts are firing, but only one description gets added to the IRC message
[19:56:42] <urandom>	 cwhite: oh, right you are
[19:56:51] <cwhite>	 c.f.: https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes%2Fservice&var-module=All&orgId=1
[19:56:59] <swfrench-wmf>	 whether that action is required in order for the videoscalers to come good is a question I don't have the expertise to answer (but judging by h.nowlan's actions in the first part of 4/10, I suspect so)
[19:58:15] <wikibugs>	 (03PS2) 10JHathaway: vrts: create a profile for alias generation [puppet] - 10https://gerrit.wikimedia.org/r/1019110 (https://phabricator.wikimedia.org/T325395)
[19:58:15] <wikibugs>	 (03PS2) 10JHathaway: otrs.conf: rename and tighten up epp type definitions [puppet] - 10https://gerrit.wikimedia.org/r/1019111 (https://phabricator.wikimedia.org/T325395)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240411T2000).
[20:00:05] <jouncebot>	 dmartin-WMF: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:35] <dmartin-WMF>	 Hello folks, I'm here
[20:00:47] <urbanecm>	 I can deploy today 
[20:00:50] <urbanecm>	 How are you?
[20:01:01] <dmartin-WMF>	 Great.  I am fine very fine.  How are you Martin?
[20:01:33] <urbanecm>	 Good as well!
[20:01:58] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] ext-EventLogging: Add mediawiki.product_metrics.wikifunctions_ui to $wgEventLoggingStreamNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018317 (owner: 10Phuedx)
[20:02:03] <dmartin-WMF>	 What's your location, if I may ask.  I'm in the Bay area, Northern California
[20:02:26] <urbanecm>	 Prague, Czech Republic. Quite far from the Bay area :)
[20:02:40] <dmartin-WMF>	 Wow.  I visited Prague once briefly.  I loved it
[20:02:46] <wikibugs>	 (03Merged) 10jenkins-bot: ext-EventLogging: Add mediawiki.product_metrics.wikifunctions_ui to $wgEventLoggingStreamNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018317 (owner: 10Phuedx)
[20:02:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018317 (owner: 10Phuedx)
[20:03:04] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1018317|ext-EventLogging: Add mediawiki.product_metrics.wikifunctions_ui to $wgEventLoggingStreamNames]]
[20:03:07] <mutante>	 location:  Cross Club
[20:03:43] <jinxer-wm>	 (VarnishUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[20:03:44] <jinxer-wm>	 (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[20:03:51] <wikibugs>	 (03CR) 10Scott French: "Thanks, Cole. Would it be ok to silence for now, and wait a review cycle to get Hnowlan's thoughts on this too?" [puppet] - 10https://gerrit.wikimedia.org/r/1018420 (https://phabricator.wikimedia.org/T349796) (owner: 10Cwhite)
[20:03:57] <urandom>	 awesome.
[20:04:09] <urbanecm>	 not the best message to see during a window
[20:05:33] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and phuedx: Backport for [[gerrit:1018317|ext-EventLogging: Add mediawiki.product_metrics.wikifunctions_ui to $wgEventLoggingStreamNames]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:05:52] <dmartin-WMF>	 Okay I should do my verification step now right?  Give me a minute please
[20:06:20] <cdanis>	 urbanecm: it looks like there was a spike of traffic right as the window started, that since stopped, I think the alerts will clear in a minute or two and it's fine to proceed
[20:06:25] <wikibugs>	 10ops-eqiad, 06SRE, 10decommission-hardware: 14decommission dumpsdata1002.eqiad.wmnet - 14https://phabricator.wikimedia.org/T362065#9708448 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF 14Unracked server and ran the script for decommission 
[20:06:29] <urbanecm>	 dmartin-WMF: indeed, please check.
[20:06:47] <urbanecm>	 cdanis: ack, thanks for the info. do you want me to wait for the clear just in case?
[20:07:20] <dmartin-WMF>	 Okay, the verification for my patch is completed
[20:07:23] <cdanis>	 urbanecm: you're good to go, the unavailable metrics track error rate, and it's returned to 0 https://grafana.wikimedia.org/d/O_OXJyTVk/home-w-wiki-status?orgId=1&refresh=5m&from=now-30m&to=now&viewPanel=8
[20:07:33] <urbanecm>	 awesome, thanks cdanis
[20:07:41] <urbanecm>	 dmartin-WMF: i take it that the patch works as expected? :)
[20:07:59] <dmartin-WMF>	 Yes it does.  This is in reference to gerrit:1018317
[20:08:13] <urbanecm>	 thanks! proceeding
[20:08:14] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and phuedx: Continuing with sync
[20:08:44] <jinxer-wm>	 (VarnishUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[20:08:44] <jinxer-wm>	 (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[20:09:25] <wikibugs>	 (03PS1) 10JHathaway: postfix: prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/1019115 (https://phabricator.wikimedia.org/T325395)
[20:09:28] <wikibugs>	 (03PS1) 10JHathaway: postfix: prometheus ops config [puppet] - 10https://gerrit.wikimedia.org/r/1019116 (https://phabricator.wikimedia.org/T325395)
[20:09:44] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1019110 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway)
[20:10:30] <urandom>	 cwhite: what happens with `page: false` here?  Does it still show in alertmanager? Do we still get alerts here in IRC (i.e. w/o the #p.age tag)?
[20:11:27] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1019111 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway)
[20:12:21] <cwhite>	 urandom: It still checks and fires alerts, but doesn't get sent to the pager anymore
[20:13:11] <cwhite>	 and the #p.age tag gets dropped from the IRC message
[20:13:18] <wikibugs>	 (03CR) 10Eevans: [C:03+1] "I'd be fine with either approach." [puppet] - 10https://gerrit.wikimedia.org/r/1018420 (https://phabricator.wikimedia.org/T349796) (owner: 10Cwhite)
[20:14:04] * cwhite submitted a silence on the jobrunner module in alertmanager that expires on Monday at 08:00Z
[20:14:16] <urandom>	 that works too
[20:15:20] <urandom>	 !incidents
[20:15:20] <sirenbot>	 4588 (RESOLVED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[20:15:21] <sirenbot>	 4587 (RESOLVED)  VarnishUnavailable global sre (varnish-text thanos-rule)
[20:15:21] <sirenbot>	 4586 (RESOLVED)  ProbeDown sre (10.2.2.26 ip4 jobrunner:443 probes/service http_jobrunner_ip4 eqiad)
[20:15:21] <sirenbot>	 4585 (RESOLVED)  [2x] ProbeDown sre (ip4 probes/service eqiad)
[20:15:21] <sirenbot>	 4584 (RESOLVED)  [2x] ProbeDown sre (ip4 probes/service eqiad)
[20:16:01] <cwhite>	 Since we reduced the concurrency, I've updated the doc status back to monitoring
[20:17:11] <urandom>	 it's worth watching for a while longer, but it doesn't seem like we moved the needle much
[20:20:42] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1018317|ext-EventLogging: Add mediawiki.product_metrics.wikifunctions_ui to $wgEventLoggingStreamNames]] (duration: 17m 38s)
[20:22:37] <urbanecm>	 dmartin-WMF: should be deployed by now :)
[20:24:01] <dmartin-WMF>	 urbanecm: That's great.  Yes in fact I was able to verify on deployment just now.  Thank you!
[20:24:20] <urbanecm>	 sounds good
[20:24:25] <dmartin-WMF>	 I mean, verify on production
[20:26:12] <urbanecm>	 i understood :)
[20:26:53] <dmartin-WMF>	 Right :)
[20:32:33] <wikibugs>	 (03PS1) 10Ahmon Dancy: values-traindev.yaml: Update train-dev repo URL in comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019122
[20:34:23] <jinxer-wm>	 (ProbeDown) firing: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:36:06] <wikibugs>	 (03CR) 10Brennen Bearnes: [C:03+1] cloud/devtools: switch default puppetmaster from 1001 to 1003 [puppet] - 10https://gerrit.wikimedia.org/r/1019109 (https://phabricator.wikimedia.org/T360470) (owner: 10Dzahn)
[20:37:36] <wikibugs>	 (03CR) 10Herron: "LGTM please see one minor comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[20:39:23] <jinxer-wm>	 (ProbeDown) resolved: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:42:49] <wikibugs>	 (03PS31) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213)
[20:43:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking)
[20:51:57] <wikibugs>	 (03CR) 10Scott French: [C:03+2] values-traindev.yaml: Update train-dev repo URL in comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019122 (owner: 10Ahmon Dancy)
[20:52:51] <wikibugs>	 (03Merged) 10jenkins-bot: values-traindev.yaml: Update train-dev repo URL in comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019122 (owner: 10Ahmon Dancy)
[20:55:05] <wikibugs>	 (03PS3) 10JHathaway: vrts: create a profile for alias generation [puppet] - 10https://gerrit.wikimedia.org/r/1019110 (https://phabricator.wikimedia.org/T325395)
[20:55:05] <wikibugs>	 (03PS3) 10JHathaway: otrs.conf: rename and tighten up epp type definitions [puppet] - 10https://gerrit.wikimedia.org/r/1019111 (https://phabricator.wikimedia.org/T325395)
[20:55:41] <wikibugs>	 (03PS32) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213)
[20:56:21] <wikibugs>	 (03CR) 10Dzahn: prometheus: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[20:56:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking)
[20:57:09] <wikibugs>	 (03CR) 10Dzahn: "A fail isn't a NOOP though?" [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[20:58:45] <wikibugs>	 (03PS7) 10Andrea Denisse: prometheus: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414)
[20:59:19] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1019111 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway)
[20:59:46] <wikibugs>	 (03CR) 10Herron: "> A fail isn't a NOOP though?" [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[21:00:22] <wikibugs>	 (03CR) 10Andrea Denisse: prometheus: Ensure TLS certificates are provided by CFSSL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[21:01:06] <wikibugs>	 (03CR) 10Dzahn: "Ack" [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[21:03:13] <wikibugs>	 (03CR) 10Herron: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[21:06:44] <wikibugs>	 (03PS33) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213)
[21:07:08] <wikibugs>	 (03CR) 10Andrea Denisse: "Here are the PCC results with the latest patchset, certificates are now correctly generated by CFSSL. Keep in mind that the Hosts that hav" [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[21:07:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking)
[21:56:15] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[22:00:25] <jinxer-wm>	 (SystemdUnitFailed) firing: prometheus-wmf-elasticsearch-exporter-9600.service on elastic2090:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:01:15] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[22:01:16] <Daimona>	 Hey all! I would like to update a DB row in production for a record that got soft-deleted accidentally. Can I use mysql.php to do so, and are there specific precautions I should take? Context is T362365
[22:01:16] <stashbot>	 T362365: Event registration should not be disabled after marking the event page for translation - https://phabricator.wikimedia.org/T362365
[22:03:35] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T362366 (10phaultfinder) 03NEW
[22:04:57] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:09:57] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:12:15] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[22:13:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 820.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[22:14:45] <urandom>	 looks we had a silence in place for the jobrunner, but not the videoscaler, so I went ahead and created a matching one
[22:16:34] <wikibugs>	 (03PS1) 10JHathaway: Postfix profile [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398)
[22:17:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Postfix profile [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway)
[22:18:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 820.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[22:21:48] <wikibugs>	 (03PS2) 10JHathaway: Postfix profile [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398)
[22:22:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Postfix profile [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway)
[22:27:15] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[22:39:15] <jinxer-wm>	 (JobrunnerPHPBusyWorkers) firing: Not enough idle php7.4-fpm.service workers for Mediawiki jobrunner at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&viewPanel=54 - https://alerts.wikimedia.org/?q=alertname%3DJobrunnerPHPBusyWorkers
[22:55:38] <jan_drewniak>	 hashar: jan_drewniak: hi, will you backport the mobilefrontend revert ? :) if we can land it and resume the train, that would be great. -- Sorry I just saw this message, thank you for taking care of that earlier today!
[23:06:43] <jinxer-wm>	 (VarnishUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[23:06:44] <jinxer-wm>	 (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[23:11:20] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[23:11:43] <jinxer-wm>	 (VarnishUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[23:11:44] <jinxer-wm>	 (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[23:17:04] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[23:25:15] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[23:35:27] <jinxer-wm>	 (SystemdUnitCrashLoop) firing: prometheus-wmf-elasticsearch-exporter-9400.service crashloop on elastic2090:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[23:38:13] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1018421
[23:38:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1018421 (owner: 10TrainBranchBot)
[23:40:15] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[23:46:15] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[23:53:28] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable