[00:01:05] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1035867 (owner: 10TrainBranchBot)
[00:04:54] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[00:04:58] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[00:12:31] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[00:12:35] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[00:17:10] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[00:17:14] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[00:26:32] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[00:26:36] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[00:31:32] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[00:31:36] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[00:39:44] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[00:39:48] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[00:54:43] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[00:54:48] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[00:59:12] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[00:59:16] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[01:02:20] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[01:02:25] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[01:11:43] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[01:11:47] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[01:15:32] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[01:15:36] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[01:19:15] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[01:19:19] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[01:22:45] <icinga-wm_>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 163 probes of 729 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[01:27:39] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[01:27:43] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[01:27:47] <icinga-wm_>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 43 probes of 729 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[01:36:01] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[01:36:05] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[01:47:09] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[01:47:13] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[01:51:47] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[01:51:51] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[02:02:55] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[02:02:59] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[02:08:57] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[02:09:01] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[02:11:05] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[02:11:10] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[02:18:18] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[02:18:22] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[02:31:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:33:06] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[02:33:10] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[02:34:54] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[02:34:58] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[02:36:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:36:47] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:49:52] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[02:49:56] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[02:52:05] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[02:52:10] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[02:54:14] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[02:54:18] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[02:56:47] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:56:52] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[02:56:56] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[03:05:40] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[03:05:44] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[03:06:47] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:08:25] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[03:08:29] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[03:11:50] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[03:11:54] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[03:19:53] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[03:19:57] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[03:24:41] <jinxer-wm>	 FIRING: [12x] SystemdUnitFailed: httpbb_hourly_appserver.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:26:47] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:29:18] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[03:29:22] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[03:35:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: kube-controller-manager.service on ml-staging-ctrl2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:36:23] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[03:36:28] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[03:40:32] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[03:40:36] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[03:42:20] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[03:42:24] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[03:49:56] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[03:50:00] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[03:53:09] <icinga-wm_>	 PROBLEM - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38
[03:53:59] <icinga-wm_>	 PROBLEM - CirrusSearch more_like eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39
[04:03:09] <icinga-wm_>	 RECOVERY - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38
[04:04:03] <icinga-wm_>	 RECOVERY - CirrusSearch more_like eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39
[04:04:30] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[04:04:34] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[04:11:42] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[04:11:46] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[04:14:20] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[04:14:24] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[04:16:08] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[04:16:12] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[04:19:26] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[04:19:31] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[04:21:15] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[04:21:19] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[04:27:43] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[04:27:48] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[04:38:32] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[04:38:36] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[04:40:31] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[04:40:35] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[04:43:00] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[04:43:13] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[04:44:05] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[04:44:09] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[04:47:29] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[04:47:33] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[04:51:51] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2150 - https://phabricator.wikimedia.org/T365797#9833744 (10Marostegui) 05Open→03Declined The RAID is still in optimal, let's close this for now.
[04:52:23] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[04:52:36] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[04:52:38] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[04:52:54] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[04:53:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1161 (T364069)', diff saved to https://phabricator.wikimedia.org/P63250 and previous config saved to /var/cache/conftool/dbconfig/20240527-045301-marostegui.json
[04:53:06] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[04:54:38] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[04:54:42] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[05:01:29] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[05:01:33] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[05:03:27] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[05:03:31] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[05:05:44] <wikibugs>	 (03PS1) 10Marostegui: db1243: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1035903
[05:05:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1243', diff saved to https://phabricator.wikimedia.org/P63251 and previous config saved to /var/cache/conftool/dbconfig/20240527-050551-marostegui.json
[05:06:35] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[05:06:39] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[05:07:09] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1243: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1035903 (owner: 10Marostegui)
[05:07:49] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1243.eqiad.wmnet with OS bookworm
[05:08:33] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[05:08:38] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[05:15:52] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[05:15:56] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[05:21:58] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[05:22:02] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[05:24:06] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[05:24:10] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[05:24:34] <logmsgbot>	 !log marostegui@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db1243.eqiad.wmnet with OS bookworm
[05:24:59] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1243.eqiad.wmnet with OS bookworm
[05:25:35] <icinga-wm_>	 RECOVERY - snapshot of s2 in eqiad on backupmon1001 is OK: Last snapshot for s2 at eqiad (db1225) taken on 2024-05-27 04:06:28 (1245 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[05:33:37] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2024-05-20-182409-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034211 (https://phabricator.wikimedia.org/T354666) (owner: 10KartikMistry)
[05:33:57] <kart_>	 Deploying cxserver, minor config changes.
[05:34:04] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[05:34:08] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[05:34:47] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2024-05-20-182409-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034211 (https://phabricator.wikimedia.org/T354666) (owner: 10KartikMistry)
[05:43:08] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[05:43:12] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[05:49:56] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[05:50:01] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[05:52:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T364069)', diff saved to https://phabricator.wikimedia.org/P63252 and previous config saved to /var/cache/conftool/dbconfig/20240527-055244-marostegui.json
[05:52:49] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[05:53:33] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply
[05:53:55] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[05:58:09] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[05:58:13] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[05:59:20] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists, 07Datacenter-Switchover: Make mailman3 work in the standby host (lists2001.wikimedia.org) - https://phabricator.wikimedia.org/T283615#9833798 (10LSobanski)
[05:59:57] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[06:00:01] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[06:01:14] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists, 07Datacenter-Switchover: Make mailman3 work in the standby host (lists2001.wikimedia.org) - https://phabricator.wikimedia.org/T283615#9833800 (10LSobanski)
[06:01:56] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[06:02:00] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[06:03:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:06:31] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[06:06:35] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[06:07:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P63253 and previous config saved to /var/cache/conftool/dbconfig/20240527-060752-marostegui.json
[06:08:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:08:40] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[06:08:44] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[06:12:31] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[06:12:44] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[06:12:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1157 (T364299)', diff saved to https://phabricator.wikimedia.org/P63255 and previous config saved to /var/cache/conftool/dbconfig/20240527-061252-marostegui.json
[06:12:57] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[06:15:22] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[06:15:53] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[06:17:14] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[06:17:21] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[06:17:26] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[06:17:49] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[06:23:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P63256 and previous config saved to /var/cache/conftool/dbconfig/20240527-062301-marostegui.json
[06:25:06] <kart_>	 !log Updated cxserver to 2024-05-20-182409-production (T354666, T365230)
[06:25:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:25:13] <stashbot>	 T354666: Enable MADLAD-400 in MinT test instance and Production for Wikipedia languages not supported by other services - https://phabricator.wikimedia.org/T354666
[06:25:13] <stashbot>	 T365230: Post-creation work for dtpwiki - https://phabricator.wikimedia.org/T365230
[06:27:54] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[06:27:58] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[06:34:23] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[06:34:27] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[06:38:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T364069)', diff saved to https://phabricator.wikimedia.org/P63257 and previous config saved to /var/cache/conftool/dbconfig/20240527-063809-marostegui.json
[06:38:12] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1183.eqiad.wmnet with reason: Maintenance
[06:38:14] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[06:38:25] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1183.eqiad.wmnet with reason: Maintenance
[06:38:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1183 (T364069)', diff saved to https://phabricator.wikimedia.org/P63258 and previous config saved to /var/cache/conftool/dbconfig/20240527-063832-marostegui.json
[06:40:26] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[06:40:30] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[06:44:45] <logmsgbot>	 !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1243.eqiad.wmnet with OS bookworm
[06:47:25] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[06:47:29] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[06:50:43] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[06:50:47] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[06:53:27] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[06:53:31] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[06:55:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T364299)', diff saved to https://phabricator.wikimedia.org/P63259 and previous config saved to /var/cache/conftool/dbconfig/20240527-065518-marostegui.json
[06:55:24] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240527T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:02:06] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[07:02:10] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[07:05:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for Blazegraph services [puppet] - 10https://gerrit.wikimedia.org/r/1035737 (owner: 10Muehlenhoff)
[07:06:20] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[07:06:25] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[07:06:47] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:07:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for initial set of WMCS roles [puppet] - 10https://gerrit.wikimedia.org/r/1035739 (owner: 10Muehlenhoff)
[07:08:42] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[07:08:46] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[07:10:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P63260 and previous config saved to /var/cache/conftool/dbconfig/20240527-071026-marostegui.json
[07:12:41] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[07:12:45] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[07:16:51] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops: Upgrade db1243 NICs firmware - https://phabricator.wikimedia.org/T365963 (10Marostegui) 03NEW
[07:16:54] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops: Upgrade db1243 NICs firmware - https://phabricator.wikimedia.org/T365963#9833903 (10Marostegui) p:05Triage→03High
[07:18:33] <wikibugs>	 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 07Schema-change-in-production, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501#9833905 (10Marostegui) There has been no i...
[07:18:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2174.codfw.wmnet
[07:19:29] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[07:19:33] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[07:21:09] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db2174 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036079 (https://phabricator.wikimedia.org/T349619)
[07:21:16] <marostegui>	 !log Deploy schema change on s7 codfw dbmaint T307501
[07:21:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:21:21] <stashbot>	 T307501: Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501
[07:22:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db2174 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036079 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[07:24:41] <jinxer-wm>	 FIRING: [12x] SystemdUnitFailed: httpbb_hourly_appserver.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:24:59] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[07:25:03] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[07:25:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P63261 and previous config saved to /var/cache/conftool/dbconfig/20240527-072534-marostegui.json
[07:26:47] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:28:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2174.codfw.wmnet
[07:29:06] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops: Upgrade db1243 NICs firmware - https://phabricator.wikimedia.org/T365963#9833916 (10Marostegui)
[07:29:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2176.codfw.wmnet
[07:29:52] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Broadcom NICs with recent firmware fail to reimage - https://phabricator.wikimedia.org/T363576#9833915 (10Marostegui)
[07:33:07] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[07:33:11] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[07:34:35] <wikibugs>	 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 07Schema-change-in-production, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501#9833926 (10Marostegui)
[07:35:05] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[07:35:10] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[07:35:30] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s6 T365783
[07:35:34] <stashbot>	 T365783: Switchover s6 master (db2129 -> db2214) - https://phabricator.wikimedia.org/T365783
[07:35:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: kube-controller-manager.service on ml-staging-ctrl2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:35:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2214 with weight 0 T365783', diff saved to https://phabricator.wikimedia.org/P63262 and previous config saved to /var/cache/conftool/dbconfig/20240527-073545-root.json
[07:35:53] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s6 T365783
[07:36:52] <wikibugs>	 (03PS2) 10Gerrit maintenance bot: mariadb: Promote db2214 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1034939 (https://phabricator.wikimedia.org/T365783)
[07:36:54] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db2176 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036178 (https://phabricator.wikimedia.org/T349619)
[07:36:54] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[07:36:58] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[07:37:09] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2214 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1034939 (https://phabricator.wikimedia.org/T365783) (owner: 10Gerrit maintenance bot)
[07:37:10] <wikibugs>	 (03CR) 10Marostegui: [V:03+2 C:03+2] mariadb: Promote db2214 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1034939 (https://phabricator.wikimedia.org/T365783) (owner: 10Gerrit maintenance bot)
[07:38:15] <wikibugs>	 07Puppet, 06SRE: Add humorous redirect for fox.wikimedia.org - https://phabricator.wikimedia.org/T352870#9833949 (10SMMpanels) SMM panels are a fantastic method to organise your social media marketing activities and offer practical, affordable solutions. They let companies easily manage several platforms a...
[07:40:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T364069)', diff saved to https://phabricator.wikimedia.org/P63263 and previous config saved to /var/cache/conftool/dbconfig/20240527-074009-marostegui.json
[07:40:14] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[07:40:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T364299)', diff saved to https://phabricator.wikimedia.org/P63264 and previous config saved to /var/cache/conftool/dbconfig/20240527-074042-marostegui.json
[07:40:45] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[07:40:47] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[07:40:58] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[07:41:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1166 (T364299)', diff saved to https://phabricator.wikimedia.org/P63265 and previous config saved to /var/cache/conftool/dbconfig/20240527-074105-marostegui.json
[07:48:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db2176 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036178 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[07:49:50] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[07:49:54] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[07:50:55] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Bump changelog [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1035780 (owner: 10Muehlenhoff)
[07:52:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2176.codfw.wmnet
[07:54:43] <marostegui>	 !log Starting s6 codfw failover from db2129 to db2214 - T365783
[07:54:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:48] <stashbot>	 T365783: Switchover s6 master (db2129 -> db2214) - https://phabricator.wikimedia.org/T365783
[07:55:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2214 to s6 primary T365783', diff saved to https://phabricator.wikimedia.org/P63266 and previous config saved to /var/cache/conftool/dbconfig/20240527-075512-marostegui.json
[07:55:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P63267 and previous config saved to /var/cache/conftool/dbconfig/20240527-075524-marostegui.json
[07:55:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove to wmf-laptop and add transition package [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1036180
[07:56:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2129 T365783', diff saved to https://phabricator.wikimedia.org/P63268 and previous config saved to /var/cache/conftool/dbconfig/20240527-075602-root.json
[07:57:46] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[07:57:50] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[07:58:20] <marostegui>	 !log Deploy schema change on s6 codfw (old master) dbmaint T364299
[07:58:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:58:25] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[08:00:20] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[08:00:24] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[08:00:54] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2129.codfw.wmnet with reason: Long schema change
[08:00:57] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2129.codfw.wmnet with reason: Long schema change
[08:01:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2188.codfw.wmnet
[08:03:34] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db2188 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036182 (https://phabricator.wikimedia.org/T349619)
[08:10:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P63269 and previous config saved to /var/cache/conftool/dbconfig/20240527-081031-marostegui.json
[08:14:11] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Always require users to pick a system for SSH keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/1035765 (owner: 10Slyngshede)
[08:14:42] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[08:14:46] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[08:14:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] Remove to wmf-laptop and add transition package [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1036180 (owner: 10Muehlenhoff)
[08:15:47] <wikibugs>	 (03Merged) 10jenkins-bot: Always require users to pick a system for SSH keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/1035765 (owner: 10Slyngshede)
[08:16:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db2188 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036182 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[08:16:38] <wikibugs>	 (03PS2) 10Muehlenhoff: Rename to wmf-laptop and add transition package [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1036180
[08:16:40] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[08:16:44] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[08:18:04] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Rename to wmf-laptop and add transition package [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1036180 (owner: 10Muehlenhoff)
[08:18:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "Very cool!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035829 (https://phabricator.wikimedia.org/T320549) (owner: 10CDanis)
[08:18:38] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[08:18:42] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[08:20:32] <wikibugs>	 (03PS1) 10Muehlenhoff: Some more renames for the wmf-laptop-sre -> wmf-laptop rename [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1036183
[08:20:37] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[08:20:40] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[08:21:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2188.codfw.wmnet
[08:22:35] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[08:22:39] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[08:23:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T364299)', diff saved to https://phabricator.wikimedia.org/P63270 and previous config saved to /var/cache/conftool/dbconfig/20240527-082351-marostegui.json
[08:23:58] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[08:25:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T364069)', diff saved to https://phabricator.wikimedia.org/P63271 and previous config saved to /var/cache/conftool/dbconfig/20240527-082539-marostegui.json
[08:25:42] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1185.eqiad.wmnet with reason: Maintenance
[08:25:46] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[08:25:55] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1185.eqiad.wmnet with reason: Maintenance
[08:26:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1185 (T364069)', diff saved to https://phabricator.wikimedia.org/P63272 and previous config saved to /var/cache/conftool/dbconfig/20240527-082603-marostegui.json
[08:30:13] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Some more renames for the wmf-laptop-sre -> wmf-laptop rename [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1036183 (owner: 10Muehlenhoff)
[08:32:31] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[08:32:35] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[08:34:08] <wikibugs>	 (03PS1) 10Muehlenhoff: Update some docs for wmf-laptop-sre -> wmf-laptop rename [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1036185
[08:36:56] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Update some docs for wmf-laptop-sre -> wmf-laptop rename [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1036185 (owner: 10Muehlenhoff)
[08:38:24] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[08:38:28] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[08:39:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P63273 and previous config saved to /var/cache/conftool/dbconfig/20240527-083859-marostegui.json
[08:40:22] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[08:40:26] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[08:40:55] <wikibugs>	 (03CR) 10Aklapper: [C:03+2] Ignore /src/.cache as well [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1035822 (owner: 10Pppery)
[08:40:57] <wikibugs>	 (03CR) 10Aklapper: [V:03+2 C:03+2] Ignore /src/.cache as well [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1035822 (owner: 10Pppery)
[08:42:21] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[08:42:25] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[08:44:58] <wikibugs>	 (03PS1) 10Muehlenhoff: wmf-laptop: Update changelog [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1036186
[08:45:19] <wikibugs>	 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 07Schema-change-in-production, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501#9834087 (10Marostegui)
[08:47:53] <wikibugs>	 (03PS1) 10Clément Goubert: httpbb: Fix test following Wikimedia_Technology rename [puppet] - 10https://gerrit.wikimedia.org/r/1036187
[08:48:28] <wikibugs>	 (03PS1) 10Santiago Faci: edit-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036188 (https://phabricator.wikimedia.org/T355407)
[08:48:35] <wikibugs>	 (03PS1) 10Slyngshede: Version bump to 0.0.8 [software/bitu] - 10https://gerrit.wikimedia.org/r/1036189
[08:51:30] <wikibugs>	 (03PS1) 10Fabfur: hiera: use benthos on cp3073 (first esams host) [puppet] - 10https://gerrit.wikimedia.org/r/1036190 (https://phabricator.wikimedia.org/T358109)
[08:52:07] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] httpbb: Fix test following Wikimedia_Technology rename [puppet] - 10https://gerrit.wikimedia.org/r/1036187 (owner: 10Clément Goubert)
[08:52:22] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] httpbb: Fix test following Wikimedia_Technology rename [puppet] - 10https://gerrit.wikimedia.org/r/1036187 (owner: 10Clément Goubert)
[08:53:50] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[08:53:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1036189 (owner: 10Slyngshede)
[08:53:54] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[08:54:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P63274 and previous config saved to /var/cache/conftool/dbconfig/20240527-085407-marostegui.json
[08:54:10] <wikibugs>	 (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1036190 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur)
[08:54:13] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Version bump to 0.0.8 [software/bitu] - 10https://gerrit.wikimedia.org/r/1036189 (owner: 10Slyngshede)
[08:54:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P63275 and previous config saved to /var/cache/conftool/dbconfig/20240527-085447-root.json
[08:55:21] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2150: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1035665
[08:55:57] <wikibugs>	 (03Merged) 10jenkins-bot: Version bump to 0.0.8 [software/bitu] - 10https://gerrit.wikimedia.org/r/1036189 (owner: 10Slyngshede)
[08:55:59] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[08:56:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 1%: Repooling T365797', diff saved to https://phabricator.wikimedia.org/P63276 and previous config saved to /var/cache/conftool/dbconfig/20240527-085602-root.json
[08:56:03] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[08:56:06] <stashbot>	 T365797: Degraded RAID on db2150 - https://phabricator.wikimedia.org/T365797
[08:56:08] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db2150: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1035665 (owner: 10Marostegui)
[08:56:30] <wikibugs>	 (03CR) 10Fabfur: [V:03+1 C:04-2] "This depends also on the outcome of I6836cfd828fec602c3d23e98bf38a1a05742c283" [puppet] - 10https://gerrit.wikimedia.org/r/1036190 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur)
[08:59:27] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[08:59:31] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[09:00:24] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "> Also, are we going to retain the general k8s-mwdebug?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035361 (https://phabricator.wikimedia.org/T365478) (owner: 10Effie Mouzeli)
[09:01:17] <wikibugs>	 (03PS27) 10Ayounsi: sre.hosts.move-vlan: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/981472 (https://phabricator.wikimedia.org/T350152)
[09:01:22] <wikibugs>	 (03CR) 10David Caro: [V:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035148 (https://phabricator.wikimedia.org/T365640) (owner: 10David Caro)
[09:01:27] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:02:39] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:02:39] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:02:39] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:02:39] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:02:39] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:02:40] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:02:40] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:02:41] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:02:41] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:02:42] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:02:42] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:03:39] <wikibugs>	 (03PS1) 10Santiago Faci: editor-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036191 (https://phabricator.wikimedia.org/T355408)
[09:04:23] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[09:04:26] <jinxer-wm>	 RESOLVED: [12x] SystemdUnitFailed: httpbb_hourly_appserver.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:04:27] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[09:04:45] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] provision datahub-next service records [dns] - 10https://gerrit.wikimedia.org/r/1032393 (https://phabricator.wikimedia.org/T363299) (owner: 10Stevemunene)
[09:05:29] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/981472 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi)
[09:06:11] <wikibugs>	 (03PS1) 10Santiago Faci: geo-analytics deployment: Big AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036192 (https://phabricator.wikimedia.org/T360525)
[09:06:25] <wikibugs>	 (03PS1) 10Zabe: Stop writing to af_user(_text)/afh_user(_text) in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036193 (https://phabricator.wikimedia.org/T337920)
[09:06:57] <wikibugs>	 (03PS3) 10Volans: sre.hosts.reimage: add support for VLAN move [cookbooks] - 10https://gerrit.wikimedia.org/r/1007652 (https://phabricator.wikimedia.org/T350152)
[09:07:03] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Add datahub-next missing values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035411 (https://phabricator.wikimedia.org/T365674) (owner: 10Stevemunene)
[09:07:29] <zabe>	 jouncebot: nowandnext
[09:07:29] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 52 minute(s)
[09:07:30] <jouncebot>	 In 0 hour(s) and 52 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240527T1000)
[09:07:52] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Stop writing to af_user(_text)/afh_user(_text) in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036193 (https://phabricator.wikimedia.org/T337920) (owner: 10Zabe)
[09:08:00] <wikibugs>	 (03CR) 10David Caro: [V:03+1 C:03+2] Reapply "openstack::bobcat: apply cloud yaml patch"" [puppet] - 10https://gerrit.wikimedia.org/r/1035148 (https://phabricator.wikimedia.org/T365640) (owner: 10David Caro)
[09:08:01] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[09:08:05] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[09:08:29] <wikibugs>	 (03PS2) 10Santiago Faci: edit-analytics deployment: big refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036188 (https://phabricator.wikimedia.org/T355407)
[09:08:39] <wikibugs>	 (03Merged) 10jenkins-bot: Stop writing to af_user(_text)/afh_user(_text) in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036193 (https://phabricator.wikimedia.org/T337920) (owner: 10Zabe)
[09:08:45] <wikibugs>	 (03PS2) 10Santiago Faci: editor-analytics deployment: big refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036191 (https://phabricator.wikimedia.org/T355408)
[09:09:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T364299)', diff saved to https://phabricator.wikimedia.org/P63277 and previous config saved to /var/cache/conftool/dbconfig/20240527-090915-marostegui.json
[09:09:18] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[09:09:18] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:1036193|Stop writing to af_user(_text)/afh_user(_text) in group1 wikis (T337920)]]
[09:09:21] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[09:09:28] <stashbot>	 T337920: Stop writing to af_user(_text)/afh_user(_text) - https://phabricator.wikimedia.org/T337920
[09:09:31] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[09:09:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1175 (T364299)', diff saved to https://phabricator.wikimedia.org/P63278 and previous config saved to /var/cache/conftool/dbconfig/20240527-090938-marostegui.json
[09:09:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P63279 and previous config saved to /var/cache/conftool/dbconfig/20240527-090953-root.json
[09:11:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 5%: Repooling T365797', diff saved to https://phabricator.wikimedia.org/P63280 and previous config saved to /var/cache/conftool/dbconfig/20240527-091108-root.json
[09:11:13] <stashbot>	 T365797: Degraded RAID on db2150 - https://phabricator.wikimedia.org/T365797
[09:11:45] <wikibugs>	 (03PS1) 10Santiago Faci: media-analytics deployment: Big AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036194 (https://phabricator.wikimedia.org/T360526)
[09:14:22] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[09:14:27] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[09:14:27] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] maps::tlsproxy: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1035750 (owner: 10Muehlenhoff)
[09:15:02] <wikibugs>	 (03PS4) 10EoghanGaffney: lists: Don't include automation in standby hosts [puppet] - 10https://gerrit.wikimedia.org/r/1035789 (https://phabricator.wikimedia.org/T365804)
[09:19:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T364069)', diff saved to https://phabricator.wikimedia.org/P63282 and previous config saved to /var/cache/conftool/dbconfig/20240527-091935-marostegui.json
[09:19:41] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[09:22:08] <wikibugs>	 (03CR) 10EoghanGaffney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035789 (https://phabricator.wikimedia.org/T365804) (owner: 10EoghanGaffney)
[09:22:35] <icinga-wm_>	 RECOVERY - Disk space on backup1011 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=backup1011&var-datasource=eqiad+prometheus/ops
[09:22:37] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "I checked the full diff, which is quite extensive, but seems legit. AFAICT we're mostly dealing with nftables config files and a prometheu" [puppet] - 10https://gerrit.wikimedia.org/r/1032632 (owner: 10Muehlenhoff)
[09:23:08] <logmsgbot>	 !log zabe@deploy1002 zabe: Backport for [[gerrit:1036193|Stop writing to af_user(_text)/afh_user(_text) in group1 wikis (T337920)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[09:23:12] <stashbot>	 T337920: Stop writing to af_user(_text)/afh_user(_text) - https://phabricator.wikimedia.org/T337920
[09:23:22] <logmsgbot>	 !log zabe@deploy1002 zabe: Continuing with sync
[09:24:56] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[09:24:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P63283 and previous config saved to /var/cache/conftool/dbconfig/20240527-092459-root.json
[09:25:01] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[09:26:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 10%: Repooling T365797', diff saved to https://phabricator.wikimedia.org/P63284 and previous config saved to /var/cache/conftool/dbconfig/20240527-092614-root.json
[09:26:19] <stashbot>	 T365797: Degraded RAID on db2150 - https://phabricator.wikimedia.org/T365797
[09:29:55] <Amir1>	 jouncebot: nowandnext
[09:29:55] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 30 minute(s)
[09:29:55] <jouncebot>	 In 0 hour(s) and 30 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240527T1000)
[09:30:04] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Update tagline and wordmark of Persian Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035852 (https://phabricator.wikimedia.org/T365913) (owner: 10Ebrahim)
[09:30:43] <wikibugs>	 (03Merged) 10jenkins-bot: Update tagline and wordmark of Persian Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035852 (https://phabricator.wikimedia.org/T365913) (owner: 10Ebrahim)
[09:31:24] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[09:31:56] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[09:33:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2129', diff saved to https://phabricator.wikimedia.org/P63285 and previous config saved to /var/cache/conftool/dbconfig/20240527-093306-root.json
[09:33:37] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] Add datahub-next missing values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035411 (https://phabricator.wikimedia.org/T365674) (owner: 10Stevemunene)
[09:34:28] <wikibugs>	 (03Merged) 10jenkins-bot: Add datahub-next missing values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035411 (https://phabricator.wikimedia.org/T365674) (owner: 10Stevemunene)
[09:34:37] <wikibugs>	 (03PS2) 10Elukey: services: upgrade tegola in codfw to use the envoy proxy for Swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035743 (https://phabricator.wikimedia.org/T344324)
[09:34:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P63286 and previous config saved to /var/cache/conftool/dbconfig/20240527-093443-marostegui.json
[09:34:49] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Enable wgDiscussionToolsEnablePermalinksBackend on enwiki (last one!) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036197 (https://phabricator.wikimedia.org/T315353)
[09:34:54] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
[09:34:56] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
[09:35:15] <wikibugs>	 (03PS4) 10Jforrester: Pre-emptively disable DiscussionToolsEnableThanks (no-op) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026511 (owner: 10Esanders)
[09:35:30] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] provision datahub-next service records [dns] - 10https://gerrit.wikimedia.org/r/1032393 (https://phabricator.wikimedia.org/T363299) (owner: 10Stevemunene)
[09:36:40] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 64096
[09:37:03] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1034962 (https://phabricator.wikimedia.org/T308002) (owner: 10Ayounsi)
[09:37:15] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1034963 (owner: 10Ayounsi)
[09:37:48] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 64096
[09:38:33] <wikibugs>	 (03CR) 10Ayounsi: [V:03+2 C:03+2] Add ApereoSocialPipeline [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1034962 (https://phabricator.wikimedia.org/T308002) (owner: 10Ayounsi)
[09:38:55] <wikibugs>	 (03CR) 10Ayounsi: [V:03+2 C:03+2] Update requirements [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1034963 (owner: 10Ayounsi)
[09:39:02] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1036193|Stop writing to af_user(_text)/afh_user(_text) in group1 wikis (T337920)]] (duration: 29m 43s)
[09:39:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Copied votes on follow-up patch sets have been updated:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035852 (https://phabricator.wikimedia.org/T365913) (owner: 10Ebrahim)
[09:39:07] <stashbot>	 T337920: Stop writing to af_user(_text)/afh_user(_text) - https://phabricator.wikimedia.org/T337920
[09:39:16] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[09:39:19] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1035852|Update tagline and wordmark of Persian Wikibooks (T365913)]]
[09:39:20] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[09:39:24] <stashbot>	 T365913: Change the Persian Wikibooks wordmark - https://phabricator.wikimedia.org/T365913
[09:41:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 25%: Repooling T365797', diff saved to https://phabricator.wikimedia.org/P63287 and previous config saved to /var/cache/conftool/dbconfig/20240527-094120-root.json
[09:41:26] <stashbot>	 T365797: Degraded RAID on db2150 - https://phabricator.wikimedia.org/T365797
[09:41:44] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: add python-social-auth and update wheels - ayounsi@cumin1002 - T308002
[09:41:48] <logmsgbot>	 !log ladsgroup@deploy1002 ebrahim and ladsgroup: Backport for [[gerrit:1035852|Update tagline and wordmark of Persian Wikibooks (T365913)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[09:41:49] <stashbot>	 T308002: Move Netbox authentication to python-social-auth - https://phabricator.wikimedia.org/T308002
[09:42:01] <wikibugs>	 (03PS1) 10Marostegui: control-mariadb-10.11-bookworm: Initial packaging [software] - 10https://gerrit.wikimedia.org/r/1036199 (https://phabricator.wikimedia.org/T365805)
[09:42:30] <logmsgbot>	 !log ladsgroup@deploy1002 ebrahim and ladsgroup: Continuing with sync
[09:44:00] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[09:44:04] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[09:45:25] <logmsgbot>	 !log stevemunene@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging
[09:45:33] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2002.codfw.wmnet with reason: add python-social-auth and update wheels - ayounsi@cumin1002 - T308002
[09:46:49] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[09:47:06] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[09:49:41] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[09:49:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P63288 and previous config saved to /var/cache/conftool/dbconfig/20240527-094951-marostegui.json
[09:49:56] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[09:52:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T364299)', diff saved to https://phabricator.wikimedia.org/P63289 and previous config saved to /var/cache/conftool/dbconfig/20240527-095208-marostegui.json
[09:52:16] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[09:52:20] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[09:52:24] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[09:53:36] <wikibugs>	 (03PS2) 10Clément Goubert: miscweb: Update various modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032525 (https://phabricator.wikimedia.org/T362978)
[09:54:29] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[09:54:33] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[09:56:19] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1035852|Update tagline and wordmark of Persian Wikibooks (T365913)]] (duration: 16m 59s)
[09:56:23] <stashbot>	 T365913: Change the Persian Wikibooks wordmark - https://phabricator.wikimedia.org/T365913
[09:56:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 50%: Repooling T365797', diff saved to https://phabricator.wikimedia.org/P63290 and previous config saved to /var/cache/conftool/dbconfig/20240527-095626-root.json
[09:56:32] <stashbot>	 T365797: Degraded RAID on db2150 - https://phabricator.wikimedia.org/T365797
[09:56:36] <wikibugs>	 (03PS1) 10Jelto: external_clouds_vendors: add Vultr cloud [puppet] - 10https://gerrit.wikimedia.org/r/1036201 (https://phabricator.wikimedia.org/T303534)
[09:58:19] <logmsgbot>	 !log stevemunene@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging
[09:59:22] <wikibugs>	 (03CR) 10Aklapper: [V:03+2 C:03+2] "Tested locally (both applying the patch, as well as changing the line in export.sh, running export.sh, and checking the resulting file pro" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1035805 (https://phabricator.wikimedia.org/T351581) (owner: 10Pppery)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240527T1000)
[10:01:58] <wikibugs>	 (03PS1) 10Stevemunene: Add dse range to an-test coordinator [puppet] - 10https://gerrit.wikimedia.org/r/1036202 (https://phabricator.wikimedia.org/T361185)
[10:04:17] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[10:04:22] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[10:04:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:04:45] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036202 (https://phabricator.wikimedia.org/T361185) (owner: 10Stevemunene)
[10:05:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T364069)', diff saved to https://phabricator.wikimedia.org/P63291 and previous config saved to /var/cache/conftool/dbconfig/20240527-100459-marostegui.json
[10:05:03] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1200.eqiad.wmnet with reason: Maintenance
[10:05:05] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[10:05:16] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1200.eqiad.wmnet with reason: Maintenance
[10:05:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T364069)', diff saved to https://phabricator.wikimedia.org/P63292 and previous config saved to /var/cache/conftool/dbconfig/20240527-100523-marostegui.json
[10:05:36] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Add dse range to an-test coordinator [puppet] - 10https://gerrit.wikimedia.org/r/1036202 (https://phabricator.wikimedia.org/T361185) (owner: 10Stevemunene)
[10:06:01] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] Add dse range to an-test coordinator [puppet] - 10https://gerrit.wikimedia.org/r/1036202 (https://phabricator.wikimedia.org/T361185) (owner: 10Stevemunene)
[10:06:58] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] control-mariadb-10.11-bookworm: Initial packaging [software] - 10https://gerrit.wikimedia.org/r/1036199 (https://phabricator.wikimedia.org/T365805) (owner: 10Marostegui)
[10:07:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P63293 and previous config saved to /var/cache/conftool/dbconfig/20240527-100717-marostegui.json
[10:11:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 75%: Repooling T365797', diff saved to https://phabricator.wikimedia.org/P63294 and previous config saved to /var/cache/conftool/dbconfig/20240527-101133-root.json
[10:11:38] <stashbot>	 T365797: Degraded RAID on db2150 - https://phabricator.wikimedia.org/T365797
[10:11:47] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:12:06] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[10:12:10] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[10:13:05] <wikibugs>	 (03CR) 10Volans: "They seem" [puppet] - 10https://gerrit.wikimedia.org/r/1036201 (https://phabricator.wikimedia.org/T303534) (owner: 10Jelto)
[10:13:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove profile::zookeeper::firewall::srange [puppet] - 10https://gerrit.wikimedia.org/r/1035334 (owner: 10Muehlenhoff)
[10:14:23] <logmsgbot>	 !log stevemunene@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging
[10:14:51] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] control-mariadb-10.11-bookworm: Initial packaging [software] - 10https://gerrit.wikimedia.org/r/1036199 (https://phabricator.wikimedia.org/T365805) (owner: 10Marostegui)
[10:15:23] <wikibugs>	 (03Merged) 10jenkins-bot: control-mariadb-10.11-bookworm: Initial packaging [software] - 10https://gerrit.wikimedia.org/r/1036199 (https://phabricator.wikimedia.org/T365805) (owner: 10Marostegui)
[10:16:47] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:18:55] <wikibugs>	 (03PS2) 10Jelto: external_clouds_vendors: add Vultr cloud [puppet] - 10https://gerrit.wikimedia.org/r/1036201 (https://phabricator.wikimedia.org/T303534)
[10:19:50] <wikibugs>	 (03CR) 10Jelto: external_clouds_vendors: add Vultr cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1036201 (https://phabricator.wikimedia.org/T303534) (owner: 10Jelto)
[10:20:53] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035351 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff)
[10:22:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P63295 and previous config saved to /var/cache/conftool/dbconfig/20240527-102226-marostegui.json
[10:26:18] <wikibugs>	 (03PS1) 10Gergő Tisza: [POC][beta] Add rewrite rule for sso.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/1036230 (https://phabricator.wikimedia.org/T365162)
[10:26:23] <logmsgbot>	 !log stevemunene@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging
[10:26:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 100%: Repooling T365797', diff saved to https://phabricator.wikimedia.org/P63296 and previous config saved to /var/cache/conftool/dbconfig/20240527-102639-root.json
[10:26:44] <stashbot>	 T365797: Degraded RAID on db2150 - https://phabricator.wikimedia.org/T365797
[10:26:49] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release datahub-next/staging on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datahub-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[10:28:05] <wikibugs>	 (03PS5) 10EoghanGaffney: lists: Don't include automation in standby hosts [puppet] - 10https://gerrit.wikimedia.org/r/1035789 (https://phabricator.wikimedia.org/T365804)
[10:31:18] <wikibugs>	 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes (excluding Votewiki) - https://phabricator.wikimedia.org/T362323#9834404 (10Clement_Goubert) As far as mediawiki calling itself goes (I see it was removed from the task description, but it is te...
[10:31:49] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release datahub-next/staging on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datahub-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[10:32:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] maps: Add option to use PKI [puppet] - 10https://gerrit.wikimedia.org/r/1035351 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff)
[10:37:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T364299)', diff saved to https://phabricator.wikimedia.org/P63297 and previous config saved to /var/cache/conftool/dbconfig/20240527-103734-marostegui.json
[10:37:38] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1189.eqiad.wmnet with reason: Maintenance
[10:37:41] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[10:37:51] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1189.eqiad.wmnet with reason: Maintenance
[10:37:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1189 (T364299)', diff saved to https://phabricator.wikimedia.org/P63298 and previous config saved to /var/cache/conftool/dbconfig/20240527-103759-marostegui.json
[10:42:33] <wikibugs>	 (03PS5) 10Bartosz Dziewoński: Pre-emptively disable DiscussionToolsEnableThanks (no-op) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026511 (owner: 10Esanders)
[10:44:06] <icinga-wm_>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:44:10] <icinga-wm_>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:45:25] <wikibugs>	 (03PS2) 10Muehlenhoff: tlsproxy::localssl: Remove support for OCSP handling [puppet] - 10https://gerrit.wikimedia.org/r/1035362
[10:45:56] <icinga-wm_>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.271 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:46:02] <icinga-wm_>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:46:32] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] benthos:cache: switch to rfc5424 format (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1035440 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur)
[10:49:06] <icinga-wm_>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:49:12] <icinga-wm_>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:49:58] <icinga-wm_>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.285 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:50:04] <icinga-wm_>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:52:11] <slyngs>	 !log Upgrade IDM to Bitu 0.0.8
[10:52:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:53:36] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] "CR is missing localssl.erb (do_ocsp is still referenced there)" [puppet] - 10https://gerrit.wikimedia.org/r/1035362 (owner: 10Muehlenhoff)
[10:55:30] <Amir1>	 !log main s2@codfw (T364985)
[10:55:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:40] <Amir1>	 !log dbmaint s2@codfw (T364985)
[10:55:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T364069)', diff saved to https://phabricator.wikimedia.org/P63299 and previous config saved to /var/cache/conftool/dbconfig/20240527-105728-marostegui.json
[10:57:35] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[10:59:45] <wikibugs>	 (03PS3) 10Muehlenhoff: tlsproxy::localssl: Remove support for OCSP handling [puppet] - 10https://gerrit.wikimedia.org/r/1035362
[11:04:38] <wikibugs>	 (03PS1) 10Clément Goubert: testwiki: Move to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1036235 (https://phabricator.wikimedia.org/T355534)
[11:04:55] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035362 (owner: 10Muehlenhoff)
[11:05:07] <wikibugs>	 (03PS2) 10Clément Goubert: testwiki: Move to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1036235 (https://phabricator.wikimedia.org/T355534)
[11:05:17] <wikibugs>	 06SRE, 10MW-on-K8s, 06Quality-and-Test-Engineering-Team, 06serviceops, 13Patch-For-Review: Move testwiki over to mw-on-k8s - https://phabricator.wikimedia.org/T355534#9834501 (10Clement_Goubert) 05Open→03In progress
[11:05:23] <wikibugs>	 (03CR) 10Fabfur: benthos:cache: switch to rfc5424 format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1035440 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur)
[11:06:12] <wikibugs>	 (03PS6) 10Fabfur: benthos:cache: switch to rfc5424 format [puppet] - 10https://gerrit.wikimedia.org/r/1035440 (https://phabricator.wikimedia.org/T365718)
[11:06:21] <wikibugs>	 (03CR) 10Fabfur: benthos:cache: switch to rfc5424 format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1035440 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur)
[11:06:48] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:07:41] <wikibugs>	 (03PS1) 10Muehlenhoff: maps: Switch kartotherian on maps2007 to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036236 (https://phabricator.wikimedia.org/T360778)
[11:08:30] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:08:50] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "\o/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036197 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński)
[11:10:25] <wikibugs>	 (03PS1) 10Stevemunene: Enable mesh for datahub-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036237 (https://phabricator.wikimedia.org/T361185)
[11:10:43] <wikibugs>	 (03PS1) 10Ayounsi: Add python-jose [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1036238 (https://phabricator.wikimedia.org/T308002)
[11:12:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P63301 and previous config saved to /var/cache/conftool/dbconfig/20240527-111236-marostegui.json
[11:12:58] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "Nit: comment is slightly wrong" [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1036238 (https://phabricator.wikimedia.org/T308002) (owner: 10Ayounsi)
[11:13:23] <wikibugs>	 (03PS6) 10EoghanGaffney: lists: Don't include automation in standby hosts [puppet] - 10https://gerrit.wikimedia.org/r/1035789 (https://phabricator.wikimedia.org/T365804)
[11:13:30] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:14:15] <wikibugs>	 (03CR) 10Muehlenhoff: lists: Don't include automation in standby hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1035789 (https://phabricator.wikimedia.org/T365804) (owner: 10EoghanGaffney)
[11:15:13] <wikibugs>	 (03CR) 10Muehlenhoff: "Good catch, updated" [puppet] - 10https://gerrit.wikimedia.org/r/1035362 (owner: 10Muehlenhoff)
[11:15:17] <wikibugs>	 (03PS7) 10EoghanGaffney: lists: Don't include automation in standby hosts [puppet] - 10https://gerrit.wikimedia.org/r/1035789 (https://phabricator.wikimedia.org/T365804)
[11:16:23] <wikibugs>	 (03PS2) 10Ayounsi: Add python-jose [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1036238 (https://phabricator.wikimedia.org/T308002)
[11:18:13] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036236 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff)
[11:18:34] <wikibugs>	 (03PS2) 10Muehlenhoff: maps: Switch kartotherian on maps2007 to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036236 (https://phabricator.wikimedia.org/T360778)
[11:19:20] <wikibugs>	 (03CR) 10EoghanGaffney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035789 (https://phabricator.wikimedia.org/T365804) (owner: 10EoghanGaffney)
[11:21:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T364299)', diff saved to https://phabricator.wikimedia.org/P63302 and previous config saved to /var/cache/conftool/dbconfig/20240527-112143-marostegui.json
[11:21:50] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[11:23:15] <wikibugs>	 (03CR) 10Volans: external_clouds_vendors: add Vultr cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1036201 (https://phabricator.wikimedia.org/T303534) (owner: 10Jelto)
[11:23:38] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Enable mesh for datahub-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036237 (https://phabricator.wikimedia.org/T361185) (owner: 10Stevemunene)
[11:24:40] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] Enable mesh for datahub-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036237 (https://phabricator.wikimedia.org/T361185) (owner: 10Stevemunene)
[11:25:29] <wikibugs>	 (03Merged) 10jenkins-bot: Enable mesh for datahub-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036237 (https://phabricator.wikimedia.org/T361185) (owner: 10Stevemunene)
[11:26:48] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:27:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P63303 and previous config saved to /var/cache/conftool/dbconfig/20240527-112744-marostegui.json
[11:29:38] <logmsgbot>	 !log stevemunene@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging
[11:30:12] <wikibugs>	 (03PS3) 10Jelto: external_clouds_vendors: add Vultr cloud [puppet] - 10https://gerrit.wikimedia.org/r/1036201 (https://phabricator.wikimedia.org/T303534)
[11:30:40] <wikibugs>	 (03CR) 10CI reject: [V:04-1] external_clouds_vendors: add Vultr cloud [puppet] - 10https://gerrit.wikimedia.org/r/1036201 (https://phabricator.wikimedia.org/T303534) (owner: 10Jelto)
[11:33:00] <wikibugs>	 (03PS4) 10Jelto: external_clouds_vendors: add Vultr cloud [puppet] - 10https://gerrit.wikimedia.org/r/1036201 (https://phabricator.wikimedia.org/T303534)
[11:33:10] <moritzm>	 !log installing jinja2 security updates
[11:33:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:34:56] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1036238 (https://phabricator.wikimedia.org/T308002) (owner: 10Ayounsi)
[11:35:16] <wikibugs>	 (03CR) 10Jelto: external_clouds_vendors: add Vultr cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1036201 (https://phabricator.wikimedia.org/T303534) (owner: 10Jelto)
[11:35:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: kube-controller-manager.service on ml-staging-ctrl2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:36:00] <wikibugs>	 (03PS1) 10Muehlenhoff: aptrepo: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1036241
[11:36:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P63304 and previous config saved to /var/cache/conftool/dbconfig/20240527-113651-marostegui.json
[11:39:25] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1036201 (https://phabricator.wikimedia.org/T303534) (owner: 10Jelto)
[11:40:31] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] testwiki: Move to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1036235 (https://phabricator.wikimedia.org/T355534) (owner: 10Clément Goubert)
[11:40:49] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] testwiki: Move to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1036235 (https://phabricator.wikimedia.org/T355534) (owner: 10Clément Goubert)
[11:41:12] <logmsgbot>	 !log stevemunene@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging
[11:41:55] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Enable wgDiscussionToolsEnablePermalinksBackend on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036197 (https://phabricator.wikimedia.org/T315353)
[11:42:19] <wikibugs>	 (03PS3) 10Ayounsi: Add python-jose [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1036238 (https://phabricator.wikimedia.org/T308002)
[11:42:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T364069)', diff saved to https://phabricator.wikimedia.org/P63305 and previous config saved to /var/cache/conftool/dbconfig/20240527-114252-marostegui.json
[11:42:55] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1210.eqiad.wmnet with reason: Maintenance
[11:42:57] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[11:43:08] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1210.eqiad.wmnet with reason: Maintenance
[11:43:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1210 (T364069)', diff saved to https://phabricator.wikimedia.org/P63306 and previous config saved to /var/cache/conftool/dbconfig/20240527-114316-marostegui.json
[11:43:34] <wikibugs>	 (03CR) 10Hnowlan: services: add data-gateway service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032595 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French)
[11:43:43] <wikibugs>	 (03CR) 10Ayounsi: [V:03+2 C:03+2] Add python-jose [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1036238 (https://phabricator.wikimedia.org/T308002) (owner: 10Ayounsi)
[11:44:24] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: add python-jose and update wheels - ayounsi@cumin1002 - T308002
[11:44:30] <stashbot>	 T308002: Move Netbox authentication to python-social-auth - https://phabricator.wikimedia.org/T308002
[11:44:37] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036236 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff)
[11:44:52] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036241 (owner: 10Muehlenhoff)
[11:44:53] <wikibugs>	 06SRE, 10MW-on-K8s, 06Quality-and-Test-Engineering-Team, 06serviceops, 13Patch-For-Review: Move testwiki over to mw-on-k8s - https://phabricator.wikimedia.org/T355534#9834567 (10Clement_Goubert) 05In progress→03Resolved `testwiki` and `testcommonswiki` are now moved over to #mw-on-k8s
[11:45:18] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2002.codfw.wmnet with reason: add python-jose and update wheels - ayounsi@cumin1002 - T308002
[11:46:27] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982 (10ABran-WMF) 03NEW
[11:47:44] <wikibugs>	 (03PS1) 10Gergő Tisza: [WIP][POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162)
[11:48:08] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9834595 (10ABran-WMF)
[11:48:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1035789 (https://phabricator.wikimedia.org/T365804) (owner: 10EoghanGaffney)
[11:49:06] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983 (10ABran-WMF) 03NEW
[11:49:12] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.deploy.python-code netbox to netbox2002.codfw.wmnet,netbox1002.eqiad.wmnet with reason: add CasApereo auth and update wheels - ayounsi@cumin1002 - T308002
[11:50:10] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9834613 (10ABran-WMF)
[11:50:57] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984 (10ABran-WMF) 03NEW
[11:51:13] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox2002.codfw.wmnet,netbox1002.eqiad.wmnet with reason: add CasApereo auth and update wheels - ayounsi@cumin1002 - T308002
[11:51:18] <stashbot>	 T308002: Move Netbox authentication to python-social-auth - https://phabricator.wikimedia.org/T308002
[11:51:45] <wikibugs>	 (03PS1) 10Hashar: Review access change [software] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1036212
[11:51:53] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9834628 (10ABran-WMF)
[11:52:00] <wikibugs>	 (03PS3) 10Klausman: install/partman: Tweak kubelet partition size for ML workers [puppet] - 10https://gerrit.wikimedia.org/r/1036195 (https://phabricator.wikimedia.org/T365971)
[11:52:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P63307 and previous config saved to /var/cache/conftool/dbconfig/20240527-115200-marostegui.json
[11:52:06] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging
[11:52:06] <wikibugs>	 (03PS2) 10Hashar: Allow SRE to create tags [software] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1036212
[11:52:25] <wikibugs>	 (03CR) 10Hashar: [V:03+2 C:03+2] Allow SRE to create tags [software] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1036212 (owner: 10Hashar)
[11:52:50] <wikibugs>	 (03PS1) 10Muehlenhoff: maps: Don't pass additional server aliases when using PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036247 (https://phabricator.wikimedia.org/T360778)
[11:53:09] <wikibugs>	 (03CR) 10Muehlenhoff: "Also see: https://puppet-compiler.wmflabs.org/output/1036236/3505/maps2007.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1036247 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff)
[11:57:04] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036247 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff)
[11:58:11] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986 (10ABran-WMF) 03NEW
[11:58:49] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e6-eqiad - https://phabricator.wikimedia.org/T365987 (10ABran-WMF) 03NEW
[11:58:50] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9834657 (10ABran-WMF)
[11:59:29] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e6-eqiad - https://phabricator.wikimedia.org/T365988 (10ABran-WMF) 03NEW
[11:59:58] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9834682 (10ABran-WMF)
[12:01:28] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9834684 (10ABran-WMF)
[12:03:49] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release datahub-next/staging on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datahub-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[12:05:07] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] api-gateway: add normalise_paths option, enable in api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035481 (https://phabricator.wikimedia.org/T365439) (owner: 10Hnowlan)
[12:05:52] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2001.codfw.wmnet with OS bullseye
[12:06:18] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: add normalise_paths option, enable in api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035481 (https://phabricator.wikimedia.org/T365439) (owner: 10Hnowlan)
[12:06:46] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.move-vlan for host <spicerack.netbox.NetboxServer object at 0x7f9776417550>
[12:07:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T364299)', diff saved to https://phabricator.wikimedia.org/P63308 and previous config saved to /var/cache/conftool/dbconfig/20240527-120709-marostegui.json
[12:07:11] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[12:07:14] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[12:07:19] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox
[12:07:25] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[12:07:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1198 (T364299)', diff saved to https://phabricator.wikimedia.org/P63309 and previous config saved to /var/cache/conftool/dbconfig/20240527-120732-marostegui.json
[12:08:49] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release datahub-next/staging on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datahub-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[12:10:24] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2001 - ayounsi@cumin1002"
[12:11:17] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2001 - ayounsi@cumin1002"
[12:11:17] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:11:17] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2001.codfw.wmnet 39.16.192.10.in-addr.arpa 9.3.0.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:11:20] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2001.codfw.wmnet 39.16.192.10.in-addr.arpa 9.3.0.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:11:21] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2001
[12:11:41] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2001
[12:11:41] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host <spicerack.netbox.NetboxServer object at 0x7f9776417550>
[12:14:48] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:14:50] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:17:17] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[12:17:30] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[12:18:07] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[12:18:34] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[12:18:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1036251 (owner: 10L10n-bot)
[12:18:57] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: apply
[12:19:19] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply
[12:20:06] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] tlsproxy::localssl: Remove support for OCSP handling [puppet] - 10https://gerrit.wikimedia.org/r/1035362 (owner: 10Muehlenhoff)
[12:24:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] tlsproxy::localssl: Remove support for OCSP handling [puppet] - 10https://gerrit.wikimedia.org/r/1035362 (owner: 10Muehlenhoff)
[12:24:26] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:24:39] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] benthos:cache: switch to rfc5424 format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1035440 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur)
[12:28:11] <wikibugs>	 (03PS2) 10Muehlenhoff: maps::tlsproxy: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1035750
[12:28:19] <Lucas_WMDE>	 oop, stashbot left
[12:29:22] <wikibugs>	 (03CR) 10Fabfur: benthos:cache: switch to rfc5424 format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1035440 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur)
[12:29:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T364069)', diff saved to https://phabricator.wikimedia.org/P63310 and previous config saved to /var/cache/conftool/dbconfig/20240527-122953-marostegui.json
[12:29:59] <icinga-wm_>	 RECOVERY - Disk space on stat1008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops
[12:32:29] <wikibugs>	 (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035750 (owner: 10Muehlenhoff)
[12:32:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] maps::tlsproxy: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1035750 (owner: 10Muehlenhoff)
[12:34:26] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] maps::tlsproxy: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1035750 (owner: 10Muehlenhoff)
[12:34:56] <jinxer-wm>	 FIRING: RdfStreamingUpdaterFlinkJobUnstable: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[12:35:21] <wikibugs>	 (03PS2) 10Muehlenhoff: maps: Don't pass additional server aliases when using PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036247 (https://phabricator.wikimedia.org/T360778)
[12:35:36] <wikibugs>	 (03PS1) 10Santiago Faci: device-analytics deployment: Big AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036260 (https://phabricator.wikimedia.org/T360524)
[12:35:58] <Lucas_WMDE>	 marostegui: FYI, stashbot was temporarily gone, in case you want to re-log that one dbctl message (but I’m guessing it’s not super important)
[12:36:13] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] maps: Don't pass additional server aliases when using PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036247 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff)
[12:37:05] <marostegui>	 Lucas_WMDE: no need, there were more like those later, it is an automated process. Thanks for the heads up though
[12:37:11] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] wikilabels::session: Set now-required memcached_user [puppet] - 10https://gerrit.wikimedia.org/r/1035762 (owner: 10Majavah)
[12:37:49] <wikibugs>	 (03PS1) 10Santiago Faci: page-analytics deployment: Big AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036261 (https://phabricator.wikimedia.org/T360523)
[12:39:26] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:40:31] <logmsgbot>	 !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host wikikube-worker2001.codfw.wmnet with OS bullseye
[12:42:19] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2001.codfw.wmnet with OS bullseye
[12:43:10] <Lucas_WMDE>	 filed T365992 for the stashbot issue FTR
[12:43:11] <stashbot>	 T365992: stashbot occasionally dies and needs manual restart - https://phabricator.wikimedia.org/T365992
[12:43:21] <wikibugs>	 (03PS2) 10NMW03: Set $wgCategoryCollation to uca-bs-u-kn on Bosnian Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034941 (https://phabricator.wikimedia.org/T365133)
[12:44:26] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:44:56] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterFlinkJobUnstable: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[12:45:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P63311 and previous config saved to /var/cache/conftool/dbconfig/20240527-124500-marostegui.json
[12:45:50] <wikibugs>	 (03PS3) 10Santiago Faci: editor-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036191 (https://phabricator.wikimedia.org/T355408)
[12:47:06] <wikibugs>	 (03CR) 10EoghanGaffney: lists: Don't include automation in standby hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1035789 (https://phabricator.wikimedia.org/T365804) (owner: 10EoghanGaffney)
[12:47:08] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+2] lists: Don't include automation in standby hosts [puppet] - 10https://gerrit.wikimedia.org/r/1035789 (https://phabricator.wikimedia.org/T365804) (owner: 10EoghanGaffney)
[12:47:13] <wikibugs>	 (03PS4) 10Santiago Faci: editor-analytics deployment: big refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036191 (https://phabricator.wikimedia.org/T355408)
[12:48:27] <Nemoralis>	 jouncebot next
[12:48:27] <jouncebot>	 In 0 hour(s) and 11 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240527T1300)
[12:50:17] <wikibugs>	 (03PS1) 10Brouberol: datahub-next: make sure subcharts get the environment default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036263 (https://phabricator.wikimedia.org/T361185)
[12:50:25] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: kube-controller-manager.service on ml-staging-ctrl2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:50:27] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1036241 (owner: 10Muehlenhoff)
[12:50:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T364299)', diff saved to https://phabricator.wikimedia.org/P63312 and previous config saved to /var/cache/conftool/dbconfig/20240527-125041-marostegui.json
[12:50:47] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[12:51:47] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:52:15] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] datahub-next: make sure subcharts get the environment default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036263 (https://phabricator.wikimedia.org/T361185) (owner: 10Brouberol)
[12:53:02] <wikibugs>	 (03PS2) 10Brouberol: datahub-next: make sure subcharts get the environment default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036263 (https://phabricator.wikimedia.org/T361185)
[12:53:49] <wikibugs>	 (03PS4) 10Ayounsi: sre.hosts.reimage: add support for VLAN move [cookbooks] - 10https://gerrit.wikimedia.org/r/1007652 (https://phabricator.wikimedia.org/T350152) (owner: 10Volans)
[12:53:58] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: upgrade tegola in codfw to use the envoy proxy for Swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035743 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey)
[12:54:26] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:54:47] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] datahub-next: make sure subcharts get the environment default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036263 (https://phabricator.wikimedia.org/T361185) (owner: 10Brouberol)
[12:56:36] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging
[12:58:19] <wikibugs>	 (03PS1) 10EoghanGaffney: lists: Fix typing on ensure in mailman::web [puppet] - 10https://gerrit.wikimedia.org/r/1036265
[12:58:37] <wikibugs>	 (03CR) 10CI reject: [V:04-1] lists: Fix typing on ensure in mailman::web [puppet] - 10https://gerrit.wikimedia.org/r/1036265 (owner: 10EoghanGaffney)
[12:59:51] <wikibugs>	 (03PS2) 10EoghanGaffney: lists: Fix typing on ensure in mailman::web [puppet] - 10https://gerrit.wikimedia.org/r/1036265
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240527T1300).
[13:00:05] <jouncebot>	 ottomata, _Gerges, MatmaRex, and Nemoralis: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P63313 and previous config saved to /var/cache/conftool/dbconfig/20240527-130008-marostegui.json
[13:00:16] <Lucas_WMDE>	 o/
[13:00:23] <Lucas_WMDE>	 I can deploy
[13:00:34] <Gerges>	 Hi
[13:00:42] <MatmaRex>	 hi
[13:00:48] <Nemoralis>	 o7
[13:00:58] <Lucas_WMDE>	 ottomata: do you want to self-service the beacon change?
[13:01:42] <Lucas_WMDE>	 (I’m guessing you have deployment rights ^^)
[13:01:55] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982#9834830 (10ABran-WMF)
[13:02:05] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982#9834831 (10ABran-WMF)
[13:02:15] <icinga-wm_>	 PROBLEM - Check whether ferm is active by checking the default input chain on wikikube-ctrl2001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:02:47] <Lucas_WMDE>	 well, let’s start with Gerges’ change then
[13:03:14] <Gerges>	 Ok
[13:03:19] <wikibugs>	 (03PS3) 10GergesShamon: Revert "arwiki: Disable Extension:ContentTranslation for non-autoreview users" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034884 (https://phabricator.wikimedia.org/T255022)
[13:03:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034884 (https://phabricator.wikimedia.org/T255022) (owner: 10GergesShamon)
[13:03:40] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9834834 (10ABran-WMF)
[13:04:12] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e6-eqiad - https://phabricator.wikimedia.org/T365987#9834836 (10ABran-WMF)
[13:04:51] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync
[13:05:23] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync
[13:05:39] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986#9834838 (10ABran-WMF)
[13:05:44] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9834840 (10ABran-WMF)
[13:05:44] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "arwiki: Disable Extension:ContentTranslation for non-autoreview users" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034884 (https://phabricator.wikimedia.org/T255022) (owner: 10GergesShamon)
[13:05:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P63314 and previous config saved to /var/cache/conftool/dbconfig/20240527-130549-marostegui.json
[13:06:00] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1034884|Revert "arwiki: Disable Extension:ContentTranslation for non-autoreview users" (T255022)]]
[13:06:04] <stashbot>	 T255022: Disable machine translation in Content Translation Tool for non-autoreview users on Arabic Wikipedia - https://phabricator.wikimedia.org/T255022
[13:06:15] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2647/co" [puppet] - 10https://gerrit.wikimedia.org/r/1036265 (owner: 10EoghanGaffney)
[13:06:32] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9834842 (10ABran-WMF)
[13:07:26] <wikibugs>	 (03CR) 10FNegri: [C:03+2] P:toolforge:redis_sentinel: set redis timeout [puppet] - 10https://gerrit.wikimedia.org/r/1029158 (https://phabricator.wikimedia.org/T363709) (owner: 10FNegri)
[13:07:38] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993 (10ABran-WMF) 03NEW
[13:08:04] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9834861 (10ABran-WMF)
[13:08:09] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging
[13:08:31] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and gergesshamon: Backport for [[gerrit:1034884|Revert "arwiki: Disable Extension:ContentTranslation for non-autoreview users" (T255022)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:08:38] <Lucas_WMDE>	 Gerges: please test :)
[13:08:44] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "Judging by T272783, this will require the following maintenance script post-deployment:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034941 (https://phabricator.wikimedia.org/T365133) (owner: 10NMW03)
[13:09:02] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1036265 (owner: 10EoghanGaffney)
[13:09:07] <Gerges>	 I can't test right now 
[13:09:11] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994 (10ABran-WMF) 03NEW
[13:09:16] <Lucas_WMDE>	 hm, ok
[13:09:26] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:09:27] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9834874 (10ABran-WMF)
[13:09:27] <RhinosF1>	 Gerges: why
[13:11:03] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995 (10ABran-WMF) 03NEW
[13:11:23] <Lucas_WMDE>	 I guess this is straightforward enough to just deploy
[13:11:33] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9834886 (10ABran-WMF)
[13:11:33] <Lucas_WMDE>	 (RhinosF1 asks a good question but I don’t want to block the other changes on this either)
[13:12:37] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and gergesshamon: Continuing with sync
[13:13:14] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9834888 (10ABran-WMF)
[13:13:45] <icinga-wm_>	 PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS5511/IPv6: Connect - Orange, AS5511/IPv4: Connect - Orange https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:14:49] <icinga-wm_>	 RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 110, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:15:09] <logmsgbot>	 !log hnowlan@cumin1002 conftool action : set/pooled=no; selector: name=parse1002.eqiad.wmnet
[13:15:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T364069)', diff saved to https://phabricator.wikimedia.org/P63315 and previous config saved to /var/cache/conftool/dbconfig/20240527-131516-marostegui.json
[13:15:19] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1213.eqiad.wmnet with reason: Maintenance
[13:15:21] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[13:15:32] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1213.eqiad.wmnet with reason: Maintenance
[13:15:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1213 (T364069)', diff saved to https://phabricator.wikimedia.org/P63316 and previous config saved to /var/cache/conftool/dbconfig/20240527-131539-marostegui.json
[13:16:01] <vgutierrez>	 !log test fifo-log-demux 0.7.5 on cp4052
[13:16:02] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad  - https://phabricator.wikimedia.org/T365983#9834899 (10ABran-WMF)
[13:16:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:13] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad  - https://phabricator.wikimedia.org/T365984#9834905 (10ABran-WMF)
[13:16:28] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+2] lists: Fix typing on ensure in mailman::web [puppet] - 10https://gerrit.wikimedia.org/r/1036265 (owner: 10EoghanGaffney)
[13:16:40] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9834911 (10ABran-WMF) a:05cmooney→03MatthewVernon
[13:16:49] <wikibugs>	 (03PS1) 10Brouberol: datahub-next: fix the ingress by restoring default gateway host [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036266 (https://phabricator.wikimedia.org/T361185)
[13:16:57] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e6-eqiad - https://phabricator.wikimedia.org/T365987#9834908 (10ABran-WMF) a:05cmooney→03ABran-WMF
[13:17:15] <icinga-wm_>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:17:43] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Shutdown of Puppet 5 servers - https://phabricator.wikimedia.org/T365798#9834913 (10MoritzMuehlenhoff) p:05Triage→03High
[13:18:13] <fabfur>	 !log disabling puppet on A:cp to safely apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1035440 (T365718)
[13:18:15] <icinga-wm_>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:18:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:18] <stashbot>	 T365718: Switch HAProxy/Benthos to rfc5424 - https://phabricator.wikimedia.org/T365718
[13:18:41] <Lucas_WMDE>	 still no sign of ottomata?
[13:18:59] <wikibugs>	 (03PS2) 10Santiago Faci: page-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036261 (https://phabricator.wikimedia.org/T360523)
[13:19:25] <wikibugs>	 (03PS3) 10Santiago Faci: page-analytics deployment: AQS 2.0 refactoring to use new functions and messages added to aqsassist 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036261 (https://phabricator.wikimedia.org/T360523)
[13:19:37] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996 (10ABran-WMF) 03NEW
[13:19:53] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] page-analytics deployment: AQS 2.0 refactoring to use new functions and messages added to aqsassist 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036261 (https://phabricator.wikimedia.org/T360523) (owner: 10Santiago Faci)
[13:19:58] <wikibugs>	 (03PS2) 10Santiago Faci: device-analytics deployment: AQS 2.0 refactoring to use new functions and messages added to aqsassist 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036260 (https://phabricator.wikimedia.org/T360524)
[13:20:02] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] "Copied votes on follow-up patch sets have been updated:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036260 (https://phabricator.wikimedia.org/T360524) (owner: 10Santiago Faci)
[13:20:05] <icinga-wm_>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:20:16] <wikibugs>	 (03PS2) 10Santiago Faci: media-analytics deployment: AQS 2.0 refactoring to use new functions and messages added to aqsassist 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036194 (https://phabricator.wikimedia.org/T360526)
[13:20:23] <wikibugs>	 (03PS1) 10Elukey: services: move tegola in eqiad to the Thanos sidecar config for Swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036269 (https://phabricator.wikimedia.org/T344324)
[13:20:57] <icinga-wm_>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:20:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P63317 and previous config saved to /var/cache/conftool/dbconfig/20240527-132057-marostegui.json
[13:21:00] <wikibugs>	 (03PS4) 10Santiago Faci: page-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036261 (https://phabricator.wikimedia.org/T360523)
[13:21:09] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging
[13:21:12] <wikibugs>	 (03PS3) 10Santiago Faci: media-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036194 (https://phabricator.wikimedia.org/T360526)
[13:21:15] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997 (10ABran-WMF) 03NEW
[13:21:29] <wikibugs>	 (03PS3) 10Santiago Faci: device-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036260 (https://phabricator.wikimedia.org/T360524)
[13:21:37] <wikibugs>	 (03Abandoned) 10Brouberol: datahub-next: fix the ingress by restoring default gateway host [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036266 (https://phabricator.wikimedia.org/T361185) (owner: 10Brouberol)
[13:21:44] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997#9834941 (10ABran-WMF)
[13:22:02] <wikibugs>	 (03PS3) 10Santiago Faci: edit-analytics deployment: AQS 2 refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036188 (https://phabricator.wikimedia.org/T355407)
[13:22:11] <wikibugs>	 (03PS5) 10Santiago Faci: editor-analytics deployment: AQS 2 refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036191 (https://phabricator.wikimedia.org/T355408)
[13:22:20] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] "Copied votes on follow-up patch sets have been updated:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036188 (https://phabricator.wikimedia.org/T355407) (owner: 10Santiago Faci)
[13:22:23] * Lucas_WMDE prepares MatmaRex’ changes for deployment
[13:22:35] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998 (10ABran-WMF) 03NEW
[13:22:37] <MatmaRex>	 👍
[13:22:40] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] media-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036194 (https://phabricator.wikimedia.org/T360526) (owner: 10Santiago Faci)
[13:22:48] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9834959 (10ABran-WMF)
[13:22:48] <Lucas_WMDE>	 (I say we deploy both together, so I’ll rebase one onto the other)
[13:22:54] <wikibugs>	 (03PS3) 10Bartosz Dziewoński: Enable wgDiscussionToolsEnablePermalinksBackend on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036197 (https://phabricator.wikimedia.org/T315353)
[13:23:04] <wikibugs>	 (03PS6) 10Bartosz Dziewoński: Pre-emptively disable DiscussionToolsEnableThanks (no-op) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026511 (owner: 10Esanders)
[13:23:16] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] editor-analytics deployment: AQS 2 refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036191 (https://phabricator.wikimedia.org/T355408) (owner: 10Santiago Faci)
[13:24:07] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Make Chqaz admin of Wikija-g mailing list - https://phabricator.wikimedia.org/T365933#9834956 (10Ladsgroup) a:03Ladsgroup > All members of our user group suddenly had their admins removed in March.  Hi, why? Any governance issues or disputes?
[13:24:07] <wikibugs>	 (03PS2) 10Santiago Faci: geo-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036192 (https://phabricator.wikimedia.org/T360525)
[13:24:08] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9834960 (10ABran-WMF)
[13:24:20] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] geo-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036192 (https://phabricator.wikimedia.org/T360525) (owner: 10Santiago Faci)
[13:24:46] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Create a mailing list for Bangla Wikimoitree - https://phabricator.wikimedia.org/T365915#9834967 (10Ladsgroup) a:03Ladsgroup
[13:24:51] <Gerges>	 RhinosF1: Sorry for the delay in replying, I don't have an arwiki account with which to test (I have an arwiki account with advanced privileges, so my test won't help)
[13:25:05] <icinga-wm_>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:25:24] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Create a mailing list for Bangla Wikimoitree - https://phabricator.wikimedia.org/T365915#9834966 (10Ladsgroup) Hi, can you pick a name that's more aligned with our standardization policy? https://meta.wikimedia.org/wiki/Mailing_lists/Standardization if it's not possible, we ne...
[13:26:43] <RhinosF1>	 Gerges: create a legit alt?
[13:26:46] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1034884|Revert "arwiki: Disable Extension:ContentTranslation for non-autoreview users" (T255022)]] (duration: 20m 46s)
[13:26:54] <stashbot>	 T255022: Disable machine translation in Content Translation Tool for non-autoreview users on Arabic Wikipedia - https://phabricator.wikimedia.org/T255022
[13:27:04] <RhinosF1>	 Surely arwiki allows you to have legit alternate accounts
[13:27:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Add new access group to grant root on the wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/1036270
[13:27:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036197 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński)
[13:27:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026511 (owner: 10Esanders)
[13:27:39] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] benthos:cache: switch to rfc5424 format [puppet] - 10https://gerrit.wikimedia.org/r/1035440 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur)
[13:27:51] <Lucas_WMDE>	 MatmaRex: will either of the changes be testable?
[13:27:52] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] services: move tegola in eqiad to the Thanos sidecar config for Swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036269 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey)
[13:27:57] <icinga-wm_>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:27:58] <Gerges>	 I have an alternate account, but I don't currently have access to that account 
[13:28:02] <wikibugs>	 (03Merged) 10jenkins-bot: Enable wgDiscussionToolsEnablePermalinksBackend on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036197 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński)
[13:28:07] <wikibugs>	 (03Merged) 10jenkins-bot: Pre-emptively disable DiscussionToolsEnableThanks (no-op) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026511 (owner: 10Esanders)
[13:28:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add new access group to grant root on the wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/1036270 (owner: 10Muehlenhoff)
[13:28:11] <Lucas_WMDE>	 I’m assuming the second change won’t be testable since the feature isn’t even merged yet
[13:28:13] <icinga-wm_>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:28:13] <icinga-wm_>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.302 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:28:16] <Lucas_WMDE>	 and I’m not sure about the first either
[13:28:22] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1036197|Enable wgDiscussionToolsEnablePermalinksBackend on enwiki (T315353)]], [[gerrit:1026511|Pre-emptively disable DiscussionToolsEnableThanks (no-op)]]
[13:28:26] <stashbot>	 T315353: Create database tables for permalinks in production wikis, and enable the feature - https://phabricator.wikimedia.org/T315353
[13:28:59] <RhinosF1>	 Gerges: in future, it's probably best to just create another account then imo
[13:29:00] <fabfur>	 !log enabled puppet on cp4037 to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1035440 (T365718)
[13:29:01] <MatmaRex>	 Lucas_WMDE: wgDiscussionToolsEnablePermalinksBackend change should enable https://en.wikipedia.org/wiki/Special:FindComment
[13:29:02] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: move tegola in eqiad to the Thanos sidecar config for Swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036269 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey)
[13:29:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:04] <stashbot>	 T365718: Switch HAProxy/Benthos to rfc5424 - https://phabricator.wikimedia.org/T365718
[13:29:06] <MatmaRex>	 Lucas_WMDE: the other one is a no-op
[13:29:13] <Lucas_WMDE>	 alright
[13:29:17] <RhinosF1>	 Gerges: and also say at the start of the window
[13:29:55] <Gerges>	 RhinosF1: ok
[13:30:35] <Lucas_WMDE>	 (+1)
[13:30:51] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 esanders and matmarex and lucaswerkmeister-wmde: Backport for [[gerrit:1036197|Enable wgDiscussionToolsEnablePermalinksBackend on enwiki (T315353)]], [[gerrit:1026511|Pre-emptively disable DiscussionToolsEnableThanks (no-op)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:30:56] <Lucas_WMDE>	 MatmaRex: sounds good, thanks
[13:31:38] <wikibugs>	 (03CR) 10Volans: sre.hosts.reimage: add support for VLAN move (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1007652 (https://phabricator.wikimedia.org/T350152) (owner: 10Volans)
[13:31:59] * Lucas_WMDE has no idea how to test Special:FindComment
[13:32:15] <icinga-wm_>	 RECOVERY - Check whether ferm is active by checking the default input chain on wikikube-ctrl2001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:32:23] <MatmaRex>	 i'm testing it
[13:32:29] <Lucas_WMDE>	 ack
[13:32:29] <MatmaRex>	 for example, try this page: https://en.wikipedia.org/wiki/Special:FindComment?idorname=c-Izno-20240417204800-Jon_(WMF)-20240417141100
[13:32:39] <Lucas_WMDE>	 ooh, pasting an HTML id= seems to work
[13:32:40] <MatmaRex>	 on main server it shows no results, on test servers it shows a result
[13:32:42] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync
[13:32:43] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging
[13:32:45] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync
[13:32:47] <Lucas_WMDE>	 cool cool
[13:32:49] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 esanders and matmarex and lucaswerkmeister-wmde: Continuing with sync
[13:32:58] <MatmaRex>	 (and it says "not in current revision" because we need to run the script again, to backfill the most recent edits)
[13:33:14] <Lucas_WMDE>	 yup, I’ll do that afterwards
[13:33:27] <MatmaRex>	 yeah, just explaining why it's like that
[13:33:30] <Lucas_WMDE>	 just from the --start printed by the last run, right?
[13:33:49] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release datahub-next/staging on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datahub-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[13:34:07] <MatmaRex>	 let me double check
[13:34:11] <Lucas_WMDE>	 ok
[13:34:14] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] edit-analytics deployment: AQS 2 refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036188 (https://phabricator.wikimedia.org/T355407) (owner: 10Santiago Faci)
[13:34:58] <wikibugs>	 (03PS2) 10Muehlenhoff: Add new access group to grant root on the wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/1036270
[13:34:59] <wikibugs>	 (03Merged) 10jenkins-bot: edit-analytics deployment: AQS 2 refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036188 (https://phabricator.wikimedia.org/T355407) (owner: 10Santiago Faci)
[13:36:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T364299)', diff saved to https://phabricator.wikimedia.org/P63318 and previous config saved to /var/cache/conftool/dbconfig/20240527-133605-marostegui.json
[13:36:09] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1212.eqiad.wmnet with reason: Maintenance
[13:36:11] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[13:36:22] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1212.eqiad.wmnet with reason: Maintenance
[13:36:24] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[13:36:29] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[13:36:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1212 (T364299)', diff saved to https://phabricator.wikimedia.org/P63319 and previous config saved to /var/cache/conftool/dbconfig/20240527-133636-marostegui.json
[13:38:49] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release datahub-next/staging on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datahub-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[13:40:05] <MatmaRex>	 Lucas_WMDE: looking back to how we did this for the previous wiki, i think that instead of --start, you should run the final one with --touched-after=<start date of the previous run>
[13:40:20] <Lucas_WMDE>	 okay
[13:40:34] <MatmaRex>	 e.g. https://phabricator.wikimedia.org/T315353#9078672
[13:40:36] <Lucas_WMDE>	 and still with --current and --all?
[13:40:51] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] maps: Don't pass additional server aliases when using PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036247 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff)
[13:40:54] <MatmaRex>	 yes
[13:41:00] <Lucas_WMDE>	 alright
[13:42:55] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync
[13:42:59] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync
[13:43:59] <wikibugs>	 (03PS3) 10NMW03: Set $wgCategoryCollation to uca-bs-u-kn on Bosnian Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034941 (https://phabricator.wikimedia.org/T365133)
[13:44:47] <logmsgbot>	 !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply
[13:45:05] <Lucas_WMDE>	 Nemoralis: do you know how to test the bswikiquote change?
[13:45:20] <Lucas_WMDE>	 (just checking in advance ^^)
[13:45:35] <Nemoralis>	 I know the regular testing progress, does it needs anything else?
[13:46:04] <Nemoralis>	 Will it work without running the updateCollation script?
[13:46:05] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync
[13:46:10] <Lucas_WMDE>	 probably not
[13:46:15] <Lucas_WMDE>	 but I was wondering that
[13:46:37] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1036197|Enable wgDiscussionToolsEnablePermalinksBackend on enwiki (T315353)]], [[gerrit:1026511|Pre-emptively disable DiscussionToolsEnableThanks (no-op)]] (duration: 18m 15s)
[13:46:44] <stashbot>	 T315353: Create database tables for permalinks in production wikis, and enable the feature - https://phabricator.wikimedia.org/T315353
[13:46:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034941 (https://phabricator.wikimedia.org/T365133) (owner: 10NMW03)
[13:46:50] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync
[13:46:58] <Lucas_WMDE>	 apparently there are ~10k categorylinks rows, so I assume the maintenance script shouldn’t take too long
[13:47:17] <Lucas_WMDE>	 I was thinking more like, what even to look for, where the change would take effect
[13:47:29] <wikibugs>	 (03Merged) 10jenkins-bot: Set $wgCategoryCollation to uca-bs-u-kn on Bosnian Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034941 (https://phabricator.wikimedia.org/T365133) (owner: 10NMW03)
[13:47:45] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1034941|Set $wgCategoryCollation to uca-bs-u-kn on Bosnian Wikiquote (T365133)]]
[13:47:49] <stashbot>	 T365133: Set $wgCategoryCollation to 'uca-bs-u-kn' on Bosnian Wikiquote and rebuild category sort keys - https://phabricator.wikimedia.org/T365133
[13:48:25] <Lucas_WMDE>	 https://bs.wikiquote.org/wiki/Kategorija:Literatura seems to be the biggest category, but I’m not sure if we would see a difference there
[13:49:11] <Nemoralis>	 https://bs.wikiquote.org/wiki/Kategorija:Pisci looks fine to test
[13:49:32] <Nemoralis>	 it is small enough and has letters from Bosnian language
[13:50:08] <Lucas_WMDE>	 true
[13:50:15] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and nmw03: Backport for [[gerrit:1034941|Set $wgCategoryCollation to uca-bs-u-kn on Bosnian Wikiquote (T365133)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:50:21] <Lucas_WMDE>	 and https://bs.wikipedia.org/wiki/Kategorija:Amerikanci_po_porijeklu (random category) looks like Š is indeed supposed to sort after S
[13:50:52] <Lucas_WMDE>	 not seeing any change with mwdebug yet… probably needs the maintenance script first
[13:51:02] <Nemoralis>	 in Pisci category, U supposed to sort after Č
[13:51:16] <Nemoralis>	 per their alphabet
[13:51:23] <Nemoralis>	 https://en.wikipedia.org/wiki/Bosnian_language#Alphabet
[13:51:37] <Lucas_WMDE>	 ok
[13:51:43] <Lucas_WMDE>	 I guess we sync now and test after the maintenance script?
[13:51:49] <Nemoralis>	 yes
[13:51:52] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and nmw03: Continuing with sync
[13:52:57] <icinga-wm_>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1380 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:54:39] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging
[13:54:52] <logmsgbot>	 !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply
[13:55:11] <wikibugs>	 (03PS5) 10Ayounsi: sre.hosts.reimage: add support for VLAN move [cookbooks] - 10https://gerrit.wikimedia.org/r/1007652 (https://phabricator.wikimedia.org/T350152) (owner: 10Volans)
[13:56:39] <wikibugs>	 (03PS6) 10Ayounsi: sre.hosts.reimage: add support for VLAN move [cookbooks] - 10https://gerrit.wikimedia.org/r/1007652 (https://phabricator.wikimedia.org/T350152) (owner: 10Volans)
[13:57:31] <wikibugs>	 (03CR) 10Ayounsi: sre.hosts.reimage: add support for VLAN move (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1007652 (https://phabricator.wikimedia.org/T350152) (owner: 10Volans)
[13:58:22] <Lucas_WMDE>	 FTR, I have a meeting in a few minutes, so I might be a bit late to start the maintenance scripts
[13:58:23] <Lucas_WMDE>	 jouncebot: next
[13:58:23] <jouncebot>	 In 1 hour(s) and 31 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240527T1530)
[13:58:32] <Lucas_WMDE>	 but I should get to it before then, I think
[13:58:53] <Nemoralis>	 (y)
[14:00:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213 (T364069)', diff saved to https://phabricator.wikimedia.org/P63320 and previous config saved to /var/cache/conftool/dbconfig/20240527-140007-marostegui.json
[14:00:19] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[14:00:31] <logmsgbot>	 !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2001.codfw.wmnet with OS bullseye
[14:01:16] <MatmaRex>	 thanks for deploying :)
[14:04:14] <wikibugs>	 (03CR) 10Effie Mouzeli: "Yes, of course! The general k8s-mwdebug remains the primary destination. @James I think the current order is alright, given that testing s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035361 (https://phabricator.wikimedia.org/T365478) (owner: 10Effie Mouzeli)
[14:04:26] <wikibugs>	 (03PS2) 10Effie Mouzeli: x-wikimedia-debug: add datacenter options for k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035361 (https://phabricator.wikimedia.org/T365478)
[14:05:10] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1034941|Set $wgCategoryCollation to uca-bs-u-kn on Bosnian Wikiquote (T365133)]] (duration: 17m 25s)
[14:05:15] <stashbot>	 T365133: Set $wgCategoryCollation to 'uca-bs-u-kn' on Bosnian Wikiquote and rebuild category sort keys - https://phabricator.wikimedia.org/T365133
[14:06:12] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging
[14:11:39] <Lucas_WMDE>	 !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript updateCollation.php bswikiquote --previous-collation=uppercase # T365133
[14:11:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:44] <stashbot>	 T365133: Set $wgCategoryCollation to 'uca-bs-u-kn' on Bosnian Wikiquote and rebuild category sort keys - https://phabricator.wikimedia.org/T365133
[14:11:47] <Lucas_WMDE>	 already finished
[14:12:00] <Lucas_WMDE>	 and https://bs.wikiquote.org/wiki/Kategorija:Pisci looks different \o/
[14:12:02] <Lucas_WMDE>	 (cc Nemoralis)
[14:12:12] <Nemoralis>	 yep, works for me!
[14:12:13] <Nemoralis>	 thanks
[14:12:16] <Lucas_WMDE>	 np :)
[14:12:23] <Lucas_WMDE>	 right, let’s do one more maintenance script for MatmaRex then ;)
[14:13:34] <wikibugs>	 (03PS1) 10Fabfur: benthos:cache: fix processing syntax [puppet] - 10https://gerrit.wikimedia.org/r/1036276 (https://phabricator.wikimedia.org/T365718)
[14:14:11] <Lucas_WMDE>	 !log START lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki enwiki --current --all --touched-after=20240524120000 2>&1 | tee -a ~/T315510-enwiki-7; date # cc T365974
[14:14:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:20] <stashbot>	 T365974: Deploy talk page permalinks to en.wiki - https://phabricator.wikimedia.org/T365974
[14:15:17] <Lucas_WMDE>	 I really hope the “estimated X rows” here is very inaccurate 😅
[14:15:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213', diff saved to https://phabricator.wikimedia.org/P63321 and previous config saved to /var/cache/conftool/dbconfig/20240527-141515-marostegui.json
[14:15:25] <Lucas_WMDE>	 (Processed 300 (updated 287) of 61401202 rows)
[14:16:49] <Lucas_WMDE>	 wow, SELECT COUNT(*) FROM page WHERE page_touched > '20240524120000'; says 3948247
[14:16:56] <Lucas_WMDE>	 almost four million touched pages
[14:17:31] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] benthos:cache: fix processing syntax [puppet] - 10https://gerrit.wikimedia.org/r/1036276 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur)
[14:17:50] <Lucas_WMDE>	 its the weekend baby. youknow what that means. its time to drink precisely one beer and touch four million enwiki pages
[14:17:57] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] benthos:cache: fix processing syntax [puppet] - 10https://gerrit.wikimedia.org/r/1036276 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur)
[14:18:36] <wikibugs>	 (03PS1) 10Hashar: Merge tag 'v3.8.6' into wmf/stable-3.8 [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1036278
[14:18:55] <Lucas_WMDE>	 I’m otherwise done deploying btw
[14:19:28] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1238.eqiad.wmnet with reason: Maintenance
[14:19:41] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1238.eqiad.wmnet with reason: Maintenance
[14:19:49] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1238 (T360332)', diff saved to https://phabricator.wikimedia.org/P63322 and previous config saved to /var/cache/conftool/dbconfig/20240527-141948-arnaudb.json
[14:19:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T364299)', diff saved to https://phabricator.wikimedia.org/P63323 and previous config saved to /var/cache/conftool/dbconfig/20240527-141949-marostegui.json
[14:19:54] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[14:19:59] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[14:21:19] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] x-wikimedia-debug: add datacenter options for k8s [puppet] - 10https://gerrit.wikimedia.org/r/1034514 (https://phabricator.wikimedia.org/T365478) (owner: 10Effie Mouzeli)
[14:22:11] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T360332)', diff saved to https://phabricator.wikimedia.org/P63324 and previous config saved to /var/cache/conftool/dbconfig/20240527-142210-arnaudb.json
[14:22:57] <icinga-wm_>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1380 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:26:48] <wikibugs>	 (03PS2) 10Hashar: Merge tag 'v3.8.6' into wmf/stable-3.8 [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1036278 (https://phabricator.wikimedia.org/T365328)
[14:28:13] <wikibugs>	 06SRE, 10Acme-chief, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 06Traffic: Revert back to fleet-wide acmechief config once all ACME consumers are on Puppet 7 - https://phabricator.wikimedia.org/T365799#9835132 (10MoritzMuehlenhoff) p:05Triage→03High
[14:30:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240527-143025-marostegui.json
[14:34:29] <wikibugs>	 (03PS1) 10Fabfur: Revert "benthos:cache: switch to rfc5424 format" [puppet] - 10https://gerrit.wikimedia.org/r/1036213
[14:34:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "benthos:cache: switch to rfc5424 format" [puppet] - 10https://gerrit.wikimedia.org/r/1036213 (owner: 10Fabfur)
[14:34:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P63326 and previous config saved to /var/cache/conftool/dbconfig/20240527-143457-marostegui.json
[14:35:13] <wikibugs>	 (03PS1) 10Fabfur: Revert "benthos:cache: fix processing syntax" [puppet] - 10https://gerrit.wikimedia.org/r/1036214
[14:36:48] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:37:19] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P63327 and previous config saved to /var/cache/conftool/dbconfig/20240527-143718-arnaudb.json
[14:37:59] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] coredump.conf: Disable compression [puppet] - 10https://gerrit.wikimedia.org/r/1029235 (https://phabricator.wikimedia.org/T236253) (owner: 10Ahmon Dancy)
[14:39:17] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] Revert "benthos:cache: fix processing syntax" [puppet] - 10https://gerrit.wikimedia.org/r/1036214 (owner: 10Fabfur)
[14:40:57] <wikibugs>	 (03CR) 10Fabfur: [V:03+2] Revert "benthos:cache: switch to rfc5424 format" [puppet] - 10https://gerrit.wikimedia.org/r/1036213 (owner: 10Fabfur)
[14:41:06] <wikibugs>	 (03PS2) 10Fabfur: Revert "benthos:cache: switch to rfc5424 format" [puppet] - 10https://gerrit.wikimedia.org/r/1036213
[14:41:09] <wikibugs>	 (03CR) 10Effie Mouzeli: (WIP) memcached: add extstore option (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1035633 (https://phabricator.wikimedia.org/T352885) (owner: 10Effie Mouzeli)
[14:45:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213 (T364069)', diff saved to https://phabricator.wikimedia.org/P63328 and previous config saved to /var/cache/conftool/dbconfig/20240527-144538-marostegui.json
[14:45:41] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1216.eqiad.wmnet with reason: Maintenance
[14:45:45] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[14:45:45] <wikibugs>	 (03PS9) 10Effie Mouzeli: (WIP) memcached: add extstore option [puppet] - 10https://gerrit.wikimedia.org/r/1035633 (https://phabricator.wikimedia.org/T352885)
[14:45:46] <wikibugs>	 (03PS1) 10Effie Mouzeli: hieradata: enable extstore on mc1049 and mc2049 [puppet] - 10https://gerrit.wikimedia.org/r/1036281 (https://phabricator.wikimedia.org/T352885)
[14:45:54] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1216.eqiad.wmnet with reason: Maintenance
[14:46:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] hieradata: enable extstore on mc1049 and mc2049 [puppet] - 10https://gerrit.wikimedia.org/r/1036281 (https://phabricator.wikimedia.org/T352885) (owner: 10Effie Mouzeli)
[14:46:58] <wikibugs>	 (03PS2) 10Effie Mouzeli: hieradata: enable extstore on mc1049 and mc2049 [puppet] - 10https://gerrit.wikimedia.org/r/1036281 (https://phabricator.wikimedia.org/T352885)
[14:47:07] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] Revert "benthos:cache: switch to rfc5424 format" [puppet] - 10https://gerrit.wikimedia.org/r/1036213 (owner: 10Fabfur)
[14:47:16] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Investigate janitor, maintenance emails parser - https://phabricator.wikimedia.org/T230835#9835224 (10elukey)
[14:47:26] <wikibugs>	 (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036281 (https://phabricator.wikimedia.org/T352885) (owner: 10Effie Mouzeli)
[14:47:37] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Puppet-Core, 10Spicerack: Add a cookbook to safely deploy puppet changes - https://phabricator.wikimedia.org/T341442#9835228 (10elukey)
[14:50:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P63329 and previous config saved to /var/cache/conftool/dbconfig/20240527-145004-marostegui.json
[14:52:27] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P63330 and previous config saved to /var/cache/conftool/dbconfig/20240527-145226-arnaudb.json
[14:55:08] <wikibugs>	 (03PS10) 10Effie Mouzeli: memcached: add extstore option [puppet] - 10https://gerrit.wikimedia.org/r/1035633 (https://phabricator.wikimedia.org/T352885)
[14:56:39] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] memcached: add extstore option [puppet] - 10https://gerrit.wikimedia.org/r/1035633 (https://phabricator.wikimedia.org/T352885) (owner: 10Effie Mouzeli)
[14:56:48] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:01:21] <fabfur>	 !log enable puppet on A:cp (T365718)
[15:01:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:27] <stashbot>	 T365718: Switch HAProxy/Benthos to rfc5424 - https://phabricator.wikimedia.org/T365718
[15:01:27] <wikibugs>	 06SRE, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982#9835264 (10ABran-WMF)
[15:01:36] <wikibugs>	 06SRE, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9835266 (10ABran-WMF)
[15:01:48] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036247 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff)
[15:01:49] <wikibugs>	 06SRE, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9835267 (10ABran-WMF)
[15:01:58] <wikibugs>	 06SRE, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986#9835268 (10ABran-WMF)
[15:02:07] <wikibugs>	 06SRE, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e6-eqiad - https://phabricator.wikimedia.org/T365987#9835269 (10ABran-WMF)
[15:02:17] <wikibugs>	 06SRE, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9835270 (10ABran-WMF)
[15:02:24] <wikibugs>	 06SRE, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9835271 (10ABran-WMF)
[15:02:33] <wikibugs>	 06SRE, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9835272 (10ABran-WMF)
[15:02:55] <wikibugs>	 06SRE, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9835273 (10ABran-WMF)
[15:03:02] <wikibugs>	 06SRE, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9835274 (10ABran-WMF)
[15:03:16] <wikibugs>	 06SRE, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997#9835275 (10ABran-WMF)
[15:03:24] <wikibugs>	 06SRE, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9835276 (10ABran-WMF)
[15:05:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T364299)', diff saved to https://phabricator.wikimedia.org/P63331 and previous config saved to /var/cache/conftool/dbconfig/20240527-150514-marostegui.json
[15:05:17] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1240.eqiad.wmnet with reason: Maintenance
[15:05:20] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[15:05:30] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1240.eqiad.wmnet with reason: Maintenance
[15:07:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] maps: Don't pass additional server aliases when using PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036247 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff)
[15:07:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T360332)', diff saved to https://phabricator.wikimedia.org/P63332 and previous config saved to /var/cache/conftool/dbconfig/20240527-150735-arnaudb.json
[15:07:41] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[15:10:57] <wikibugs>	 (03PS3) 10Effie Mouzeli: hieradata: enable extstore on mc1049 and mc2049 [puppet] - 10https://gerrit.wikimedia.org/r/1036281 (https://phabricator.wikimedia.org/T352885)
[15:11:21] <wikibugs>	 (03PS3) 10Muehlenhoff: maps: Switch kartotherian on maps2007 to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036236 (https://phabricator.wikimedia.org/T360778)
[15:12:11] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036236 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff)
[15:13:02] <wikibugs>	 (03PS4) 10Effie Mouzeli: hieradata: enable extstore on mc1049 and mc2049 [puppet] - 10https://gerrit.wikimedia.org/r/1036281 (https://phabricator.wikimedia.org/T352885)
[15:15:16] <wikibugs>	 (03CR) 10Hashar: [C:03+2] Merge tag 'v3.8.6' into wmf/stable-3.8 [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1036278 (https://phabricator.wikimedia.org/T365328) (owner: 10Hashar)
[15:20:00] <wikibugs>	 (03PS1) 10Elukey: Move thanos-fe1002's envoy to CFSSL/PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036284 (https://phabricator.wikimedia.org/T344324)
[15:20:20] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Move thanos-fe1002's envoy to CFSSL/PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036284 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey)
[15:20:31] <logmsgbot>	 !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply
[15:22:07] <wikibugs>	 (03Merged) 10jenkins-bot: Merge tag 'v3.8.6' into wmf/stable-3.8 [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1036278 (https://phabricator.wikimedia.org/T365328) (owner: 10Hashar)
[15:22:25] <effie>	 !log disable puppet on mc1049  pending OS upgrade 
[15:22:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:06] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] hieradata: enable extstore on mc1049 and mc2049 [puppet] - 10https://gerrit.wikimedia.org/r/1036281 (https://phabricator.wikimedia.org/T352885) (owner: 10Effie Mouzeli)
[15:23:31] <wikibugs>	 (03PS8) 10Kamila Součková: [WIP] create a shellbox deployment for videoscalers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003446 (https://phabricator.wikimedia.org/T357309)
[15:24:23] <wikibugs>	 (03PS2) 10Elukey: Move thanos-fe1002's envoy to CFSSL/PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036284 (https://phabricator.wikimedia.org/T344324)
[15:26:07] <wikibugs>	 (03CR) 10Kamila Součková: "Is this good to go after I do something halfway sensible about https://gerrit.wikimedia.org/r/1005139 (the timeout patch)?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003446 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková)
[15:26:32] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1036284 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey)
[15:26:48] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:26:57] <wikibugs>	 (03CR) 10Elukey: Move thanos-fe1002's envoy to CFSSL/PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036284 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey)
[15:28:47] <logmsgbot>	 !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc2049.codfw.wmnet with OS bookworm
[15:28:49] <wikibugs>	 (03PS1) 10Santiago Faci: edit-analytics deployment: AQS 2 refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036285 (https://phabricator.wikimedia.org/T355407)
[15:29:03] <wikibugs>	 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Consolidate TLS cert puppetry for ms and thanos swift frontends - https://phabricator.wikimedia.org/T356412#9835385 (10elukey) Tegola is now using envoy (sidecar) to connect to Thanos Swift, so in theory we are good to proceed.  Next step: * Move thanos-fe1002...
[15:29:52] <wikibugs>	 (03PS1) 10Hashar: Upgrade to Gerrit v3.8.6 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1036286 (https://phabricator.wikimedia.org/T365328)
[15:30:04] <jouncebot>	 jan_drewniak: #bothumor My software never has bugs. It just develops random features. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240527T1530).
[15:30:37] <logmsgbot>	 !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply
[15:34:29] <wikibugs>	 (03PS1) 10KartikMistry: Section Translation: Enable in newly created Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036289 (https://phabricator.wikimedia.org/T366003)
[15:44:15] <logmsgbot>	 !log brouberol@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply
[15:46:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1035724 (https://phabricator.wikimedia.org/T365689) (owner: 10Vgutierrez)
[15:49:10] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 40 probes of 788 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:50:26] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance
[15:50:39] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance
[15:53:30] <wikibugs>	 (03CR) 10Elukey: [C:03+2] redfish: fix typo in DellSCP's class descr [software/spicerack] - 10https://gerrit.wikimedia.org/r/1035791 (owner: 10Elukey)
[15:54:10] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 31 probes of 788 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:54:20] <logmsgbot>	 !log brouberol@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply
[15:54:44] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] edit-analytics deployment: AQS 2 refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036285 (https://phabricator.wikimedia.org/T355407) (owner: 10Santiago Faci)
[15:56:00] <logmsgbot>	 !log jiji@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mc2049.codfw.wmnet with OS bookworm
[15:56:14] <elukey>	 !log run `apt-get clean` on dse-k8s-worker1001 to free space on the root partition
[15:56:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:11] <wikibugs>	 06SRE, 06serviceops, 10Data Products (Data Products Sprint 14), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9835532 (10WDoranWMF)
[15:57:18] <logmsgbot>	 !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc2049.codfw.wmnet with OS bookworm
[15:58:10] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists, 07Datacenter-Switchover: Make mailman3 work in the standby host (lists2001.wikimedia.org) - https://phabricator.wikimedia.org/T283615#9835562 (10eoghan)
[15:59:51] <wikibugs>	 (03Merged) 10jenkins-bot: redfish: fix typo in DellSCP's class descr [software/spicerack] - 10https://gerrit.wikimedia.org/r/1035791 (owner: 10Elukey)
[16:00:54] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] edit-analytics deployment: AQS 2 refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036285 (https://phabricator.wikimedia.org/T355407) (owner: 10Santiago Faci)
[16:01:14] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 37 probes of 788 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[16:01:43] <wikibugs>	 (03Merged) 10jenkins-bot: edit-analytics deployment: AQS 2 refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036285 (https://phabricator.wikimedia.org/T355407) (owner: 10Santiago Faci)
[16:02:47] <logmsgbot>	 !log brouberol@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply
[16:03:25] <logmsgbot>	 !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply
[16:03:27] <logmsgbot>	 !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply
[16:06:26] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 27 probes of 788 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[16:07:52] <jinxer-wm>	 FIRING: KubernetesCalicoDown: wikikube-worker2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[16:09:54] <logmsgbot>	 !log brouberol@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply
[16:18:55] <logmsgbot>	 !log brouberol@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply
[16:19:11] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[16:19:25] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[16:22:10] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release edit-analytics/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=edit-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[16:29:00] <logmsgbot>	 !log brouberol@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply
[16:32:10] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release edit-analytics/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=edit-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[16:36:08] <logmsgbot>	 !log jiji@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mc2049.codfw.wmnet with OS bookworm
[16:37:53] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[16:38:06] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[16:47:56] <wikibugs>	 (03PS8) 10Kamila Součková: shellbox: add PHP + Apache timeout settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005139 (https://phabricator.wikimedia.org/T357309)
[16:47:56] <wikibugs>	 (03CR) 10Kamila Součková: "Turns out this works, but it took me a long time to understand what exactly is happening because turns out child processes do not get kill" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005139 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková)
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240527T1700)
[17:00:05] <jouncebot>	 ryankemper: #bothumor My software never has bugs. It just develops random features. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240527T1700).
[17:04:40] <icinga-wm_>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:05:24] <icinga-wm_>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:06:16] <icinga-wm_>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51925 bytes in 1.927 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:06:32] <icinga-wm_>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.276 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:09:42] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:22:23] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2128.codfw.wmnet with reason: Maintenance
[17:22:36] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2128.codfw.wmnet with reason: Maintenance
[17:22:38] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[17:22:51] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[17:22:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2128 (T364069)', diff saved to https://phabricator.wikimedia.org/P63333 and previous config saved to /var/cache/conftool/dbconfig/20240527-172258-marostegui.json
[17:23:03] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[17:24:44] <icinga-wm_>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:25:24] <icinga-wm_>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:27:21] <wikibugs>	 (03CR) 10Pppery: "This change seems to have accidentally clobbered https://gerrit.wikimedia.org/r/c/phabricator/translations/+/1035805" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1036251 (owner: 10L10n-bot)
[17:30:15] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance
[17:30:28] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance
[17:30:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2127 (T364299)', diff saved to https://phabricator.wikimedia.org/P63334 and previous config saved to /var/cache/conftool/dbconfig/20240527-173035-marostegui.json
[17:30:47] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[17:31:12] <icinga-wm_>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:34:42] <icinga-wm_>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8617 bytes in 4.912 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:35:06] <icinga-wm_>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:35:16] <icinga-wm_>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:35:34] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: set command for hf image and remove nllb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036297 (https://phabricator.wikimedia.org/T365842)
[17:44:40] <wikibugs>	 (03CR) 10Ssingh: "Looking good but one nit that the commit message needs to be updated: we are doing just cp6001 in drmrs." [puppet] - 10https://gerrit.wikimedia.org/r/1035538 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins)
[18:16:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T364069)', diff saved to https://phabricator.wikimedia.org/P63335 and previous config saved to /var/cache/conftool/dbconfig/20240527-181607-marostegui.json
[18:16:15] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[18:24:03] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1036284 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey)
[18:24:46] <wikibugs>	 (03PS1) 10Gmodena: EventStreamConfig: Add webrequest.frontend.error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036299 (https://phabricator.wikimedia.org/T314956)
[18:25:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] EventStreamConfig: Add webrequest.frontend.error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036299 (https://phabricator.wikimedia.org/T314956) (owner: 10Gmodena)
[18:31:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P63336 and previous config saved to /var/cache/conftool/dbconfig/20240527-183115-marostegui.json
[18:46:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P63337 and previous config saved to /var/cache/conftool/dbconfig/20240527-184624-marostegui.json
[18:53:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[18:55:12] <wikibugs>	 (03PS2) 10Majavah: wikilabels::session: Set now-required memcached_user [puppet] - 10https://gerrit.wikimedia.org/r/1035762
[18:56:47] <wikibugs>	 (03CR) 10Majavah: [C:03+2] wikilabels::session: Set now-required memcached_user [puppet] - 10https://gerrit.wikimedia.org/r/1035762 (owner: 10Majavah)
[18:58:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[19:01:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T364069)', diff saved to https://phabricator.wikimedia.org/P63338 and previous config saved to /var/cache/conftool/dbconfig/20240527-190132-marostegui.json
[19:01:35] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance
[19:01:40] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[19:01:48] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance
[19:01:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2157 (T364069)', diff saved to https://phabricator.wikimedia.org/P63339 and previous config saved to /var/cache/conftool/dbconfig/20240527-190155-marostegui.json
[19:06:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T364299)', diff saved to https://phabricator.wikimedia.org/P63340 and previous config saved to /var/cache/conftool/dbconfig/20240527-190634-marostegui.json
[19:06:39] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[19:21:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P63341 and previous config saved to /var/cache/conftool/dbconfig/20240527-192142-marostegui.json
[19:26:48] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:36:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P63342 and previous config saved to /var/cache/conftool/dbconfig/20240527-193650-marostegui.json
[19:51:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T364299)', diff saved to https://phabricator.wikimedia.org/P63343 and previous config saved to /var/cache/conftool/dbconfig/20240527-195158-marostegui.json
[19:52:01] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance
[19:52:04] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[19:52:25] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance
[19:52:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2149 (T364299)', diff saved to https://phabricator.wikimedia.org/P63344 and previous config saved to /var/cache/conftool/dbconfig/20240527-195232-marostegui.json
[19:54:05] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P63345 and previous config saved to /var/cache/conftool/dbconfig/20240527-195404-ladsgroup.json
[20:01:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T364069)', diff saved to https://phabricator.wikimedia.org/P63346 and previous config saved to /var/cache/conftool/dbconfig/20240527-200106-marostegui.json
[20:01:12] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[20:08:07] <jinxer-wm>	 FIRING: KubernetesCalicoDown: wikikube-worker2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[20:09:11] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P63347 and previous config saved to /var/cache/conftool/dbconfig/20240527-200910-ladsgroup.json
[20:16:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P63348 and previous config saved to /var/cache/conftool/dbconfig/20240527-201614-marostegui.json
[20:24:17] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P63349 and previous config saved to /var/cache/conftool/dbconfig/20240527-202416-ladsgroup.json
[20:31:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P63350 and previous config saved to /var/cache/conftool/dbconfig/20240527-203122-marostegui.json
[20:39:23] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P63351 and previous config saved to /var/cache/conftool/dbconfig/20240527-203922-ladsgroup.json
[20:46:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T364069)', diff saved to https://phabricator.wikimedia.org/P63352 and previous config saved to /var/cache/conftool/dbconfig/20240527-204630-marostegui.json
[20:46:33] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance
[20:46:36] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[20:46:46] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance
[20:46:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2171 (T364069)', diff saved to https://phabricator.wikimedia.org/P63353 and previous config saved to /var/cache/conftool/dbconfig/20240527-204653-marostegui.json
[21:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: Time to do the Weekly Security deployment window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240527T2100).
[21:09:42] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:14:51] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for wenjun fan - https://phabricator.wikimedia.org/T353958#9836202 (10TheDJ) accidentally attached patch to wrong ticket.
[21:15:05] <wikibugs>	 (03PS1) 10Gergő Tisza: [multiversion] Add 'manage-dblist init-labs' subcommand [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036313
[21:27:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T364299)', diff saved to https://phabricator.wikimedia.org/P63355 and previous config saved to /var/cache/conftool/dbconfig/20240527-212738-marostegui.json
[21:27:43] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[21:35:23] <wikibugs>	 (03CR) 10Gergő Tisza: beta: Introduce new test2wiki on test.wikipedia.beta.wmcloud.org (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035749 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga)
[21:42:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T364069)', diff saved to https://phabricator.wikimedia.org/P63356 and previous config saved to /var/cache/conftool/dbconfig/20240527-214210-marostegui.json
[21:42:17] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[21:42:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P63357 and previous config saved to /var/cache/conftool/dbconfig/20240527-214246-marostegui.json
[21:52:44] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for Sohom Datta - https://phabricator.wikimedia.org/T366032 (10Soda) 03NEW
[21:53:52] <wikibugs>	 (03CR) 10Gergő Tisza: "Other things that I think need an update:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035749 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga)
[21:55:29] <wikibugs>	 (03CR) 10Gergő Tisza: "Other things that might need to be updated:" [puppet] - 10https://gerrit.wikimedia.org/r/1035752 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga)
[21:57:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P63358 and previous config saved to /var/cache/conftool/dbconfig/20240527-215719-marostegui.json
[21:57:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P63359 and previous config saved to /var/cache/conftool/dbconfig/20240527-215754-marostegui.json
[22:07:50] <icinga-wm_>	 PROBLEM - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38
[22:12:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P63360 and previous config saved to /var/cache/conftool/dbconfig/20240527-221227-marostegui.json
[22:13:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T364299)', diff saved to https://phabricator.wikimedia.org/P63361 and previous config saved to /var/cache/conftool/dbconfig/20240527-221302-marostegui.json
[22:13:06] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance
[22:13:07] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[22:13:19] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance
[22:13:20] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2186.codfw.wmnet with reason: Maintenance
[22:13:23] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2186.codfw.wmnet with reason: Maintenance
[22:13:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2156 (T364299)', diff saved to https://phabricator.wikimedia.org/P63362 and previous config saved to /var/cache/conftool/dbconfig/20240527-221330-marostegui.json
[22:13:50] <icinga-wm_>	 RECOVERY - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38
[22:27:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T364069)', diff saved to https://phabricator.wikimedia.org/P63363 and previous config saved to /var/cache/conftool/dbconfig/20240527-222735-marostegui.json
[22:27:38] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2178.codfw.wmnet with reason: Maintenance
[22:27:42] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[22:27:52] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2178.codfw.wmnet with reason: Maintenance
[22:28:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2178 (T364069)', diff saved to https://phabricator.wikimedia.org/P63364 and previous config saved to /var/cache/conftool/dbconfig/20240527-222759-marostegui.json
[23:12:44] <icinga-wm_>	 PROBLEM - CirrusSearch more_like eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39
[23:13:52] <icinga-wm_>	 PROBLEM - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38
[23:20:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T364069)', diff saved to https://phabricator.wikimedia.org/P63365 and previous config saved to /var/cache/conftool/dbconfig/20240527-232025-marostegui.json
[23:20:33] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[23:21:44] <icinga-wm_>	 RECOVERY - CirrusSearch more_like eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39
[23:21:53] <icinga-wm_>	 RECOVERY - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38
[23:26:48] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:35:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P63366 and previous config saved to /var/cache/conftool/dbconfig/20240527-233533-marostegui.json
[23:38:10] <icinga-wm_>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 7/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:38:12] <icinga-wm_>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:38:19] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1035870
[23:38:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1035870 (owner: 10TrainBranchBot)
[23:39:10] <icinga-wm_>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:39:12] <icinga-wm_>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:42:52] <icinga-wm_>	 PROBLEM - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38
[23:43:48] <icinga-wm_>	 PROBLEM - CirrusSearch more_like eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39
[23:47:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T364299)', diff saved to https://phabricator.wikimedia.org/P63367 and previous config saved to /var/cache/conftool/dbconfig/20240527-234705-marostegui.json
[23:47:13] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[23:50:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P63368 and previous config saved to /var/cache/conftool/dbconfig/20240527-235041-marostegui.json
[23:51:54] <icinga-wm_>	 RECOVERY - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38
[23:52:48] <icinga-wm_>	 RECOVERY - CirrusSearch more_like eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39