[00:01:05] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1035867 (owner: 10TrainBranchBot) [00:04:54] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:04:58] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:12:31] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:12:35] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:17:10] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:17:14] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:26:32] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:26:36] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:31:32] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:31:36] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:39:44] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:39:48] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:54:43] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:54:48] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:59:12] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:59:16] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:02:20] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:02:25] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:11:43] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:11:47] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:15:32] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:15:36] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:19:15] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:19:19] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:22:45] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 163 probes of 729 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:27:39] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:27:43] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:27:47] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 43 probes of 729 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:36:01] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:36:05] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:47:09] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:47:13] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:51:47] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:51:51] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:02:55] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:02:59] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:08:57] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:09:01] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:11:05] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:11:10] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:18:18] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:18:22] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:31:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:33:06] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:33:10] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:34:54] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:34:58] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:36:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:36:47] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:49:52] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:49:56] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:52:05] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:52:10] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:54:14] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:54:18] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:56:47] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:56:52] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:56:56] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:05:40] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:05:44] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:06:47] FIRING: [4x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:08:25] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:08:29] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:11:50] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:11:54] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:19:53] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:19:57] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:24:41] FIRING: [12x] SystemdUnitFailed: httpbb_hourly_appserver.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:26:47] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:29:18] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:29:22] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:35:40] FIRING: [2x] SystemdUnitFailed: kube-controller-manager.service on ml-staging-ctrl2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:36:23] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:36:28] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:40:32] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:40:36] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:42:20] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:42:24] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:49:56] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:50:00] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:53:09] PROBLEM - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [03:53:59] PROBLEM - CirrusSearch more_like eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [04:03:09] RECOVERY - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [04:04:03] RECOVERY - CirrusSearch more_like eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [04:04:30] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:04:34] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:11:42] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:11:46] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:14:20] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:14:24] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:16:08] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:16:12] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:19:26] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:19:31] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:21:15] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:21:19] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:27:43] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:27:48] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:38:32] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:38:36] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:40:31] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:40:35] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:43:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [04:43:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [04:44:05] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:44:09] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:47:29] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:47:33] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:51:51] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2150 - https://phabricator.wikimedia.org/T365797#9833744 (10Marostegui) 05Open→03Declined The RAID is still in optimal, let's close this for now. [04:52:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance [04:52:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance [04:52:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [04:52:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [04:53:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1161 (T364069)', diff saved to https://phabricator.wikimedia.org/P63250 and previous config saved to /var/cache/conftool/dbconfig/20240527-045301-marostegui.json [04:53:06] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [04:54:38] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:54:42] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:01:29] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:01:33] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:03:27] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:03:31] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:05:44] (03PS1) 10Marostegui: db1243: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1035903 [05:05:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1243', diff saved to https://phabricator.wikimedia.org/P63251 and previous config saved to /var/cache/conftool/dbconfig/20240527-050551-marostegui.json [05:06:35] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:06:39] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:07:09] (03CR) 10Marostegui: [C:03+2] db1243: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1035903 (owner: 10Marostegui) [05:07:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1243.eqiad.wmnet with OS bookworm [05:08:33] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:08:38] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:15:52] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:15:56] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:21:58] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:22:02] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:24:06] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:24:10] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:24:34] !log marostegui@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db1243.eqiad.wmnet with OS bookworm [05:24:59] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1243.eqiad.wmnet with OS bookworm [05:25:35] RECOVERY - snapshot of s2 in eqiad on backupmon1001 is OK: Last snapshot for s2 at eqiad (db1225) taken on 2024-05-27 04:06:28 (1245 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [05:33:37] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2024-05-20-182409-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034211 (https://phabricator.wikimedia.org/T354666) (owner: 10KartikMistry) [05:33:57] Deploying cxserver, minor config changes. [05:34:04] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:34:08] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:34:47] (03Merged) 10jenkins-bot: Update cxserver to 2024-05-20-182409-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034211 (https://phabricator.wikimedia.org/T354666) (owner: 10KartikMistry) [05:43:08] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:43:12] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:49:56] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:50:01] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:52:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T364069)', diff saved to https://phabricator.wikimedia.org/P63252 and previous config saved to /var/cache/conftool/dbconfig/20240527-055244-marostegui.json [05:52:49] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [05:53:33] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:53:55] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:58:09] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:58:13] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:59:20] 06SRE, 10Wikimedia-Mailing-lists, 07Datacenter-Switchover: Make mailman3 work in the standby host (lists2001.wikimedia.org) - https://phabricator.wikimedia.org/T283615#9833798 (10LSobanski) [05:59:57] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:00:01] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:01:14] 06SRE, 10Wikimedia-Mailing-lists, 07Datacenter-Switchover: Make mailman3 work in the standby host (lists2001.wikimedia.org) - https://phabricator.wikimedia.org/T283615#9833800 (10LSobanski) [06:01:56] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:02:00] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:03:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:06:31] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:06:35] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:07:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P63253 and previous config saved to /var/cache/conftool/dbconfig/20240527-060752-marostegui.json [06:08:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:08:40] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:08:44] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:12:31] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [06:12:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [06:12:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1157 (T364299)', diff saved to https://phabricator.wikimedia.org/P63255 and previous config saved to /var/cache/conftool/dbconfig/20240527-061252-marostegui.json [06:12:57] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [06:15:22] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:15:53] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:17:14] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [06:17:21] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:17:26] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:17:49] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [06:23:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P63256 and previous config saved to /var/cache/conftool/dbconfig/20240527-062301-marostegui.json [06:25:06] !log Updated cxserver to 2024-05-20-182409-production (T354666, T365230) [06:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:13] T354666: Enable MADLAD-400 in MinT test instance and Production for Wikipedia languages not supported by other services - https://phabricator.wikimedia.org/T354666 [06:25:13] T365230: Post-creation work for dtpwiki - https://phabricator.wikimedia.org/T365230 [06:27:54] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:27:58] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:34:23] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:34:27] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:38:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T364069)', diff saved to https://phabricator.wikimedia.org/P63257 and previous config saved to /var/cache/conftool/dbconfig/20240527-063809-marostegui.json [06:38:12] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1183.eqiad.wmnet with reason: Maintenance [06:38:14] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [06:38:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1183.eqiad.wmnet with reason: Maintenance [06:38:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1183 (T364069)', diff saved to https://phabricator.wikimedia.org/P63258 and previous config saved to /var/cache/conftool/dbconfig/20240527-063832-marostegui.json [06:40:26] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:40:30] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:44:45] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1243.eqiad.wmnet with OS bookworm [06:47:25] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:47:29] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:50:43] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:50:47] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:53:27] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:53:31] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:55:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T364299)', diff saved to https://phabricator.wikimedia.org/P63259 and previous config saved to /var/cache/conftool/dbconfig/20240527-065518-marostegui.json [06:55:24] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [07:00:05] Amir1 and Urbanecm: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240527T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:02:06] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:02:10] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:05:46] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for Blazegraph services [puppet] - 10https://gerrit.wikimedia.org/r/1035737 (owner: 10Muehlenhoff) [07:06:20] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:06:25] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:06:47] FIRING: [4x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:07:48] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for initial set of WMCS roles [puppet] - 10https://gerrit.wikimedia.org/r/1035739 (owner: 10Muehlenhoff) [07:08:42] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:08:46] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:10:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P63260 and previous config saved to /var/cache/conftool/dbconfig/20240527-071026-marostegui.json [07:12:41] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:12:45] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:16:51] 10ops-eqiad, 06DBA, 06DC-Ops: Upgrade db1243 NICs firmware - https://phabricator.wikimedia.org/T365963 (10Marostegui) 03NEW [07:16:54] 10ops-eqiad, 06DBA, 06DC-Ops: Upgrade db1243 NICs firmware - https://phabricator.wikimedia.org/T365963#9833903 (10Marostegui) p:05Triage→03High [07:18:33] 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 07Schema-change-in-production, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501#9833905 (10Marostegui) There has been no i... [07:18:42] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2174.codfw.wmnet [07:19:29] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:19:33] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:21:09] (03PS1) 10Muehlenhoff: Switch db2174 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036079 (https://phabricator.wikimedia.org/T349619) [07:21:16] !log Deploy schema change on s7 codfw dbmaint T307501 [07:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:21] T307501: Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501 [07:22:01] (03CR) 10Muehlenhoff: [C:03+2] Switch db2174 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036079 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:24:41] FIRING: [12x] SystemdUnitFailed: httpbb_hourly_appserver.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:24:59] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:25:03] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:25:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P63261 and previous config saved to /var/cache/conftool/dbconfig/20240527-072534-marostegui.json [07:26:47] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:28:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2174.codfw.wmnet [07:29:06] 10ops-eqiad, 06DBA, 06DC-Ops: Upgrade db1243 NICs firmware - https://phabricator.wikimedia.org/T365963#9833916 (10Marostegui) [07:29:28] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2176.codfw.wmnet [07:29:52] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Broadcom NICs with recent firmware fail to reimage - https://phabricator.wikimedia.org/T363576#9833915 (10Marostegui) [07:33:07] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:33:11] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:34:35] 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 07Schema-change-in-production, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501#9833926 (10Marostegui) [07:35:05] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:35:10] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:35:30] !log root@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s6 T365783 [07:35:34] T365783: Switchover s6 master (db2129 -> db2214) - https://phabricator.wikimedia.org/T365783 [07:35:40] FIRING: [2x] SystemdUnitFailed: kube-controller-manager.service on ml-staging-ctrl2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:35:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2214 with weight 0 T365783', diff saved to https://phabricator.wikimedia.org/P63262 and previous config saved to /var/cache/conftool/dbconfig/20240527-073545-root.json [07:35:53] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s6 T365783 [07:36:52] (03PS2) 10Gerrit maintenance bot: mariadb: Promote db2214 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1034939 (https://phabricator.wikimedia.org/T365783) [07:36:54] (03PS1) 10Muehlenhoff: Switch db2176 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036178 (https://phabricator.wikimedia.org/T349619) [07:36:54] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:36:58] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:37:09] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2214 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1034939 (https://phabricator.wikimedia.org/T365783) (owner: 10Gerrit maintenance bot) [07:37:10] (03CR) 10Marostegui: [V:03+2 C:03+2] mariadb: Promote db2214 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1034939 (https://phabricator.wikimedia.org/T365783) (owner: 10Gerrit maintenance bot) [07:38:15] 07Puppet, 06SRE: Add humorous redirect for fox.wikimedia.org - https://phabricator.wikimedia.org/T352870#9833949 (10SMMpanels) SMM panels are a fantastic method to organise your social media marketing activities and offer practical, affordable solutions. They let companies easily manage several platforms a... [07:40:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T364069)', diff saved to https://phabricator.wikimedia.org/P63263 and previous config saved to /var/cache/conftool/dbconfig/20240527-074009-marostegui.json [07:40:14] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [07:40:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T364299)', diff saved to https://phabricator.wikimedia.org/P63264 and previous config saved to /var/cache/conftool/dbconfig/20240527-074042-marostegui.json [07:40:45] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [07:40:47] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [07:40:58] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [07:41:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1166 (T364299)', diff saved to https://phabricator.wikimedia.org/P63265 and previous config saved to /var/cache/conftool/dbconfig/20240527-074105-marostegui.json [07:48:47] (03CR) 10Muehlenhoff: [C:03+2] Switch db2176 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036178 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:49:50] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:49:54] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:50:55] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Bump changelog [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1035780 (owner: 10Muehlenhoff) [07:52:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2176.codfw.wmnet [07:54:43] !log Starting s6 codfw failover from db2129 to db2214 - T365783 [07:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:48] T365783: Switchover s6 master (db2129 -> db2214) - https://phabricator.wikimedia.org/T365783 [07:55:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2214 to s6 primary T365783', diff saved to https://phabricator.wikimedia.org/P63266 and previous config saved to /var/cache/conftool/dbconfig/20240527-075512-marostegui.json [07:55:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P63267 and previous config saved to /var/cache/conftool/dbconfig/20240527-075524-marostegui.json [07:55:44] (03PS1) 10Muehlenhoff: Remove to wmf-laptop and add transition package [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1036180 [07:56:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2129 T365783', diff saved to https://phabricator.wikimedia.org/P63268 and previous config saved to /var/cache/conftool/dbconfig/20240527-075602-root.json [07:57:46] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:57:50] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:58:20] !log Deploy schema change on s6 codfw (old master) dbmaint T364299 [07:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:25] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [08:00:20] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:00:24] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:00:54] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2129.codfw.wmnet with reason: Long schema change [08:00:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2129.codfw.wmnet with reason: Long schema change [08:01:51] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2188.codfw.wmnet [08:03:34] (03PS1) 10Muehlenhoff: Switch db2188 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036182 (https://phabricator.wikimedia.org/T349619) [08:10:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P63269 and previous config saved to /var/cache/conftool/dbconfig/20240527-081031-marostegui.json [08:14:11] (03CR) 10Slyngshede: [C:03+2] Always require users to pick a system for SSH keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/1035765 (owner: 10Slyngshede) [08:14:42] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:14:46] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:14:53] (03CR) 10Filippo Giunchedi: [C:03+1] Remove to wmf-laptop and add transition package [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1036180 (owner: 10Muehlenhoff) [08:15:47] (03Merged) 10jenkins-bot: Always require users to pick a system for SSH keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/1035765 (owner: 10Slyngshede) [08:16:04] (03CR) 10Muehlenhoff: [C:03+2] Switch db2188 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036182 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:16:38] (03PS2) 10Muehlenhoff: Rename to wmf-laptop and add transition package [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1036180 [08:16:40] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:16:44] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:18:04] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Rename to wmf-laptop and add transition package [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1036180 (owner: 10Muehlenhoff) [08:18:11] (03CR) 10Filippo Giunchedi: [C:03+1] "Very cool!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035829 (https://phabricator.wikimedia.org/T320549) (owner: 10CDanis) [08:18:38] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:18:42] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:20:32] (03PS1) 10Muehlenhoff: Some more renames for the wmf-laptop-sre -> wmf-laptop rename [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1036183 [08:20:37] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:20:40] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:21:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2188.codfw.wmnet [08:22:35] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:22:39] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:23:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T364299)', diff saved to https://phabricator.wikimedia.org/P63270 and previous config saved to /var/cache/conftool/dbconfig/20240527-082351-marostegui.json [08:23:58] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [08:25:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T364069)', diff saved to https://phabricator.wikimedia.org/P63271 and previous config saved to /var/cache/conftool/dbconfig/20240527-082539-marostegui.json [08:25:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1185.eqiad.wmnet with reason: Maintenance [08:25:46] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [08:25:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1185.eqiad.wmnet with reason: Maintenance [08:26:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1185 (T364069)', diff saved to https://phabricator.wikimedia.org/P63272 and previous config saved to /var/cache/conftool/dbconfig/20240527-082603-marostegui.json [08:30:13] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Some more renames for the wmf-laptop-sre -> wmf-laptop rename [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1036183 (owner: 10Muehlenhoff) [08:32:31] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:32:35] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:34:08] (03PS1) 10Muehlenhoff: Update some docs for wmf-laptop-sre -> wmf-laptop rename [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1036185 [08:36:56] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Update some docs for wmf-laptop-sre -> wmf-laptop rename [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1036185 (owner: 10Muehlenhoff) [08:38:24] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:38:28] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:39:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P63273 and previous config saved to /var/cache/conftool/dbconfig/20240527-083859-marostegui.json [08:40:22] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:40:26] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:40:55] (03CR) 10Aklapper: [C:03+2] Ignore /src/.cache as well [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1035822 (owner: 10Pppery) [08:40:57] (03CR) 10Aklapper: [V:03+2 C:03+2] Ignore /src/.cache as well [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1035822 (owner: 10Pppery) [08:42:21] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:42:25] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:44:58] (03PS1) 10Muehlenhoff: wmf-laptop: Update changelog [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1036186 [08:45:19] 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 07Schema-change-in-production, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501#9834087 (10Marostegui) [08:47:53] (03PS1) 10Clément Goubert: httpbb: Fix test following Wikimedia_Technology rename [puppet] - 10https://gerrit.wikimedia.org/r/1036187 [08:48:28] (03PS1) 10Santiago Faci: edit-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036188 (https://phabricator.wikimedia.org/T355407) [08:48:35] (03PS1) 10Slyngshede: Version bump to 0.0.8 [software/bitu] - 10https://gerrit.wikimedia.org/r/1036189 [08:51:30] (03PS1) 10Fabfur: hiera: use benthos on cp3073 (first esams host) [puppet] - 10https://gerrit.wikimedia.org/r/1036190 (https://phabricator.wikimedia.org/T358109) [08:52:07] (03CR) 10Hnowlan: [C:03+1] httpbb: Fix test following Wikimedia_Technology rename [puppet] - 10https://gerrit.wikimedia.org/r/1036187 (owner: 10Clément Goubert) [08:52:22] (03CR) 10Clément Goubert: [C:03+2] httpbb: Fix test following Wikimedia_Technology rename [puppet] - 10https://gerrit.wikimedia.org/r/1036187 (owner: 10Clément Goubert) [08:53:50] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:53:53] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1036189 (owner: 10Slyngshede) [08:53:54] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:54:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P63274 and previous config saved to /var/cache/conftool/dbconfig/20240527-085407-marostegui.json [08:54:10] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1036190 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [08:54:13] (03CR) 10Slyngshede: [C:03+2] Version bump to 0.0.8 [software/bitu] - 10https://gerrit.wikimedia.org/r/1036189 (owner: 10Slyngshede) [08:54:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P63275 and previous config saved to /var/cache/conftool/dbconfig/20240527-085447-root.json [08:55:21] (03PS1) 10Marostegui: Revert "db2150: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1035665 [08:55:57] (03Merged) 10jenkins-bot: Version bump to 0.0.8 [software/bitu] - 10https://gerrit.wikimedia.org/r/1036189 (owner: 10Slyngshede) [08:55:59] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:56:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 1%: Repooling T365797', diff saved to https://phabricator.wikimedia.org/P63276 and previous config saved to /var/cache/conftool/dbconfig/20240527-085602-root.json [08:56:03] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:56:06] T365797: Degraded RAID on db2150 - https://phabricator.wikimedia.org/T365797 [08:56:08] (03CR) 10Marostegui: [C:03+2] Revert "db2150: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1035665 (owner: 10Marostegui) [08:56:30] (03CR) 10Fabfur: [V:03+1 C:04-2] "This depends also on the outcome of I6836cfd828fec602c3d23e98bf38a1a05742c283" [puppet] - 10https://gerrit.wikimedia.org/r/1036190 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [08:59:27] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:59:31] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:00:24] (03CR) 10Lucas Werkmeister (WMDE): "> Also, are we going to retain the general k8s-mwdebug?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035361 (https://phabricator.wikimedia.org/T365478) (owner: 10Effie Mouzeli) [09:01:17] (03PS27) 10Ayounsi: sre.hosts.move-vlan: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/981472 (https://phabricator.wikimedia.org/T350152) [09:01:22] (03CR) 10David Caro: [V:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035148 (https://phabricator.wikimedia.org/T365640) (owner: 10David Caro) [09:01:27] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:02:39] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:02:39] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:02:39] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:02:39] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:02:39] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:02:40] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:02:40] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:02:41] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:02:41] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:02:42] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:02:42] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:03:39] (03PS1) 10Santiago Faci: editor-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036191 (https://phabricator.wikimedia.org/T355408) [09:04:23] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:04:26] RESOLVED: [12x] SystemdUnitFailed: httpbb_hourly_appserver.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:04:27] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:04:45] (03CR) 10Brouberol: [C:03+1] provision datahub-next service records [dns] - 10https://gerrit.wikimedia.org/r/1032393 (https://phabricator.wikimedia.org/T363299) (owner: 10Stevemunene) [09:05:29] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/981472 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [09:06:11] (03PS1) 10Santiago Faci: geo-analytics deployment: Big AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036192 (https://phabricator.wikimedia.org/T360525) [09:06:25] (03PS1) 10Zabe: Stop writing to af_user(_text)/afh_user(_text) in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036193 (https://phabricator.wikimedia.org/T337920) [09:06:57] (03PS3) 10Volans: sre.hosts.reimage: add support for VLAN move [cookbooks] - 10https://gerrit.wikimedia.org/r/1007652 (https://phabricator.wikimedia.org/T350152) [09:07:03] (03CR) 10Brouberol: [C:03+1] Add datahub-next missing values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035411 (https://phabricator.wikimedia.org/T365674) (owner: 10Stevemunene) [09:07:29] jouncebot: nowandnext [09:07:29] No deployments scheduled for the next 0 hour(s) and 52 minute(s) [09:07:30] In 0 hour(s) and 52 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240527T1000) [09:07:52] (03CR) 10Zabe: [C:03+2] Stop writing to af_user(_text)/afh_user(_text) in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036193 (https://phabricator.wikimedia.org/T337920) (owner: 10Zabe) [09:08:00] (03CR) 10David Caro: [V:03+1 C:03+2] Reapply "openstack::bobcat: apply cloud yaml patch"" [puppet] - 10https://gerrit.wikimedia.org/r/1035148 (https://phabricator.wikimedia.org/T365640) (owner: 10David Caro) [09:08:01] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:08:05] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:08:29] (03PS2) 10Santiago Faci: edit-analytics deployment: big refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036188 (https://phabricator.wikimedia.org/T355407) [09:08:39] (03Merged) 10jenkins-bot: Stop writing to af_user(_text)/afh_user(_text) in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036193 (https://phabricator.wikimedia.org/T337920) (owner: 10Zabe) [09:08:45] (03PS2) 10Santiago Faci: editor-analytics deployment: big refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036191 (https://phabricator.wikimedia.org/T355408) [09:09:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T364299)', diff saved to https://phabricator.wikimedia.org/P63277 and previous config saved to /var/cache/conftool/dbconfig/20240527-090915-marostegui.json [09:09:18] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [09:09:18] !log zabe@deploy1002 Started scap: Backport for [[gerrit:1036193|Stop writing to af_user(_text)/afh_user(_text) in group1 wikis (T337920)]] [09:09:21] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [09:09:28] T337920: Stop writing to af_user(_text)/afh_user(_text) - https://phabricator.wikimedia.org/T337920 [09:09:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [09:09:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1175 (T364299)', diff saved to https://phabricator.wikimedia.org/P63278 and previous config saved to /var/cache/conftool/dbconfig/20240527-090938-marostegui.json [09:09:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P63279 and previous config saved to /var/cache/conftool/dbconfig/20240527-090953-root.json [09:11:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 5%: Repooling T365797', diff saved to https://phabricator.wikimedia.org/P63280 and previous config saved to /var/cache/conftool/dbconfig/20240527-091108-root.json [09:11:13] T365797: Degraded RAID on db2150 - https://phabricator.wikimedia.org/T365797 [09:11:45] (03PS1) 10Santiago Faci: media-analytics deployment: Big AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036194 (https://phabricator.wikimedia.org/T360526) [09:14:22] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:14:27] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:14:27] (03CR) 10Hnowlan: [C:03+1] maps::tlsproxy: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1035750 (owner: 10Muehlenhoff) [09:15:02] (03PS4) 10EoghanGaffney: lists: Don't include automation in standby hosts [puppet] - 10https://gerrit.wikimedia.org/r/1035789 (https://phabricator.wikimedia.org/T365804) [09:19:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T364069)', diff saved to https://phabricator.wikimedia.org/P63282 and previous config saved to /var/cache/conftool/dbconfig/20240527-091935-marostegui.json [09:19:41] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [09:22:08] (03CR) 10EoghanGaffney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035789 (https://phabricator.wikimedia.org/T365804) (owner: 10EoghanGaffney) [09:22:35] RECOVERY - Disk space on backup1011 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=backup1011&var-datasource=eqiad+prometheus/ops [09:22:37] (03CR) 10Brouberol: [C:03+1] "I checked the full diff, which is quite extensive, but seems legit. AFAICT we're mostly dealing with nftables config files and a prometheu" [puppet] - 10https://gerrit.wikimedia.org/r/1032632 (owner: 10Muehlenhoff) [09:23:08] !log zabe@deploy1002 zabe: Backport for [[gerrit:1036193|Stop writing to af_user(_text)/afh_user(_text) in group1 wikis (T337920)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:23:12] T337920: Stop writing to af_user(_text)/afh_user(_text) - https://phabricator.wikimedia.org/T337920 [09:23:22] !log zabe@deploy1002 zabe: Continuing with sync [09:24:56] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:24:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P63283 and previous config saved to /var/cache/conftool/dbconfig/20240527-092459-root.json [09:25:01] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:26:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 10%: Repooling T365797', diff saved to https://phabricator.wikimedia.org/P63284 and previous config saved to /var/cache/conftool/dbconfig/20240527-092614-root.json [09:26:19] T365797: Degraded RAID on db2150 - https://phabricator.wikimedia.org/T365797 [09:29:55] jouncebot: nowandnext [09:29:55] No deployments scheduled for the next 0 hour(s) and 30 minute(s) [09:29:55] In 0 hour(s) and 30 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240527T1000) [09:30:04] (03CR) 10Ladsgroup: [C:03+2] Update tagline and wordmark of Persian Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035852 (https://phabricator.wikimedia.org/T365913) (owner: 10Ebrahim) [09:30:43] (03Merged) 10jenkins-bot: Update tagline and wordmark of Persian Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035852 (https://phabricator.wikimedia.org/T365913) (owner: 10Ebrahim) [09:31:24] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:31:56] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:33:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2129', diff saved to https://phabricator.wikimedia.org/P63285 and previous config saved to /var/cache/conftool/dbconfig/20240527-093306-root.json [09:33:37] (03CR) 10Stevemunene: [C:03+2] Add datahub-next missing values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035411 (https://phabricator.wikimedia.org/T365674) (owner: 10Stevemunene) [09:34:28] (03Merged) 10jenkins-bot: Add datahub-next missing values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035411 (https://phabricator.wikimedia.org/T365674) (owner: 10Stevemunene) [09:34:37] (03PS2) 10Elukey: services: upgrade tegola in codfw to use the envoy proxy for Swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035743 (https://phabricator.wikimedia.org/T344324) [09:34:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P63286 and previous config saved to /var/cache/conftool/dbconfig/20240527-093443-marostegui.json [09:34:49] (03PS1) 10Bartosz Dziewoński: Enable wgDiscussionToolsEnablePermalinksBackend on enwiki (last one!) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036197 (https://phabricator.wikimedia.org/T315353) [09:34:54] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance [09:34:56] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance [09:35:15] (03PS4) 10Jforrester: Pre-emptively disable DiscussionToolsEnableThanks (no-op) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026511 (owner: 10Esanders) [09:35:30] (03CR) 10Stevemunene: [C:03+2] provision datahub-next service records [dns] - 10https://gerrit.wikimedia.org/r/1032393 (https://phabricator.wikimedia.org/T363299) (owner: 10Stevemunene) [09:36:40] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 64096 [09:37:03] (03CR) 10Volans: [C:03+1] "LGTM" [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1034962 (https://phabricator.wikimedia.org/T308002) (owner: 10Ayounsi) [09:37:15] (03CR) 10Volans: [C:03+1] "LGTM" [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1034963 (owner: 10Ayounsi) [09:37:48] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 64096 [09:38:33] (03CR) 10Ayounsi: [V:03+2 C:03+2] Add ApereoSocialPipeline [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1034962 (https://phabricator.wikimedia.org/T308002) (owner: 10Ayounsi) [09:38:55] (03CR) 10Ayounsi: [V:03+2 C:03+2] Update requirements [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1034963 (owner: 10Ayounsi) [09:39:02] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1036193|Stop writing to af_user(_text)/afh_user(_text) in group1 wikis (T337920)]] (duration: 29m 43s) [09:39:05] (03CR) 10TrainBranchBot: [C:03+2] "Copied votes on follow-up patch sets have been updated:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035852 (https://phabricator.wikimedia.org/T365913) (owner: 10Ebrahim) [09:39:07] T337920: Stop writing to af_user(_text)/afh_user(_text) - https://phabricator.wikimedia.org/T337920 [09:39:16] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:39:19] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1035852|Update tagline and wordmark of Persian Wikibooks (T365913)]] [09:39:20] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:39:24] T365913: Change the Persian Wikibooks wordmark - https://phabricator.wikimedia.org/T365913 [09:41:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 25%: Repooling T365797', diff saved to https://phabricator.wikimedia.org/P63287 and previous config saved to /var/cache/conftool/dbconfig/20240527-094120-root.json [09:41:26] T365797: Degraded RAID on db2150 - https://phabricator.wikimedia.org/T365797 [09:41:44] !log ayounsi@cumin1002 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: add python-social-auth and update wheels - ayounsi@cumin1002 - T308002 [09:41:48] !log ladsgroup@deploy1002 ebrahim and ladsgroup: Backport for [[gerrit:1035852|Update tagline and wordmark of Persian Wikibooks (T365913)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:41:49] T308002: Move Netbox authentication to python-social-auth - https://phabricator.wikimedia.org/T308002 [09:42:01] (03PS1) 10Marostegui: control-mariadb-10.11-bookworm: Initial packaging [software] - 10https://gerrit.wikimedia.org/r/1036199 (https://phabricator.wikimedia.org/T365805) [09:42:30] !log ladsgroup@deploy1002 ebrahim and ladsgroup: Continuing with sync [09:44:00] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:44:04] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:45:25] !log stevemunene@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [09:45:33] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2002.codfw.wmnet with reason: add python-social-auth and update wheels - ayounsi@cumin1002 - T308002 [09:46:49] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:47:06] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:49:41] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:49:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P63288 and previous config saved to /var/cache/conftool/dbconfig/20240527-094951-marostegui.json [09:49:56] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:52:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T364299)', diff saved to https://phabricator.wikimedia.org/P63289 and previous config saved to /var/cache/conftool/dbconfig/20240527-095208-marostegui.json [09:52:16] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [09:52:20] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:52:24] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:53:36] (03PS2) 10Clément Goubert: miscweb: Update various modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032525 (https://phabricator.wikimedia.org/T362978) [09:54:29] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:54:33] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:56:19] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1035852|Update tagline and wordmark of Persian Wikibooks (T365913)]] (duration: 16m 59s) [09:56:23] T365913: Change the Persian Wikibooks wordmark - https://phabricator.wikimedia.org/T365913 [09:56:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 50%: Repooling T365797', diff saved to https://phabricator.wikimedia.org/P63290 and previous config saved to /var/cache/conftool/dbconfig/20240527-095626-root.json [09:56:32] T365797: Degraded RAID on db2150 - https://phabricator.wikimedia.org/T365797 [09:56:36] (03PS1) 10Jelto: external_clouds_vendors: add Vultr cloud [puppet] - 10https://gerrit.wikimedia.org/r/1036201 (https://phabricator.wikimedia.org/T303534) [09:58:19] !log stevemunene@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [09:59:22] (03CR) 10Aklapper: [V:03+2 C:03+2] "Tested locally (both applying the patch, as well as changing the line in export.sh, running export.sh, and checking the resulting file pro" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1035805 (https://phabricator.wikimedia.org/T351581) (owner: 10Pppery) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240527T1000) [10:01:58] (03PS1) 10Stevemunene: Add dse range to an-test coordinator [puppet] - 10https://gerrit.wikimedia.org/r/1036202 (https://phabricator.wikimedia.org/T361185) [10:04:17] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:04:22] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:04:26] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:04:45] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036202 (https://phabricator.wikimedia.org/T361185) (owner: 10Stevemunene) [10:05:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T364069)', diff saved to https://phabricator.wikimedia.org/P63291 and previous config saved to /var/cache/conftool/dbconfig/20240527-100459-marostegui.json [10:05:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1200.eqiad.wmnet with reason: Maintenance [10:05:05] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [10:05:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1200.eqiad.wmnet with reason: Maintenance [10:05:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T364069)', diff saved to https://phabricator.wikimedia.org/P63292 and previous config saved to /var/cache/conftool/dbconfig/20240527-100523-marostegui.json [10:05:36] (03CR) 10Brouberol: [C:03+1] Add dse range to an-test coordinator [puppet] - 10https://gerrit.wikimedia.org/r/1036202 (https://phabricator.wikimedia.org/T361185) (owner: 10Stevemunene) [10:06:01] (03CR) 10Stevemunene: [C:03+2] Add dse range to an-test coordinator [puppet] - 10https://gerrit.wikimedia.org/r/1036202 (https://phabricator.wikimedia.org/T361185) (owner: 10Stevemunene) [10:06:58] (03CR) 10Arnaudb: [C:03+1] control-mariadb-10.11-bookworm: Initial packaging [software] - 10https://gerrit.wikimedia.org/r/1036199 (https://phabricator.wikimedia.org/T365805) (owner: 10Marostegui) [10:07:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P63293 and previous config saved to /var/cache/conftool/dbconfig/20240527-100717-marostegui.json [10:11:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 75%: Repooling T365797', diff saved to https://phabricator.wikimedia.org/P63294 and previous config saved to /var/cache/conftool/dbconfig/20240527-101133-root.json [10:11:38] T365797: Degraded RAID on db2150 - https://phabricator.wikimedia.org/T365797 [10:11:47] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:12:06] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:12:10] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:13:05] (03CR) 10Volans: "They seem" [puppet] - 10https://gerrit.wikimedia.org/r/1036201 (https://phabricator.wikimedia.org/T303534) (owner: 10Jelto) [10:13:34] (03CR) 10Muehlenhoff: [C:03+2] Remove profile::zookeeper::firewall::srange [puppet] - 10https://gerrit.wikimedia.org/r/1035334 (owner: 10Muehlenhoff) [10:14:23] !log stevemunene@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [10:14:51] (03CR) 10Marostegui: [C:03+2] control-mariadb-10.11-bookworm: Initial packaging [software] - 10https://gerrit.wikimedia.org/r/1036199 (https://phabricator.wikimedia.org/T365805) (owner: 10Marostegui) [10:15:23] (03Merged) 10jenkins-bot: control-mariadb-10.11-bookworm: Initial packaging [software] - 10https://gerrit.wikimedia.org/r/1036199 (https://phabricator.wikimedia.org/T365805) (owner: 10Marostegui) [10:16:47] RESOLVED: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:18:55] (03PS2) 10Jelto: external_clouds_vendors: add Vultr cloud [puppet] - 10https://gerrit.wikimedia.org/r/1036201 (https://phabricator.wikimedia.org/T303534) [10:19:50] (03CR) 10Jelto: external_clouds_vendors: add Vultr cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1036201 (https://phabricator.wikimedia.org/T303534) (owner: 10Jelto) [10:20:53] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035351 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff) [10:22:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P63295 and previous config saved to /var/cache/conftool/dbconfig/20240527-102226-marostegui.json [10:26:18] (03PS1) 10Gergő Tisza: [POC][beta] Add rewrite rule for sso.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/1036230 (https://phabricator.wikimedia.org/T365162) [10:26:23] !log stevemunene@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [10:26:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 100%: Repooling T365797', diff saved to https://phabricator.wikimedia.org/P63296 and previous config saved to /var/cache/conftool/dbconfig/20240527-102639-root.json [10:26:44] T365797: Degraded RAID on db2150 - https://phabricator.wikimedia.org/T365797 [10:26:49] FIRING: HelmReleaseBadStatus: Helm release datahub-next/staging on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datahub-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:28:05] (03PS5) 10EoghanGaffney: lists: Don't include automation in standby hosts [puppet] - 10https://gerrit.wikimedia.org/r/1035789 (https://phabricator.wikimedia.org/T365804) [10:31:18] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes (excluding Votewiki) - https://phabricator.wikimedia.org/T362323#9834404 (10Clement_Goubert) As far as mediawiki calling itself goes (I see it was removed from the task description, but it is te... [10:31:49] RESOLVED: HelmReleaseBadStatus: Helm release datahub-next/staging on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datahub-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:32:29] (03CR) 10Muehlenhoff: [C:03+2] maps: Add option to use PKI [puppet] - 10https://gerrit.wikimedia.org/r/1035351 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff) [10:37:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T364299)', diff saved to https://phabricator.wikimedia.org/P63297 and previous config saved to /var/cache/conftool/dbconfig/20240527-103734-marostegui.json [10:37:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1189.eqiad.wmnet with reason: Maintenance [10:37:41] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [10:37:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1189.eqiad.wmnet with reason: Maintenance [10:37:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1189 (T364299)', diff saved to https://phabricator.wikimedia.org/P63298 and previous config saved to /var/cache/conftool/dbconfig/20240527-103759-marostegui.json [10:42:33] (03PS5) 10Bartosz Dziewoński: Pre-emptively disable DiscussionToolsEnableThanks (no-op) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026511 (owner: 10Esanders) [10:44:06] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:44:10] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:45:25] (03PS2) 10Muehlenhoff: tlsproxy::localssl: Remove support for OCSP handling [puppet] - 10https://gerrit.wikimedia.org/r/1035362 [10:45:56] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.271 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:46:02] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:46:32] (03CR) 10Vgutierrez: [C:04-1] benthos:cache: switch to rfc5424 format (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1035440 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [10:49:06] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:49:12] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:49:58] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.285 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:50:04] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:52:11] !log Upgrade IDM to Bitu 0.0.8 [10:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:36] (03CR) 10Vgutierrez: [C:04-1] "CR is missing localssl.erb (do_ocsp is still referenced there)" [puppet] - 10https://gerrit.wikimedia.org/r/1035362 (owner: 10Muehlenhoff) [10:55:30] !log main s2@codfw (T364985) [10:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:40] !log dbmaint s2@codfw (T364985) [10:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T364069)', diff saved to https://phabricator.wikimedia.org/P63299 and previous config saved to /var/cache/conftool/dbconfig/20240527-105728-marostegui.json [10:57:35] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [10:59:45] (03PS3) 10Muehlenhoff: tlsproxy::localssl: Remove support for OCSP handling [puppet] - 10https://gerrit.wikimedia.org/r/1035362 [11:04:38] (03PS1) 10Clément Goubert: testwiki: Move to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1036235 (https://phabricator.wikimedia.org/T355534) [11:04:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035362 (owner: 10Muehlenhoff) [11:05:07] (03PS2) 10Clément Goubert: testwiki: Move to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1036235 (https://phabricator.wikimedia.org/T355534) [11:05:17] 06SRE, 10MW-on-K8s, 06Quality-and-Test-Engineering-Team, 06serviceops, 13Patch-For-Review: Move testwiki over to mw-on-k8s - https://phabricator.wikimedia.org/T355534#9834501 (10Clement_Goubert) 05Open→03In progress [11:05:23] (03CR) 10Fabfur: benthos:cache: switch to rfc5424 format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1035440 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [11:06:12] (03PS6) 10Fabfur: benthos:cache: switch to rfc5424 format [puppet] - 10https://gerrit.wikimedia.org/r/1035440 (https://phabricator.wikimedia.org/T365718) [11:06:21] (03CR) 10Fabfur: benthos:cache: switch to rfc5424 format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1035440 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [11:06:48] FIRING: [4x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:07:41] (03PS1) 10Muehlenhoff: maps: Switch kartotherian on maps2007 to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036236 (https://phabricator.wikimedia.org/T360778) [11:08:30] FIRING: [2x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:08:50] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "\o/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036197 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński) [11:10:25] (03PS1) 10Stevemunene: Enable mesh for datahub-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036237 (https://phabricator.wikimedia.org/T361185) [11:10:43] (03PS1) 10Ayounsi: Add python-jose [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1036238 (https://phabricator.wikimedia.org/T308002) [11:12:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P63301 and previous config saved to /var/cache/conftool/dbconfig/20240527-111236-marostegui.json [11:12:58] (03CR) 10Slyngshede: [C:03+1] "Nit: comment is slightly wrong" [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1036238 (https://phabricator.wikimedia.org/T308002) (owner: 10Ayounsi) [11:13:23] (03PS6) 10EoghanGaffney: lists: Don't include automation in standby hosts [puppet] - 10https://gerrit.wikimedia.org/r/1035789 (https://phabricator.wikimedia.org/T365804) [11:13:30] RESOLVED: [2x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:14:15] (03CR) 10Muehlenhoff: lists: Don't include automation in standby hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1035789 (https://phabricator.wikimedia.org/T365804) (owner: 10EoghanGaffney) [11:15:13] (03CR) 10Muehlenhoff: "Good catch, updated" [puppet] - 10https://gerrit.wikimedia.org/r/1035362 (owner: 10Muehlenhoff) [11:15:17] (03PS7) 10EoghanGaffney: lists: Don't include automation in standby hosts [puppet] - 10https://gerrit.wikimedia.org/r/1035789 (https://phabricator.wikimedia.org/T365804) [11:16:23] (03PS2) 10Ayounsi: Add python-jose [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1036238 (https://phabricator.wikimedia.org/T308002) [11:18:13] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036236 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff) [11:18:34] (03PS2) 10Muehlenhoff: maps: Switch kartotherian on maps2007 to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036236 (https://phabricator.wikimedia.org/T360778) [11:19:20] (03CR) 10EoghanGaffney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035789 (https://phabricator.wikimedia.org/T365804) (owner: 10EoghanGaffney) [11:21:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T364299)', diff saved to https://phabricator.wikimedia.org/P63302 and previous config saved to /var/cache/conftool/dbconfig/20240527-112143-marostegui.json [11:21:50] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [11:23:15] (03CR) 10Volans: external_clouds_vendors: add Vultr cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1036201 (https://phabricator.wikimedia.org/T303534) (owner: 10Jelto) [11:23:38] (03CR) 10Brouberol: [C:03+1] Enable mesh for datahub-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036237 (https://phabricator.wikimedia.org/T361185) (owner: 10Stevemunene) [11:24:40] (03CR) 10Stevemunene: [C:03+2] Enable mesh for datahub-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036237 (https://phabricator.wikimedia.org/T361185) (owner: 10Stevemunene) [11:25:29] (03Merged) 10jenkins-bot: Enable mesh for datahub-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036237 (https://phabricator.wikimedia.org/T361185) (owner: 10Stevemunene) [11:26:48] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:27:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P63303 and previous config saved to /var/cache/conftool/dbconfig/20240527-112744-marostegui.json [11:29:38] !log stevemunene@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [11:30:12] (03PS3) 10Jelto: external_clouds_vendors: add Vultr cloud [puppet] - 10https://gerrit.wikimedia.org/r/1036201 (https://phabricator.wikimedia.org/T303534) [11:30:40] (03CR) 10CI reject: [V:04-1] external_clouds_vendors: add Vultr cloud [puppet] - 10https://gerrit.wikimedia.org/r/1036201 (https://phabricator.wikimedia.org/T303534) (owner: 10Jelto) [11:33:00] (03PS4) 10Jelto: external_clouds_vendors: add Vultr cloud [puppet] - 10https://gerrit.wikimedia.org/r/1036201 (https://phabricator.wikimedia.org/T303534) [11:33:10] !log installing jinja2 security updates [11:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:56] (03CR) 10Slyngshede: [C:03+1] "LGTM" [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1036238 (https://phabricator.wikimedia.org/T308002) (owner: 10Ayounsi) [11:35:16] (03CR) 10Jelto: external_clouds_vendors: add Vultr cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1036201 (https://phabricator.wikimedia.org/T303534) (owner: 10Jelto) [11:35:40] FIRING: [2x] SystemdUnitFailed: kube-controller-manager.service on ml-staging-ctrl2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:36:00] (03PS1) 10Muehlenhoff: aptrepo: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1036241 [11:36:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P63304 and previous config saved to /var/cache/conftool/dbconfig/20240527-113651-marostegui.json [11:39:25] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1036201 (https://phabricator.wikimedia.org/T303534) (owner: 10Jelto) [11:40:31] (03CR) 10Hnowlan: [C:03+1] testwiki: Move to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1036235 (https://phabricator.wikimedia.org/T355534) (owner: 10Clément Goubert) [11:40:49] (03CR) 10Clément Goubert: [C:03+2] testwiki: Move to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1036235 (https://phabricator.wikimedia.org/T355534) (owner: 10Clément Goubert) [11:41:12] !log stevemunene@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [11:41:55] (03PS2) 10Bartosz Dziewoński: Enable wgDiscussionToolsEnablePermalinksBackend on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036197 (https://phabricator.wikimedia.org/T315353) [11:42:19] (03PS3) 10Ayounsi: Add python-jose [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1036238 (https://phabricator.wikimedia.org/T308002) [11:42:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T364069)', diff saved to https://phabricator.wikimedia.org/P63305 and previous config saved to /var/cache/conftool/dbconfig/20240527-114252-marostegui.json [11:42:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1210.eqiad.wmnet with reason: Maintenance [11:42:57] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [11:43:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1210.eqiad.wmnet with reason: Maintenance [11:43:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1210 (T364069)', diff saved to https://phabricator.wikimedia.org/P63306 and previous config saved to /var/cache/conftool/dbconfig/20240527-114316-marostegui.json [11:43:34] (03CR) 10Hnowlan: services: add data-gateway service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032595 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [11:43:43] (03CR) 10Ayounsi: [V:03+2 C:03+2] Add python-jose [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1036238 (https://phabricator.wikimedia.org/T308002) (owner: 10Ayounsi) [11:44:24] !log ayounsi@cumin1002 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: add python-jose and update wheels - ayounsi@cumin1002 - T308002 [11:44:30] T308002: Move Netbox authentication to python-social-auth - https://phabricator.wikimedia.org/T308002 [11:44:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036236 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff) [11:44:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036241 (owner: 10Muehlenhoff) [11:44:53] 06SRE, 10MW-on-K8s, 06Quality-and-Test-Engineering-Team, 06serviceops, 13Patch-For-Review: Move testwiki over to mw-on-k8s - https://phabricator.wikimedia.org/T355534#9834567 (10Clement_Goubert) 05In progress→03Resolved `testwiki` and `testcommonswiki` are now moved over to #mw-on-k8s [11:45:18] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2002.codfw.wmnet with reason: add python-jose and update wheels - ayounsi@cumin1002 - T308002 [11:46:27] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982 (10ABran-WMF) 03NEW [11:47:44] (03PS1) 10Gergő Tisza: [WIP][POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) [11:48:08] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9834595 (10ABran-WMF) [11:48:10] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1035789 (https://phabricator.wikimedia.org/T365804) (owner: 10EoghanGaffney) [11:49:06] 06SRE, 06Infrastructure-Foundations, 10netops: T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983 (10ABran-WMF) 03NEW [11:49:12] !log ayounsi@cumin1002 START - Cookbook sre.deploy.python-code netbox to netbox2002.codfw.wmnet,netbox1002.eqiad.wmnet with reason: add CasApereo auth and update wheels - ayounsi@cumin1002 - T308002 [11:50:10] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9834613 (10ABran-WMF) [11:50:57] 06SRE, 06Infrastructure-Foundations, 10netops: T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984 (10ABran-WMF) 03NEW [11:51:13] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox2002.codfw.wmnet,netbox1002.eqiad.wmnet with reason: add CasApereo auth and update wheels - ayounsi@cumin1002 - T308002 [11:51:18] T308002: Move Netbox authentication to python-social-auth - https://phabricator.wikimedia.org/T308002 [11:51:45] (03PS1) 10Hashar: Review access change [software] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1036212 [11:51:53] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9834628 (10ABran-WMF) [11:52:00] (03PS3) 10Klausman: install/partman: Tweak kubelet partition size for ML workers [puppet] - 10https://gerrit.wikimedia.org/r/1036195 (https://phabricator.wikimedia.org/T365971) [11:52:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P63307 and previous config saved to /var/cache/conftool/dbconfig/20240527-115200-marostegui.json [11:52:06] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [11:52:06] (03PS2) 10Hashar: Allow SRE to create tags [software] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1036212 [11:52:25] (03CR) 10Hashar: [V:03+2 C:03+2] Allow SRE to create tags [software] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1036212 (owner: 10Hashar) [11:52:50] (03PS1) 10Muehlenhoff: maps: Don't pass additional server aliases when using PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036247 (https://phabricator.wikimedia.org/T360778) [11:53:09] (03CR) 10Muehlenhoff: "Also see: https://puppet-compiler.wmflabs.org/output/1036236/3505/maps2007.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1036247 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff) [11:57:04] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036247 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff) [11:58:11] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986 (10ABran-WMF) 03NEW [11:58:49] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e6-eqiad - https://phabricator.wikimedia.org/T365987 (10ABran-WMF) 03NEW [11:58:50] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9834657 (10ABran-WMF) [11:59:29] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e6-eqiad - https://phabricator.wikimedia.org/T365988 (10ABran-WMF) 03NEW [11:59:58] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9834682 (10ABran-WMF) [12:01:28] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9834684 (10ABran-WMF) [12:03:49] FIRING: HelmReleaseBadStatus: Helm release datahub-next/staging on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datahub-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:05:07] (03CR) 10Hnowlan: [C:03+2] api-gateway: add normalise_paths option, enable in api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035481 (https://phabricator.wikimedia.org/T365439) (owner: 10Hnowlan) [12:05:52] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2001.codfw.wmnet with OS bullseye [12:06:18] (03Merged) 10jenkins-bot: api-gateway: add normalise_paths option, enable in api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035481 (https://phabricator.wikimedia.org/T365439) (owner: 10Hnowlan) [12:06:46] !log ayounsi@cumin1002 START - Cookbook sre.hosts.move-vlan for host [12:07:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T364299)', diff saved to https://phabricator.wikimedia.org/P63308 and previous config saved to /var/cache/conftool/dbconfig/20240527-120709-marostegui.json [12:07:11] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1198.eqiad.wmnet with reason: Maintenance [12:07:14] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [12:07:19] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [12:07:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1198.eqiad.wmnet with reason: Maintenance [12:07:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1198 (T364299)', diff saved to https://phabricator.wikimedia.org/P63309 and previous config saved to /var/cache/conftool/dbconfig/20240527-120732-marostegui.json [12:08:49] RESOLVED: HelmReleaseBadStatus: Helm release datahub-next/staging on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datahub-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:10:24] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2001 - ayounsi@cumin1002" [12:11:17] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2001 - ayounsi@cumin1002" [12:11:17] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:11:17] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2001.codfw.wmnet 39.16.192.10.in-addr.arpa 9.3.0.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:11:20] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2001.codfw.wmnet 39.16.192.10.in-addr.arpa 9.3.0.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:11:21] !log ayounsi@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2001 [12:11:41] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2001 [12:11:41] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [12:14:48] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:14:50] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:17:17] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply [12:17:30] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [12:18:07] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [12:18:34] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [12:18:54] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1036251 (owner: 10L10n-bot) [12:18:57] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [12:19:19] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [12:20:06] (03CR) 10Vgutierrez: [C:03+1] tlsproxy::localssl: Remove support for OCSP handling [puppet] - 10https://gerrit.wikimedia.org/r/1035362 (owner: 10Muehlenhoff) [12:24:18] (03CR) 10Muehlenhoff: [C:03+2] tlsproxy::localssl: Remove support for OCSP handling [puppet] - 10https://gerrit.wikimedia.org/r/1035362 (owner: 10Muehlenhoff) [12:24:26] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:24:39] (03CR) 10Vgutierrez: [C:03+1] benthos:cache: switch to rfc5424 format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1035440 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [12:28:11] (03PS2) 10Muehlenhoff: maps::tlsproxy: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1035750 [12:28:19] oop, stashbot left [12:29:22] (03CR) 10Fabfur: benthos:cache: switch to rfc5424 format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1035440 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [12:29:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T364069)', diff saved to https://phabricator.wikimedia.org/P63310 and previous config saved to /var/cache/conftool/dbconfig/20240527-122953-marostegui.json [12:29:59] RECOVERY - Disk space on stat1008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops [12:32:29] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035750 (owner: 10Muehlenhoff) [12:32:30] (03CR) 10Muehlenhoff: [C:03+2] maps::tlsproxy: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1035750 (owner: 10Muehlenhoff) [12:34:26] (03CR) 10Effie Mouzeli: [C:03+1] maps::tlsproxy: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1035750 (owner: 10Muehlenhoff) [12:34:56] FIRING: RdfStreamingUpdaterFlinkJobUnstable: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [12:35:21] (03PS2) 10Muehlenhoff: maps: Don't pass additional server aliases when using PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036247 (https://phabricator.wikimedia.org/T360778) [12:35:36] (03PS1) 10Santiago Faci: device-analytics deployment: Big AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036260 (https://phabricator.wikimedia.org/T360524) [12:35:58] marostegui: FYI, stashbot was temporarily gone, in case you want to re-log that one dbctl message (but I’m guessing it’s not super important) [12:36:13] (03CR) 10Effie Mouzeli: [C:03+1] maps: Don't pass additional server aliases when using PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036247 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff) [12:37:05] Lucas_WMDE: no need, there were more like those later, it is an automated process. Thanks for the heads up though [12:37:11] (03CR) 10Effie Mouzeli: [C:03+1] wikilabels::session: Set now-required memcached_user [puppet] - 10https://gerrit.wikimedia.org/r/1035762 (owner: 10Majavah) [12:37:49] (03PS1) 10Santiago Faci: page-analytics deployment: Big AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036261 (https://phabricator.wikimedia.org/T360523) [12:39:26] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:40:31] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host wikikube-worker2001.codfw.wmnet with OS bullseye [12:42:19] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2001.codfw.wmnet with OS bullseye [12:43:10] filed T365992 for the stashbot issue FTR [12:43:11] T365992: stashbot occasionally dies and needs manual restart - https://phabricator.wikimedia.org/T365992 [12:43:21] (03PS2) 10NMW03: Set $wgCategoryCollation to uca-bs-u-kn on Bosnian Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034941 (https://phabricator.wikimedia.org/T365133) [12:44:26] FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:44:56] RESOLVED: RdfStreamingUpdaterFlinkJobUnstable: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [12:45:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P63311 and previous config saved to /var/cache/conftool/dbconfig/20240527-124500-marostegui.json [12:45:50] (03PS3) 10Santiago Faci: editor-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036191 (https://phabricator.wikimedia.org/T355408) [12:47:06] (03CR) 10EoghanGaffney: lists: Don't include automation in standby hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1035789 (https://phabricator.wikimedia.org/T365804) (owner: 10EoghanGaffney) [12:47:08] (03CR) 10EoghanGaffney: [C:03+2] lists: Don't include automation in standby hosts [puppet] - 10https://gerrit.wikimedia.org/r/1035789 (https://phabricator.wikimedia.org/T365804) (owner: 10EoghanGaffney) [12:47:13] (03PS4) 10Santiago Faci: editor-analytics deployment: big refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036191 (https://phabricator.wikimedia.org/T355408) [12:48:27] jouncebot next [12:48:27] In 0 hour(s) and 11 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240527T1300) [12:50:17] (03PS1) 10Brouberol: datahub-next: make sure subcharts get the environment default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036263 (https://phabricator.wikimedia.org/T361185) [12:50:25] RESOLVED: [2x] SystemdUnitFailed: kube-controller-manager.service on ml-staging-ctrl2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:50:27] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1036241 (owner: 10Muehlenhoff) [12:50:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T364299)', diff saved to https://phabricator.wikimedia.org/P63312 and previous config saved to /var/cache/conftool/dbconfig/20240527-125041-marostegui.json [12:50:47] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [12:51:47] RESOLVED: [4x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:52:15] (03CR) 10Stevemunene: [C:03+1] datahub-next: make sure subcharts get the environment default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036263 (https://phabricator.wikimedia.org/T361185) (owner: 10Brouberol) [12:53:02] (03PS2) 10Brouberol: datahub-next: make sure subcharts get the environment default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036263 (https://phabricator.wikimedia.org/T361185) [12:53:49] (03PS4) 10Ayounsi: sre.hosts.reimage: add support for VLAN move [cookbooks] - 10https://gerrit.wikimedia.org/r/1007652 (https://phabricator.wikimedia.org/T350152) (owner: 10Volans) [12:53:58] (03CR) 10Elukey: [C:03+2] services: upgrade tegola in codfw to use the envoy proxy for Swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035743 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [12:54:26] FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:54:47] (03CR) 10Brouberol: [C:03+2] datahub-next: make sure subcharts get the environment default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036263 (https://phabricator.wikimedia.org/T361185) (owner: 10Brouberol) [12:56:36] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [12:58:19] (03PS1) 10EoghanGaffney: lists: Fix typing on ensure in mailman::web [puppet] - 10https://gerrit.wikimedia.org/r/1036265 [12:58:37] (03CR) 10CI reject: [V:04-1] lists: Fix typing on ensure in mailman::web [puppet] - 10https://gerrit.wikimedia.org/r/1036265 (owner: 10EoghanGaffney) [12:59:51] (03PS2) 10EoghanGaffney: lists: Fix typing on ensure in mailman::web [puppet] - 10https://gerrit.wikimedia.org/r/1036265 [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240527T1300). [13:00:05] ottomata, _Gerges, MatmaRex, and Nemoralis: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P63313 and previous config saved to /var/cache/conftool/dbconfig/20240527-130008-marostegui.json [13:00:16] o/ [13:00:23] I can deploy [13:00:34] Hi [13:00:42] hi [13:00:48] o7 [13:00:58] ottomata: do you want to self-service the beacon change? [13:01:42] (I’m guessing you have deployment rights ^^) [13:01:55] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982#9834830 (10ABran-WMF) [13:02:05] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982#9834831 (10ABran-WMF) [13:02:15] PROBLEM - Check whether ferm is active by checking the default input chain on wikikube-ctrl2001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:02:47] well, let’s start with Gerges’ change then [13:03:14] Ok [13:03:19] (03PS3) 10GergesShamon: Revert "arwiki: Disable Extension:ContentTranslation for non-autoreview users" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034884 (https://phabricator.wikimedia.org/T255022) [13:03:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034884 (https://phabricator.wikimedia.org/T255022) (owner: 10GergesShamon) [13:03:40] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9834834 (10ABran-WMF) [13:04:12] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e6-eqiad - https://phabricator.wikimedia.org/T365987#9834836 (10ABran-WMF) [13:04:51] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync [13:05:23] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync [13:05:39] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986#9834838 (10ABran-WMF) [13:05:44] 06SRE, 06Infrastructure-Foundations, 10netops: T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9834840 (10ABran-WMF) [13:05:44] (03Merged) 10jenkins-bot: Revert "arwiki: Disable Extension:ContentTranslation for non-autoreview users" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034884 (https://phabricator.wikimedia.org/T255022) (owner: 10GergesShamon) [13:05:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P63314 and previous config saved to /var/cache/conftool/dbconfig/20240527-130549-marostegui.json [13:06:00] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1034884|Revert "arwiki: Disable Extension:ContentTranslation for non-autoreview users" (T255022)]] [13:06:04] T255022: Disable machine translation in Content Translation Tool for non-autoreview users on Arabic Wikipedia - https://phabricator.wikimedia.org/T255022 [13:06:15] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2647/co" [puppet] - 10https://gerrit.wikimedia.org/r/1036265 (owner: 10EoghanGaffney) [13:06:32] 06SRE, 06Infrastructure-Foundations, 10netops: T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9834842 (10ABran-WMF) [13:07:26] (03CR) 10FNegri: [C:03+2] P:toolforge:redis_sentinel: set redis timeout [puppet] - 10https://gerrit.wikimedia.org/r/1029158 (https://phabricator.wikimedia.org/T363709) (owner: 10FNegri) [13:07:38] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993 (10ABran-WMF) 03NEW [13:08:04] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9834861 (10ABran-WMF) [13:08:09] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [13:08:31] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and gergesshamon: Backport for [[gerrit:1034884|Revert "arwiki: Disable Extension:ContentTranslation for non-autoreview users" (T255022)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:08:38] Gerges: please test :) [13:08:44] (03CR) 10Lucas Werkmeister (WMDE): "Judging by T272783, this will require the following maintenance script post-deployment:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034941 (https://phabricator.wikimedia.org/T365133) (owner: 10NMW03) [13:09:02] (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1036265 (owner: 10EoghanGaffney) [13:09:07] I can't test right now [13:09:11] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994 (10ABran-WMF) 03NEW [13:09:16] hm, ok [13:09:26] FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:09:27] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9834874 (10ABran-WMF) [13:09:27] Gerges: why [13:11:03] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995 (10ABran-WMF) 03NEW [13:11:23] I guess this is straightforward enough to just deploy [13:11:33] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9834886 (10ABran-WMF) [13:11:33] (RhinosF1 asks a good question but I don’t want to block the other changes on this either) [13:12:37] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and gergesshamon: Continuing with sync [13:13:14] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9834888 (10ABran-WMF) [13:13:45] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS5511/IPv6: Connect - Orange, AS5511/IPv4: Connect - Orange https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:14:49] RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 110, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:15:09] !log hnowlan@cumin1002 conftool action : set/pooled=no; selector: name=parse1002.eqiad.wmnet [13:15:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T364069)', diff saved to https://phabricator.wikimedia.org/P63315 and previous config saved to /var/cache/conftool/dbconfig/20240527-131516-marostegui.json [13:15:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1213.eqiad.wmnet with reason: Maintenance [13:15:21] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [13:15:32] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1213.eqiad.wmnet with reason: Maintenance [13:15:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1213 (T364069)', diff saved to https://phabricator.wikimedia.org/P63316 and previous config saved to /var/cache/conftool/dbconfig/20240527-131539-marostegui.json [13:16:01] !log test fifo-log-demux 0.7.5 on cp4052 [13:16:02] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9834899 (10ABran-WMF) [13:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:13] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9834905 (10ABran-WMF) [13:16:28] (03CR) 10EoghanGaffney: [C:03+2] lists: Fix typing on ensure in mailman::web [puppet] - 10https://gerrit.wikimedia.org/r/1036265 (owner: 10EoghanGaffney) [13:16:40] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9834911 (10ABran-WMF) a:05cmooney→03MatthewVernon [13:16:49] (03PS1) 10Brouberol: datahub-next: fix the ingress by restoring default gateway host [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036266 (https://phabricator.wikimedia.org/T361185) [13:16:57] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e6-eqiad - https://phabricator.wikimedia.org/T365987#9834908 (10ABran-WMF) a:05cmooney→03ABran-WMF [13:17:15] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:17:43] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Shutdown of Puppet 5 servers - https://phabricator.wikimedia.org/T365798#9834913 (10MoritzMuehlenhoff) p:05Triage→03High [13:18:13] !log disabling puppet on A:cp to safely apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1035440 (T365718) [13:18:15] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:18] T365718: Switch HAProxy/Benthos to rfc5424 - https://phabricator.wikimedia.org/T365718 [13:18:41] still no sign of ottomata? [13:18:59] (03PS2) 10Santiago Faci: page-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036261 (https://phabricator.wikimedia.org/T360523) [13:19:25] (03PS3) 10Santiago Faci: page-analytics deployment: AQS 2.0 refactoring to use new functions and messages added to aqsassist 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036261 (https://phabricator.wikimedia.org/T360523) [13:19:37] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996 (10ABran-WMF) 03NEW [13:19:53] (03CR) 10Hnowlan: [C:03+1] page-analytics deployment: AQS 2.0 refactoring to use new functions and messages added to aqsassist 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036261 (https://phabricator.wikimedia.org/T360523) (owner: 10Santiago Faci) [13:19:58] (03PS2) 10Santiago Faci: device-analytics deployment: AQS 2.0 refactoring to use new functions and messages added to aqsassist 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036260 (https://phabricator.wikimedia.org/T360524) [13:20:02] (03CR) 10Hnowlan: [C:03+1] "Copied votes on follow-up patch sets have been updated:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036260 (https://phabricator.wikimedia.org/T360524) (owner: 10Santiago Faci) [13:20:05] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:20:16] (03PS2) 10Santiago Faci: media-analytics deployment: AQS 2.0 refactoring to use new functions and messages added to aqsassist 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036194 (https://phabricator.wikimedia.org/T360526) [13:20:23] (03PS1) 10Elukey: services: move tegola in eqiad to the Thanos sidecar config for Swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036269 (https://phabricator.wikimedia.org/T344324) [13:20:57] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:20:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P63317 and previous config saved to /var/cache/conftool/dbconfig/20240527-132057-marostegui.json [13:21:00] (03PS4) 10Santiago Faci: page-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036261 (https://phabricator.wikimedia.org/T360523) [13:21:09] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [13:21:12] (03PS3) 10Santiago Faci: media-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036194 (https://phabricator.wikimedia.org/T360526) [13:21:15] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997 (10ABran-WMF) 03NEW [13:21:29] (03PS3) 10Santiago Faci: device-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036260 (https://phabricator.wikimedia.org/T360524) [13:21:37] (03Abandoned) 10Brouberol: datahub-next: fix the ingress by restoring default gateway host [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036266 (https://phabricator.wikimedia.org/T361185) (owner: 10Brouberol) [13:21:44] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997#9834941 (10ABran-WMF) [13:22:02] (03PS3) 10Santiago Faci: edit-analytics deployment: AQS 2 refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036188 (https://phabricator.wikimedia.org/T355407) [13:22:11] (03PS5) 10Santiago Faci: editor-analytics deployment: AQS 2 refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036191 (https://phabricator.wikimedia.org/T355408) [13:22:20] (03CR) 10Hnowlan: [C:03+1] "Copied votes on follow-up patch sets have been updated:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036188 (https://phabricator.wikimedia.org/T355407) (owner: 10Santiago Faci) [13:22:23] * Lucas_WMDE prepares MatmaRex’ changes for deployment [13:22:35] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998 (10ABran-WMF) 03NEW [13:22:37] 👍 [13:22:40] (03CR) 10Hnowlan: [C:03+1] media-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036194 (https://phabricator.wikimedia.org/T360526) (owner: 10Santiago Faci) [13:22:48] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9834959 (10ABran-WMF) [13:22:48] (I say we deploy both together, so I’ll rebase one onto the other) [13:22:54] (03PS3) 10Bartosz Dziewoński: Enable wgDiscussionToolsEnablePermalinksBackend on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036197 (https://phabricator.wikimedia.org/T315353) [13:23:04] (03PS6) 10Bartosz Dziewoński: Pre-emptively disable DiscussionToolsEnableThanks (no-op) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026511 (owner: 10Esanders) [13:23:16] (03CR) 10Hnowlan: [C:03+1] editor-analytics deployment: AQS 2 refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036191 (https://phabricator.wikimedia.org/T355408) (owner: 10Santiago Faci) [13:24:07] 06SRE, 10Wikimedia-Mailing-lists: Make Chqaz admin of Wikija-g mailing list - https://phabricator.wikimedia.org/T365933#9834956 (10Ladsgroup) a:03Ladsgroup > All members of our user group suddenly had their admins removed in March. Hi, why? Any governance issues or disputes? [13:24:07] (03PS2) 10Santiago Faci: geo-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036192 (https://phabricator.wikimedia.org/T360525) [13:24:08] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9834960 (10ABran-WMF) [13:24:20] (03CR) 10Hnowlan: [C:03+1] geo-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036192 (https://phabricator.wikimedia.org/T360525) (owner: 10Santiago Faci) [13:24:46] 06SRE, 10Wikimedia-Mailing-lists: Create a mailing list for Bangla Wikimoitree - https://phabricator.wikimedia.org/T365915#9834967 (10Ladsgroup) a:03Ladsgroup [13:24:51] RhinosF1: Sorry for the delay in replying, I don't have an arwiki account with which to test (I have an arwiki account with advanced privileges, so my test won't help) [13:25:05] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:25:24] 06SRE, 10Wikimedia-Mailing-lists: Create a mailing list for Bangla Wikimoitree - https://phabricator.wikimedia.org/T365915#9834966 (10Ladsgroup) Hi, can you pick a name that's more aligned with our standardization policy? https://meta.wikimedia.org/wiki/Mailing_lists/Standardization if it's not possible, we ne... [13:26:43] Gerges: create a legit alt? [13:26:46] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1034884|Revert "arwiki: Disable Extension:ContentTranslation for non-autoreview users" (T255022)]] (duration: 20m 46s) [13:26:54] T255022: Disable machine translation in Content Translation Tool for non-autoreview users on Arabic Wikipedia - https://phabricator.wikimedia.org/T255022 [13:27:04] Surely arwiki allows you to have legit alternate accounts [13:27:20] (03PS1) 10Muehlenhoff: Add new access group to grant root on the wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/1036270 [13:27:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036197 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński) [13:27:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026511 (owner: 10Esanders) [13:27:39] (03CR) 10Fabfur: [C:03+2] benthos:cache: switch to rfc5424 format [puppet] - 10https://gerrit.wikimedia.org/r/1035440 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [13:27:51] MatmaRex: will either of the changes be testable? [13:27:52] (03CR) 10Effie Mouzeli: [C:03+1] services: move tegola in eqiad to the Thanos sidecar config for Swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036269 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [13:27:57] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:27:58] I have an alternate account, but I don't currently have access to that account [13:28:02] (03Merged) 10jenkins-bot: Enable wgDiscussionToolsEnablePermalinksBackend on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036197 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński) [13:28:07] (03Merged) 10jenkins-bot: Pre-emptively disable DiscussionToolsEnableThanks (no-op) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026511 (owner: 10Esanders) [13:28:09] (03CR) 10CI reject: [V:04-1] Add new access group to grant root on the wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/1036270 (owner: 10Muehlenhoff) [13:28:11] I’m assuming the second change won’t be testable since the feature isn’t even merged yet [13:28:13] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:28:13] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.302 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:28:16] and I’m not sure about the first either [13:28:22] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1036197|Enable wgDiscussionToolsEnablePermalinksBackend on enwiki (T315353)]], [[gerrit:1026511|Pre-emptively disable DiscussionToolsEnableThanks (no-op)]] [13:28:26] T315353: Create database tables for permalinks in production wikis, and enable the feature - https://phabricator.wikimedia.org/T315353 [13:28:59] Gerges: in future, it's probably best to just create another account then imo [13:29:00] !log enabled puppet on cp4037 to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1035440 (T365718) [13:29:01] Lucas_WMDE: wgDiscussionToolsEnablePermalinksBackend change should enable https://en.wikipedia.org/wiki/Special:FindComment [13:29:02] (03CR) 10Elukey: [C:03+2] services: move tegola in eqiad to the Thanos sidecar config for Swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036269 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [13:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:04] T365718: Switch HAProxy/Benthos to rfc5424 - https://phabricator.wikimedia.org/T365718 [13:29:06] Lucas_WMDE: the other one is a no-op [13:29:13] alright [13:29:17] Gerges: and also say at the start of the window [13:29:55] RhinosF1: ok [13:30:35] (+1) [13:30:51] !log lucaswerkmeister-wmde@deploy1002 esanders and matmarex and lucaswerkmeister-wmde: Backport for [[gerrit:1036197|Enable wgDiscussionToolsEnablePermalinksBackend on enwiki (T315353)]], [[gerrit:1026511|Pre-emptively disable DiscussionToolsEnableThanks (no-op)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:30:56] MatmaRex: sounds good, thanks [13:31:38] (03CR) 10Volans: sre.hosts.reimage: add support for VLAN move (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1007652 (https://phabricator.wikimedia.org/T350152) (owner: 10Volans) [13:31:59] * Lucas_WMDE has no idea how to test Special:FindComment [13:32:15] RECOVERY - Check whether ferm is active by checking the default input chain on wikikube-ctrl2001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:32:23] i'm testing it [13:32:29] ack [13:32:29] for example, try this page: https://en.wikipedia.org/wiki/Special:FindComment?idorname=c-Izno-20240417204800-Jon_(WMF)-20240417141100 [13:32:39] ooh, pasting an HTML id= seems to work [13:32:40] on main server it shows no results, on test servers it shows a result [13:32:42] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync [13:32:43] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [13:32:45] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync [13:32:47] cool cool [13:32:49] !log lucaswerkmeister-wmde@deploy1002 esanders and matmarex and lucaswerkmeister-wmde: Continuing with sync [13:32:58] (and it says "not in current revision" because we need to run the script again, to backfill the most recent edits) [13:33:14] yup, I’ll do that afterwards [13:33:27] yeah, just explaining why it's like that [13:33:30] just from the --start printed by the last run, right? [13:33:49] FIRING: HelmReleaseBadStatus: Helm release datahub-next/staging on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datahub-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:34:07] let me double check [13:34:11] ok [13:34:14] (03CR) 10Santiago Faci: [C:03+2] edit-analytics deployment: AQS 2 refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036188 (https://phabricator.wikimedia.org/T355407) (owner: 10Santiago Faci) [13:34:58] (03PS2) 10Muehlenhoff: Add new access group to grant root on the wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/1036270 [13:34:59] (03Merged) 10jenkins-bot: edit-analytics deployment: AQS 2 refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036188 (https://phabricator.wikimedia.org/T355407) (owner: 10Santiago Faci) [13:36:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T364299)', diff saved to https://phabricator.wikimedia.org/P63318 and previous config saved to /var/cache/conftool/dbconfig/20240527-133605-marostegui.json [13:36:09] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1212.eqiad.wmnet with reason: Maintenance [13:36:11] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [13:36:22] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1212.eqiad.wmnet with reason: Maintenance [13:36:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:36:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:36:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1212 (T364299)', diff saved to https://phabricator.wikimedia.org/P63319 and previous config saved to /var/cache/conftool/dbconfig/20240527-133636-marostegui.json [13:38:49] RESOLVED: HelmReleaseBadStatus: Helm release datahub-next/staging on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datahub-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:40:05] Lucas_WMDE: looking back to how we did this for the previous wiki, i think that instead of --start, you should run the final one with --touched-after= [13:40:20] okay [13:40:34] e.g. https://phabricator.wikimedia.org/T315353#9078672 [13:40:36] and still with --current and --all? [13:40:51] (03CR) 10Hnowlan: [C:03+1] maps: Don't pass additional server aliases when using PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036247 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff) [13:40:54] yes [13:41:00] alright [13:42:55] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync [13:42:59] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync [13:43:59] (03PS3) 10NMW03: Set $wgCategoryCollation to uca-bs-u-kn on Bosnian Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034941 (https://phabricator.wikimedia.org/T365133) [13:44:47] !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [13:45:05] Nemoralis: do you know how to test the bswikiquote change? [13:45:20] (just checking in advance ^^) [13:45:35] I know the regular testing progress, does it needs anything else? [13:46:04] Will it work without running the updateCollation script? [13:46:05] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync [13:46:10] probably not [13:46:15] but I was wondering that [13:46:37] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1036197|Enable wgDiscussionToolsEnablePermalinksBackend on enwiki (T315353)]], [[gerrit:1026511|Pre-emptively disable DiscussionToolsEnableThanks (no-op)]] (duration: 18m 15s) [13:46:44] T315353: Create database tables for permalinks in production wikis, and enable the feature - https://phabricator.wikimedia.org/T315353 [13:46:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034941 (https://phabricator.wikimedia.org/T365133) (owner: 10NMW03) [13:46:50] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync [13:46:58] apparently there are ~10k categorylinks rows, so I assume the maintenance script shouldn’t take too long [13:47:17] I was thinking more like, what even to look for, where the change would take effect [13:47:29] (03Merged) 10jenkins-bot: Set $wgCategoryCollation to uca-bs-u-kn on Bosnian Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034941 (https://phabricator.wikimedia.org/T365133) (owner: 10NMW03) [13:47:45] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1034941|Set $wgCategoryCollation to uca-bs-u-kn on Bosnian Wikiquote (T365133)]] [13:47:49] T365133: Set $wgCategoryCollation to 'uca-bs-u-kn' on Bosnian Wikiquote and rebuild category sort keys - https://phabricator.wikimedia.org/T365133 [13:48:25] https://bs.wikiquote.org/wiki/Kategorija:Literatura seems to be the biggest category, but I’m not sure if we would see a difference there [13:49:11] https://bs.wikiquote.org/wiki/Kategorija:Pisci looks fine to test [13:49:32] it is small enough and has letters from Bosnian language [13:50:08] true [13:50:15] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and nmw03: Backport for [[gerrit:1034941|Set $wgCategoryCollation to uca-bs-u-kn on Bosnian Wikiquote (T365133)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:50:21] and https://bs.wikipedia.org/wiki/Kategorija:Amerikanci_po_porijeklu (random category) looks like Š is indeed supposed to sort after S [13:50:52] not seeing any change with mwdebug yet… probably needs the maintenance script first [13:51:02] in Pisci category, U supposed to sort after Č [13:51:16] per their alphabet [13:51:23] https://en.wikipedia.org/wiki/Bosnian_language#Alphabet [13:51:37] ok [13:51:43] I guess we sync now and test after the maintenance script? [13:51:49] yes [13:51:52] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and nmw03: Continuing with sync [13:52:57] PROBLEM - Check whether ferm is active by checking the default input chain on mw1380 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:54:39] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [13:54:52] !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [13:55:11] (03PS5) 10Ayounsi: sre.hosts.reimage: add support for VLAN move [cookbooks] - 10https://gerrit.wikimedia.org/r/1007652 (https://phabricator.wikimedia.org/T350152) (owner: 10Volans) [13:56:39] (03PS6) 10Ayounsi: sre.hosts.reimage: add support for VLAN move [cookbooks] - 10https://gerrit.wikimedia.org/r/1007652 (https://phabricator.wikimedia.org/T350152) (owner: 10Volans) [13:57:31] (03CR) 10Ayounsi: sre.hosts.reimage: add support for VLAN move (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1007652 (https://phabricator.wikimedia.org/T350152) (owner: 10Volans) [13:58:22] FTR, I have a meeting in a few minutes, so I might be a bit late to start the maintenance scripts [13:58:23] jouncebot: next [13:58:23] In 1 hour(s) and 31 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240527T1530) [13:58:32] but I should get to it before then, I think [13:58:53] (y) [14:00:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213 (T364069)', diff saved to https://phabricator.wikimedia.org/P63320 and previous config saved to /var/cache/conftool/dbconfig/20240527-140007-marostegui.json [14:00:19] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [14:00:31] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2001.codfw.wmnet with OS bullseye [14:01:16] thanks for deploying :) [14:04:14] (03CR) 10Effie Mouzeli: "Yes, of course! The general k8s-mwdebug remains the primary destination. @James I think the current order is alright, given that testing s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035361 (https://phabricator.wikimedia.org/T365478) (owner: 10Effie Mouzeli) [14:04:26] (03PS2) 10Effie Mouzeli: x-wikimedia-debug: add datacenter options for k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035361 (https://phabricator.wikimedia.org/T365478) [14:05:10] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1034941|Set $wgCategoryCollation to uca-bs-u-kn on Bosnian Wikiquote (T365133)]] (duration: 17m 25s) [14:05:15] T365133: Set $wgCategoryCollation to 'uca-bs-u-kn' on Bosnian Wikiquote and rebuild category sort keys - https://phabricator.wikimedia.org/T365133 [14:06:12] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [14:11:39] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript updateCollation.php bswikiquote --previous-collation=uppercase # T365133 [14:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:44] T365133: Set $wgCategoryCollation to 'uca-bs-u-kn' on Bosnian Wikiquote and rebuild category sort keys - https://phabricator.wikimedia.org/T365133 [14:11:47] already finished [14:12:00] and https://bs.wikiquote.org/wiki/Kategorija:Pisci looks different \o/ [14:12:02] (cc Nemoralis) [14:12:12] yep, works for me! [14:12:13] thanks [14:12:16] np :) [14:12:23] right, let’s do one more maintenance script for MatmaRex then ;) [14:13:34] (03PS1) 10Fabfur: benthos:cache: fix processing syntax [puppet] - 10https://gerrit.wikimedia.org/r/1036276 (https://phabricator.wikimedia.org/T365718) [14:14:11] !log START lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki enwiki --current --all --touched-after=20240524120000 2>&1 | tee -a ~/T315510-enwiki-7; date # cc T365974 [14:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:20] T365974: Deploy talk page permalinks to en.wiki - https://phabricator.wikimedia.org/T365974 [14:15:17] I really hope the “estimated X rows” here is very inaccurate 😅 [14:15:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213', diff saved to https://phabricator.wikimedia.org/P63321 and previous config saved to /var/cache/conftool/dbconfig/20240527-141515-marostegui.json [14:15:25] (Processed 300 (updated 287) of 61401202 rows) [14:16:49] wow, SELECT COUNT(*) FROM page WHERE page_touched > '20240524120000'; says 3948247 [14:16:56] almost four million touched pages [14:17:31] (03CR) 10Vgutierrez: [C:03+1] benthos:cache: fix processing syntax [puppet] - 10https://gerrit.wikimedia.org/r/1036276 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [14:17:50] its the weekend baby. youknow what that means. its time to drink precisely one beer and touch four million enwiki pages [14:17:57] (03CR) 10Fabfur: [C:03+2] benthos:cache: fix processing syntax [puppet] - 10https://gerrit.wikimedia.org/r/1036276 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [14:18:36] (03PS1) 10Hashar: Merge tag 'v3.8.6' into wmf/stable-3.8 [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1036278 [14:18:55] I’m otherwise done deploying btw [14:19:28] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1238.eqiad.wmnet with reason: Maintenance [14:19:41] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1238.eqiad.wmnet with reason: Maintenance [14:19:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1238 (T360332)', diff saved to https://phabricator.wikimedia.org/P63322 and previous config saved to /var/cache/conftool/dbconfig/20240527-141948-arnaudb.json [14:19:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T364299)', diff saved to https://phabricator.wikimedia.org/P63323 and previous config saved to /var/cache/conftool/dbconfig/20240527-141949-marostegui.json [14:19:54] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [14:19:59] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [14:21:19] (03CR) 10Effie Mouzeli: [C:03+2] x-wikimedia-debug: add datacenter options for k8s [puppet] - 10https://gerrit.wikimedia.org/r/1034514 (https://phabricator.wikimedia.org/T365478) (owner: 10Effie Mouzeli) [14:22:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T360332)', diff saved to https://phabricator.wikimedia.org/P63324 and previous config saved to /var/cache/conftool/dbconfig/20240527-142210-arnaudb.json [14:22:57] RECOVERY - Check whether ferm is active by checking the default input chain on mw1380 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:26:48] (03PS2) 10Hashar: Merge tag 'v3.8.6' into wmf/stable-3.8 [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1036278 (https://phabricator.wikimedia.org/T365328) [14:28:13] 06SRE, 10Acme-chief, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 06Traffic: Revert back to fleet-wide acmechief config once all ACME consumers are on Puppet 7 - https://phabricator.wikimedia.org/T365799#9835132 (10MoritzMuehlenhoff) p:05Triage→03High [14:30:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240527-143025-marostegui.json [14:34:29] (03PS1) 10Fabfur: Revert "benthos:cache: switch to rfc5424 format" [puppet] - 10https://gerrit.wikimedia.org/r/1036213 [14:34:39] (03CR) 10CI reject: [V:04-1] Revert "benthos:cache: switch to rfc5424 format" [puppet] - 10https://gerrit.wikimedia.org/r/1036213 (owner: 10Fabfur) [14:34:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P63326 and previous config saved to /var/cache/conftool/dbconfig/20240527-143457-marostegui.json [14:35:13] (03PS1) 10Fabfur: Revert "benthos:cache: fix processing syntax" [puppet] - 10https://gerrit.wikimedia.org/r/1036214 [14:36:48] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P63327 and previous config saved to /var/cache/conftool/dbconfig/20240527-143718-arnaudb.json [14:37:59] (03CR) 10Effie Mouzeli: [C:03+1] coredump.conf: Disable compression [puppet] - 10https://gerrit.wikimedia.org/r/1029235 (https://phabricator.wikimedia.org/T236253) (owner: 10Ahmon Dancy) [14:39:17] (03CR) 10Fabfur: [C:03+2] Revert "benthos:cache: fix processing syntax" [puppet] - 10https://gerrit.wikimedia.org/r/1036214 (owner: 10Fabfur) [14:40:57] (03CR) 10Fabfur: [V:03+2] Revert "benthos:cache: switch to rfc5424 format" [puppet] - 10https://gerrit.wikimedia.org/r/1036213 (owner: 10Fabfur) [14:41:06] (03PS2) 10Fabfur: Revert "benthos:cache: switch to rfc5424 format" [puppet] - 10https://gerrit.wikimedia.org/r/1036213 [14:41:09] (03CR) 10Effie Mouzeli: (WIP) memcached: add extstore option (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1035633 (https://phabricator.wikimedia.org/T352885) (owner: 10Effie Mouzeli) [14:45:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213 (T364069)', diff saved to https://phabricator.wikimedia.org/P63328 and previous config saved to /var/cache/conftool/dbconfig/20240527-144538-marostegui.json [14:45:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1216.eqiad.wmnet with reason: Maintenance [14:45:45] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [14:45:45] (03PS9) 10Effie Mouzeli: (WIP) memcached: add extstore option [puppet] - 10https://gerrit.wikimedia.org/r/1035633 (https://phabricator.wikimedia.org/T352885) [14:45:46] (03PS1) 10Effie Mouzeli: hieradata: enable extstore on mc1049 and mc2049 [puppet] - 10https://gerrit.wikimedia.org/r/1036281 (https://phabricator.wikimedia.org/T352885) [14:45:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1216.eqiad.wmnet with reason: Maintenance [14:46:09] (03CR) 10CI reject: [V:04-1] hieradata: enable extstore on mc1049 and mc2049 [puppet] - 10https://gerrit.wikimedia.org/r/1036281 (https://phabricator.wikimedia.org/T352885) (owner: 10Effie Mouzeli) [14:46:58] (03PS2) 10Effie Mouzeli: hieradata: enable extstore on mc1049 and mc2049 [puppet] - 10https://gerrit.wikimedia.org/r/1036281 (https://phabricator.wikimedia.org/T352885) [14:47:07] (03CR) 10Fabfur: [C:03+2] Revert "benthos:cache: switch to rfc5424 format" [puppet] - 10https://gerrit.wikimedia.org/r/1036213 (owner: 10Fabfur) [14:47:16] 06SRE, 06Infrastructure-Foundations: Investigate janitor, maintenance emails parser - https://phabricator.wikimedia.org/T230835#9835224 (10elukey) [14:47:26] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036281 (https://phabricator.wikimedia.org/T352885) (owner: 10Effie Mouzeli) [14:47:37] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Puppet-Core, 10Spicerack: Add a cookbook to safely deploy puppet changes - https://phabricator.wikimedia.org/T341442#9835228 (10elukey) [14:50:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P63329 and previous config saved to /var/cache/conftool/dbconfig/20240527-145004-marostegui.json [14:52:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P63330 and previous config saved to /var/cache/conftool/dbconfig/20240527-145226-arnaudb.json [14:55:08] (03PS10) 10Effie Mouzeli: memcached: add extstore option [puppet] - 10https://gerrit.wikimedia.org/r/1035633 (https://phabricator.wikimedia.org/T352885) [14:56:39] (03CR) 10Effie Mouzeli: [C:03+2] memcached: add extstore option [puppet] - 10https://gerrit.wikimedia.org/r/1035633 (https://phabricator.wikimedia.org/T352885) (owner: 10Effie Mouzeli) [14:56:48] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:01:21] !log enable puppet on A:cp (T365718) [15:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:27] T365718: Switch HAProxy/Benthos to rfc5424 - https://phabricator.wikimedia.org/T365718 [15:01:27] 06SRE, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982#9835264 (10ABran-WMF) [15:01:36] 06SRE, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9835266 (10ABran-WMF) [15:01:48] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036247 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff) [15:01:49] 06SRE, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9835267 (10ABran-WMF) [15:01:58] 06SRE, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986#9835268 (10ABran-WMF) [15:02:07] 06SRE, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e6-eqiad - https://phabricator.wikimedia.org/T365987#9835269 (10ABran-WMF) [15:02:17] 06SRE, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9835270 (10ABran-WMF) [15:02:24] 06SRE, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9835271 (10ABran-WMF) [15:02:33] 06SRE, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9835272 (10ABran-WMF) [15:02:55] 06SRE, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9835273 (10ABran-WMF) [15:03:02] 06SRE, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9835274 (10ABran-WMF) [15:03:16] 06SRE, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997#9835275 (10ABran-WMF) [15:03:24] 06SRE, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9835276 (10ABran-WMF) [15:05:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T364299)', diff saved to https://phabricator.wikimedia.org/P63331 and previous config saved to /var/cache/conftool/dbconfig/20240527-150514-marostegui.json [15:05:17] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1240.eqiad.wmnet with reason: Maintenance [15:05:20] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [15:05:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1240.eqiad.wmnet with reason: Maintenance [15:07:09] (03CR) 10Muehlenhoff: [C:03+2] maps: Don't pass additional server aliases when using PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036247 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff) [15:07:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T360332)', diff saved to https://phabricator.wikimedia.org/P63332 and previous config saved to /var/cache/conftool/dbconfig/20240527-150735-arnaudb.json [15:07:41] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [15:10:57] (03PS3) 10Effie Mouzeli: hieradata: enable extstore on mc1049 and mc2049 [puppet] - 10https://gerrit.wikimedia.org/r/1036281 (https://phabricator.wikimedia.org/T352885) [15:11:21] (03PS3) 10Muehlenhoff: maps: Switch kartotherian on maps2007 to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036236 (https://phabricator.wikimedia.org/T360778) [15:12:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036236 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff) [15:13:02] (03PS4) 10Effie Mouzeli: hieradata: enable extstore on mc1049 and mc2049 [puppet] - 10https://gerrit.wikimedia.org/r/1036281 (https://phabricator.wikimedia.org/T352885) [15:15:16] (03CR) 10Hashar: [C:03+2] Merge tag 'v3.8.6' into wmf/stable-3.8 [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1036278 (https://phabricator.wikimedia.org/T365328) (owner: 10Hashar) [15:20:00] (03PS1) 10Elukey: Move thanos-fe1002's envoy to CFSSL/PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036284 (https://phabricator.wikimedia.org/T344324) [15:20:20] (03CR) 10CI reject: [V:04-1] Move thanos-fe1002's envoy to CFSSL/PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036284 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [15:20:31] !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [15:22:07] (03Merged) 10jenkins-bot: Merge tag 'v3.8.6' into wmf/stable-3.8 [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1036278 (https://phabricator.wikimedia.org/T365328) (owner: 10Hashar) [15:22:25] !log disable puppet on mc1049 pending OS upgrade [15:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:06] (03CR) 10Effie Mouzeli: [C:03+2] hieradata: enable extstore on mc1049 and mc2049 [puppet] - 10https://gerrit.wikimedia.org/r/1036281 (https://phabricator.wikimedia.org/T352885) (owner: 10Effie Mouzeli) [15:23:31] (03PS8) 10Kamila Součková: [WIP] create a shellbox deployment for videoscalers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003446 (https://phabricator.wikimedia.org/T357309) [15:24:23] (03PS2) 10Elukey: Move thanos-fe1002's envoy to CFSSL/PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036284 (https://phabricator.wikimedia.org/T344324) [15:26:07] (03CR) 10Kamila Součková: "Is this good to go after I do something halfway sensible about https://gerrit.wikimedia.org/r/1005139 (the timeout patch)?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003446 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková) [15:26:32] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1036284 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [15:26:48] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:26:57] (03CR) 10Elukey: Move thanos-fe1002's envoy to CFSSL/PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036284 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [15:28:47] !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc2049.codfw.wmnet with OS bookworm [15:28:49] (03PS1) 10Santiago Faci: edit-analytics deployment: AQS 2 refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036285 (https://phabricator.wikimedia.org/T355407) [15:29:03] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Consolidate TLS cert puppetry for ms and thanos swift frontends - https://phabricator.wikimedia.org/T356412#9835385 (10elukey) Tegola is now using envoy (sidecar) to connect to Thanos Swift, so in theory we are good to proceed. Next step: * Move thanos-fe1002... [15:29:52] (03PS1) 10Hashar: Upgrade to Gerrit v3.8.6 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1036286 (https://phabricator.wikimedia.org/T365328) [15:30:04] jan_drewniak: #bothumor My software never has bugs. It just develops random features. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240527T1530). [15:30:37] !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [15:34:29] (03PS1) 10KartikMistry: Section Translation: Enable in newly created Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036289 (https://phabricator.wikimedia.org/T366003) [15:44:15] !log brouberol@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [15:46:56] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1035724 (https://phabricator.wikimedia.org/T365689) (owner: 10Vgutierrez) [15:49:10] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 40 probes of 788 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:50:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance [15:50:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance [15:53:30] (03CR) 10Elukey: [C:03+2] redfish: fix typo in DellSCP's class descr [software/spicerack] - 10https://gerrit.wikimedia.org/r/1035791 (owner: 10Elukey) [15:54:10] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 31 probes of 788 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:54:20] !log brouberol@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [15:54:44] (03CR) 10Brouberol: [C:03+1] edit-analytics deployment: AQS 2 refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036285 (https://phabricator.wikimedia.org/T355407) (owner: 10Santiago Faci) [15:56:00] !log jiji@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mc2049.codfw.wmnet with OS bookworm [15:56:14] !log run `apt-get clean` on dse-k8s-worker1001 to free space on the root partition [15:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:11] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 14), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9835532 (10WDoranWMF) [15:57:18] !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc2049.codfw.wmnet with OS bookworm [15:58:10] 06SRE, 10Wikimedia-Mailing-lists, 07Datacenter-Switchover: Make mailman3 work in the standby host (lists2001.wikimedia.org) - https://phabricator.wikimedia.org/T283615#9835562 (10eoghan) [15:59:51] (03Merged) 10jenkins-bot: redfish: fix typo in DellSCP's class descr [software/spicerack] - 10https://gerrit.wikimedia.org/r/1035791 (owner: 10Elukey) [16:00:54] (03CR) 10Santiago Faci: [C:03+2] edit-analytics deployment: AQS 2 refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036285 (https://phabricator.wikimedia.org/T355407) (owner: 10Santiago Faci) [16:01:14] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 37 probes of 788 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:01:43] (03Merged) 10jenkins-bot: edit-analytics deployment: AQS 2 refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036285 (https://phabricator.wikimedia.org/T355407) (owner: 10Santiago Faci) [16:02:47] !log brouberol@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [16:03:25] !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [16:03:27] !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [16:06:26] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 27 probes of 788 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:07:52] FIRING: KubernetesCalicoDown: wikikube-worker2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:09:54] !log brouberol@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [16:18:55] !log brouberol@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [16:19:11] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:19:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:22:10] FIRING: HelmReleaseBadStatus: Helm release edit-analytics/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=edit-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:29:00] !log brouberol@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [16:32:10] RESOLVED: HelmReleaseBadStatus: Helm release edit-analytics/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=edit-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:36:08] !log jiji@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mc2049.codfw.wmnet with OS bookworm [16:37:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [16:38:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [16:47:56] (03PS8) 10Kamila Součková: shellbox: add PHP + Apache timeout settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005139 (https://phabricator.wikimedia.org/T357309) [16:47:56] (03CR) 10Kamila Součková: "Turns out this works, but it took me a long time to understand what exactly is happening because turns out child processes do not get kill" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005139 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240527T1700) [17:00:05] ryankemper: #bothumor My software never has bugs. It just develops random features. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240527T1700). [17:04:40] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:05:24] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:06:16] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51925 bytes in 1.927 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:06:32] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.276 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:09:42] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:22:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2128.codfw.wmnet with reason: Maintenance [17:22:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2128.codfw.wmnet with reason: Maintenance [17:22:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [17:22:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [17:22:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2128 (T364069)', diff saved to https://phabricator.wikimedia.org/P63333 and previous config saved to /var/cache/conftool/dbconfig/20240527-172258-marostegui.json [17:23:03] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [17:24:44] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:25:24] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:27:21] (03CR) 10Pppery: "This change seems to have accidentally clobbered https://gerrit.wikimedia.org/r/c/phabricator/translations/+/1035805" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1036251 (owner: 10L10n-bot) [17:30:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance [17:30:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance [17:30:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2127 (T364299)', diff saved to https://phabricator.wikimedia.org/P63334 and previous config saved to /var/cache/conftool/dbconfig/20240527-173035-marostegui.json [17:30:47] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [17:31:12] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:34:42] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8617 bytes in 4.912 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:35:06] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:35:16] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:35:34] (03PS1) 10Ilias Sarantopoulos: ml-services: set command for hf image and remove nllb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036297 (https://phabricator.wikimedia.org/T365842) [17:44:40] (03CR) 10Ssingh: "Looking good but one nit that the commit message needs to be updated: we are doing just cp6001 in drmrs." [puppet] - 10https://gerrit.wikimedia.org/r/1035538 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [18:16:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T364069)', diff saved to https://phabricator.wikimedia.org/P63335 and previous config saved to /var/cache/conftool/dbconfig/20240527-181607-marostegui.json [18:16:15] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [18:24:03] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1036284 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [18:24:46] (03PS1) 10Gmodena: EventStreamConfig: Add webrequest.frontend.error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036299 (https://phabricator.wikimedia.org/T314956) [18:25:28] (03CR) 10CI reject: [V:04-1] EventStreamConfig: Add webrequest.frontend.error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036299 (https://phabricator.wikimedia.org/T314956) (owner: 10Gmodena) [18:31:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P63336 and previous config saved to /var/cache/conftool/dbconfig/20240527-183115-marostegui.json [18:46:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P63337 and previous config saved to /var/cache/conftool/dbconfig/20240527-184624-marostegui.json [18:53:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:55:12] (03PS2) 10Majavah: wikilabels::session: Set now-required memcached_user [puppet] - 10https://gerrit.wikimedia.org/r/1035762 [18:56:47] (03CR) 10Majavah: [C:03+2] wikilabels::session: Set now-required memcached_user [puppet] - 10https://gerrit.wikimedia.org/r/1035762 (owner: 10Majavah) [18:58:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:01:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T364069)', diff saved to https://phabricator.wikimedia.org/P63338 and previous config saved to /var/cache/conftool/dbconfig/20240527-190132-marostegui.json [19:01:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance [19:01:40] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [19:01:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance [19:01:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2157 (T364069)', diff saved to https://phabricator.wikimedia.org/P63339 and previous config saved to /var/cache/conftool/dbconfig/20240527-190155-marostegui.json [19:06:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T364299)', diff saved to https://phabricator.wikimedia.org/P63340 and previous config saved to /var/cache/conftool/dbconfig/20240527-190634-marostegui.json [19:06:39] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [19:21:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P63341 and previous config saved to /var/cache/conftool/dbconfig/20240527-192142-marostegui.json [19:26:48] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:36:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P63342 and previous config saved to /var/cache/conftool/dbconfig/20240527-193650-marostegui.json [19:51:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T364299)', diff saved to https://phabricator.wikimedia.org/P63343 and previous config saved to /var/cache/conftool/dbconfig/20240527-195158-marostegui.json [19:52:01] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance [19:52:04] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [19:52:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance [19:52:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2149 (T364299)', diff saved to https://phabricator.wikimedia.org/P63344 and previous config saved to /var/cache/conftool/dbconfig/20240527-195232-marostegui.json [19:54:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P63345 and previous config saved to /var/cache/conftool/dbconfig/20240527-195404-ladsgroup.json [20:01:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T364069)', diff saved to https://phabricator.wikimedia.org/P63346 and previous config saved to /var/cache/conftool/dbconfig/20240527-200106-marostegui.json [20:01:12] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [20:08:07] FIRING: KubernetesCalicoDown: wikikube-worker2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:09:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P63347 and previous config saved to /var/cache/conftool/dbconfig/20240527-200910-ladsgroup.json [20:16:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P63348 and previous config saved to /var/cache/conftool/dbconfig/20240527-201614-marostegui.json [20:24:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P63349 and previous config saved to /var/cache/conftool/dbconfig/20240527-202416-ladsgroup.json [20:31:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P63350 and previous config saved to /var/cache/conftool/dbconfig/20240527-203122-marostegui.json [20:39:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P63351 and previous config saved to /var/cache/conftool/dbconfig/20240527-203922-ladsgroup.json [20:46:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T364069)', diff saved to https://phabricator.wikimedia.org/P63352 and previous config saved to /var/cache/conftool/dbconfig/20240527-204630-marostegui.json [20:46:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [20:46:36] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [20:46:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [20:46:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2171 (T364069)', diff saved to https://phabricator.wikimedia.org/P63353 and previous config saved to /var/cache/conftool/dbconfig/20240527-204653-marostegui.json [21:00:05] Reedy, sbassett, Maryum, and manfredi: Time to do the Weekly Security deployment window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240527T2100). [21:09:42] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:14:51] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for wenjun fan - https://phabricator.wikimedia.org/T353958#9836202 (10TheDJ) accidentally attached patch to wrong ticket. [21:15:05] (03PS1) 10Gergő Tisza: [multiversion] Add 'manage-dblist init-labs' subcommand [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036313 [21:27:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T364299)', diff saved to https://phabricator.wikimedia.org/P63355 and previous config saved to /var/cache/conftool/dbconfig/20240527-212738-marostegui.json [21:27:43] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [21:35:23] (03CR) 10Gergő Tisza: beta: Introduce new test2wiki on test.wikipedia.beta.wmcloud.org (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035749 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga) [21:42:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T364069)', diff saved to https://phabricator.wikimedia.org/P63356 and previous config saved to /var/cache/conftool/dbconfig/20240527-214210-marostegui.json [21:42:17] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [21:42:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P63357 and previous config saved to /var/cache/conftool/dbconfig/20240527-214246-marostegui.json [21:52:44] 06SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for Sohom Datta - https://phabricator.wikimedia.org/T366032 (10Soda) 03NEW [21:53:52] (03CR) 10Gergő Tisza: "Other things that I think need an update:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035749 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga) [21:55:29] (03CR) 10Gergő Tisza: "Other things that might need to be updated:" [puppet] - 10https://gerrit.wikimedia.org/r/1035752 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga) [21:57:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P63358 and previous config saved to /var/cache/conftool/dbconfig/20240527-215719-marostegui.json [21:57:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P63359 and previous config saved to /var/cache/conftool/dbconfig/20240527-215754-marostegui.json [22:07:50] PROBLEM - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [22:12:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P63360 and previous config saved to /var/cache/conftool/dbconfig/20240527-221227-marostegui.json [22:13:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T364299)', diff saved to https://phabricator.wikimedia.org/P63361 and previous config saved to /var/cache/conftool/dbconfig/20240527-221302-marostegui.json [22:13:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance [22:13:07] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [22:13:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance [22:13:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2186.codfw.wmnet with reason: Maintenance [22:13:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2186.codfw.wmnet with reason: Maintenance [22:13:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2156 (T364299)', diff saved to https://phabricator.wikimedia.org/P63362 and previous config saved to /var/cache/conftool/dbconfig/20240527-221330-marostegui.json [22:13:50] RECOVERY - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [22:27:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T364069)', diff saved to https://phabricator.wikimedia.org/P63363 and previous config saved to /var/cache/conftool/dbconfig/20240527-222735-marostegui.json [22:27:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2178.codfw.wmnet with reason: Maintenance [22:27:42] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [22:27:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2178.codfw.wmnet with reason: Maintenance [22:28:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2178 (T364069)', diff saved to https://phabricator.wikimedia.org/P63364 and previous config saved to /var/cache/conftool/dbconfig/20240527-222759-marostegui.json [23:12:44] PROBLEM - CirrusSearch more_like eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [23:13:52] PROBLEM - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [23:20:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T364069)', diff saved to https://phabricator.wikimedia.org/P63365 and previous config saved to /var/cache/conftool/dbconfig/20240527-232025-marostegui.json [23:20:33] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [23:21:44] RECOVERY - CirrusSearch more_like eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [23:21:53] RECOVERY - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [23:26:48] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:35:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P63366 and previous config saved to /var/cache/conftool/dbconfig/20240527-233533-marostegui.json [23:38:10] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 7/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:38:12] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:38:19] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1035870 [23:38:19] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1035870 (owner: 10TrainBranchBot) [23:39:10] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:39:12] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:42:52] PROBLEM - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [23:43:48] PROBLEM - CirrusSearch more_like eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [23:47:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T364299)', diff saved to https://phabricator.wikimedia.org/P63367 and previous config saved to /var/cache/conftool/dbconfig/20240527-234705-marostegui.json [23:47:13] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [23:50:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P63368 and previous config saved to /var/cache/conftool/dbconfig/20240527-235041-marostegui.json [23:51:54] RECOVERY - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [23:52:48] RECOVERY - CirrusSearch more_like eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39