[00:05:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [00:09:53] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:10:05] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:10:15] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [00:15:55] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:32:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1030552 [00:32:15] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1030552 (owner: 10TrainBranchBot) [00:52:54] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1030552 (owner: 10TrainBranchBot) [01:00:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T352010)', diff saved to https://phabricator.wikimedia.org/P62306 and previous config saved to /var/cache/conftool/dbconfig/20240513-010055-ladsgroup.json [01:01:01] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [01:16:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P62307 and previous config saved to /var/cache/conftool/dbconfig/20240513-011605-ladsgroup.json [01:18:41] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:21:22] ACKNOWLEDGEMENT - MD RAID on aqs1013 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T364699 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [01:21:27] 10ops-eqiad, 06SRE: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T364699 (10ops-monitoring-bot) 03NEW [01:23:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:28:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:31:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P62308 and previous config saved to /var/cache/conftool/dbconfig/20240513-013113-ladsgroup.json [01:38:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:46:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T352010)', diff saved to https://phabricator.wikimedia.org/P62309 and previous config saved to /var/cache/conftool/dbconfig/20240513-014623-ladsgroup.json [01:46:27] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [01:48:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:03:26] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:27:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:32:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:36:29] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:48:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:48:16] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:48:49] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:53:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:53:49] RESOLVED: [2x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:00:14] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:03:26] RESOLVED: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:05:41] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:05:49] FIRING: [3x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:10:41] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:10:49] RESOLVED: [3x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:12:49] FIRING: [2x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:15:41] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:17:49] FIRING: [3x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:21:50] !log restart apache2 on phab1004 [03:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:22:49] RESOLVED: [3x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:25:41] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:28:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:23:25] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:24:20] (03PS3) 10KartikMistry: Update MinT to 2024-03-28-061726-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015258 (https://phabricator.wikimedia.org/T333969) [05:01:50] (03PS2) 10Marostegui: db-production.php: Enable writes on es6 and es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029109 (https://phabricator.wikimedia.org/T364446) [05:01:59] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1161.eqiad.wmnet with reason: Maintenance [05:02:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1161.eqiad.wmnet with reason: Maintenance [05:02:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [05:02:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [05:02:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1161 (T364299)', diff saved to https://phabricator.wikimedia.org/P62310 and previous config saved to /var/cache/conftool/dbconfig/20240513-050237-marostegui.json [05:02:43] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [05:02:52] (03PS1) 10Marostegui: es2039: Remove "to be setup" [puppet] - 10https://gerrit.wikimedia.org/r/1030621 (https://phabricator.wikimedia.org/T355424) [05:08:41] (03CR) 10Marostegui: [C:03+2] es2039: Remove "to be setup" [puppet] - 10https://gerrit.wikimedia.org/r/1030621 (https://phabricator.wikimedia.org/T355424) (owner: 10Marostegui) [05:10:15] (03CR) 10Marostegui: [C:03+1] mariadb::ferm_idm: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1030099 (owner: 10Muehlenhoff) [05:22:32] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2123 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1030553 (https://phabricator.wikimedia.org/T364703) [05:22:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 24 hosts with reason: Primary switchover s5 T364703 [05:23:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2123 with weight 0 T364703', diff saved to https://phabricator.wikimedia.org/P62311 and previous config saved to /var/cache/conftool/dbconfig/20240513-052304-root.json [05:23:09] T364703: Switchover s5 master (db2213 -> db2123) - https://phabricator.wikimedia.org/T364703 [05:23:18] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s5 T364703 [05:24:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove vslow from db2123 T364703', diff saved to https://phabricator.wikimedia.org/P62312 and previous config saved to /var/cache/conftool/dbconfig/20240513-052424-marostegui.json [05:25:12] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2123 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1030553 (https://phabricator.wikimedia.org/T364703) (owner: 10Gerrit maintenance bot) [05:28:34] (03PS1) 10Marostegui: es2040: Remove "to be set up" [puppet] - 10https://gerrit.wikimedia.org/r/1030622 [05:28:58] (03CR) 10Marostegui: [C:03+2] es2040: Remove "to be set up" [puppet] - 10https://gerrit.wikimedia.org/r/1030622 (owner: 10Marostegui) [05:35:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T364299)', diff saved to https://phabricator.wikimedia.org/P62313 and previous config saved to /var/cache/conftool/dbconfig/20240513-053553-marostegui.json [05:36:00] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [05:38:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:47:32] !log Starting s5 codfw failover from db2213 to db2123 - T364703 [05:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:36] T364703: Switchover s5 master (db2213 -> db2123) - https://phabricator.wikimedia.org/T364703 [05:48:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2123 to s5 primary T364703', diff saved to https://phabricator.wikimedia.org/P62314 and previous config saved to /var/cache/conftool/dbconfig/20240513-054802-root.json [05:48:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2213 T364703', diff saved to https://phabricator.wikimedia.org/P62315 and previous config saved to /var/cache/conftool/dbconfig/20240513-054841-root.json [05:51:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P62316 and previous config saved to /var/cache/conftool/dbconfig/20240513-055102-marostegui.json [05:51:10] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2213.codfw.wmnet with reason: Schema change [05:51:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2213.codfw.wmnet with reason: Schema change [05:55:05] (03PS1) 10Marostegui: db2213: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1030624 [05:57:18] (03CR) 10Marostegui: [C:03+2] db2213: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1030624 (owner: 10Marostegui) [06:05:40] (03PS1) 10Marostegui: db2183: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1030625 [06:05:47] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2213.codfw.wmnet with reason: Schema change [06:05:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2213.codfw.wmnet with reason: Schema change [06:06:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P62317 and previous config saved to /var/cache/conftool/dbconfig/20240513-060610-marostegui.json [06:06:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2183.codfw.wmnet with reason: Reimage [06:06:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2183.codfw.wmnet with reason: Reimage [06:06:40] (03CR) 10Marostegui: [C:03+2] db2183: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1030625 (owner: 10Marostegui) [06:07:17] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2183.codfw.wmnet with OS bookworm [06:09:35] PROBLEM - MariaDB Replica IO: backup1-codfw on db2184 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2183.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2183.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:11:48] expected [06:12:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2184.codfw.wmnet with reason: Reimage of the master [06:12:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2184.codfw.wmnet with reason: Reimage of the master [06:16:54] (03PS1) 10Marostegui: Revert "db2213: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1030323 [06:19:35] (03CR) 10Marostegui: [C:03+2] Revert "db2213: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1030323 (owner: 10Marostegui) [06:21:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T364299)', diff saved to https://phabricator.wikimedia.org/P62318 and previous config saved to /var/cache/conftool/dbconfig/20240513-062117-marostegui.json [06:21:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1183.eqiad.wmnet with reason: Maintenance [06:21:22] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [06:21:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1183.eqiad.wmnet with reason: Maintenance [06:21:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1183 (T364299)', diff saved to https://phabricator.wikimedia.org/P62319 and previous config saved to /var/cache/conftool/dbconfig/20240513-062129-marostegui.json [06:22:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P62320 and previous config saved to /var/cache/conftool/dbconfig/20240513-062219-root.json [06:25:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2183.codfw.wmnet with reason: host reimage [06:28:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2183.codfw.wmnet with reason: host reimage [06:28:43] (03PS5) 10KartikMistry: ContentTranslation: Update publishing setting for cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025300 (https://phabricator.wikimedia.org/T353049) [06:32:32] (03PS1) 10Marostegui: es6 eqiad: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1030627 (https://phabricator.wikimedia.org/T364446) [06:33:00] (03CR) 10Marostegui: [C:03+2] es6 eqiad: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1030627 (https://phabricator.wikimedia.org/T364446) (owner: 10Marostegui) [06:37:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P62321 and previous config saved to /var/cache/conftool/dbconfig/20240513-063724-root.json [06:43:35] (03PS1) 10Marostegui: Revert "db2183: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1030324 [06:43:39] RECOVERY - MariaDB Replica IO: backup1-codfw on db2184 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:46:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2183.codfw.wmnet with OS bookworm [06:46:17] (03CR) 10Marostegui: [C:03+2] Revert "db2183: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1030324 (owner: 10Marostegui) [06:48:12] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:48:39] (03PS1) 10Marostegui: db2183: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1030741 (https://phabricator.wikimedia.org/T364296) [06:52:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P62322 and previous config saved to /var/cache/conftool/dbconfig/20240513-065230-root.json [06:54:45] (03PS1) 10Marostegui: es6 codfw: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1030742 (https://phabricator.wikimedia.org/T364446) [06:55:01] (03CR) 10Muehlenhoff: [C:03+2] mariadb::ferm_idm: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1030099 (owner: 10Muehlenhoff) [06:55:08] (03PS1) 10Slyngshede: Dockerize [software/bitu] - 10https://gerrit.wikimedia.org/r/1030743 (https://phabricator.wikimedia.org/T362318) [06:55:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T364299)', diff saved to https://phabricator.wikimedia.org/P62323 and previous config saved to /var/cache/conftool/dbconfig/20240513-065518-marostegui.json [06:55:22] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [06:57:33] (03CR) 10Marostegui: [C:03+2] es6 codfw: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1030742 (https://phabricator.wikimedia.org/T364446) (owner: 10Marostegui) [06:59:17] (03CR) 10Jelto: [C:04-1] "one comment in line" [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [06:59:43] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudbackup1004.eqiad.wmnet [07:00:04] Amir1 and Urbanecm: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240513T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:48] (03PS1) 10Muehlenhoff: Switch cloudbackup1004 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1030744 (https://phabricator.wikimedia.org/T349619) [07:01:02] oh, I didn't get notification by jouncebot :/ [07:01:35] (because, I forgot to put irc nickname!) [07:03:08] urbanecm: I'm going ahead with, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1025300 [07:03:25] (03CR) 10Brouberol: [C:03+2] zookeeper: use datacenter-local aliases for flink ensembles [cookbooks] - 10https://gerrit.wikimedia.org/r/1028521 (https://phabricator.wikimedia.org/T363975) (owner: 10Brouberol) [07:03:33] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudbackup1004 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1030744 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:03:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025300 (https://phabricator.wikimedia.org/T353049) (owner: 10KartikMistry) [07:04:23] (03Merged) 10jenkins-bot: ContentTranslation: Update publishing setting for cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025300 (https://phabricator.wikimedia.org/T353049) (owner: 10KartikMistry) [07:04:27] Ack! [07:04:33] (03CR) 10Brouberol: [C:03+2] hadoop: make analytics DB password available to analytics-product user [puppet] - 10https://gerrit.wikimedia.org/r/1028814 (https://phabricator.wikimedia.org/T363437) (owner: 10Brouberol) [07:05:12] !log kartik@deploy1002 Started scap: Backport for [[gerrit:1025300|ContentTranslation: Update publishing setting for cswiki (T353049)]] [07:05:15] T353049: New settings for ContentTranslation on the cswiki - https://phabricator.wikimedia.org/T353049 [07:07:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P62324 and previous config saved to /var/cache/conftool/dbconfig/20240513-070738-root.json [07:08:21] (03PS1) 10KartikMistry: CX: Add mw.cx.UserPermissionChecker [extensions/ContentTranslation] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030325 (https://phabricator.wikimedia.org/T349959) [07:08:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudbackup1004.eqiad.wmnet [07:10:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P62325 and previous config saved to /var/cache/conftool/dbconfig/20240513-071026-marostegui.json [07:10:33] PROBLEM - Check whether ferm is active by checking the default input chain on mw1422 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:10:42] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: wmcs::openstack::eqiad1::instance_backups [07:11:32] (03CR) 10Brouberol: [C:03+2] aliases: add datacenter-scoped cumin aliases for flink zk ensembles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028520 (https://phabricator.wikimedia.org/T363975) (owner: 10Brouberol) [07:12:59] (03CR) 10Brouberol: [C:03+1] Move stats misc_jobs from stat1007 to stat1011 [puppet] - 10https://gerrit.wikimedia.org/r/1028866 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis) [07:14:03] (03PS1) 10Muehlenhoff: Switch wmcs::openstack::eqiad1::instance_backups to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1030748 (https://phabricator.wikimedia.org/T349619) [07:15:35] (03PS1) 10Brouberol: cumin: improve how zookeeper-flink-eqiad/codfw aliases are computed [puppet] - 10https://gerrit.wikimedia.org/r/1030749 (https://phabricator.wikimedia.org/T363975) [07:16:28] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1030749 (https://phabricator.wikimedia.org/T363975) (owner: 10Brouberol) [07:16:40] (03CR) 10Muehlenhoff: [C:03+2] Switch wmcs::openstack::eqiad1::instance_backups to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1030748 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:16:52] (03CR) 10Brouberol: [C:03+2] aliases: add datacenter-scoped cumin aliases for flink zk ensembles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028520 (https://phabricator.wikimedia.org/T363975) (owner: 10Brouberol) [07:17:41] (03CR) 10Marostegui: db-production.php: Enable writes on es6 and es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029109 (https://phabricator.wikimedia.org/T364446) (owner: 10Marostegui) [07:19:40] !log kartik@deploy1002 kartik: Backport for [[gerrit:1025300|ContentTranslation: Update publishing setting for cswiki (T353049)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:19:43] T353049: New settings for ContentTranslation on the cswiki - https://phabricator.wikimedia.org/T353049 [07:21:13] headsup: I'm going to rolling restart the flink-zookeeper-eqiad and flink-zookeeper-codfw zk ensembles as part of T363975, one after the other [07:22:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::openstack::eqiad1::instance_backups [07:22:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62326 and previous config saved to /var/cache/conftool/dbconfig/20240513-072244-root.json [07:23:23] !log kartik@deploy1002 kartik: Continuing with sync [07:23:41] !log brouberol@cumin2002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-flink-codfw cluster: Roll restart of jvm daemons. [07:23:57] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9788897 (10MoritzMuehlenhoff) [07:24:04] marostegui: I'm up now [07:24:08] \o/ [07:24:11] jouncebot: now [07:24:11] For the next 0 hour(s) and 35 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240513T0700) [07:24:28] kart_: can you let me know when you are done? [07:25:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P62327 and previous config saved to /var/cache/conftool/dbconfig/20240513-072533-marostegui.json [07:25:48] (03PS1) 10Marostegui: es7 eqiad: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1030753 (https://phabricator.wikimedia.org/T364446) [07:25:53] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for o11y services [puppet] - 10https://gerrit.wikimedia.org/r/1029924 (owner: 10Muehlenhoff) [07:26:47] (03CR) 10Brouberol: [C:03+2] cumin: improve how zookeeper-flink-eqiad/codfw aliases are computed [puppet] - 10https://gerrit.wikimedia.org/r/1030749 (https://phabricator.wikimedia.org/T363975) (owner: 10Brouberol) [07:26:55] (03CR) 10Marostegui: [C:03+2] es7 eqiad: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1030753 (https://phabricator.wikimedia.org/T364446) (owner: 10Marostegui) [07:27:01] (03CR) 10Muehlenhoff: [C:03+2] Add alias for full cluster [puppet] - 10https://gerrit.wikimedia.org/r/1030085 (owner: 10Muehlenhoff) [07:27:15] brouberol: ok to merge your changes? [07:28:33] moritzm: brouberol ok to merge? [07:28:36] marostegui: mine can also be merged along [07:28:47] moritzm: thanks [07:29:11] the patch by Balthazar is a harmless cleanup, should be fine to merge along [07:29:17] ok [07:29:23] and I had +1d it a few minutes ago [07:29:25] merging them [07:29:30] ack, thx [07:29:33] thanks [07:29:54] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [07:30:06] !log brouberol@cumin2002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-flink-codfw cluster: Roll restart of jvm daemons. [07:30:07] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [07:30:09] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:30:24] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:30:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T352010)', diff saved to https://phabricator.wikimedia.org/P62328 and previous config saved to /var/cache/conftool/dbconfig/20240513-073031-ladsgroup.json [07:30:35] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [07:31:23] (03PS1) 10Ladsgroup: Fix static cache access [extensions/DiscussionTools] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030866 (https://phabricator.wikimedia.org/T364693) [07:31:46] (03CR) 10Ladsgroup: [C:03+2] Fix static cache access [extensions/DiscussionTools] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030866 (https://phabricator.wikimedia.org/T364693) (owner: 10Ladsgroup) [07:32:16] (03CR) 10Muehlenhoff: [C:03+2] New Cumin alias for analytics mariadb nodes [puppet] - 10https://gerrit.wikimedia.org/r/1028526 (owner: 10Muehlenhoff) [07:33:25] (03CR) 10Muehlenhoff: [C:03+2] Configure an-test-druid to use firewall::service compatible firewall settings [puppet] - 10https://gerrit.wikimedia.org/r/1029180 (owner: 10Muehlenhoff) [07:35:22] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 17451 [07:36:29] (03PS1) 10Marostegui: es7 codfw: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1030886 (https://phabricator.wikimedia.org/T364446) [07:36:58] (03CR) 10Marostegui: [C:03+2] es7 codfw: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1030886 (https://phabricator.wikimedia.org/T364446) (owner: 10Marostegui) [07:37:15] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1025300|ContentTranslation: Update publishing setting for cswiki (T353049)]] (duration: 32m 03s) [07:37:18] T353049: New settings for ContentTranslation on the cswiki - https://phabricator.wikimedia.org/T353049 [07:37:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62329 and previous config saved to /var/cache/conftool/dbconfig/20240513-073750-root.json [07:38:26] !log brouberol@cumin2002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-flink-eqiad cluster: Roll restart of jvm daemons. [07:39:41] Amir1: let's go for https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1029109 ? [07:39:55] (03Merged) 10jenkins-bot: Fix static cache access [extensions/DiscussionTools] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030866 (https://phabricator.wikimedia.org/T364693) (owner: 10Ladsgroup) [07:40:04] let me deploy this first? [07:40:06] ^ [07:40:07] sure [07:40:09] go for it [07:40:26] thank you <3 [07:40:33] RECOVERY - Check whether ferm is active by checking the default input chain on mw1422 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:40:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T364299)', diff saved to https://phabricator.wikimedia.org/P62330 and previous config saved to /var/cache/conftool/dbconfig/20240513-074041-marostegui.json [07:40:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1185.eqiad.wmnet with reason: Maintenance [07:40:45] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [07:40:55] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete Hiera settings to allow dropping Python 2 [puppet] - 10https://gerrit.wikimedia.org/r/1028764 (https://phabricator.wikimedia.org/T316876) (owner: 10Muehlenhoff) [07:40:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1185.eqiad.wmnet with reason: Maintenance [07:41:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1185 (T364299)', diff saved to https://phabricator.wikimedia.org/P62331 and previous config saved to /var/cache/conftool/dbconfig/20240513-074103-marostegui.json [07:41:30] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1030866|Fix static cache access (T364693)]] [07:41:35] T364693: DiscussionTools isFeatureEnabled check is taking 5% of all requests - https://phabricator.wikimedia.org/T364693 [07:44:01] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1030866|Fix static cache access (T364693)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:44:40] !log brouberol@cumin2002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-flink-eqiad cluster: Roll restart of jvm daemons. [07:46:17] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [07:46:56] (03CR) 10JMeybohm: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030221 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [07:48:47] (03CR) 10JMeybohm: [C:03+1] "nice!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028910 (https://phabricator.wikimedia.org/T359423) (owner: 10Scott French) [07:51:21] PROBLEM - Check whether ferm is active by checking the default input chain on mw1449 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:52:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P62332 and previous config saved to /var/cache/conftool/dbconfig/20240513-075256-root.json [07:53:04] !log installing libgd2 security updates [07:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:32] (03PS1) 10Brouberol: logstash: introduce restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1030890 (https://phabricator.wikimedia.org/T363975) [07:53:52] (03PS2) 10Brouberol: logstash: introduce restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1030890 (https://phabricator.wikimedia.org/T363975) [07:54:51] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 17451 [07:54:56] (03PS3) 10Brouberol: logstash: introduce restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1030890 (https://phabricator.wikimedia.org/T363975) [07:55:45] (03CR) 10Muehlenhoff: "There is already an existing cookbook to restart Logstash under sre.o11y, possibly this can simply be extended by just adding the alias fo" [cookbooks] - 10https://gerrit.wikimedia.org/r/1030890 (https://phabricator.wikimedia.org/T363975) (owner: 10Brouberol) [07:58:25] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1030866|Fix static cache access (T364693)]] (duration: 16m 54s) [07:58:29] T364693: DiscussionTools isFeatureEnabled check is taking 5% of all requests - https://phabricator.wikimedia.org/T364693 [07:58:47] (03CR) 10Brouberol: "I thought about that, but the o11y.logstash cookbook seems to restart more services that I actually need (apache, envoyproxy, opensearch-d" [cookbooks] - 10https://gerrit.wikimedia.org/r/1030890 (https://phabricator.wikimedia.org/T363975) (owner: 10Brouberol) [07:59:07] (03CR) 10CI reject: [V:04-1] logstash: introduce restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1030890 (https://phabricator.wikimedia.org/T363975) (owner: 10Brouberol) [07:59:16] marostegui: I'm done [07:59:45] Amir1: nice, going to deploy then [07:59:57] (03CR) 10Marostegui: [C:03+2] db-production.php: Enable writes on es6 and es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029109 (https://phabricator.wikimedia.org/T364446) (owner: 10Marostegui) [08:00:11] once you reach test servers let me know [08:00:17] !log installing python2.7 security updates [08:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:27] Amir1: wilco [08:00:41] (03Merged) 10jenkins-bot: db-production.php: Enable writes on es6 and es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029109 (https://phabricator.wikimedia.org/T364446) (owner: 10Marostegui) [08:01:01] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:1029109|db-production.php: Enable writes on es6 and es7 (T364446)]] [08:01:06] T364446: Enable writes on es6 and es7 - https://phabricator.wikimedia.org/T364446 [08:03:18] !log marostegui@deploy1002 marostegui: Backport for [[gerrit:1029109|db-production.php: Enable writes on es6 and es7 (T364446)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:03:27] Amir1: Changes synced to the testservers. (see https://wikitech.wikimedia.org/wiki/Mwdebug) [08:03:27] Please do any necessary checks before continuing. [08:03:36] awesome [08:05:19] marostegui: do you see writes on enwiki there? [08:05:27] one sec [08:06:17] (03CR) 10Ayounsi: [C:03+2] drmrs: force Free on Arelion [homer/public] - 10https://gerrit.wikimedia.org/r/1030188 (owner: 10Ayounsi) [08:06:28] Amir1: no [08:06:35] Maybe they went to es4 or es5? [08:06:51] (03Merged) 10jenkins-bot: drmrs: force Free on Arelion [homer/public] - 10https://gerrit.wikimedia.org/r/1030188 (owner: 10Ayounsi) [08:06:57] es6 and es7 enwiki tables are empty [08:07:10] hmm, I made like six seven edits [08:07:25] there are no rows on neither of them [08:07:32] (03PS6) 10JMeybohm: Add CertProvider to hot reload TLS certs for gRPC service [software/envoyproxy/ratelimiter] - 10https://gerrit.wikimedia.org/r/1029205 (https://phabricator.wikimedia.org/T362310) [08:07:57] Amir1: Error connecting to es1038 as user wikiuser2023: :real_connect(): (HY000/1044): Access denied for user 'wikiuser2023'@'10.%' to database 'enwiki' [08:07:58] ha [08:08:37] (03CR) 10JMeybohm: "Thanks for the review (and sorry I had all the copy/paste leftovers in there)!" [software/envoyproxy/ratelimiter] - 10https://gerrit.wikimedia.org/r/1029205 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [08:08:40] fun [08:08:45] The grants to each database aren't deployed [08:08:52] I can fix that, one sec [08:09:00] 😭 [08:09:35] can you try again? [08:09:40] (03PS4) 10Brouberol: logstash: introduce restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1030890 (https://phabricator.wikimedia.org/T363975) [08:10:00] sure, in the mean time take a look at this https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&viewPanel=1 [08:10:28] This is even more impressive https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&viewPanel=1&from=now-24h&to=now [08:11:43] I see edits show up in blobs_cluster31 [08:11:57] yeah! [08:12:06] 30 is still empty [08:12:13] it would be nice to try to generate some and see if they arrive well there [08:12:26] yeah, let me edit even more [08:12:35] (03CR) 10Muehlenhoff: "Ack, it makes sense to handle them as discrete cookbooks, then. Although it seems confusing to have a top level sre.logstash.roll-restart-" [cookbooks] - 10https://gerrit.wikimedia.org/r/1030890 (https://phabricator.wikimedia.org/T363975) (owner: 10Brouberol) [08:13:20] nothing is showing up there [08:13:42] yeah, so far only 31 [08:14:07] let's check error logs, etc. [08:14:25] yep [08:14:34] so es2040 is showing errors [08:14:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T364299)', diff saved to https://phabricator.wikimedia.org/P62333 and previous config saved to /var/cache/conftool/dbconfig/20240513-081448-marostegui.json [08:14:54] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [08:15:03] Unable to store text to external storage DB://cluster30 (caught Wikimedia\Rdbms\DBConnectionError exception: Cannot access the database: Access denied for user 'wikiuser2023'@'10.%' to database 'enwiki' (es1038)) [08:15:09] there is also [08:15:15] Error 1146 from ExternalStoreDB::fetchBlob, Table 'enwiki.blobs' doesn't exist SELECT blob_text FROM `blobs` WHERE blob_id = '3' LIMIT 1 es2040 [08:15:30] are you sure that es1038 isn't old? [08:15:32] I fixed that one [08:15:48] you're right [08:15:53] (03PS1) 10Slyngshede: Configuration for disabling signup. [software/bitu] - 10https://gerrit.wikimedia.org/r/1030891 [08:15:54] it's ten minutes old [08:16:01] good [08:16:03] ah I think I know [08:16:06] give me a sec [08:16:13] this one is worrying Error 1146 from ExternalStoreDB::fetchBlob, Table 'enwiki.blobs' doesn't exist SELECT blob_text FROM `blobs` WHERE blob_id = '3' LIMIT 1 es2040 [08:16:58] which by the way doesn't make sense as blobs_cluster31 exists, but blobs doesn't [08:17:01] what is that blobs table? [08:17:04] (03CR) 10Slyngshede: "For WMCS labtest signups are disabled in mediawiki, this allows Bitu to work as a management interface, but not create new users." [software/bitu] - 10https://gerrit.wikimedia.org/r/1030891 (owner: 10Slyngshede) [08:18:16] that error means it's probably falling back to default [08:18:31] because the table name is not set properly [08:18:34] but it is: 'cluster30' => [ 'blobs table' => 'blobs_cluster30' ], [08:19:25] But I don't see what is the mistake at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1029109/3/wmf-config/db-production.php [08:19:35] (03CR) 10Jcrespo: [C:03+1] db2183: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1030741 (https://phabricator.wikimedia.org/T364296) (owner: 10Marostegui) [08:19:53] yeah, I checked it like five time by now [08:20:22] XDD [08:21:21] RECOVERY - Check whether ferm is active by checking the default input chain on mw1449 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:24:22] !log installing PHP 7.3 security updates [08:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:21] There definitely not such table on es4 so it must be something with the code/config [08:25:50] marostegui: I'm still looking but why we have grants on parsercache on the users there, for later though [08:26:04] Amir1: Nah, simply cause I copied those over [08:26:12] I will create a task to clean them up [08:26:27] how does MW picks which one of the non-static ES to write to? [08:27:21] randomness [08:27:47] marostegui: let me run it with verbose, give me a bit [08:28:33] ack [08:28:42] Amir1: ok [08:29:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P62334 and previous config saved to /var/cache/conftool/dbconfig/20240513-082956-marostegui.json [08:30:13] marostegui: https://logstash.wikimedia.org/goto/95598c3c85d50acf6ada9c761850faee [08:30:34] (03CR) 10Brouberol: [C:04-1] "The helmfile releases need some changes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028761 (https://phabricator.wikimedia.org/T363300) (owner: 10Stevemunene) [08:30:41] https://www.irccloud.com/pastebin/Imx0ItrO/ [08:30:47] the writes are showing up now? [08:31:03] or I should have been looking for a different cluster [08:31:08] in 30 yes [08:31:14] oh and 31! [08:31:22] yeah, we now have 3 rows in 30 and 4 in 31 [08:31:39] what ever [08:31:44] let's roll forward [08:31:49] but what's that es2040 error? [08:32:18] my guess is that it didn't pick up the config that can happen for many reasons (caching e.g.) [08:32:30] ah ok [08:32:33] so, let's go? [08:32:48] let's go but monitor [08:32:52] yep [08:32:54] !log marostegui@deploy1002 marostegui: Continuing with sync [08:32:57] ^ [08:34:37] Amir1: errors coming through [08:34:43] With the same blobs table thing [08:34:48] Still deploying though [08:35:15] noted [08:35:20] let me see if they subside [08:38:12] marostegui: writes are coming through to cluster30 from what I'm seeing [08:38:13] Amir1: I find strange that it could be caching, cause the error is specific for a select to a table that doesn't really exist [08:38:29] yeah for both [08:38:34] the rows are growing [08:38:43] PROBLEM - Check whether ferm is active by checking the default input chain on mw1370 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:38:45] PROBLEM - Check whether ferm is active by checking the default input chain on mw1494 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:39:05] that is easily explainable, if it doesn't pick up the config, that table name is the default, it falls back to that [08:39:12] ah ok! [08:39:39] 10ops-eqiad, 06SRE: Degraded RAID on cloudcephosd1031 - https://phabricator.wikimedia.org/T364060#9789137 (10dcaro) From dmesg: ` [Fri May 3 03:23:34 2024] sd 0:0:0:0: [sda] tag#460 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=4s [Fri May 3 03:23:34 2024] sd 0:0:0:0: [sda] tag#460 CDB... [08:41:11] (03CR) 10Muehlenhoff: [C:03+2] elasticsearch::tlsproxy: Stop passing certs to tlsproxy::localssl [puppet] - 10https://gerrit.wikimedia.org/r/1029121 (https://phabricator.wikimedia.org/T360439) (owner: 10Muehlenhoff) [08:41:15] 06SRE, 10SRE-Access-Requests: Requesting permissions for analytics-privatedata-users (with kerberos) for Mareike Heuer - https://phabricator.wikimedia.org/T364715 (10MareikeHeuerWMDE) 03NEW [08:42:54] (03CR) 10Btullis: [C:03+2] Move snapshot1009 to insetup::data_engineering [puppet] - 10https://gerrit.wikimedia.org/r/1029509 (https://phabricator.wikimedia.org/T364456) (owner: 10Btullis) [08:43:16] marostegui: the errors have subsided? [08:43:29] Amir1: not yet [08:43:35] Still doing fpm restarts [08:43:35] https://usercontent.irccloud-cdn.com/file/kO6MgWQp/grafik.png [08:44:02] https://logstash.wikimedia.org/goto/2cee9922d6d1a5b856194f56b14cbdb7 [08:44:22] aaah, these errors are expected [08:44:26] these are fetch [08:44:39] on hosts that doesn't have the patch yet [08:44:40] how can we have expected errors? :) [08:44:44] ah ok [08:44:59] so write on a host that has it, but read from a host that doesn't have it yet [08:45:01] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:1029109|db-production.php: Enable writes on es6 and es7 (T364446)]] (duration: 44m 00s) [08:45:02] that's normal [08:45:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P62335 and previous config saved to /var/cache/conftool/dbconfig/20240513-084503-marostegui.json [08:45:05] T364446: Enable writes on es6 and es7 - https://phabricator.wikimedia.org/T364446 [08:45:09] finished deployment [08:46:00] (03PS1) 10Btullis: Remove snapshot1009 from the scap deployment targets [dumps/scap] - 10https://gerrit.wikimedia.org/r/1030894 (https://phabricator.wikimedia.org/T364456) [08:46:33] Rows being inserted [08:47:34] there is no more errors on table not existing yet [08:47:58] yeah! [08:50:29] (03CR) 10Marostegui: [C:03+2] db2183: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1030741 (https://phabricator.wikimedia.org/T364296) (owner: 10Marostegui) [08:51:10] (03CR) 10Filippo Giunchedi: "I'm +1 on what Moritz suggested (and don't feel very strongly about it either!)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1030890 (https://phabricator.wikimedia.org/T363975) (owner: 10Brouberol) [08:51:52] !log btullis@cumin1002 START - Cookbook sre.hosts.decommission for hosts snapshot1009.eqiad.wmnet [08:53:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2184.codfw.wmnet with OS bookworm [08:56:30] (03CR) 10Muehlenhoff: "I'll add this to the agenda of today's SRE IF meeting" [puppet] - 10https://gerrit.wikimedia.org/r/1027052 (https://phabricator.wikimedia.org/T364494) (owner: 10Dzahn) [08:56:55] (03CR) 10Lucas Werkmeister (WMDE): specials: Fix "include templates" query builder for Special:Export (031 comment) [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029564 (https://phabricator.wikimedia.org/T364554) (owner: 10Umherirrender) [08:56:57] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [08:58:38] (03PS1) 10Jcrespo: dbbackups: Add stats grants for dbprov1006, dbprov2006 at m1:dbbackups [puppet] - 10https://gerrit.wikimedia.org/r/1030896 (https://phabricator.wikimedia.org/T362509) [08:58:57] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: snapshot1009.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [08:59:30] (03CR) 10Jcrespo: "This is causing missing backup alerts, but backups are succeeding." [puppet] - 10https://gerrit.wikimedia.org/r/1030896 (https://phabricator.wikimedia.org/T362509) (owner: 10Jcrespo) [08:59:56] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:59:58] (03PS2) 10Jcrespo: dbbackups: Add stats grants for dbprov1006, dbprov2006 at m1:dbbackups [puppet] - 10https://gerrit.wikimedia.org/r/1030896 (https://phabricator.wikimedia.org/T362509) [09:00:00] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: snapshot1009.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [09:00:01] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:00:03] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts snapshot1009.eqiad.wmnet [09:00:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T364299)', diff saved to https://phabricator.wikimedia.org/P62336 and previous config saved to /var/cache/conftool/dbconfig/20240513-090011-marostegui.json [09:00:13] 10ops-eqiad, 10decommission-hardware, 10Data-Platform-SRE (2024.05.06 - 2024.05.26), 13Patch-For-Review: decommission snapshot1009.eqiad.wmnet - https://phabricator.wikimedia.org/T364456#9789280 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by btullis@cumin1002 for hosts: `snapshot1009.e... [09:00:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1200.eqiad.wmnet with reason: Maintenance [09:00:25] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [09:00:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1200.eqiad.wmnet with reason: Maintenance [09:00:34] 10ops-eqiad, 10decommission-hardware, 10Data-Platform-SRE (2024.05.06 - 2024.05.26), 13Patch-For-Review: decommission snapshot1009.eqiad.wmnet - https://phabricator.wikimedia.org/T364456#9789272 (10BTullis) a:05BTullisβ†’03Jclark-ctr [09:00:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T364299)', diff saved to https://phabricator.wikimedia.org/P62337 and previous config saved to /var/cache/conftool/dbconfig/20240513-090035-marostegui.json [09:00:43] 10ops-eqiad, 10decommission-hardware, 10Data-Platform-SRE (2024.05.06 - 2024.05.26), 13Patch-For-Review: decommission snapshot1009.eqiad.wmnet - https://phabricator.wikimedia.org/T364456#9789276 (10BTullis) a:05Jclark-ctrβ†’03None [09:00:43] (03CR) 10Btullis: [V:03+2 C:03+2] Remove snapshot1009 from the scap deployment targets [dumps/scap] - 10https://gerrit.wikimedia.org/r/1030894 (https://phabricator.wikimedia.org/T364456) (owner: 10Btullis) [09:02:33] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2184.codfw.wmnet with OS bookworm [09:03:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:03:28] (03CR) 10Jcrespo: [C:03+2] dbbackups: Add stats grants for dbprov1006, dbprov2006 at m1:dbbackups [puppet] - 10https://gerrit.wikimedia.org/r/1030896 (https://phabricator.wikimedia.org/T362509) (owner: 10Jcrespo) [09:03:29] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2184.codfw.wmnet with OS bookworm [09:03:33] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9789300 (10Clement_Goubert) [09:05:57] !log deploy new stat grants at m1:dbbackups T362509 [09:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:01] T362509: Setup new dbprov hosts and decommission the old ones - https://phabricator.wikimedia.org/T362509 [09:08:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:08:44] RECOVERY - Check whether ferm is active by checking the default input chain on mw1370 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:08:46] RECOVERY - Check whether ferm is active by checking the default input chain on mw1494 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:09:56] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:20:10] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2184.codfw.wmnet with reason: host reimage [09:23:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2184.codfw.wmnet with reason: host reimage [09:24:45] (03PS1) 10Marostegui: db2184: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1030902 (https://phabricator.wikimedia.org/T364296) [09:26:45] (03CR) 10Jcrespo: [C:03+1] db2184: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1030902 (https://phabricator.wikimedia.org/T364296) (owner: 10Marostegui) [09:27:06] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: sync [09:28:10] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [09:28:44] (03CR) 10Btullis: [V:03+1 C:03+2] Move stats misc_jobs from stat1007 to stat1011 [puppet] - 10https://gerrit.wikimedia.org/r/1028866 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis) [09:31:23] (03CR) 10Filippo Giunchedi: [C:03+1] "Great extensive comments!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030290 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [09:32:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T364299)', diff saved to https://phabricator.wikimedia.org/P62338 and previous config saved to /var/cache/conftool/dbconfig/20240513-093200-marostegui.json [09:32:06] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [09:39:01] (03CR) 10Marostegui: [C:03+2] db2184: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1030902 (https://phabricator.wikimedia.org/T364296) (owner: 10Marostegui) [09:39:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2184.codfw.wmnet with OS bookworm [09:41:59] (03CR) 10Ladsgroup: [C:03+1] hieradata: Add arbcom_itwiki to private wikis (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028855 (https://phabricator.wikimedia.org/T363825) (owner: 10Zabe) [09:46:40] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [09:47:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P62340 and previous config saved to /var/cache/conftool/dbconfig/20240513-094709-marostegui.json [09:47:18] (03CR) 10Brouberol: "In order to not change the already existing o11y cookbook, I'll simply rename mine apifeatureusage.roll-restart-reboot-logstash.py, to mak" [cookbooks] - 10https://gerrit.wikimedia.org/r/1030890 (https://phabricator.wikimedia.org/T363975) (owner: 10Brouberol) [09:47:37] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [09:49:22] (03PS5) 10Brouberol: logstash: introduce restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1030890 (https://phabricator.wikimedia.org/T363975) [09:50:00] (03PS6) 10Brouberol: apifeatureusage: introduce restart/reboot cookbook for logstash [cookbooks] - 10https://gerrit.wikimedia.org/r/1030890 (https://phabricator.wikimedia.org/T363975) [09:50:05] (03PS1) 10Btullis: Create /srv/analytics-wmde on stat1011 [puppet] - 10https://gerrit.wikimedia.org/r/1030903 (https://phabricator.wikimedia.org/T353785) [09:51:49] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2410/co" [puppet] - 10https://gerrit.wikimedia.org/r/1030903 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis) [09:54:12] (03CR) 10Btullis: [V:03+1 C:03+2] Create /srv/analytics-wmde on stat1011 [puppet] - 10https://gerrit.wikimedia.org/r/1030903 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240513T1000) [10:02:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P62341 and previous config saved to /var/cache/conftool/dbconfig/20240513-100216-marostegui.json [10:07:47] 06SRE, 06Infrastructure-Foundations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929#9789580 (10taavi) >>! In T187929#9748100, @cmooney wrote: > The aggregate that is used for the cloud-private allocations should come from IPv6 space not announced to the internet/DFZ, or space that i... [10:09:29] (03PS10) 10EoghanGaffney: lists: Add lists role to list2001 [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) [10:12:34] (03CR) 10CI reject: [V:04-1] lists: Add lists role to list2001 [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [10:17:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T364299)', diff saved to https://phabricator.wikimedia.org/P62342 and previous config saved to /var/cache/conftool/dbconfig/20240513-101724-marostegui.json [10:17:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1210.eqiad.wmnet with reason: Maintenance [10:17:30] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [10:17:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1210.eqiad.wmnet with reason: Maintenance [10:17:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1210 (T364299)', diff saved to https://phabricator.wikimedia.org/P62343 and previous config saved to /var/cache/conftool/dbconfig/20240513-101748-marostegui.json [10:19:33] (03PS11) 10EoghanGaffney: lists: Add lists role to list2001 [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) [10:19:54] (03CR) 10CI reject: [V:04-1] lists: Add lists role to list2001 [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [10:19:58] !log installing expat security updates [10:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:15] FIRING: PHPFPMTooBusy: Not enough idle php7.4-fpm.service workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=appserver&var-site=eqiad&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:21:13] (03PS12) 10EoghanGaffney: lists: Add lists role to list2001 [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) [10:21:55] !incidents [10:21:55] 4673 (UNACKED) PHPFPMTooBusy appserver sre (php7.4-fpm.service eqiad) [10:21:55] 4672 (RESOLVED) [2x] ProbeDown sre (phab1004:443 probes/custom eqiad) [10:21:56] 4671 (RESOLVED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [10:21:56] 4670 (RESOLVED) ProbeDown sre (10.64.16.101 ip4 phab1004:443 probes/custom http_phabricator_wikimedia_org_collab_ip4 eqiad) [10:22:14] !ack 4673 [10:22:14] 4673 (ACKED) PHPFPMTooBusy appserver sre (php7.4-fpm.service eqiad) [10:22:15] FIRING: MediaWikiLatencyExceeded: Average latency high: eqiad appserver GET/200: 2.8420353390522783s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:22:45] here [10:22:49] Let me Take a look [10:24:24] I'm around if need be [10:24:24] (03PS3) 10Ladsgroup: [WIP] Change static 'A Wikimedia project' icon to new one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester) [10:25:05] (03CR) 10CI reject: [V:04-1] [WIP] Change static 'A Wikimedia project' icon to new one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester) [10:25:11] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: CloudVPS: enable BGP in the neutron transport network - https://phabricator.wikimedia.org/T245606#9789660 (10taavi) 05Stalledβ†’03Declined Closing this in favour of the slightly different approach in {T358868} that's likely going t... [10:25:14] FIRING: ProbeDown: Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#appservers-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:25:55] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1365.eqiad.wmnet, mw1413.eqiad.wmnet, mw1456.eqiad.wmnet, mw1420.eqiad.wmnet, mw1366.eqiad.wmnet, mw1407.eqiad.wmnet, mw1418.eqiad.wmnet, mw1429.eqiad.wmnet, mw1401.eqiad.wmnet, mw1373.eqiad.wmnet, mw1417.eqiad.wmnet, mw1364.eqiad.wmnet, mw1436.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/ [10:25:57] FIRING: ProbeDown: Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#appservers-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:26:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1365.eqiad.wmnet, mw1413.eqiad.wmnet, mw1364.eqiad.wmnet, mw1420.eqiad.wmnet, mw1366.eqiad.wmnet, mw1407.eqiad.wmnet, mw1418.eqiad.wmnet, mw1372.eqiad.wmnet, mw1429.eqiad.wmnet, mw1401.eqiad.wmnet, mw1403.eqiad.wmnet, mw1373.eqiad.wmnet, mw1411.eqiad.wmnet, mw1417.eqiad.wmnet, mw1456.eqiad.wmnet, mw1436.eqiad.wmnet are [10:26:13] down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:26:39] not sure if related but the timing lines up - something fairly expensive happened on the jobqueue https://grafana.wikimedia.org/goto/NL2mAuLSR?orgId=1 (wikibase-addUsagesforPage) [10:26:59] (03PS4) 10Ladsgroup: [WIP] Change static 'A Wikimedia project' icon to new one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester) [10:27:14] !incidents [10:27:14] 4673 (ACKED) PHPFPMTooBusy appserver sre (php7.4-fpm.service eqiad) [10:27:14] 4674 (UNACKED) ProbeDown sre (10.2.2.1 ip4 appservers-https:443 probes/service http_appservers-https_ip4 eqiad) [10:27:15] 4672 (RESOLVED) [2x] ProbeDown sre (phab1004:443 probes/custom eqiad) [10:27:15] 4671 (RESOLVED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [10:27:15] FIRING: [2x] MediaWikiLatencyExceeded: Average latency high: eqiad appserver GET/200: 6.690172925205767s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceed [10:27:15] 4670 (RESOLVED) ProbeDown sre (10.64.16.101 ip4 phab1004:443 probes/custom http_phabricator_wikimedia_org_collab_ip4 eqiad) [10:27:32] !ack 4674 [10:27:33] 4674 (ACKED) ProbeDown sre (10.2.2.1 ip4 appservers-https:443 probes/service http_appservers-https_ip4 eqiad) [10:27:39] RPS tripled on bare-metal [10:27:41] (03CR) 10CI reject: [V:04-1] [WIP] Change static 'A Wikimedia project' icon to new one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester) [10:28:07] And not on k8s [10:28:32] (03PS5) 10Ladsgroup: [WIP] Change static 'A Wikimedia project' icon to new one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester) [10:28:48] so something to do with commons I bet [10:29:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [10:31:13] There's a spike of requests for one specific thumb [10:31:37] possibly an embed snowball? [10:32:16] commons is indeed very slow to load for me [10:32:30] we are also chatting in -sre [10:32:36] taavi: commons is only served by bare metal, and its fpm workers are saturated [10:32:39] So that makes sense [10:32:48] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-codfw [10:35:14] FIRING: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:37:15] FIRING: [3x] MediaWikiLatencyExceeded: Average latency high: codfw appserver GET/200: 0.45201377687128713s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:37:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-codfw [10:40:14] FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:42:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from appservers-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [10:43:01] !incidents [10:43:02] 4673 (ACKED) PHPFPMTooBusy appserver sre (php7.4-fpm.service eqiad) [10:43:02] 4674 (ACKED) ProbeDown sre (10.2.2.1 ip4 appservers-https:443 probes/service http_appservers-https_ip4 eqiad) [10:43:02] 4675 (UNACKED) [3x] ATSBackendErrorsHigh cache_text sre (appservers-ro.discovery.wmnet) [10:43:02] 4672 (RESOLVED) [2x] ProbeDown sre (phab1004:443 probes/custom eqiad) [10:43:02] 4671 (RESOLVED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [10:43:03] 4670 (RESOLVED) ProbeDown sre (10.64.16.101 ip4 phab1004:443 probes/custom http_phabricator_wikimedia_org_collab_ip4 eqiad) [10:43:09] !ack 4675 [10:43:09] 4675 (ACKED) [3x] ATSBackendErrorsHigh cache_text sre (appservers-ro.discovery.wmnet) [10:45:14] RESOLVED: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:46:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T364299)', diff saved to https://phabricator.wikimedia.org/P62345 and previous config saved to /var/cache/conftool/dbconfig/20240513-104627-marostegui.json [10:46:34] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [10:47:19] (03PS6) 10Ladsgroup: Change static footer icons to the new one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester) [10:48:12] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:49:15] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [10:53:06] !incidents [10:53:06] 4673 (ACKED) PHPFPMTooBusy appserver sre (php7.4-fpm.service eqiad) [10:53:06] 4674 (ACKED) ProbeDown sre (10.2.2.1 ip4 appservers-https:443 probes/service http_appservers-https_ip4 eqiad) [10:53:07] 4675 (ACKED) [3x] ATSBackendErrorsHigh cache_text sre (appservers-ro.discovery.wmnet) [10:53:07] 4672 (RESOLVED) [2x] ProbeDown sre (phab1004:443 probes/custom eqiad) [10:53:07] 4671 (RESOLVED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [10:53:07] 4670 (RESOLVED) ProbeDown sre (10.64.16.101 ip4 phab1004:443 probes/custom http_phabricator_wikimedia_org_collab_ip4 eqiad) [10:54:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [10:57:15] FIRING: [3x] MediaWikiLatencyExceeded: Average latency high: codfw appserver GET/200: 0.40758012487165324s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:01:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P62346 and previous config saved to /var/cache/conftool/dbconfig/20240513-110137-marostegui.json [11:02:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from appservers-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [11:03:00] !incidents [11:03:01] 4673 (ACKED) PHPFPMTooBusy appserver sre (php7.4-fpm.service eqiad) [11:03:01] 4674 (ACKED) ProbeDown sre (10.2.2.1 ip4 appservers-https:443 probes/service http_appservers-https_ip4 eqiad) [11:03:01] 4675 (ACKED) [3x] ATSBackendErrorsHigh cache_text sre (appservers-ro.discovery.wmnet) [11:03:01] 4672 (RESOLVED) [2x] ProbeDown sre (phab1004:443 probes/custom eqiad) [11:03:01] 4671 (RESOLVED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [11:03:01] 4670 (RESOLVED) ProbeDown sre (10.64.16.101 ip4 phab1004:443 probes/custom http_phabricator_wikimedia_org_collab_ip4 eqiad) [11:04:12] !log installing tomcat9 security updates [11:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:24] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-eqiad [11:07:51] RESOLVED: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from appservers-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [11:08:22] !incidents [11:08:23] 4673 (ACKED) PHPFPMTooBusy appserver sre (php7.4-fpm.service eqiad) [11:08:23] 4674 (ACKED) ProbeDown sre (10.2.2.1 ip4 appservers-https:443 probes/service http_appservers-https_ip4 eqiad) [11:08:23] 4675 (RESOLVED) [3x] ATSBackendErrorsHigh cache_text sre (appservers-ro.discovery.wmnet) [11:08:23] 4672 (RESOLVED) [2x] ProbeDown sre (phab1004:443 probes/custom eqiad) [11:08:23] 4671 (RESOLVED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [11:08:24] 4670 (RESOLVED) ProbeDown sre (10.64.16.101 ip4 phab1004:443 probes/custom http_phabricator_wikimedia_org_collab_ip4 eqiad) [11:09:01] (03CR) 10Muehlenhoff: [C:03+1] "Few nits inline, LGTM otherwise" [cookbooks] - 10https://gerrit.wikimedia.org/r/1030890 (https://phabricator.wikimedia.org/T363975) (owner: 10Brouberol) [11:09:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad [11:09:15] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [11:09:55] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:10:15] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:10:57] RESOLVED: ProbeDown: Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#appservers-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:11:24] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1030891 (owner: 10Slyngshede) [11:11:32] !incidents [11:11:33] 4673 (ACKED) PHPFPMTooBusy appserver sre (php7.4-fpm.service eqiad) [11:11:33] 4674 (RESOLVED) ProbeDown sre (10.2.2.1 ip4 appservers-https:443 probes/service http_appservers-https_ip4 eqiad) [11:11:33] 4675 (RESOLVED) [3x] ATSBackendErrorsHigh cache_text sre (appservers-ro.discovery.wmnet) [11:11:33] 4672 (RESOLVED) [2x] ProbeDown sre (phab1004:443 probes/custom eqiad) [11:11:34] 4671 (RESOLVED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [11:11:34] 4670 (RESOLVED) ProbeDown sre (10.64.16.101 ip4 phab1004:443 probes/custom http_phabricator_wikimedia_org_collab_ip4 eqiad) [11:15:15] RESOLVED: PHPFPMTooBusy: Not enough idle php7.4-fpm.service workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=appserver&var-site=eqiad&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:15:23] !incidents [11:15:23] 4673 (RESOLVED) PHPFPMTooBusy appserver sre (php7.4-fpm.service eqiad) [11:15:23] 4674 (RESOLVED) ProbeDown sre (10.2.2.1 ip4 appservers-https:443 probes/service http_appservers-https_ip4 eqiad) [11:15:24] 4675 (RESOLVED) [3x] ATSBackendErrorsHigh cache_text sre (appservers-ro.discovery.wmnet) [11:15:24] 4672 (RESOLVED) [2x] ProbeDown sre (phab1004:443 probes/custom eqiad) [11:15:24] 4671 (RESOLVED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [11:15:24] 4670 (RESOLVED) ProbeDown sre (10.64.16.101 ip4 phab1004:443 probes/custom http_phabricator_wikimedia_org_collab_ip4 eqiad) [11:16:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P62347 and previous config saved to /var/cache/conftool/dbconfig/20240513-111644-marostegui.json [11:17:15] RESOLVED: [2x] MediaWikiLatencyExceeded: Average latency high: eqiad appserver GET/200: 2.0175717050383914s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExc [11:20:30] (03CR) 10Muehlenhoff: [C:03+2] Add a class to Cumin hosts which generates a Kafka certificate for frtech [puppet] - 10https://gerrit.wikimedia.org/r/1030018 (https://phabricator.wikimedia.org/T360779) (owner: 10Muehlenhoff) [11:28:18] (03CR) 10Ayounsi: [C:03+1] "LGTM ! Looks a bit complex at first sight, but I don't have any suggestion on how to improve it :)" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1029231 (https://phabricator.wikimedia.org/T364480) (owner: 10Cathal Mooney) [11:30:37] (03PS1) 10Muehlenhoff: profile::frtech::kafka_certificate: Fix owner [puppet] - 10https://gerrit.wikimedia.org/r/1030917 (https://phabricator.wikimedia.org/T360779) [11:31:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T364299)', diff saved to https://phabricator.wikimedia.org/P62348 and previous config saved to /var/cache/conftool/dbconfig/20240513-113152-marostegui.json [11:31:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1213.eqiad.wmnet with reason: Maintenance [11:31:56] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [11:32:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1213.eqiad.wmnet with reason: Maintenance [11:32:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1213 (T364299)', diff saved to https://phabricator.wikimedia.org/P62349 and previous config saved to /var/cache/conftool/dbconfig/20240513-113215-marostegui.json [11:35:15] (03CR) 10Muehlenhoff: [C:03+2] profile::frtech::kafka_certificate: Fix owner [puppet] - 10https://gerrit.wikimedia.org/r/1030917 (https://phabricator.wikimedia.org/T360779) (owner: 10Muehlenhoff) [11:40:56] (03PS1) 10Marostegui: db-production.php: Make es4 and es5 RO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030918 (https://phabricator.wikimedia.org/T364447) [11:42:38] (03Abandoned) 10Ayounsi: drmrs: remove Tata traffic engineering [homer/public] - 10https://gerrit.wikimedia.org/r/1030177 (owner: 10Ayounsi) [11:43:41] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Review/cleanup content of /srv/private/modules/secret/secrets/ssl in the private repo - https://phabricator.wikimedia.org/T364622#9789846 (10MoritzMuehlenhoff) [11:47:38] (03CR) 10Marostegui: [C:04-2] "Not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030918 (https://phabricator.wikimedia.org/T364447) (owner: 10Marostegui) [11:53:06] 06SRE, 10ChangeProp, 06collaboration-services, 06Infrastructure-Foundations, and 10 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9789858 (10MoritzMuehlenhoff) Redict is now packaged in Debian: https://tracker.debian.org/pkg/redict [11:53:09] (03PS2) 10Jsn.sherman: Add AutoModerator extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026972 (https://phabricator.wikimedia.org/T364034) [11:53:13] (03PS2) 10Jsn.sherman: Deploy AutoModerator to Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026973 (https://phabricator.wikimedia.org/T364034) [11:53:17] (03PS2) 10Jsn.sherman: Add AutoModerator extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026974 (https://phabricator.wikimedia.org/T364034) [11:53:21] (03PS2) 10Jsn.sherman: CommonSettings-labs: Load AutoModerator extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026975 (https://phabricator.wikimedia.org/T364034) [11:53:47] 06SRE, 10ChangeProp, 06collaboration-services, 06Infrastructure-Foundations, and 10 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9789859 (10MoritzMuehlenhoff) [11:53:56] (03CR) 10CI reject: [V:04-1] CommonSettings-labs: Load AutoModerator extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026975 (https://phabricator.wikimedia.org/T364034) (owner: 10Jsn.sherman) [11:58:56] !log Restarted CI Jenkins to update the Parameterized build plugin | T336782 [11:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:01] T336782: Jenkins CI parameterized trigger plugin logs warnings - https://phabricator.wikimedia.org/T336782 [12:01:52] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM modulo what Moritz said" [cookbooks] - 10https://gerrit.wikimedia.org/r/1030890 (https://phabricator.wikimedia.org/T363975) (owner: 10Brouberol) [12:02:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213 (T364299)', diff saved to https://phabricator.wikimedia.org/P62350 and previous config saved to /var/cache/conftool/dbconfig/20240513-120229-marostegui.json [12:02:33] (03CR) 10Ecarg: "TY!" [puppet] - 10https://gerrit.wikimedia.org/r/1030291 (https://phabricator.wikimedia.org/T364414) (owner: 10Dzahn) [12:02:33] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [12:02:45] (03CR) 10EoghanGaffney: lists: Add lists role to list2001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [12:06:29] (03PS7) 10Brouberol: apifeatureusage: introduce restart/reboot cookbook for logstash [cookbooks] - 10https://gerrit.wikimedia.org/r/1030890 (https://phabricator.wikimedia.org/T363975) [12:08:58] 06SRE, 10Wikimedia-Mailing-lists: Mailing list for Wiktionary admins - https://phabricator.wikimedia.org/T364731#9789879 (10Ladsgroup) a:03Ladsgroup Hi, you mean English Wiktionary admins or admins of all wiktionary projects? If the former, then it should be wiktionary-en-admins@lists.wikimedia.org (see http... [12:09:05] 06SRE, 10Wikimedia-Mailing-lists: Mailing list for Wiktionary admins - https://phabricator.wikimedia.org/T364731#9789893 (10Ladsgroup) Second: Should be public or private? [12:11:15] 06SRE, 10Wikimedia-Mailing-lists: Mailing list for Wiktionary admins - https://phabricator.wikimedia.org/T364731#9789903 (10Vininn126) Hello. I mean specifically English Wiktionary admins, but it could be used for English Wiktionary in general, certainly not all Wiktionary projects (hence the en prefix). What... [12:11:55] (03CR) 10Muehlenhoff: [C:03+2] Configure analytics Druid nodes to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1029184 (owner: 10Muehlenhoff) [12:12:22] (03CR) 10Brouberol: apifeatureusage: introduce restart/reboot cookbook for logstash (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1030890 (https://phabricator.wikimedia.org/T363975) (owner: 10Brouberol) [12:12:37] (03PS2) 10Vgutierrez: hiera: Enable IPIP on upload and upload-https services [puppet] - 10https://gerrit.wikimedia.org/r/1030022 (https://phabricator.wikimedia.org/T357257) [12:12:38] (03PS3) 10Vgutierrez: hiera: Enable IPIP encapsulation on high-traffic2@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1030021 (https://phabricator.wikimedia.org/T357257) [12:12:38] (03PS2) 10Vgutierrez: cache: Enable IPIP encapsulation on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1030051 (https://phabricator.wikimedia.org/T357257) [12:14:23] (03CR) 10Vgutierrez: "targeting ulsfo now, thanks for your feedback" [puppet] - 10https://gerrit.wikimedia.org/r/1030051 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [12:14:40] (03CR) 10Vgutierrez: hiera: Enable IPIP on upload and upload-https services [puppet] - 10https://gerrit.wikimedia.org/r/1030022 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [12:14:54] (03CR) 10Vgutierrez: hiera: Enable IPIP encapsulation on high-traffic2@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1030021 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [12:17:01] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2412/co" [puppet] - 10https://gerrit.wikimedia.org/r/1030021 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [12:17:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213', diff saved to https://phabricator.wikimedia.org/P62351 and previous config saved to /var/cache/conftool/dbconfig/20240513-121737-marostegui.json [12:20:02] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1030890 (https://phabricator.wikimedia.org/T363975) (owner: 10Brouberol) [12:20:54] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030557 [12:22:16] 06SRE, 10Wikimedia-Mailing-lists: Mailing list for Wiktionary admins - https://phabricator.wikimedia.org/T364731#9789936 (10Ladsgroup) if you want a general mailing list for English Wiktionary then it should be wiktionary-en@. We can have both. Public: Everyone can join and read archives. Private: If someone... [12:22:58] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2413/co" [puppet] - 10https://gerrit.wikimedia.org/r/1030051 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [12:23:00] (03CR) 10Muehlenhoff: [C:03+2] Switch public Druid nodes to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1029188 (owner: 10Muehlenhoff) [12:25:12] 06SRE, 10Wikimedia-Mailing-lists: Mailing list for Wiktionary admins - https://phabricator.wikimedia.org/T364731#9789948 (10Vininn126) Okay, thank you for the explanation. Considering this mailing list is intended to be used also for social media account creation, I think we should limit it to admins, so I sup... [12:26:35] 06SRE, 10Wikimedia-Mailing-lists: Mailing list for Wiktionary admins - https://phabricator.wikimedia.org/T364731#9789950 (10Vininn126) However a general mailing list might be useful as well, it's just not what we had discussed prior in the threads. [12:30:06] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2414/co" [puppet] - 10https://gerrit.wikimedia.org/r/1030051 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [12:31:03] (03CR) 10Brouberol: [C:03+2] apifeatureusage: introduce restart/reboot cookbook for logstash [cookbooks] - 10https://gerrit.wikimedia.org/r/1030890 (https://phabricator.wikimedia.org/T363975) (owner: 10Brouberol) [12:31:04] (03PS1) 10Muehlenhoff: an-druid: One more setting for enabling firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1030937 [12:32:10] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1030937 (owner: 10Muehlenhoff) [12:32:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213', diff saved to https://phabricator.wikimedia.org/P62352 and previous config saved to /var/cache/conftool/dbconfig/20240513-123244-marostegui.json [12:33:22] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1030021 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [12:36:22] (03PS1) 10Ladsgroup: etcd: Ignore parsercache clusters in externalLoads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030938 (https://phabricator.wikimedia.org/T362786) [12:41:31] !log brouberol@cumin2002 START - Cookbook sre.apifeatureusage.roll-restart-reboot-logstash rolling restart_daemons on A:apifeatureusage [12:42:01] (03PS1) 10Vgutierrez: depool upload@ulsfo before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1030939 (https://phabricator.wikimedia.org/T357257) [12:44:05] !log brouberol@cumin2002 END (PASS) - Cookbook sre.apifeatureusage.roll-restart-reboot-logstash (exit_code=0) rolling restart_daemons on A:apifeatureusage [12:44:25] (03CR) 10Brouberol: [C:03+1] Update the email address for data-engineering-alerts [puppet] - 10https://gerrit.wikimedia.org/r/1030172 (https://phabricator.wikimedia.org/T364632) (owner: 10Btullis) [12:45:21] (03CR) 10Brouberol: [C:03+1] Move some of the data-engineering alerts to data-platform [alerts] - 10https://gerrit.wikimedia.org/r/1030189 (https://phabricator.wikimedia.org/T346438) (owner: 10Btullis) [12:47:01] (03CR) 10Btullis: [V:03+1 C:03+2] Update the email address for data-engineering-alerts [puppet] - 10https://gerrit.wikimedia.org/r/1030172 (https://phabricator.wikimedia.org/T364632) (owner: 10Btullis) [12:47:34] (03CR) 10Btullis: [C:03+2] Move some of the data-engineering alerts to data-platform [alerts] - 10https://gerrit.wikimedia.org/r/1030189 (https://phabricator.wikimedia.org/T346438) (owner: 10Btullis) [12:47:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213 (T364299)', diff saved to https://phabricator.wikimedia.org/P62353 and previous config saved to /var/cache/conftool/dbconfig/20240513-124752-marostegui.json [12:47:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1216.eqiad.wmnet with reason: Maintenance [12:47:58] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [12:48:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1216.eqiad.wmnet with reason: Maintenance [12:48:59] RECOVERY - snapshot of s7 in eqiad on backupmon1001 is OK: Last snapshot for s7 at eqiad (db1171) taken on 2024-05-13 11:32:05 (1058 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [12:49:06] (03Merged) 10jenkins-bot: Move some of the data-engineering alerts to data-platform [alerts] - 10https://gerrit.wikimedia.org/r/1030189 (https://phabricator.wikimedia.org/T346438) (owner: 10Btullis) [12:51:48] (03CR) 10Brouberol: [C:03+1] "LGTM thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1030937 (owner: 10Muehlenhoff) [12:52:23] (03CR) 10Btullis: [C:03+1] an-druid: One more setting for enabling firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1030937 (owner: 10Muehlenhoff) [12:55:22] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1021893 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis) [12:57:33] (03CR) 10Jforrester: [C:03+1] admin: add Grace Choi to deployers [puppet] - 10https://gerrit.wikimedia.org/r/1030291 (https://phabricator.wikimedia.org/T364414) (owner: 10Dzahn) [12:57:41] (03CR) 10Hnowlan: [C:03+1] services: move Tegola's Swift config in staging to local envoy proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029544 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [12:58:28] (03CR) 10Ssingh: [C:03+1] depool upload@ulsfo before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1030939 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [12:59:20] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2165.codfw.wmnet with reason: Maintenance [12:59:33] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2165.codfw.wmnet with reason: Maintenance [12:59:36] (03CR) 10Btullis: [C:03+2] Remove the piwik role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1021893 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis) [12:59:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2165 (T360332)', diff saved to https://phabricator.wikimedia.org/P62354 and previous config saved to /var/cache/conftool/dbconfig/20240513-125940-arnaudb.json [12:59:44] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240513T1300). [13:00:05] MatmaRex and hnowlan: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] o/ [13:00:25] (03CR) 10Muehlenhoff: [C:03+2] an-druid: One more setting for enabling firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1030937 (owner: 10Muehlenhoff) [13:00:29] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1181.eqiad.wmnet with reason: Maintenance [13:00:42] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1181.eqiad.wmnet with reason: Maintenance [13:00:46] hi [13:00:47] o/ [13:00:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1181 (T360332)', diff saved to https://phabricator.wikimedia.org/P62355 and previous config saved to /var/cache/conftool/dbconfig/20240513-130049-arnaudb.json [13:01:26] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2202.codfw.wmnet with reason: Maintenance [13:01:39] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2202.codfw.wmnet with reason: Maintenance [13:01:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T360332)', diff saved to https://phabricator.wikimedia.org/P62356 and previous config saved to /var/cache/conftool/dbconfig/20240513-130158-arnaudb.json [13:02:22] MatmaRex: I left a comment on your patch, could you address that in a follow-up? [13:02:32] (probably doesn’t have to block the backport) [13:02:39] looking [13:03:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T360332)', diff saved to https://phabricator.wikimedia.org/P62357 and previous config saved to /var/cache/conftool/dbconfig/20240513-130329-arnaudb.json [13:03:49] (03CR) 10Elukey: [C:03+2] services: move Tegola's Swift config in staging to local envoy proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029544 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [13:03:56] Lucas_WMDE: yeah, good point, i don't think that blocks the backport though [13:04:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029564 (https://phabricator.wikimedia.org/T364554) (owner: 10Umherirrender) [13:04:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030308 (https://phabricator.wikimedia.org/T364635) (owner: 10Bartosz DziewoΕ„ski) [13:04:27] i'll write a follow-up in a moment [13:04:45] alright, thanks [13:05:09] !log filippo@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [13:05:11] !log filippo@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [13:06:26] I’m trying out the visual diff issue but can’t reproduce it so far, strange [13:07:26] !log filippo@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [13:07:29] !log filippo@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [13:07:50] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync [13:07:58] (03CR) 10Ssingh: [C:03+1] team-traffic: Add runbook link to LVSRealserverMSS alert [alerts] - 10https://gerrit.wikimedia.org/r/1030057 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:08:07] those applies by me are tests btw, nothing actually happened [13:08:11] (03PS1) 10Muehlenhoff: druid: Remove support for using ferm firewall definitions [puppet] - 10https://gerrit.wikimedia.org/r/1030947 [13:08:51] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:09:25] (03CR) 10Bartosz DziewoΕ„ski: specials: Fix "include templates" query builder for Special:Export (031 comment) [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029564 (https://phabricator.wikimedia.org/T364554) (owner: 10Umherirrender) [13:09:42] can’t reproduce the visual diff issue on testwiki either [13:09:46] (03PS1) 10Filippo Giunchedi: jaeger: update chart to 3.0.7 / f3c883908e576 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030950 (https://phabricator.wikimedia.org/T364477) [13:09:48] (03PS1) 10Filippo Giunchedi: jaeger: update aux values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030951 (https://phabricator.wikimedia.org/T364477) [13:09:49] (03PS1) 10Filippo Giunchedi: jaeger: update bitnami/common to 2.19.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030952 (https://phabricator.wikimedia.org/T364477) [13:10:14] Lucas_WMDE: the visual diff problem only occurs in the editor, not for historical diffs [13:10:19] (03Abandoned) 10Hashar: CommonSettings.php: Fix jobrunner hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1021886 (https://phabricator.wikimedia.org/T349796) (owner: 10ClΓ©ment Goubert) [13:10:23] yes, I’m checking the editor already [13:10:24] Lucas_WMDE: follow-up is https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1030949 [13:10:30] but apparently I also have to be in VE mode, not source mode [13:10:35] which nobody mentioned so far (or I missed it) [13:10:39] (03CR) 10Hashar: [C:03+1] Include mw-jobrunner port in host header check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025391 (owner: 10Hnowlan) [13:10:45] ah. i don't think i tested that one [13:10:46] doesn’t help that visual mode is disabled on enwiki’s sandbox either [13:10:53] (03CR) 10CI reject: [V:04-1] jaeger: update aux values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030951 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi) [13:10:53] (03CR) 10CI reject: [V:04-1] jaeger: update bitnami/common to 2.19.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030952 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi) [13:11:01] (03CR) 10CI reject: [V:04-1] jaeger: update chart to 3.0.7 / f3c883908e576 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030950 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi) [13:11:36] !log filippo@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [13:11:38] !log filippo@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [13:11:58] MatmaRex: thanks! +2ed [13:12:04] (03CR) 10Hashar: [C:03+1] Include mw-jobrunner port in host header check (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025391 (owner: 10Hnowlan) [13:12:16] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1030947 (owner: 10Muehlenhoff) [13:13:38] i'm a bit behind on gerrit emails and i haven't seen your comment before… i'm currently at 160 unread messages, i blame all the hackathon activity :) [13:14:00] hehe [13:14:40] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1245.eqiad.wmnet with reason: Maintenance [13:14:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1245.eqiad.wmnet with reason: Maintenance [13:15:39] (03PS2) 10Filippo Giunchedi: jaeger: update chart to 3.0.7 / f3c883908e576 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030950 (https://phabricator.wikimedia.org/T364477) [13:15:39] (03PS2) 10Filippo Giunchedi: jaeger: update aux values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030951 (https://phabricator.wikimedia.org/T364477) [13:15:39] (03PS2) 10Filippo Giunchedi: jaeger: update bitnami/common to 2.19.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030952 (https://phabricator.wikimedia.org/T364477) [13:17:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P62358 and previous config saved to /var/cache/conftool/dbconfig/20240513-131706-arnaudb.json [13:17:14] (03CR) 10CI reject: [V:04-1] jaeger: update chart to 3.0.7 / f3c883908e576 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030950 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi) [13:17:19] (03CR) 10CI reject: [V:04-1] jaeger: update aux values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030951 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi) [13:17:21] (03CR) 10CI reject: [V:04-1] jaeger: update bitnami/common to 2.19.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030952 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi) [13:17:54] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync [13:18:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P62359 and previous config saved to /var/cache/conftool/dbconfig/20240513-131837-arnaudb.json [13:19:09] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9790179 (10Eevans) That didn't take long: ` /dev/md2: Version : 1.2 Creation Time : Thu May 9 14:23:21 2024 Raid Level : raid10 Array Size : 3701655552 (3530.17 GiB 3790.50... [13:19:26] me: β€œgosh this backport for a Special:Export issue is taking a while. let me look at logspam-watch in the meantime… oh, a database error, I wonder if that’s already been reported” [13:19:27] (spoiler: it’s been reported at https://phabricator.wikimedia.org/T364554 and you’ll never guess which gerrit change is attached there) [13:20:12] (03PS3) 10Jsn.sherman: extension-list: Add AutoModerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026972 (https://phabricator.wikimedia.org/T364034) [13:20:12] (03PS3) 10Jsn.sherman: InitialiseSettings.php: Add wmgUseAutoModerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026973 (https://phabricator.wikimedia.org/T364034) [13:20:12] (03PS3) 10Jsn.sherman: InitialiseSettings-labs.php: Deploy AutoModerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026974 (https://phabricator.wikimedia.org/T364034) [13:20:12] :D [13:20:13] (03PS3) 10Jsn.sherman: CommonSettings-labs: Load AutoModerator extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026975 (https://phabricator.wikimedia.org/T364034) [13:21:50] (03PS1) 10JMeybohm: Add kubestagemaster200[45] as insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/1030955 (https://phabricator.wikimedia.org/T363310) [13:23:14] (03PS2) 10JMeybohm: Add kubestagemaster200[45] as insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/1030955 (https://phabricator.wikimedia.org/T364740) [13:24:48] (03CR) 10Brouberol: [C:03+1] "LGTM and PCC confirms it's a NOOP." [puppet] - 10https://gerrit.wikimedia.org/r/1030947 (owner: 10Muehlenhoff) [13:25:51] how long does sync to debug servers take these days? maybe i'll go make myself some tea [13:27:07] only a few minutes, I think [13:27:11] once CI goes through, that is [13:27:25] you definitely could’ve made yourself some tea when Zuul still said ETA 15+ minutes ^^ [13:27:29] but now it says it’s all but done [13:29:15] (03PS1) 10JMeybohm: Add kubestagemaster2004 to the etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1030957 (https://phabricator.wikimedia.org/T363307) [13:29:22] (03Merged) 10jenkins-bot: specials: Fix "include templates" query builder for Special:Export [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029564 (https://phabricator.wikimedia.org/T364554) (owner: 10Umherirrender) [13:29:26] (03Merged) 10jenkins-bot: ArticleTarget: Fix return of getVisualDiffGeneratorPromise [extensions/VisualEditor] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030308 (https://phabricator.wikimedia.org/T364635) (owner: 10Bartosz DziewoΕ„ski) [13:29:44] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1029564|specials: Fix "include templates" query builder for Special:Export (T364554)]], [[gerrit:1030308|ArticleTarget: Fix return of getVisualDiffGeneratorPromise (T364635)]] [13:29:50] T364554: Wikimedia\Rdbms\DBQueryError: Error 1066: Not unique table/alias: 'templatelinks' when using "Include templates" on Special:Export - https://phabricator.wikimedia.org/T364554 [13:29:51] T364635: Visual Diff not working - https://phabricator.wikimedia.org/T364635 [13:30:12] (03CR) 10Brouberol: [C:03+1] Add datasets-config and datasets-config-next helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [13:30:19] (03CR) 10CI reject: [V:04-1] Add kubestagemaster2004 to the etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1030957 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [13:30:52] (03CR) 10Brouberol: "LGTM with the PCC output!" [puppet] - 10https://gerrit.wikimedia.org/r/1029541 (https://phabricator.wikimedia.org/T364542) (owner: 10Btullis) [13:31:42] (03PS1) 10JMeybohm: Add kubestagemaster2004 to staging-codfw conftool [puppet] - 10https://gerrit.wikimedia.org/r/1030958 (https://phabricator.wikimedia.org/T363307) [13:32:09] !log lucaswerkmeister-wmde@deploy1002 umherirrender and lucaswerkmeister-wmde and matmarex: Backport for [[gerrit:1029564|specials: Fix "include templates" query builder for Special:Export (T364554)]], [[gerrit:1030308|ArticleTarget: Fix return of getVisualDiffGeneratorPromise (T364635)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:32:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P62360 and previous config saved to /var/cache/conftool/dbconfig/20240513-133214-arnaudb.json [13:32:20] MatmaRex: now you can test :) [13:32:51] VE issue looks better to me now [13:33:02] Lucas_WMDE: yep. both look good [13:33:25] ok! [13:33:27] !log lucaswerkmeister-wmde@deploy1002 umherirrender and lucaswerkmeister-wmde and matmarex: Continuing with sync [13:33:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P62361 and previous config saved to /var/cache/conftool/dbconfig/20240513-133345-arnaudb.json [13:36:04] (03CR) 10Brouberol: [C:03+1] Add the airflow profile to the statistics::explorer role [puppet] - 10https://gerrit.wikimedia.org/r/1029541 (https://phabricator.wikimedia.org/T364542) (owner: 10Btullis) [13:36:45] 10ops-codfw, 06SRE: connected console ports attached to unracked device - https://phabricator.wikimedia.org/T364633#9790290 (10Jhancock.wm) 05Openβ†’03Resolved a:03Jhancock.wm Initiated: 2024-05-13 13:35 Duration: 0 minutes, 1.60 seconds Completed [13:36:54] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9790294 (10Jhancock.wm) [13:37:03] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync [13:38:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [13:38:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [13:38:58] Lucas_WMDE: my config patch is a no-op/cleanup, we could skip it and let hnowlan's patches go first. i have to leave at the top of the hour anyway [13:40:42] just fyi the logging fix for my patch is to fix something that's already broken and will only be visible when in prod. The testwiki change is fine to test though [13:41:22] MatmaRex: ok [13:41:23] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2026 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:41:53] hnowlan: also, I just saw h.ashar left a comment on the logging fix [13:42:15] (03PS4) 10Hnowlan: Enable async upload-by-URL via jobqueue on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025790 (https://phabricator.wikimedia.org/T295007) [13:42:34] I’ll do the testwiki change first [13:42:59] Lucas_WMDE: I think that's for a future improvement, it comes just after a +1 so I'm happy to revisit in another patch [13:43:21] ack, thanks! [13:43:54] (03CR) 10Muehlenhoff: [C:04-1] "preseed.yaml currently configures all kubestagemaster nodes for a VM partition layout, this will need an update as well to be able to comp" [puppet] - 10https://gerrit.wikimedia.org/r/1030955 (https://phabricator.wikimedia.org/T364740) (owner: 10JMeybohm) [13:45:48] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1029564|specials: Fix "include templates" query builder for Special:Export (T364554)]], [[gerrit:1030308|ArticleTarget: Fix return of getVisualDiffGeneratorPromise (T364635)]] (duration: 16m 04s) [13:45:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025790 (https://phabricator.wikimedia.org/T295007) (owner: 10Hnowlan) [13:46:00] T364554: Wikimedia\Rdbms\DBQueryError: Error 1066: Not unique table/alias: 'templatelinks' when using "Include templates" on Special:Export - https://phabricator.wikimedia.org/T364554 [13:46:01] T364635: Visual Diff not working - https://phabricator.wikimedia.org/T364635 [13:47:00] (03CR) 10Muehlenhoff: [C:03+2] druid: Remove support for using ferm firewall definitions [puppet] - 10https://gerrit.wikimedia.org/r/1030947 (owner: 10Muehlenhoff) [13:47:01] (03Merged) 10jenkins-bot: Enable async upload-by-URL via jobqueue on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025790 (https://phabricator.wikimedia.org/T295007) (owner: 10Hnowlan) [13:47:08] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync [13:47:20] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1025790|Enable async upload-by-URL via jobqueue on testwiki (T295007)]] [13:47:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T360332)', diff saved to https://phabricator.wikimedia.org/P62362 and previous config saved to /var/cache/conftool/dbconfig/20240513-134721-arnaudb.json [13:47:24] T295007: Upload by URL should use the job queue, possibly chunked with range requests - https://phabricator.wikimedia.org/T295007 [13:47:30] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [13:48:47] (03PS1) 10Muehlenhoff: Failover IDP [dns] - 10https://gerrit.wikimedia.org/r/1030959 [13:48:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T360332)', diff saved to https://phabricator.wikimedia.org/P62363 and previous config saved to /var/cache/conftool/dbconfig/20240513-134852-arnaudb.json [13:49:01] (03CR) 10JMeybohm: "kubestagemaster* nodes will continue to be VMs, so a change there should not be necessary" [puppet] - 10https://gerrit.wikimedia.org/r/1030955 (https://phabricator.wikimedia.org/T364740) (owner: 10JMeybohm) [13:49:35] (03CR) 10Muehlenhoff: [C:03+1] "Ah, ok. LGTM then." [puppet] - 10https://gerrit.wikimedia.org/r/1030955 (https://phabricator.wikimedia.org/T364740) (owner: 10JMeybohm) [13:49:46] !log lucaswerkmeister-wmde@deploy1002 hnowlan and lucaswerkmeister-wmde: Backport for [[gerrit:1025790|Enable async upload-by-URL via jobqueue on testwiki (T295007)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:50:45] thanks for deploying [13:50:48] np [13:50:52] hnowlan: can you test the testwiki change? [13:50:53] thanks, testing! [13:51:37] β€œPHP Warning: get_class() expects parameter 1 to be object, unknown given” [13:51:42] the hell is β€œunknown” o_O [13:53:50] (03CR) 10Slyngshede: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1030959 (owner: 10Muehlenhoff) [13:54:21] 06SRE, 06Infrastructure-Foundations, 06Traffic, 13Patch-For-Review: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024#9790333 (10CDanis) [13:56:23] !log brouberol@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: JVM restart - brouberol@cumin2002 - T363975 [13:56:46] !log brouberol@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: JVM restart - brouberol@cumin2002 - T363975 [13:58:26] Lucas_WMDE: looks like there's an issue with the allowlist for uploads, can't properly test the change :/ [13:58:39] damn :/ [13:59:00] should I try syncing it anyway? it’s only testwiki… [13:59:16] OTOH, I guess you might not be able to test it on non-mwdebug testwiki anymore than under mwdebug, so idk how useful syncing it is [13:59:28] I suppose it at least gives you more time to figure out the allowlist issue [13:59:45] !log brouberol@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: JVM restart - brouberol@cumin2002 - T363975 [13:59:58] Lucas_WMDE: yeah it would make things a little easier for me if we synced it anyway, and it's not functionality that will break anyone's workflow most likely [14:00:11] yeah, ok [14:00:12] the change has worked in beta so I suspect it's down to how testwiki reads or doesn't read the list [14:00:13] !log lucaswerkmeister-wmde@deploy1002 hnowlan and lucaswerkmeister-wmde: Continuing with sync [14:00:23] let’s do it then [14:00:28] thanks! [14:00:32] jouncebot: now [14:00:33] No deployments scheduled for the next 1 hour(s) and 29 minute(s) [14:00:39] ok, then we can overrun a bit I think [14:01:57] (03PS1) 10Andrew Bogott: puppetserver-deploy-code.sh: use 'gitpuppet' user to check current branch [puppet] - 10https://gerrit.wikimedia.org/r/1030962 (https://phabricator.wikimedia.org/T364492) [14:02:23] (03Abandoned) 10Andrew Bogott: puppet-git-sync-upstream: run as 'gitpuppet' user [puppet] - 10https://gerrit.wikimedia.org/r/1029198 (https://phabricator.wikimedia.org/T364047) (owner: 10Andrew Bogott) [14:03:44] hnowlan: once this finishes, do you still have time for the logging fix too? [14:03:45] hnowlan: Maybe something to do with upload permissions in wmf-config/core-Permissions.php ? [14:03:48] I’d roll that out as well [14:04:05] testwiki doesnΒ΄t have any override, and I don't know what the default perms for uploading are [14:04:40] claime: https://test.wikipedia.org/wiki/Special:ListGroupRights suggests all users have upload_by_url right [14:04:43] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Review/cleanup content of /srv/private/modules/secret/secrets/ssl in the private repo - https://phabricator.wikimedia.org/T364622#9790377 (10MoritzMuehlenhoff) p:05Triageβ†’03High [14:04:50] Lucas_WMDE: yep, should be easily done [14:04:50] Ah thanks for the link [14:05:03] claime: I'm sysop on testwiki so I should be able to either way [14:05:10] I am not well versed in special pages x) [14:05:10] (don’t ask me why that one right has underscores instead of hyphens in its identifier ^^) [14:05:12] !log brouberol@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: JVM restart - brouberol@cumin2002 - T363975 [14:05:35] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 10vm-requests, and 2 others: Site: codfw 2 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T364740#9790379 (10MoritzMuehlenhoff) p:05Triageβ†’03Medium LGTM [14:06:03] Jenkins / Zuul are going to be shutdown to switch over the hosts [14:06:49] hnowlan: https://test.wikipedia.org/wiki/MediaWiki:Copyupload-allowed-domains says β€œOnly work if $wgCopyUploadAllowOnWikiDomainConfig is set to true” [14:06:58] and if I read mediawiki-config correctly, commonswiki is the only wiki which has that set [14:07:13] so on all other wikis that page is a (misleading) no-op, I guess [14:07:25] heh, that checks out :( [14:07:32] ah, but you’re the one who created it, so it wasn’t pre-existing misleading ^^ [14:07:35] RECOVERY - snapshot of s4 in eqiad on backupmon1001 is OK: Last snapshot for s4 at eqiad (db1150) taken on 2024-05-13 12:40:02 (1684 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [14:07:53] although now that I look at it we can use upload.wikimedia.org on testwiki [14:08:05] heh, yeah, that might work [14:08:10] I hear we have some large files there ;) [14:08:11] (03Abandoned) 10Andrew Bogott: puppetserver-deploy-code: add -force to g10k call to invoke purging [puppet] - 10https://gerrit.wikimedia.org/r/1025818 (https://phabricator.wikimedia.org/T364047) (owner: 10Andrew Bogott) [14:08:14] uploadception [14:09:04] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on contint2002.wikimedia.org with reason: T334517 [14:09:08] T334517: upgrade contint servers to bullseye - https://phabricator.wikimedia.org/T334517 [14:09:19] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on contint2002.wikimedia.org with reason: T334517 [14:09:40] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on contint1002.wikimedia.org with reason: T334517 [14:09:55] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on contint1002.wikimedia.org with reason: T334517 [14:10:34] Lucas_WMDE: hnowlan: we are shutting down CI (Zuul/Jenkins) [14:10:38] (03CR) 10Dzahn: [C:03+1] ci: disable zuul merger on contint2002 for migration [puppet] - 10https://gerrit.wikimedia.org/r/1020950 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [14:10:40] ack, thx [14:11:00] (03CR) 10Dzahn: [C:03+2] ci: disable zuul merger on contint2002 for migration [puppet] - 10https://gerrit.wikimedia.org/r/1020950 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [14:11:12] (03PS4) 10Jsn.sherman: CommonSettings-labs: Load AutoModerator extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026975 (https://phabricator.wikimedia.org/T364034) [14:11:23] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2026 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:11:40] (03CR) 10JMeybohm: [C:03+2] Add kubestagemaster200[45] as insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/1030955 (https://phabricator.wikimedia.org/T364740) (owner: 10JMeybohm) [14:11:51] it’s so weird how php-fpm-restart is at less than 100 hosts total these days :D [14:12:00] where have all the appservers gone? to kubernetes, every one [14:12:30] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1025790|Enable async upload-by-URL via jobqueue on testwiki (T295007)]] (duration: 25m 09s) [14:12:33] T295007: Upload by URL should use the job queue, possibly chunked with range requests - https://phabricator.wikimedia.org/T295007 [14:12:40] !log jayme@cumin1002 START - Cookbook sre.ganeti.makevm for new host kubestagemaster2004.codfw.wmnet [14:12:41] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [14:13:13] alright… waiting for gerrit to come back before deploying the other config change, I guess [14:13:27] (right now CI is still up for me but I assume it’s not already back) [14:13:27] !log jayme@cumin1002 START - Cookbook sre.ganeti.makevm for new host kubestagemaster2005.codfw.wmnet [14:13:58] (03CR) 10CDanis: [C:03+1] external clouds: allow to get prefixes from RIPE (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956955 (https://phabricator.wikimedia.org/T303534) (owner: 10Volans) [14:15:04] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster2004.codfw.wmnet - jayme@cumin1002" [14:15:05] !log CI - migration in progress - stopping jenkins and zuul (T334517) [14:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:16] T334517: upgrade contint servers to bullseye - https://phabricator.wikimedia.org/T334517 [14:15:53] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster2004.codfw.wmnet - jayme@cumin1002" [14:15:54] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:15:54] !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache kubestagemaster2004.codfw.wmnet on all recursors [14:15:54] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [14:15:57] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kubestagemaster2004.codfw.wmnet on all recursors [14:16:53] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster2004.codfw.wmnet - jayme@cumin1002" [14:16:59] (03PS2) 10Dzahn: switch contint.wikimedia.org from contint2002 to contint1002 [dns] - 10https://gerrit.wikimedia.org/r/1020951 (https://phabricator.wikimedia.org/T334517) [14:17:00] (03PS2) 10Hnowlan: Include mw-jobrunner port in host header check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025391 [14:17:38] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster2004.codfw.wmnet - jayme@cumin1002" [14:17:48] (03PS10) 10TChin: Add datasets-config and datasets-config-next helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434) [14:18:02] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestagemaster2004.codfw.wmnet with OS bullseye [14:18:13] (03CR) 10Dzahn: [C:03+1] switch contint.wikimedia.org from contint2002 to contint1002 [dns] - 10https://gerrit.wikimedia.org/r/1020951 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [14:18:16] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster2005.codfw.wmnet - jayme@cumin1002" [14:18:16] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 3 others: Site: codfw 1 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T363310#9790432 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestage... [14:18:19] (03CR) 10Dzahn: [V:03+2 C:03+2] switch contint.wikimedia.org from contint2002 to contint1002 [dns] - 10https://gerrit.wikimedia.org/r/1020951 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [14:19:09] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster2005.codfw.wmnet - jayme@cumin1002" [14:19:09] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:19:10] !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache kubestagemaster2005.codfw.wmnet on all recursors [14:19:13] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kubestagemaster2005.codfw.wmnet on all recursors [14:19:42] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster2005.codfw.wmnet - jayme@cumin1002" [14:22:16] (03CR) 10Dzahn: [C:03+1] ci: switch contint manager_host from 2002 to 1002 [puppet] - 10https://gerrit.wikimedia.org/r/1020954 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [14:22:28] (03CR) 10Dzahn: [C:03+2] ci: switch contint manager_host from 2002 to 1002 [puppet] - 10https://gerrit.wikimedia.org/r/1020954 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [14:22:33] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster2005.codfw.wmnet - jayme@cumin1002" [14:23:54] (03CR) 10Dzahn: [C:03+2] ci: switch gearman_server IP from contint2002 to contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/1020955 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [14:25:00] Lucas_WMDE: eat all the appservers [14:25:01] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestagemaster2005.codfw.wmnet with OS bullseye [14:25:14] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 10vm-requests, and 2 others: Site: codfw 2 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T364740#9790469 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemas... [14:25:46] I’m excited for this chunked upload-by-URL feature to potentially unblock commons-on-k8s ^^ [14:25:56] (03CR) 10Ayounsi: [C:03+1] squid_exporter: Remove some outdated comments [puppet] - 10https://gerrit.wikimedia.org/r/1026910 (owner: 10Muehlenhoff) [14:25:58] So are we :D [14:26:06] yeah, it'll make life a lot easier [14:26:28] The votewiki work looks like it's progressing as well [14:26:34] nice [14:26:47] Good news is, I see the upload-by-url thing creating jobqueue events via testwiki, so the on-wiki part of things is working as expected afaict [14:27:06] Now does it actually upload the things :D [14:27:16] upload ALL the things [14:27:25] claime: a minor technical detail [14:27:56] if it doesnΒ΄t work, skill issue tbh [14:30:08] looks like CI is coming back [14:30:48] (03CR) 10Lucas Werkmeister (WMDE): "recheck, CI should be back" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025391 (owner: 10Hnowlan) [14:31:58] (03CR) 10Dzahn: [C:03+2] ci: switch source and destination server for data rsync [puppet] - 10https://gerrit.wikimedia.org/r/1020957 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [14:32:12] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync [14:32:21] Lucas_WMDE: :) yay [14:32:30] we are doing that right now, good to hear you see it [14:32:53] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster2004.codfw.wmnet with reason: host reimage [14:32:59] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync [14:33:02] alright, let’s try the logging fix I think [14:33:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025391 (owner: 10Hnowlan) [14:33:48] (03Merged) 10jenkins-bot: Include mw-jobrunner port in host header check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025391 (owner: 10Hnowlan) [14:34:06] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1025391|Include mw-jobrunner port in host header check]] [14:34:17] !log CI - switch over to other contint server finished - T334517 [14:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:22] T334517: upgrade contint servers to bullseye - https://phabricator.wikimedia.org/T334517 [14:34:47] mutante: I hope I wasn’t too hasty there ^^ [14:35:59] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster2004.codfw.wmnet with reason: host reimage [14:36:31] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and hnowlan: Backport for [[gerrit:1025391|Include mw-jobrunner port in host header check]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:37:05] hnowlan: IIRC you said this one couldn’t be tested on WikimediaDebug? [14:37:18] (I guess mw-jobrunner always hits the non-debug hosts…) [14:37:31] yeah [14:37:39] alright [14:37:42] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and hnowlan: Continuing with sync [14:38:02] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:54] (03CR) 10Bartosz DziewoΕ„ski: "Scheduled for tomorrow: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T1300" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030532 (https://phabricator.wikimedia.org/T357221) (owner: 10Bartosz DziewoΕ„ski) [14:38:57] (03CR) 10Bartosz DziewoΕ„ski: "Scheduled for tomorrow: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T1300" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030535 (https://phabricator.wikimedia.org/T357221) (owner: 10Bartosz DziewoΕ„ski) [14:39:10] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster2005.codfw.wmnet with reason: host reimage [14:39:17] (03CR) 10Bartosz DziewoΕ„ski: [C:03+1] "We ran out of time in the window. Rescheduled for tomorrow: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T1300" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029187 (owner: 10Bartosz DziewoΕ„ski) [14:39:59] also, not seeing an upload at https://test.wikipedia.org/wiki/Special:ListFiles yet :/ [14:40:35] MatmaRex: I’m still around, if you’re back now we could still sync the wgCdnMaxAge change in a moment… [14:41:01] eh, let's do tomorrow [14:41:01] (we’re severely overrunning the window but I haven’t heard anyone complain that they’re waiting to deploy) [14:41:03] ok :) [14:41:06] heh [14:42:08] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster2005.codfw.wmnet with reason: host reimage [14:43:22] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2026 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:43:22] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2018 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:43:46] PROBLEM - Check whether ferm is active by checking the default input chain on mw2267 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:45:22] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2015 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:46:29] not seeing any entries for RunSingleJob in xff.log \o/ [14:46:50] (03PS1) 10JMeybohm: Add stacked kubernetes masters to appropriate aliases [puppet] - 10https://gerrit.wikimedia.org/r/1030995 (https://phabricator.wikimedia.org/T363307) [14:46:53] thanks Lucas_WMDE <3 [14:47:27] (03PS1) 10JMeybohm: Add kubestagemaster100[345] [puppet] - 10https://gerrit.wikimedia.org/r/1030996 (https://phabricator.wikimedia.org/T364746) [14:47:44] UploadFromURL is now running on jobqueue workers! I am getting automatically booted from uploads for whatever reason though [14:47:47] "A brief description of the abuse rule which your action matched is: 0 copyvios " [14:48:10] re xff.log: \o/ [14:48:19] re upload: hmph [14:48:31] 10ops-codfw, 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: Create (or teach Andrew how to create) private connections+dns entries for new cloudcontrols - https://phabricator.wikimedia.org/T364559#9790593 (10cmooney) 05Openβ†’03Resolved p:05Triageβ†’03Medium >>! In T364559#... [14:48:36] (03CR) 10JMeybohm: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1030957 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [14:48:45] shrug, it's still a good sign even if the copyvios thing is a mystery [14:48:51] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:49:08] (03PS2) 10JMeybohm: Add kubestagemaster2004 to the etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1030957 (https://phabricator.wikimedia.org/T363307) [14:49:18] (03PS2) 10Klausman: admin_ng: Add Cassandra ServiceEntry and VS for LiftWing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030993 (https://phabricator.wikimedia.org/T360428) [14:49:37] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestagemaster2004.codfw.wmnet with OS bullseye [14:49:37] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host kubestagemaster2004.codfw.wmnet [14:49:50] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 3 others: Site: codfw 1 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T363310#9790605 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemast... [14:50:10] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1025391|Include mw-jobrunner port in host header check]] (duration: 16m 04s) [14:51:37] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 10vm-requests, 07Kubernetes: Site: codfw 2 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T364740#9790613 (10JMeybohm) kubestagemaster2004 is done (I messed up the phab ID in the cumin command, so report ended up in... [14:55:09] !log UTC afternoon backport+config window don [14:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:12] *done, dangit [14:55:16] 06SRE, 10Wikimedia-Mailing-lists: Mailing list for English Wiktionary admins - https://phabricator.wikimedia.org/T364731#9790621 (10Bugreporter) [14:55:31] *the godfather theme starts playing* [14:55:48] thank you! [14:56:02] Lucas_WMDE: no issues with CI, right? [14:56:12] none that I’ve seen so far [14:56:26] great. happy to announce the active CI server is now eqiad and not buster anymore [14:56:29] hnowlan: apparently it happens if you upload a large file and have less than 10 edits https://test.wikipedia.org/wiki/Special:AbuseFilter/162 [14:57:00] https://test.wikipedia.org/wiki/Special:AbuseLog/102460 [14:57:40] (03PS2) 10Tchanders: IPInfo: Remove $wgIPInfoGeoIP2EnterprisePath [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017152 (https://phabricator.wikimedia.org/T361884) [14:59:08] a couple years back, when Wikipedia Zero was a thing, we had issues with people using us to host movies and stuff [14:59:56] that's an interesting fact! [15:00:08] ohhh that makes sense [15:00:39] (that is also why Phabricator has the super low file size limit for uploads) [15:00:40] is file_size in bytes? [15:01:34] yes [15:01:35] the file I'm uploading is 496KB - but either way, seems like good news to be hitting a rule like that [15:01:42] (03CR) 10AOkoth: prometheus: puppetise sql_exporter (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [15:02:15] I think the rule is somewhat reasonable, even test wikis need vandalism protection (even more so if they have fewer active patrollers) [15:02:17] beta has the same problem [15:02:36] so I wouldn’t want to disable the rule altogether… probably easiest for hnowlan to just make eight wikitext edits ^^ [15:03:02] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:03:14] yeah :D [15:03:51] mutante: I'm on my phone, so apologies if I've missed context β€” you asked if there's (no) issues with CI? It appears the (beta) `deployment-deploy03` Jenkins agent is having some trouble starting. Is that related or just beta being beta? [15:03:55] (03CR) 10Muehlenhoff: [C:03+1] Add kubestagemaster100[345] [puppet] - 10https://gerrit.wikimedia.org/r/1030996 (https://phabricator.wikimedia.org/T364746) (owner: 10JMeybohm) [15:04:32] (03CR) 10Muehlenhoff: [C:03+2] squid_exporter: Remove some outdated comments [puppet] - 10https://gerrit.wikimedia.org/r/1026910 (owner: 10Muehlenhoff) [15:12:18] (03PS1) 10Ilias Sarantopoulos: ml-services: test nllb image with torch221-rocm5.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031001 (https://phabricator.wikimedia.org/T362984) [15:13:22] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2026 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:13:22] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2018 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:13:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T352010)', diff saved to https://phabricator.wikimedia.org/P62366 and previous config saved to /var/cache/conftool/dbconfig/20240513-151325-ladsgroup.json [15:13:30] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [15:13:38] (03CR) 10Elukey: [C:03+1] ml-services: test nllb image with torch221-rocm5.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031001 (https://phabricator.wikimedia.org/T362984) (owner: 10Ilias Sarantopoulos) [15:13:46] RECOVERY - Check whether ferm is active by checking the default input chain on mw2267 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:13:56] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9790688 (10Jhancock.wm) @Papaul, This was the last screen I got. The servers all have the OS installed and it failed at the certificate stage. I think it's caus... [15:15:22] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2015 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:15:35] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: test nllb image with torch221-rocm5.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031001 (https://phabricator.wikimedia.org/T362984) (owner: 10Ilias Sarantopoulos) [15:16:36] (03Merged) 10jenkins-bot: ml-services: test nllb image with torch221-rocm5.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031001 (https://phabricator.wikimedia.org/T362984) (owner: 10Ilias Sarantopoulos) [15:17:03] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9790699 (10MoritzMuehlenhoff) All insetup roles default to Puppet 7 these days (as does the kafka-main roler itself), so these should be installed with Puppet 7. [15:17:55] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9790707 (10MoritzMuehlenhoff) I think the reason the installation failed is because there is no entry in site.pp yet. [15:18:30] !log brouberol@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: JVM restart - brouberol@cumin2002 - T363975 [15:19:16] (03PS2) 10Muehlenhoff: Failover IDP [dns] - 10https://gerrit.wikimedia.org/r/1030959 [15:19:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance [15:19:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance [15:21:32] I'm getting an error I haven't seen before between git/ssh and gerrit, "Bad server host key: Invalid key length". I'm not sure what to make of it? [15:22:27] (03CR) 10Muehlenhoff: [C:03+2] Failover IDP [dns] - 10https://gerrit.wikimedia.org/r/1030959 (owner: 10Muehlenhoff) [15:22:45] hmm… mutante: ^ the Gerrit SSH server shouldn’t have been affected by your CI work, right? [15:24:28] (03CR) 10Elukey: "Have you tried with protocol TLS in the Service Entry? We may be able to use SNI and avoid harcoding the IPs, but never used it." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030993 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [15:24:34] 06SRE, 06Infrastructure-Foundations: Request access to servers Dcops group - https://phabricator.wikimedia.org/T360356#9790738 (10Volans) @Jclark-ctr is the main purpose of this gather debug information on the host? If that's the case the simplest solution is to write a cookbook that gathers all that info for... [15:25:26] !log herron@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-logging-codfw [15:27:12] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [15:27:23] Jeff_Green: https://phabricator.wikimedia.org/T364217 [15:27:59] paladox: ah cool, I went looking for a task but didn't manage to find this one [15:28:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P62367 and previous config saved to /var/cache/conftool/dbconfig/20240513-152833-ladsgroup.json [15:30:00] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for Cassandra services [puppet] - 10https://gerrit.wikimedia.org/r/1026940 (owner: 10Muehlenhoff) [15:30:05] jan_drewniak: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240513T1530). [15:30:19] paladox: adding "RequiredRSASize 1024" to a host definition in .ssh/config worked. thanks! [15:30:31] yw [15:30:44] (03PS3) 10Klausman: admin_ng: Add Cassandra ServiceEntry and VS for LiftWing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030993 (https://phabricator.wikimedia.org/T360428) [15:32:20] (03PS4) 10Klausman: admin_ng: Add Cassandra ServiceEntry and VS for LiftWing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030993 (https://phabricator.wikimedia.org/T360428) [15:32:54] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-04-17-163312 to 2024-05-13-145903 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031004 (https://phabricator.wikimedia.org/T282716) [15:33:02] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-04-18-150843 to 2024-05-13-145650 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031005 (https://phabricator.wikimedia.org/T282716) [15:33:04] (03CR) 10Klausman: "As discussion on IRC, switched protocol to TLS, IPs to hostnames and `resolution` to `DNS`" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030993 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [15:33:41] (03PS6) 10Dreamy Jazz: Remove old CampaignEvents DB config (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014625 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [15:33:45] (03PS4) 10Dreamy Jazz: Remove old CampaignEvents DB config (prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014626 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [15:34:14] (03PS5) 10Klausman: admin_ng: Add Cassandra ServiceEntry and VS for LiftWing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030993 (https://phabricator.wikimedia.org/T360428) [15:34:32] (03Abandoned) 10Kimberly Sarabia: Deploy disabled limited width on main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030200 (https://phabricator.wikimedia.org/T357706) (owner: 10Kimberly Sarabia) [15:34:50] (03CR) 10Elukey: [C:03+1] admin_ng: Add Cassandra ServiceEntry and VS for LiftWing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030993 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [15:38:03] (03CR) 10Klausman: [C:03+2] admin_ng: Add Cassandra ServiceEntry and VS for LiftWing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030993 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [15:38:11] (03PS2) 10JMeybohm: Fix all-etcd, wikikube-master and wikikube-etcd aliases [puppet] - 10https://gerrit.wikimedia.org/r/1030995 (https://phabricator.wikimedia.org/T363307) [15:40:26] (03Merged) 10jenkins-bot: admin_ng: Add Cassandra ServiceEntry and VS for LiftWing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030993 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [15:42:13] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031006 (https://phabricator.wikimedia.org/T128546) [15:43:27] (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031006 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:43:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P62368 and previous config saved to /var/cache/conftool/dbconfig/20240513-154341-ladsgroup.json [15:45:00] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031006 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:49:01] !log herron@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-codfw [15:55:08] 10ops-magru: magru: add PDUs to Netbox - https://phabricator.wikimedia.org/T364628#9790956 (10wiki_willy) a:03RobH [15:56:14] 10ops-eqiad, 06SRE: eqiad: magru transport down - https://phabricator.wikimedia.org/T363117#9790957 (10ayounsi) 05Openβ†’03Resolved All good now. [15:58:00] (03PS3) 10JMeybohm: Fix all-etcd, wikikube-master and wikikube-etcd aliases [puppet] - 10https://gerrit.wikimedia.org/r/1030995 (https://phabricator.wikimedia.org/T363307) [15:58:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T352010)', diff saved to https://phabricator.wikimedia.org/P62369 and previous config saved to /var/cache/conftool/dbconfig/20240513-155849-ladsgroup.json [15:58:51] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [15:58:53] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [15:59:05] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [15:59:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1170 (T352010)', diff saved to https://phabricator.wikimedia.org/P62370 and previous config saved to /var/cache/conftool/dbconfig/20240513-155911-ladsgroup.json [15:59:26] PROBLEM - prometheus-codfw.wikimedia.org tls expiry on prometheus2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:59:36] PROBLEM - SSH on prometheus2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:59:49] (03CR) 10Volans: [C:04-1] Fix all-etcd, wikikube-master and wikikube-etcd aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1030995 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [16:00:16] PROBLEM - prometheus-codfw.wikimedia.org requires authentication on prometheus2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:00:34] (03PS4) 10JMeybohm: Fix all-etcd, wikikube-master and wikikube-etcd aliases [puppet] - 10https://gerrit.wikimedia.org/r/1030995 (https://phabricator.wikimedia.org/T363307) [16:00:37] (03CR) 10Andrew Bogott: [C:03+2] puppetserver-deploy-code.sh: use 'gitpuppet' user to check current branch [puppet] - 10https://gerrit.wikimedia.org/r/1030962 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [16:01:06] RECOVERY - prometheus-codfw.wikimedia.org requires authentication on prometheus2005 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 564 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:01:18] RECOVERY - prometheus-codfw.wikimedia.org tls expiry on prometheus2005 is OK: OK - Certificate prometheus.discovery.wmnet will expire on Sun 02 Jun 2024 06:40:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:01:26] RECOVERY - SSH on prometheus2005 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:02:37] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1031006| Bumping portals to master (T128546)]] (duration: 14m 23s) [16:02:41] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:09:37] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028927 (owner: 10PipelineBot) [16:09:57] (03CR) 10SBassett: [C:03+2] Implement security.txt standard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010971 (https://phabricator.wikimedia.org/T337949) (owner: 10Mmartorana) [16:10:37] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1030995 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [16:10:59] (03Merged) 10jenkins-bot: Implement security.txt standard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010971 (https://phabricator.wikimedia.org/T337949) (owner: 10Mmartorana) [16:14:35] (03CR) 10JMeybohm: Fix all-etcd, wikikube-master and wikikube-etcd aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1030995 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [16:16:25] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:1031006| Bumping portals to master (T128546)]] (duration: 13m 47s) [16:16:29] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:21:25] FIRING: SystemdUnitFailed: elasticsearch-disable-readahead.service on elastic2083:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:22:38] (03CR) 10Jdlrobson: "Kim: You need to backport some kind of change - as since the config exists it will completely override everything inside Vector." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030200 (https://phabricator.wikimedia.org/T357706) (owner: 10Kimberly Sarabia) [16:23:29] (03Abandoned) 10Andrew Bogott: ensure_canary: 0-pad the instance counter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/868803 (owner: 10Andrew Bogott) [16:28:42] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 10vm-requests, 07Kubernetes: Site: codfw 2 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T364740#9791122 (10JMeybohm) kubestagemaster2005 got stuck at: ` [18/60, retrying in 540.00s] Attempt to run 'spicerack.puppe... [16:31:54] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 10vm-requests, 07Kubernetes: Site: codfw 2 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T364740#9791138 (10Dzahn) @JMeybohm I noticed I can't manually run puppet agent on this host. It says I don't have the sudo pr... [16:34:18] !log brouberol@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: JVM restart - brouberol@cumin2002 - T363975 [16:34:36] 06SRE, 10SRE-Access-Requests: Requesting permissions for analytics-privatedata-users (with kerberos) for Mareike Heuer - https://phabricator.wikimedia.org/T364715#9791145 (10CDanis) a:03Tobi_WMDE_SW Tobi please approve and reassign to me, thanks! [16:38:58] (03PS1) 10Ladsgroup: Enable section-wide circuit breaking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031021 (https://phabricator.wikimedia.org/T360930) [16:45:49] (03CR) 10Scott French: [C:03+1] "LGTM. I started something similar the patch series starting at [0] (the more interesting part is in the second patch - this only overlaps " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030938 (https://phabricator.wikimedia.org/T362786) (owner: 10Ladsgroup) [16:46:02] !log jayme@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kubestagemaster2005.codfw.wmnet with OS bullseye [16:46:02] !log jayme@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host kubestagemaster2005.codfw.wmnet [16:46:11] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 10vm-requests, 07Kubernetes: Site: codfw 2 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T364740#9791223 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaste... [16:46:32] !log jayme@cumin1002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster2004.codfw.wmnet to plain [16:47:18] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster2004.codfw.wmnet to plain [16:47:31] !log jayme@cumin1002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster2005.codfw.wmnet to plain [16:48:13] (03CR) 10Ladsgroup: [C:04-1] "Needs some tweaking." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031021 (https://phabricator.wikimedia.org/T360930) (owner: 10Ladsgroup) [16:49:14] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster2005.codfw.wmnet to plain [16:50:25] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestagemaster2005.codfw.wmnet with OS bullseye [16:50:33] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 10vm-requests, 07Kubernetes: Site: codfw 2 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T364740#9791229 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagem... [16:51:17] (03PS2) 10Ladsgroup: Enable section-wide circuit breaking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031021 (https://phabricator.wikimedia.org/T360930) [16:51:25] RESOLVED: SystemdUnitFailed: elasticsearch-disable-readahead.service on elastic2083:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:58:40] (03PS1) 10Ebernhardson: cirrus: Deploy updater to eqiad at 25% load [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031027 (https://phabricator.wikimedia.org/T363475) [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240513T1700) [17:00:04] ryankemper: Time to do the Wikidata Query Service weekly deploy deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240513T1700). [17:02:24] (03PS1) 10Hnowlan: Enable async jobqueue-powered URL uploads on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031028 (https://phabricator.wikimedia.org/T295007) [17:02:35] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster2005.codfw.wmnet with reason: host reimage [17:05:50] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster2005.codfw.wmnet with reason: host reimage [17:06:40] (03CR) 10Ebernhardson: [C:03+2] cirrus: Deploy updater to eqiad at 25% load [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031027 (https://phabricator.wikimedia.org/T363475) (owner: 10Ebernhardson) [17:06:51] (03PS1) 10Ebernhardson: cirrus: Shift 25% of public wikis writes in eqiad to replacement updater [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031029 (https://phabricator.wikimedia.org/T363475) [17:07:31] (03CR) 10CI reject: [V:04-1] cirrus: Shift 25% of public wikis writes in eqiad to replacement updater [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031029 (https://phabricator.wikimedia.org/T363475) (owner: 10Ebernhardson) [17:07:35] (03Merged) 10jenkins-bot: cirrus: Deploy updater to eqiad at 25% load [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031027 (https://phabricator.wikimedia.org/T363475) (owner: 10Ebernhardson) [17:13:02] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:20:33] (03CR) 10Scott French: [C:03+1] "Thanks, Janis!" [software/envoyproxy/ratelimiter] - 10https://gerrit.wikimedia.org/r/1029205 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [17:21:30] (03PS1) 10Cathal Mooney: Add includes for netbox-generated PTRs for new spine-core links [dns] - 10https://gerrit.wikimedia.org/r/1031031 (https://phabricator.wikimedia.org/T364095) [17:22:15] (03CR) 10CI reject: [V:04-1] Add includes for netbox-generated PTRs for new spine-core links [dns] - 10https://gerrit.wikimedia.org/r/1031031 (https://phabricator.wikimedia.org/T364095) (owner: 10Cathal Mooney) [17:26:30] (03CR) 10Alexandros Kosiaris: [C:03+1] Enable async jobqueue-powered URL uploads on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031028 (https://phabricator.wikimedia.org/T295007) (owner: 10Hnowlan) [17:27:00] !log ryankemper@cumin2002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-jumbo-eqiad [17:27:52] !log T363973 [Kafka] Restarting `jumbo-eqiad` brokers, followed by mirror maker [17:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:18] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for ecarg/Grace Choi - https://phabricator.wikimedia.org/T364414#9791374 (10thcipriani) >>! In T364414#9778550, @Dzahn wrote: > @thcipriani please consider for approval (https://wikimedia.namely.com/people/eaebb898-01ba-404e-8... [17:29:15] (03CR) 10Thcipriani: [C:03+1] admin: add Grace Choi to deployers [puppet] - 10https://gerrit.wikimedia.org/r/1030291 (https://phabricator.wikimedia.org/T364414) (owner: 10Dzahn) [17:34:07] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for ecarg/Grace Choi - https://phabricator.wikimedia.org/T364414#9791394 (10Dzahn) a:05thciprianiβ†’03None [17:36:39] (03CR) 10Ssingh: [C:03+1] Add includes for netbox-generated PTRs for new spine-core links [dns] - 10https://gerrit.wikimedia.org/r/1031031 (https://phabricator.wikimedia.org/T364095) (owner: 10Cathal Mooney) [17:37:33] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [17:38:40] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:38:46] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:39:14] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add records for new linknets on codfw spines - cmooney@cumin1002" [17:40:14] (03CR) 10Cathal Mooney: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1031031 (https://phabricator.wikimedia.org/T364095) (owner: 10Cathal Mooney) [17:40:36] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add records for new linknets on codfw spines - cmooney@cumin1002" [17:40:36] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:41:09] (03CR) 10Cathal Mooney: [C:03+2] Add includes for netbox-generated PTRs for new spine-core links [dns] - 10https://gerrit.wikimedia.org/r/1031031 (https://phabricator.wikimedia.org/T364095) (owner: 10Cathal Mooney) [17:41:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [17:44:41] (03PS1) 10Scott French: conftool: prepare dbctl json schemas for parser cahce [puppet] - 10https://gerrit.wikimedia.org/r/1031032 (https://phabricator.wikimedia.org/T362786) [17:44:44] (03PS1) 10Scott French: conftool-data: bootstrap parser-cache sections and instances [puppet] - 10https://gerrit.wikimedia.org/r/1031033 (https://phabricator.wikimedia.org/T362786) [17:46:05] (03PS1) 10Scott French: Import dbconfig and instance schema from puppet [software/conftool] - 10https://gerrit.wikimedia.org/r/1031034 (https://phabricator.wikimedia.org/T362786) [17:46:33] (03CR) 10CDanis: [C:03+1] Import dbconfig and instance schema from puppet [software/conftool] - 10https://gerrit.wikimedia.org/r/1031034 (https://phabricator.wikimedia.org/T362786) (owner: 10Scott French) [17:46:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [17:48:20] PROBLEM - Host ps1-c6-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:49:20] (03Restored) 10Kimberly Sarabia: Deploy disabled limited width on main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030200 (https://phabricator.wikimedia.org/T357706) (owner: 10Kimberly Sarabia) [17:49:47] (03PS2) 10Kimberly Sarabia: Deploy disabled limited width on main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030200 (https://phabricator.wikimedia.org/T357706) [17:52:58] 10ops-eqiad, 06SRE, 10decommission-hardware, 10Data-Platform-SRE (2024.05.06 - 2024.05.26): decommission snapshot1009.eqiad.wmnet - https://phabricator.wikimedia.org/T364456#9791496 (10VRiley-WMF) a:03VRiley-WMF [17:53:13] 10ops-eqiad, 06SRE, 10decommission-hardware, 10Data-Platform-SRE (2024.05.06 - 2024.05.26): decommission snapshot1009.eqiad.wmnet - https://phabricator.wikimedia.org/T364456#9791498 (10VRiley-WMF) 05Openβ†’03In progress [17:59:55] (03CR) 10Scott French: [C:03+2] blubberoid: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028911 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [18:00:48] (03Merged) 10jenkins-bot: blubberoid: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028911 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [18:01:39] (03PS2) 10Msz2001: Set $wgSignatureValidation to 'disallow' on Polish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030978 (https://phabricator.wikimedia.org/T364769) [18:03:34] 10ops-eqiad, 06SRE, 10decommission-hardware, 10Data-Platform-SRE (2024.05.06 - 2024.05.26): decommission snapshot1009.eqiad.wmnet - https://phabricator.wikimedia.org/T364456#9791544 (10VRiley-WMF) 05In progressβ†’03Resolved [18:03:45] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply [18:04:10] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply [18:04:19] (03CR) 10Dzahn: [C:03+2] admin: add Grace Choi to deployers [puppet] - 10https://gerrit.wikimedia.org/r/1030291 (https://phabricator.wikimedia.org/T364414) (owner: 10Dzahn) [18:05:54] (03PS1) 10Andrew Bogott: Openstack: remove obsolete files/templates/manifests for version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/1031036 [18:07:45] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/blubberoid: apply [18:08:29] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/blubberoid: apply [18:09:43] (03CR) 10Dzahn: [C:03+2] "[deploy1002:~] $ id ecarg" [puppet] - 10https://gerrit.wikimedia.org/r/1030291 (https://phabricator.wikimedia.org/T364414) (owner: 10Dzahn) [18:11:30] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for ecarg/Grace Choi - https://phabricator.wikimedia.org/T364414#9791598 (10Dzahn) @ecarg Your user is now in the deployment group on the deployment server. Give it about 30 minutes and you should have all the access needed fo... [18:11:53] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for ecarg/Grace Choi - https://phabricator.wikimedia.org/T364414#9791604 (10Dzahn) [18:12:46] (03PS3) 10Kimberly Sarabia: Deploy disabled limited width on main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030200 (https://phabricator.wikimedia.org/T357706) [18:14:14] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for ecarg/Grace Choi - https://phabricator.wikimedia.org/T364414#9791615 (10Dzahn) 05In progressβ†’03Resolved a:03Dzahn [18:14:20] (03CR) 10Dzahn: [C:03+1] ci: Enable profile::auto_restarts::service for docker/containerd [puppet] - 10https://gerrit.wikimedia.org/r/1028795 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [18:16:11] (03CR) 10Andrew Bogott: [C:03+2] Openstack: remove obsolete files/templates/manifests for version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/1031036 (owner: 10Andrew Bogott) [18:17:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:19:10] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/blubberoid: apply [18:20:08] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/blubberoid: apply [18:21:05] (03CR) 10CDanis: [C:03+2] Import dbconfig and instance schema from puppet [software/conftool] - 10https://gerrit.wikimedia.org/r/1031034 (https://phabricator.wikimedia.org/T362786) (owner: 10Scott French) [18:21:30] (03PS1) 10Stoyofuku-wmf: Exclude client errors with undefined stack trace or file url [puppet] - 10https://gerrit.wikimedia.org/r/1031039 (https://phabricator.wikimedia.org/T364517) [18:21:54] (03PS4) 10Dzahn: ci: avoid hardcoded IP in Hiera, lookup contint.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1020958 (https://phabricator.wikimedia.org/T334517) [18:22:13] (03CR) 10CI reject: [V:04-1] ci: avoid hardcoded IP in Hiera, lookup contint.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1020958 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [18:22:18] (03CR) 10Dzahn: "Yea! Amended to do this!" [puppet] - 10https://gerrit.wikimedia.org/r/1020958 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [18:23:05] (03PS5) 10Dzahn: ci: avoid hardcoded IP in Hiera, lookup contint.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1020958 (https://phabricator.wikimedia.org/T334517) [18:23:17] (03CR) 10Stoyofuku-wmf: "Please let me know if there's any other information I can provide to be helpful here!" [puppet] - 10https://gerrit.wikimedia.org/r/1031039 (https://phabricator.wikimedia.org/T364517) (owner: 10Stoyofuku-wmf) [18:24:22] (03Merged) 10jenkins-bot: Import dbconfig and instance schema from puppet [software/conftool] - 10https://gerrit.wikimedia.org/r/1031034 (https://phabricator.wikimedia.org/T362786) (owner: 10Scott French) [18:24:28] (03CR) 10Scott French: "Thanks for the review, Chris." [puppet] - 10https://gerrit.wikimedia.org/r/1031032 (https://phabricator.wikimedia.org/T362786) (owner: 10Scott French) [18:24:44] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-jumbo-eqiad [18:24:57] !log herron@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-logging-eqiad [18:26:24] PROBLEM - MariaDB Replica Lag: pc4 on pc2016 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 443.58 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:26:38] PROBLEM - MariaDB Replica Lag: pc3 on pc2013 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 420.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:26:42] (03CR) 10CDanis: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1031032 (https://phabricator.wikimedia.org/T362786) (owner: 10Scott French) [18:27:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 32.14% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:27:46] FIRING: Primary outbound port utilisation over 80% #page: Alert for device asw2-c-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [18:27:55] here it comes [18:28:01] !incidents [18:28:02] 4676 (UNACKED) Primary outbound port utilisation over 80% (paged) global noc (asw2-c-eqiad.mgmt.eqiad.wmnet) [18:28:02] 4673 (RESOLVED) PHPFPMTooBusy appserver sre (php7.4-fpm.service eqiad) [18:28:02] 4674 (RESOLVED) ProbeDown sre (10.2.2.1 ip4 appservers-https:443 probes/service http_appservers-https_ip4 eqiad) [18:28:02] 4675 (RESOLVED) [3x] ATSBackendErrorsHigh cache_text sre (appservers-ro.discovery.wmnet) [18:28:02] 4672 (RESOLVED) [2x] ProbeDown sre (phab1004:443 probes/custom eqiad) [18:28:03] 4671 (RESOLVED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [18:28:03] 4670 (RESOLVED) ProbeDown sre (10.64.16.101 ip4 phab1004:443 probes/custom http_phabricator_wikimedia_org_collab_ip4 eqiad) [18:28:07] !ack 4676 [18:28:07] 4676 (ACKED) Primary outbound port utilisation over 80% (paged) global noc (asw2-c-eqiad.mgmt.eqiad.wmnet) [18:28:11] here [18:28:38] it's just the management switch? [18:28:46] FIRING: [2x] Primary inbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [18:28:47] mutante: no, librenms polls network devices over the management interface [18:29:07] I see, ack [18:29:09] https://librenms.wikimedia.org/alerts [18:29:24] RECOVERY - MariaDB Replica Lag: pc4 on pc2016 is OK: OK slave_sql_lag Replication lag: 0.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:29:40] RECOVERY - MariaDB Replica Lag: pc3 on pc2013 is OK: OK slave_sql_lag Replication lag: 0.40 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:29:55] looking [18:30:38] !log ryankemper@cumin2002 START - Cookbook sre.kafka.roll-restart-mirror-maker restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons. [18:31:03] (03CR) 10Scott French: [C:03+2] conftool: prepare dbctl json schemas for parser cahce [puppet] - 10https://gerrit.wikimedia.org/r/1031032 (https://phabricator.wikimedia.org/T362786) (owner: 10Scott French) [18:32:01] weird the timing doesn't match [18:32:15] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 33.71% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:32:46] FIRING: [2x] Primary outbound port utilisation over 80% #page: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [18:33:10] (03PS2) 10Scott French: conftool: prepare dbctl json schemas for parser cache [puppet] - 10https://gerrit.wikimedia.org/r/1031032 (https://phabricator.wikimedia.org/T362786) [18:33:10] (03PS2) 10Scott French: conftool-data: bootstrap parser-cache sections and instances [puppet] - 10https://gerrit.wikimedia.org/r/1031033 (https://phabricator.wikimedia.org/T362786) [18:33:19] ok it does, I see it now [18:33:36] spike in appserver RED dashboard that has already peaked though [18:33:45] I think those are unrelated mutante [18:33:46] RESOLVED: [2x] Primary inbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [18:33:55] it looks like the ports that are maxing out are the ports between the access switches and the CRs in eqiad [18:35:10] !incidents [18:35:10] 4676 (ACKED) Primary outbound port utilisation over 80% (paged) global noc (asw2-c-eqiad.mgmt.eqiad.wmnet) [18:35:11] 4677 (RESOLVED) [2x] Primary inbound port utilisation over 80% (paged) global noc () [18:35:11] 4673 (RESOLVED) PHPFPMTooBusy appserver sre (php7.4-fpm.service eqiad) [18:35:11] 4674 (RESOLVED) ProbeDown sre (10.2.2.1 ip4 appservers-https:443 probes/service http_appservers-https_ip4 eqiad) [18:35:11] 4675 (RESOLVED) [3x] ATSBackendErrorsHigh cache_text sre (appservers-ro.discovery.wmnet) [18:35:11] 4672 (RESOLVED) [2x] ProbeDown sre (phab1004:443 probes/custom eqiad) [18:35:12] 4671 (RESOLVED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [18:35:12] 4670 (RESOLVED) ProbeDown sre (10.64.16.101 ip4 phab1004:443 probes/custom http_phabricator_wikimedia_org_collab_ip4 eqiad) [18:35:29] cdanis: as in, this was driven by internal traffic? [18:35:50] bblack: I don't know what the php-fpm alerts were driven by [18:36:01] but for the port utilization messages, I think possibly [18:37:46] RESOLVED: Primary outbound port utilisation over 80% #page: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [18:37:55] well that's good but I still can't correlate the timings [18:38:07] for the increase in requests, yes. for the port utilization no [18:38:59] there was also another IRC alert that only shows up in #traffic (because it's false-alarm-prone) [18:39:02] 18:27 < jinxer-wm> FIRING: LVSHighRX: Excessive RX traffic on lvs6001:9100 (enp175s0f0np0) [18:39:29] so this may have been somehow driven by external->drmrs->eqiad [18:39:34] it wasn't [18:39:55] that timing does at least match in wmf_netflow: https://w.wiki/A4su [18:39:59] or at least -- not in a way that caused that high inbound bps to translate into higher than normal bps over the eqiad<>drmrs link [18:40:02] 185.15.58.224 is text-lb drmrs [18:40:13] sukhe: yeah but you'll need to use internal netflow to see the internal traffic [18:40:31] cdanis: yeah I just can't seem to pinpoint that though [18:41:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.57% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:41:58] sukhe: bblack: I think perhaps this https://w.wiki/A4sy [18:42:30] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.57% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:42:31] cdanis: I think I am just too tired but timing doesn't match? [18:42:47] sukhe: it will never match exactly, librenms polls on a five minute cycle [18:43:30] 50010 [18:43:32] weird [18:44:20] 06SRE, 06Infrastructure-Foundations: Request access to servers Dcops group - https://phabricator.wikimedia.org/T360356#9791713 (10Jclark-ctr) @Volans The main purpose is for gathering debug information I would prefer to grep mesg /log files instead of searching throughout entire output. Mdadm commands wou... [18:44:35] yeah I don't know what that is, but, 127e9bytes/minute (per turnilo) is 17Gbit/sec [18:44:50] oh [18:44:54] k8s hadoop [18:45:01] port 50010 shows up as datanode-data in hadoop workers [18:45:04] bblack@memex:~/repos/puppet$ git grep 50010 [18:45:04] modules/profile/manifests/kubernetes/deployment_server/global_config.pp: 'port' => 50010, [18:45:15] ok :) [18:45:33] does someone know if network_flows_internal has some lag too? [18:45:35] maybe a really big hadoop query output or something [18:46:13] oh [18:46:15] yeah [18:46:18] ignore everything I said [18:46:28] the data I was looking at was 4 hours old [18:46:34] and I was bamboozled because that is also my UTC offset [18:46:40] lol [18:46:43] there was this: START - Cookbook sre.kafka.roll-restart-mirror-maker restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons. [18:46:57] cdanis: it's ok but I was quite confused :] [18:47:06] and I see " Much of the data here is imported into Hadoop using Gobblin." [18:47:13] sukhe: yeah usually i know better to check https://i.imgur.com/ivfikvS.png [18:47:26] cdanis: have you used network_flows_internal? it doesn't seem to be showing the data for me regardless of time [18:47:35] https://w.wiki/A4sx [18:47:57] I haven't used it extensively, XioNoX is the expert on it [18:47:58] well it doesn't go far enough in time yet, I think [18:48:06] (03CR) 10Scott French: [V:03+2 C:03+2] conftool: prepare dbctl json schemas for parser cache [puppet] - 10https://gerrit.wikimedia.org/r/1031032 (https://phabricator.wikimedia.org/T362786) (owner: 10Scott French) [18:48:11] yeah [18:48:40] (03PS1) 10Jdlrobson: Unbreak link buttons [extensions/GuidedTour] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030983 (https://phabricator.wikimedia.org/T364062) [18:48:57] can the SAL entry be related? [18:49:08] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons. [18:49:12] at least kafka and hadoop are mentioned on the same wikitech page [18:49:27] and the restarts there [18:49:27] okay, I *do* think this was analytics traffic though, see https://grafana.wikimedia.org/d/pXnJdJ17k/all-clusters-network-traffic-traffic?orgId=1&var-site=All&var-datasource=thanos&var-cluster=All&viewPanel=82 [18:49:39] !log herron@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-eqiad [18:50:13] (03CR) 10Pppery: "I don't particularly see why this needs to be backported - it's been broken for over a month already and thus can survive another 3 days o" [extensions/GuidedTour] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030983 (https://phabricator.wikimedia.org/T364062) (owner: 10Jdlrobson) [18:50:20] (03PS1) 10Jdlrobson: Add notheme class to Echo [extensions/Echo] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030984 (https://phabricator.wikimedia.org/T363779) [18:50:21] (03PS1) 10Ahmon Dancy: Configure Docker builder GC settings for CI [puppet] - 10https://gerrit.wikimedia.org/r/1031045 (https://phabricator.wikimedia.org/T364773) [18:50:30] mutante: yeah and one before that as well [18:51:36] to make things worse, there is a general rate of increase in requests during the same time period as well [18:51:58] which explains the lvs6001 alert at least [18:52:31] cdanis: ok thanks, digging down to see what it can be in analytics [18:52:37] I mean both of these resolved but yeah [18:53:02] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:54:05] (03CR) 10Jdlrobson: Exclude client errors with undefined stack trace or file url (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1031039 (https://phabricator.wikimedia.org/T364517) (owner: 10Stoyofuku-wmf) [18:54:22] (03PS1) 10BCornwall: hieradata/common: Move shared_acme_certificates to its own file [puppet] - 10https://gerrit.wikimedia.org/r/1031046 (https://phabricator.wikimedia.org/T355189) [18:55:31] (03CR) 10Stoyofuku-wmf: Exclude client errors with undefined stack trace or file url (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1031039 (https://phabricator.wikimedia.org/T364517) (owner: 10Stoyofuku-wmf) [18:56:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.86% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:57:25] (03PS1) 10Jdlrobson: Phase 5: Vector-2022.js should no longer load legacy Vector code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031047 (https://phabricator.wikimedia.org/T301212) [19:00:22] (03CR) 10CDanis: [C:03+2] Parse OTel service names from what's available. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030290 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [19:01:12] (03Merged) 10jenkins-bot: Parse OTel service names from what's available. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030290 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [19:05:18] the last run of "gobblin" also doesn't really line up with the alert [19:06:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:09:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 36.29% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:12:10] (03CR) 10CI reject: [V:04-1] Add notheme class to Echo [extensions/Echo] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030984 (https://phabricator.wikimedia.org/T363779) (owner: 10Jdlrobson) [19:14:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 36.29% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:19:02] (03CR) 10Cwhite: [C:03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1031039 (https://phabricator.wikimedia.org/T364517) (owner: 10Stoyofuku-wmf) [19:20:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 34.29% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:20:23] :] [19:23:20] (03PS1) 10Herron: pyrra-filesystem: increase StartLimits and delay notified unit [puppet] - 10https://gerrit.wikimedia.org/r/1031050 (https://phabricator.wikimedia.org/T364645) [19:23:52] (03PS1) 10CDanis: otelcol: re-add mistakenly removed default processors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031051 (https://phabricator.wikimedia.org/T363407) [19:24:21] (03CR) 10CDanis: [C:03+2] otelcol: re-add mistakenly removed default processors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031051 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [19:25:05] (03Merged) 10jenkins-bot: otelcol: re-add mistakenly removed default processors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031051 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [19:25:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 34.29% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:25:44] RECOVERY - BGP status on cr1-magru is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:25:48] huh [19:26:05] !log cdanis@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply [19:26:14] !log cdanis@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply [19:27:23] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/opentelemetry-collector: apply [19:30:12] (03PS2) 10BCornwall: hieradata/common: Move shared_acme_certificates to its own file [puppet] - 10https://gerrit.wikimedia.org/r/1031046 (https://phabricator.wikimedia.org/T355189) [19:33:12] (03CR) 10CI reject: [V:04-1] hieradata/common: Move shared_acme_certificates to its own file [puppet] - 10https://gerrit.wikimedia.org/r/1031046 (https://phabricator.wikimedia.org/T355189) (owner: 10BCornwall) [19:36:51] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9791958 (10Eevans) The failed device (`sdd`) was replaced; This time we're using `sfdisk` to copy the partition table. The first run complained of a 'ddf_raid_member' signature remaining on the device, and... [19:40:24] (03PS2) 10Ebernhardson: cirrus: Shift 25% of public wikis writes in eqiad to replacement updater [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031029 (https://phabricator.wikimedia.org/T363475) [19:41:02] (03CR) 10CI reject: [V:04-1] cirrus: Shift 25% of public wikis writes in eqiad to replacement updater [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031029 (https://phabricator.wikimedia.org/T363475) (owner: 10Ebernhardson) [19:42:22] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2005 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:42:26] (03PS3) 10Ebernhardson: cirrus: Shift 25% of public wikis writes in eqiad to replacement updater [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031029 (https://phabricator.wikimedia.org/T363475) [19:43:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 32.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:47:36] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/opentelemetry-collector: apply [19:48:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 40% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:51:46] (03CR) 10Ladsgroup: "Which one do you want us to deploy for the first patch? Mine or yours?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030938 (https://phabricator.wikimedia.org/T362786) (owner: 10Ladsgroup) [19:52:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:52:18] (03PS3) 10BCornwall: hieradata/common: Move shared_acme_certificates to its own file [puppet] - 10https://gerrit.wikimedia.org/r/1031046 (https://phabricator.wikimedia.org/T355189) [19:53:00] (03CR) 10Ladsgroup: "this is way too complicated. Let's do a simple "switch everything in one patch" and we can simply pull it in mwdebug and check things inst" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030440 (owner: 10Scott French) [19:53:14] (03CR) 10Ladsgroup: "Yes. I'm unhinged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030440 (owner: 10Scott French) [19:53:43] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2418/console" [puppet] - 10https://gerrit.wikimedia.org/r/1031046 (https://phabricator.wikimedia.org/T355189) (owner: 10BCornwall) [19:55:16] (03CR) 10CI reject: [V:04-1] hieradata/common: Move shared_acme_certificates to its own file [puppet] - 10https://gerrit.wikimedia.org/r/1031046 (https://phabricator.wikimedia.org/T355189) (owner: 10BCornwall) [19:56:59] (03PS4) 10BCornwall: hieradata/common: Move shared_acme_certificates to its own file [puppet] - 10https://gerrit.wikimedia.org/r/1031046 (https://phabricator.wikimedia.org/T355189) [19:57:15] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/opentelemetry-collector: apply [19:57:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.36% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:58:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.79% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:58:34] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2419/console" [puppet] - 10https://gerrit.wikimedia.org/r/1031046 (https://phabricator.wikimedia.org/T355189) (owner: 10BCornwall) [19:59:56] (03CR) 10CI reject: [V:04-1] hieradata/common: Move shared_acme_certificates to its own file [puppet] - 10https://gerrit.wikimedia.org/r/1031046 (https://phabricator.wikimedia.org/T355189) (owner: 10BCornwall) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240513T2000). [20:00:05] Daimona, Tchanders, and ebernhardson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:09] \o [20:00:11] o/ [20:01:27] o/ [20:02:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.36% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:02:49] \o/ [20:03:11] Is anyone else able to deploy? My ssh key is being rejected... [20:03:29] (03PS5) 10BCornwall: hieradata/common: Move shared_acme_certificates to its own file [puppet] - 10https://gerrit.wikimedia.org/r/1031046 (https://phabricator.wikimedia.org/T355189) [20:05:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:05:33] not a deployer :/ [20:06:28] (03CR) 10CI reject: [V:04-1] hieradata/common: Move shared_acme_certificates to its own file [puppet] - 10https://gerrit.wikimedia.org/r/1031046 (https://phabricator.wikimedia.org/T355189) (owner: 10BCornwall) [20:08:08] i can deploy i suppose [20:09:29] (03CR) 10Ebernhardson: [C:03+2] "backport window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014625 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [20:10:08] (03Merged) 10jenkins-bot: Remove old CampaignEvents DB config (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014625 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [20:10:09] ebernhardson: Ah thanks. Though if it's a pain I can move mine to a future window, it's not urgent [20:11:07] (03CR) 10Jdlrobson: Exclude client errors with undefined stack trace or file url (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1031039 (https://phabricator.wikimedia.org/T364517) (owner: 10Stoyofuku-wmf) [20:11:24] thanks ebernhardson [20:11:41] hmm, weird. When attempting to backport 1014626 it says: 20:09:58 Related change 1014625 found for 1014626 [20:11:42] 20:09:58 Change '1014595', project 'mediawiki/extensions/CampaignEvents', branch 'master' not found in any deployed wikiversion. Deployed wikiversions: ['1.43.0-wmf.4'] [20:11:54] Daimona: any idea what those related are? [20:12:22] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:12:49] ugh [20:13:16] 1014625 is because i +2'd it first, but dunno why it's asking about 1014595 [20:13:23] let me take a look, and thanks for deploying btw! [20:14:15] maybe a rebase might fix that? [20:14:19] (03PS5) 10Dreamy Jazz: Remove old CampaignEvents DB config (prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014626 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [20:14:23] hmm, can't hurt to try [20:14:33] try again? no idea what's going on tbh [20:15:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014626 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [20:15:05] Daimona: it' happy and showing normal this time. Weird. [20:15:08] PROBLEM - Check whether ferm is active by checking the default input chain on mw2381 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:15:34] weird. I guess it might have been caused by the parent patch having a depends-on? [20:15:45] oh, yea perhaps that was it. [20:15:45] (03Merged) 10jenkins-bot: Remove old CampaignEvents DB config (prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014626 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [20:16:59] !log ebernhardson@deploy1002 Started scap: Backport for [[gerrit:1014626|Remove old CampaignEvents DB config (prod) (T348281)]] [20:17:26] T348281: Make the CampaignEvents database configuration use the new DatabaseVirtualDomains config - https://phabricator.wikimedia.org/T348281 [20:17:27] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/opentelemetry-collector: apply [20:18:51] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:19:22] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1031050 (https://phabricator.wikimedia.org/T364645) (owner: 10Herron) [20:19:24] !log ebernhardson@deploy1002 ebernhardson and daimona: Backport for [[gerrit:1014626|Remove old CampaignEvents DB config (prod) (T348281)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:21:21] Daimona: alright its up on test servers [20:22:08] thx, testing now [20:24:08] Looking good to me [20:25:26] awesome [20:25:45] !log ebernhardson@deploy1002 ebernhardson and daimona: Continuing with sync [20:26:06] Tchanders: you'll be up when this is synced [20:26:26] Ok great [20:26:52] amazing, thank you! [20:27:35] (03CR) 10Krinkle: varnish: Copy value of X-Wikimedia-Debug cookie to header (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) (owner: 10GergΕ‘ Tisza) [20:29:58] (03CR) 10Krinkle: [C:04-1] "I see these metrics as a single metric (i.e. wmfstatic_response_total) where there are several mutually exclusive kinds of responses that " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255) (owner: 10Andrea Denisse) [20:30:00] PROBLEM - Check whether ferm is active by checking the default input chain on parse2005 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:30:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.71% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:32:06] PROBLEM - Check whether ferm is active by checking the default input chain on parse1007 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:33:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 36.95% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:37:08] * ebernhardson forgets how slow deploys are now.. [20:37:36] Jdlrobson: unfortunately, i highly doubt we'll make it to your patches [20:38:06] 06SRE, 10Wikimedia-Mailing-lists: Mailing list for English Wiktionary admins - https://phabricator.wikimedia.org/T364731#9792169 (10Ladsgroup) 05Openβ†’03Resolved https://lists.wikimedia.org/postorius/lists/wiktionary-en-admins.lists.wikimedia.org/ Feel to create a separate request for mailing list for... [20:38:13] !log ebernhardson@deploy1002 Finished scap: Backport for [[gerrit:1014626|Remove old CampaignEvents DB config (prod) (T348281)]] (duration: 21m 14s) [20:38:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.99% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:38:21] T348281: Make the CampaignEvents database configuration use the new DatabaseVirtualDomains config - https://phabricator.wikimedia.org/T348281 [20:38:32] hey ebernhardson apparently https://gerrit.wikimedia.org/r/c/1030983/ is impacting a Wikimedia event and I suspect it will take some time to merge because of CI - should we kick off the merge now? [20:39:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017152 (https://phabricator.wikimedia.org/T361884) (owner: 10Tchanders) [20:39:12] ebernhardson: my understanding is it has been causing a LOT of disruption [20:39:21] (03PS3) 10Tchanders: IPInfo: Remove $wgIPInfoGeoIP2EnterprisePath [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017152 (https://phabricator.wikimedia.org/T361884) [20:39:22] but I've been out on vacation so haven't been able to pick it up until today [20:39:31] Jdlrobson: hmm, yea i suppose [20:39:37] (03CR) 10TrainBranchBot: "Approved by ebernhardson@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017152 (https://phabricator.wikimedia.org/T361884) (owner: 10Tchanders) [20:40:13] (03Merged) 10jenkins-bot: IPInfo: Remove $wgIPInfoGeoIP2EnterprisePath [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017152 (https://phabricator.wikimedia.org/T361884) (owner: 10Tchanders) [20:40:15] it was in the telegram channel so I can't link to the conversation but there's a 2 week long bootcamp [20:40:31] !log ebernhardson@deploy1002 Started scap: Backport for [[gerrit:1017152|IPInfo: Remove $wgIPInfoGeoIP2EnterprisePath (T361884)]] [20:40:40] T361884: Remove $wgIPInfoGeoIP2EnterprisePath from production config - https://phabricator.wikimedia.org/T361884 [20:42:53] 06SRE, 10Wikimedia-Mailing-lists: wikimediacz-l does not hold all posts for moderation - https://phabricator.wikimedia.org/T298729#9792197 (10Ladsgroup) Hi @Urbanecm is this resolved or you're still seeing stuff? [20:42:54] !log ebernhardson@deploy1002 ebernhardson and tchanders: Backport for [[gerrit:1017152|IPInfo: Remove $wgIPInfoGeoIP2EnterprisePath (T361884)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:43:09] Tchanders: alright its on the test servers [20:44:04] (03PS1) 10Andrew Bogott: Cloud VMs: update openstack client files to 'bobcat' where possible [puppet] - 10https://gerrit.wikimedia.org/r/1031060 [20:44:35] (03CR) 10Andrew Bogott: [C:03+2] Cloud VMs: update openstack client files to 'bobcat' where possible [puppet] - 10https://gerrit.wikimedia.org/r/1031060 (owner: 10Andrew Bogott) [20:44:43] ebernhardson: It's not terribly testable, but I can't see any obvious regressions. Thanks we can go ahead [20:45:08] RECOVERY - Check whether ferm is active by checking the default input chain on mw2381 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:45:37] Tchanders: ok, moving forward [20:45:38] !log ebernhardson@deploy1002 ebernhardson and tchanders: Continuing with sync [20:47:30] (03CR) 10Ebernhardson: [C:03+2] Unbreak link buttons [extensions/GuidedTour] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030983 (https://phabricator.wikimedia.org/T364062) (owner: 10Jdlrobson) [20:48:02] FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:48:45] Thanks @ebernhardson I can reschedule the others for tomorrow. Should be fine for those to wait another day. [20:49:56] PROBLEM - Check whether ferm is active by checking the default input chain on mw2318 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:52:00] PROBLEM - Check whether ferm is active by checking the default input chain on mw1476 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:52:16] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1059 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:53:42] (03CR) 10Jdlrobson: "recheck" [extensions/Echo] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030984 (https://phabricator.wikimedia.org/T363779) (owner: 10Jdlrobson) [20:54:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.14% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:57:54] !log ebernhardson@deploy1002 Finished scap: Backport for [[gerrit:1017152|IPInfo: Remove $wgIPInfoGeoIP2EnterprisePath (T361884)]] (duration: 17m 22s) [20:57:58] T361884: Remove $wgIPInfoGeoIP2EnterprisePath from production config - https://phabricator.wikimedia.org/T361884 [20:58:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1002 using scap backport" [extensions/GuidedTour] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030983 (https://phabricator.wikimedia.org/T364062) (owner: 10Jdlrobson) [20:59:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.14% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:00:00] RECOVERY - Check whether ferm is active by checking the default input chain on parse2005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:00:05] Reedy, sbassett, Maryum, and manfredi: May I have your attention please! Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240513T2100) [21:02:06] RECOVERY - Check whether ferm is active by checking the default input chain on parse1007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:08:02] ebernhardson: can i test it yet? [21:08:18] Jdlrobson: it hasn't even merged yet :S [21:09:11] 😭 [21:09:48] (03CR) 10Cwhite: [C:03+2] Exclude client errors with undefined stack trace or file url (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1031039 (https://phabricator.wikimedia.org/T364517) (owner: 10Stoyofuku-wmf) [21:10:01] Jdlrobson: the timing is going to be a bit rough...i have a hard stop in 30min [21:10:14] (03Merged) 10jenkins-bot: Unbreak link buttons [extensions/GuidedTour] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030983 (https://phabricator.wikimedia.org/T364062) (owner: 10Jdlrobson) [21:10:32] !log ebernhardson@deploy1002 Started scap: Backport for [[gerrit:1030983|Unbreak link buttons (T364062)]] [21:10:37] T364062: GuidedTour external link buttons don't work - https://phabricator.wikimedia.org/T364062 [21:12:58] !log ebernhardson@deploy1002 jdlrobson and ebernhardson: Backport for [[gerrit:1030983|Unbreak link buttons (T364062)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:15:07] (03CR) 10Cwhite: "Thanks for the review!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255) (owner: 10Andrea Denisse) [21:15:12] Jdlrobson: live on test server now [21:19:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9792290 (10VRiley-WMF) [21:19:56] RECOVERY - Check whether ferm is active by checking the default input chain on mw2318 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:20:22] ebernhardson: please sync [21:20:25] !log ebernhardson@deploy1002 jdlrobson and ebernhardson: Continuing with sync [21:22:00] RECOVERY - Check whether ferm is active by checking the default input chain on mw1476 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:22:16] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1059 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:22:23] !log ryankemper@cumin2002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on A:datahubsearch [21:22:30] (03PS1) 10C. Scott Ananian: Fix the loss of ParserOutput pointer in ContentDOMTransformStages [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031067 (https://phabricator.wikimedia.org/T364597) [21:23:47] (03PS1) 10Ryan Kemper: opensearch/roll-restart-reboot: fix usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1031063 [21:24:27] (03CR) 10Bking: [C:03+1] opensearch/roll-restart-reboot: fix usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1031063 (owner: 10Ryan Kemper) [21:24:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9792304 (10VRiley-WMF) kafka-main1006 Rack: A 3 U 24 CableID: 1881 Port: 36 kafka-main1007 Rack: B 3 U 34 CableID: 5173 Port: 19 kafka-main1008 Rack: C 3 U... [21:24:51] thanks ebernhardson sorry we had to overrun [21:25:33] also didn't manage to get my patch out :P It's ok, there is always tomorrow [21:25:50] (03CR) 10Scott French: "Thanks, Amir!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030440 (owner: 10Scott French) [21:26:06] PROBLEM - Check whether ferm is active by checking the default input chain on mw2435 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:26:40] PROBLEM - Check whether ferm is active by checking the default input chain on mw1379 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:27:04] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1017 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:27:54] PROBLEM - Check whether ferm is active by checking the default input chain on mw1350 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:28:06] PROBLEM - Check whether ferm is active by checking the default input chain on parse1011 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:28:54] (03CR) 10Scott French: [C:03+1] "Thanks, Amir." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030938 (https://phabricator.wikimedia.org/T362786) (owner: 10Ladsgroup) [21:32:32] !log ebernhardson@deploy1002 Finished scap: Backport for [[gerrit:1030983|Unbreak link buttons (T364062)]] (duration: 22m 00s) [21:32:38] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:datahubsearch [21:32:39] T364062: GuidedTour external link buttons don't work - https://phabricator.wikimedia.org/T364062 [21:35:42] (03PS1) 10Jdlrobson: Suppress phan errors caused by UserMerge undeploy [extensions/Echo] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031068 (https://phabricator.wikimedia.org/T364610) [21:36:00] (03PS2) 10Jdlrobson: Add notheme class to Echo [extensions/Echo] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030984 (https://phabricator.wikimedia.org/T363779) [21:39:59] !log ryankemper@cumin2002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. [21:46:23] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. [21:47:35] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: T363975 eqiad cluster restart - ryankemper@cumin2002 - T363975 [21:53:03] 06SRE, 06Infrastructure-Foundations: Request access to servers Dcops group - https://phabricator.wikimedia.org/T360356#9792385 (10Volans) >>! In T360356#9791713, @Jclark-ctr wrote: > Mdadm commands would allow us to one day rebuild failed software raids That should be covered by T364540 no? [21:56:06] RECOVERY - Check whether ferm is active by checking the default input chain on mw2435 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:56:40] RECOVERY - Check whether ferm is active by checking the default input chain on mw1379 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:57:04] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1017 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:57:54] RECOVERY - Check whether ferm is active by checking the default input chain on mw1350 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:58:06] RECOVERY - Check whether ferm is active by checking the default input chain on parse1011 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [22:09:25] FIRING: SystemdUnitFailed: elasticsearch-disable-readahead.service on elastic1097:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:18:39] FIRING: CirrusBackendErrorRateTooHigh: CirrusSearch getting over 0.1% error responses from elasticsearch - TODO - https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus-ops&var-origin=appserver&var-origin_instance=All&var-destination=search-https_eqiad - https://alerts.wikimedia.org/?q=alertname%3DCirrusBackendErrorRateTooHigh [22:20:25] curious, looking [22:27:43] !log bking@cumin2002 conftool action : set/weight=10:pooled=no; selector: name=elastic110[5|7]\.eqiad\.wmnet [22:28:39] RESOLVED: CirrusBackendErrorRateTooHigh: CirrusSearch getting over 0.1% error responses from elasticsearch - TODO - https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus-ops&var-origin=appserver&var-origin_instance=All&var-destination=search-https_eqiad - https://alerts.wikimedia.org/?q=alertname%3DCirrusBackendErrorRateTooHigh [22:30:02] !log zabe@mwmaint1002:~$ mwscript cleanupTitles.php itwikivoyage # T298315 [22:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:08] T298315: Deleting Ns:104 in it:voy - https://phabricator.wikimedia.org/T298315 [22:33:39] FIRING: CirrusBackendErrorRateTooHigh: CirrusSearch getting over 0.1% error responses from elasticsearch - TODO - https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus-ops&var-origin=appserver&var-origin_instance=All&var-destination=search-https_eqiad - https://alerts.wikimedia.org/?q=alertname%3DCirrusBackendErrorRateTooHigh [22:36:20] (03CR) 10CI reject: [V:04-1] Suppress phan errors caused by UserMerge undeploy [extensions/Echo] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031068 (https://phabricator.wikimedia.org/T364610) (owner: 10Jdlrobson) [22:36:51] (03CR) 10Jdlrobson: "recheck" [extensions/Echo] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031068 (https://phabricator.wikimedia.org/T364610) (owner: 10Jdlrobson) [22:38:39] RESOLVED: CirrusBackendErrorRateTooHigh: CirrusSearch getting over 0.1% error responses from elasticsearch - TODO - https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus-ops&var-origin=appserver&var-origin_instance=All&var-destination=search-https_eqiad - https://alerts.wikimedia.org/?q=alertname%3DCirrusBackendErrorRateTooHigh [22:39:25] RESOLVED: SystemdUnitFailed: elasticsearch-disable-readahead.service on elastic1097:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:43:55] !log ryankemper@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: T363975 eqiad cluster restart - ryankemper@cumin2002 - T363975 [22:44:55] FIRING: [2x] SystemdUnitFailed: push_cross_cluster_settings_9600.service on elastic1083:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:45:22] (03PS2) 10Scott French: WIP: etcd.php: ignore pc sections in externalLoads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030496 [22:45:22] (03PS5) 10Scott French: configure parsercache servers via dbconfig in etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030440 [22:47:24] PROBLEM - Check unit status of push_cross_cluster_settings_9600 on elastic1083 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:53:02] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:53:06] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:55:19] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=elastic110[5|7]\.eqiad\.wmnet [22:56:59] (03PS2) 10Kimberly Sarabia: Introduce sample overrides to web_ui_actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024813 (https://phabricator.wikimedia.org/T361962) [22:57:24] RECOVERY - Check unit status of push_cross_cluster_settings_9600 on elastic1083 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:59:30] (03CR) 10Scott French: "Went ahead and removed the feature flag :) Also pruned the db list from ProductionServices.php to avoid any confusion as to what's authori" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030440 (owner: 10Scott French) [22:59:55] RESOLVED: SystemdUnitFailed: push_cross_cluster_settings_9600.service on elastic1083:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:00:15] (03PS6) 10Scott French: configure parsercache servers via dbconfig in etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030440 (https://phabricator.wikimedia.org/T362786) [23:11:31] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (no kerberos, no ssh) for HNordeen - https://phabricator.wikimedia.org/T364801 (10HNordeenWMF) 03NEW [23:12:22] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (no kerberos, no ssh) for HNordeen - https://phabricator.wikimedia.org/T364801#9792668 (10HNordeenWMF) Hi @MMiller_WMF could you kindly comment to approve this request? Thanks! [23:24:03] (03CR) 10Clare Ming: [C:03+1] Introduce sample overrides to web_ui_actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024813 (https://phabricator.wikimedia.org/T361962) (owner: 10Kimberly Sarabia) [23:38:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1030559 [23:38:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1030559 (owner: 10TrainBranchBot)