[00:01:46] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:03:55] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:06:46] FIRING: SystemdUnitFailed: rsyslog-imfile-remedy.service on mw1473:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:21:46] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [00:31:46] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:53:20] PROBLEM - MD RAID on aqs1013 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:53:21] ACKNOWLEDGEMENT - MD RAID on aqs1013 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T365337 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:53:27] 10ops-eqiad, 06SRE: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T365337 (10ops-monitoring-bot) 03NEW [01:46:34] PROBLEM - snapshot of s6 in codfw on backupmon1001 is CRITICAL: Last snapshot for s6 at codfw (db2197) taken on 2024-05-20 01:02:42 is 473 GiB, but the previous one was 571 GiB, a change of -17.1 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:49:32] RECOVERY - snapshot of s6 in eqiad on backupmon1001 is OK: Last snapshot for s6 at eqiad (db1225) taken on 2024-05-20 01:01:24 (458 GiB, +0.3 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:54:41] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9811631 (10Eevans) >>! In T362033#9798051, @Eevans wrote: > The array has rebuilt, but I could swear I hear it ticking... 💥 `lines=20,name=dmesg [ ... ] [898421.304851] md: super_written gets error=-5 [89... [02:36:46] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:51:06] FIRING: KubernetesAPINotScrapable: k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [02:51:46] FIRING: [3x] SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:01:46] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:03:55] RESOLVED: SystemdUnitFailed: rsyslog-imfile-remedy.service on mw1473:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:16:46] FIRING: [3x] SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:40:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [03:40:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [03:40:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T364299)', diff saved to https://phabricator.wikimedia.org/P62669 and previous config saved to /var/cache/conftool/dbconfig/20240520-034057-marostegui.json [03:41:02] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [04:06:46] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:21:46] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:17:12] (03PS1) 10Marostegui: es2023: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1033721 [05:17:36] (03CR) 10Marostegui: [C:03+2] es2023: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1033721 (owner: 10Marostegui) [05:22:25] 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 07Schema-change-in-production, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501#9811670 (10Marostegui) I've not see any sl... [05:30:37] (03Abandoned) 10Marostegui: check_flags_per_dc: Add es6 and es7 [software] - 10https://gerrit.wikimedia.org/r/1032631 (owner: 10Marostegui) [05:31:43] 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 07Schema-change-in-production, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501#9811671 (10Marostegui) [05:34:22] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2165 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1033389 (https://phabricator.wikimedia.org/T365339) [05:35:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 33 hosts with reason: Primary switchover s8 T365339 [05:35:11] T365339: Switchover s8 master (db2161 -> db2165) - https://phabricator.wikimedia.org/T365339 [05:35:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2165 with weight 0 T365339', diff saved to https://phabricator.wikimedia.org/P62670 and previous config saved to /var/cache/conftool/dbconfig/20240520-053523-root.json [05:35:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 33 hosts with reason: Primary switchover s8 T365339 [05:36:18] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2165 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1033389 (https://phabricator.wikimedia.org/T365339) (owner: 10Gerrit maintenance bot) [05:50:21] 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 07Schema-change-in-production, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501#9811703 (10Marostegui) [05:53:24] 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 07Schema-change-in-production, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501#9811704 (10Marostegui) db2150 looking okay... [05:57:48] !log Starting s8 codfw failover from db2161 to db2165 - T365339 [05:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:52] T365339: Switchover s8 master (db2161 -> db2165) - https://phabricator.wikimedia.org/T365339 [05:58:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2165 to s8 primary T365339', diff saved to https://phabricator.wikimedia.org/P62671 and previous config saved to /var/cache/conftool/dbconfig/20240520-055812-root.json [05:59:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2161 T365339', diff saved to https://phabricator.wikimedia.org/P62672 and previous config saved to /var/cache/conftool/dbconfig/20240520-055908-root.json [06:02:29] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 20:00:00 on db2161.codfw.wmnet with reason: Schema change T364299 [06:02:32] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on db2161.codfw.wmnet with reason: Schema change T364299 [06:02:36] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:51:06] FIRING: KubernetesAPINotScrapable: k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [07:00:05] Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240520T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:01:46] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:11:03] stealing the window then [07:11:41] (03PS1) 10Urbanecm: [Growth] enwiki: Enable AddLink backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1033889 (https://phabricator.wikimedia.org/T308144) [07:12:08] (03PS2) 10Urbanecm: Update interwiki.php cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030547 (https://phabricator.wikimedia.org/T363658) [07:12:11] (03CR) 10Urbanecm: [C:03+2] Update interwiki.php cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030547 (https://phabricator.wikimedia.org/T363658) (owner: 10Urbanecm) [07:12:49] (03Merged) 10jenkins-bot: Update interwiki.php cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030547 (https://phabricator.wikimedia.org/T363658) (owner: 10Urbanecm) [07:14:01] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1030547|Update interwiki.php cache (T363658)]] [07:14:05] T363658: Please run maintenance task "scap update-interwiki-cache" (28 April 2024) - https://phabricator.wikimedia.org/T363658 [07:16:46] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:27:13] 07:22:45 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-05-20-071420-publish (ran as mwdeploy@parse1002.eqiad.wmnet) returned [255]: ssh: connect to host parse1002.eqiad.wmnet port 22: Connection timed out [07:27:16] reason for concerns? [07:37:39] eoghan: do you know? [07:41:05] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1030547|Update interwiki.php cache (T363658)]] (duration: 27m 04s) [07:41:09] T363658: Please run maintenance task "scap update-interwiki-cache" (28 April 2024) - https://phabricator.wikimedia.org/T363658 [07:43:24] urbanecm: let me have a look [07:43:39] (03CR) 10Filippo Giunchedi: [C:03+2] postgresql: install configuration before starting the server [puppet] - 10https://gerrit.wikimedia.org/r/1002387 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [07:48:34] (03CR) 10Filippo Giunchedi: [C:03+2] zookeeper: fix logging on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1031465 (owner: 10Filippo Giunchedi) [07:49:15] thanks taavi! [07:50:41] 10ops-eqiad, 06SRE: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9811797 (10taavi) 05Resolved→03Open And it's down again. I ran `sudo puppet node deactivate parse1002.eqiad.wmnet` again to have it removed from the scap mediawiki image pulling list. [07:58:58] taavi: sounds like i can deploy my second patch then? [07:59:06] or should i wait for sth else to happen? [07:59:56] yeah, go ahead. that box seems to have a troubled history so I removed it from the list of servers that hack in scap tries to deploy to until it's fixed again [08:00:26] thanks! [08:00:34] (03PS2) 10Urbanecm: [Growth] enwiki: Enable AddLink backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1033889 (https://phabricator.wikimedia.org/T308144) [08:01:16] (03CR) 10Urbanecm: [C:03+2] [Growth] enwiki: Enable AddLink backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1033889 (https://phabricator.wikimedia.org/T308144) (owner: 10Urbanecm) [08:01:55] (03Merged) 10jenkins-bot: [Growth] enwiki: Enable AddLink backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1033889 (https://phabricator.wikimedia.org/T308144) (owner: 10Urbanecm) [08:02:45] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1033889|[Growth] enwiki: Enable AddLink backend (T308144)]] [08:02:49] T308144: Deploy "add a link" to 18th round of wikis (en.wp and de.wp) - https://phabricator.wikimedia.org/T308144 [08:05:14] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:1033889|[Growth] enwiki: Enable AddLink backend (T308144)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:06:34] !log urbanecm@deploy1002 urbanecm: Continuing with sync [08:06:46] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:16:29] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1125.eqiad.wmnet with OS bookworm [08:16:35] (03PS1) 10Marostegui: db1125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1034039 [08:16:57] (03CR) 10Marostegui: [C:03+2] db1125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1034039 (owner: 10Marostegui) [08:19:52] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1033889|[Growth] enwiki: Enable AddLink backend (T308144)]] (duration: 17m 07s) [08:19:56] T308144: Deploy "add a link" to 18th round of wikis (en.wp and de.wp) - https://phabricator.wikimedia.org/T308144 [08:21:46] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:21:51] seems to have worked with no issues [08:28:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1125.eqiad.wmnet with reason: host reimage [08:30:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 1.103s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:31:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1125.eqiad.wmnet with reason: host reimage [08:35:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 1.239s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:38:55] RESOLVED: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:47:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1125.eqiad.wmnet with OS bookworm [08:50:20] (03PS1) 10Marostegui: Revert "db1125: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1032815 [08:53:08] (03CR) 10Marostegui: [C:03+2] Revert "db1125: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1032815 (owner: 10Marostegui) [08:54:29] (03CR) 10Hnowlan: [C:03+2] geo-analytics: use replicas consistent with other analytics services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032831 (owner: 10Hnowlan) [08:55:32] (03Merged) 10jenkins-bot: geo-analytics: use replicas consistent with other analytics services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032831 (owner: 10Hnowlan) [08:56:49] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/geo-analytics: apply [08:57:17] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: apply [09:02:14] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/geo-analytics: apply [09:02:41] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/geo-analytics: apply [09:03:24] (03PS1) 10Marostegui: control-mariadb-10.6-bookworm: Upgrade version [software] - 10https://gerrit.wikimedia.org/r/1034040 (https://phabricator.wikimedia.org/T365338) [09:06:25] (03CR) 10Btullis: [C:03+1] "Looks good." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028761 (https://phabricator.wikimedia.org/T363300) (owner: 10Stevemunene) [09:08:37] (03CR) 10Hnowlan: [C:03+1] services: add data-gateway service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032595 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [09:11:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T352010)', diff saved to https://phabricator.wikimedia.org/P62673 and previous config saved to /var/cache/conftool/dbconfig/20240520-091143-ladsgroup.json [09:11:48] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [09:14:41] (03CR) 10Marostegui: [C:03+2] control-mariadb-10.6-bookworm: Upgrade version [software] - 10https://gerrit.wikimedia.org/r/1034040 (https://phabricator.wikimedia.org/T365338) (owner: 10Marostegui) [09:15:08] (03Merged) 10jenkins-bot: control-mariadb-10.6-bookworm: Upgrade version [software] - 10https://gerrit.wikimedia.org/r/1034040 (https://phabricator.wikimedia.org/T365338) (owner: 10Marostegui) [09:15:15] (03PS3) 10Matěj Suchánek: Remove deprecated abuse filter fields [puppet] - 10https://gerrit.wikimedia.org/r/1032809 (https://phabricator.wikimedia.org/T361996) [09:16:39] (03CR) 10Effie Mouzeli: [C:03+1] trafficserver: move to 15% traffic split for commons [puppet] - 10https://gerrit.wikimedia.org/r/1032828 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [09:17:11] (03CR) 10Hnowlan: [C:03+2] trafficserver: move to 15% traffic split for commons [puppet] - 10https://gerrit.wikimedia.org/r/1032828 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [09:17:52] !log Increasing commons on k8s traffic to 15% [09:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:18] !log Install 10.6.18 on db1125 and pc1014 T365338 [09:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:21] T365338: MariaDB 10.6.18 released - https://phabricator.wikimedia.org/T365338 [09:18:54] ACKNOWLEDGEMENT - MegaRAID on db1172 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T365346 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:19:03] 10ops-eqiad, 06SRE: Degraded RAID on db1172 - https://phabricator.wikimedia.org/T365346 (10ops-monitoring-bot) 03NEW [09:20:57] 10ops-eqiad, 06SRE, 06DBA: Degraded RAID on db1172 - https://phabricator.wikimedia.org/T365346#9811933 (10Marostegui) It looks bad indeed ` [960563.875753] megaraid_sas 0000:18:00.0: 2151 (769510480s/0x0004/CRIT) - Enclosure PD 20(c None/p1) phy bad for slot 4 ` Can we get a replacement disk? Thanks! [09:21:29] 10ops-eqiad, 06SRE, 06DBA: Degraded RAID on db1172 - https://phabricator.wikimedia.org/T365346#9811936 (10Marostegui) p:05Triage→03Medium [09:21:36] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1014.eqiad.wmnet with reason: Testing new mariadb version [09:21:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1014.eqiad.wmnet with reason: Testing new mariadb version [09:24:03] (03CR) 10Fabfur: [C:03+2] benthos:cache: better parsing for path and query string [puppet] - 10https://gerrit.wikimedia.org/r/1031818 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [09:26:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P62674 and previous config saved to /var/cache/conftool/dbconfig/20240520-092651-ladsgroup.json [09:40:04] (03PS1) 10FNegri: Add DNS for ToolsDB replica host [puppet] - 10https://gerrit.wikimedia.org/r/1034042 (https://phabricator.wikimedia.org/T348407) [09:40:23] (03PS1) 10Hnowlan: trafficserver: move commons-on-k8s to 30% [puppet] - 10https://gerrit.wikimedia.org/r/1034043 (https://phabricator.wikimedia.org/T362323) [09:41:54] (03PS2) 10FNegri: Add DNS for ToolsDB replica host [puppet] - 10https://gerrit.wikimedia.org/r/1034042 (https://phabricator.wikimedia.org/T348407) [09:41:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P62675 and previous config saved to /var/cache/conftool/dbconfig/20240520-094159-ladsgroup.json [09:43:39] (03PS1) 10Marostegui: db2175: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1034044 (https://phabricator.wikimedia.org/T361543) [09:43:46] 06SRE, 06Infrastructure-Foundations, 10netops: magru network setup - https://phabricator.wikimedia.org/T362421#9811967 (10cmooney) >>! In T362421#9808627, @ayounsi wrote: > The Telxius community doesn't seem to be of any effect so far, I'll wait for their reply, maybe they changed or need to be enabled on th... [09:43:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2175 T361543', diff saved to https://phabricator.wikimedia.org/P62676 and previous config saved to /var/cache/conftool/dbconfig/20240520-094352-marostegui.json [09:43:57] T361543: Upgrade s2 to MariaDB 10.6 - https://phabricator.wikimedia.org/T361543 [09:44:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2175.codfw.wmnet with reason: Migration to bookworm [09:44:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2175.codfw.wmnet with reason: Migration to bookworm [09:44:37] (03CR) 10Marostegui: [C:03+2] db2175: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1034044 (https://phabricator.wikimedia.org/T361543) (owner: 10Marostegui) [09:45:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2175.codfw.wmnet with OS bookworm [09:52:37] (03PS1) 10LSobanski: Filter out additional addresses handled by gsuite and postfix that cannot be removed from VRTS [puppet] - 10https://gerrit.wikimedia.org/r/1034046 (https://phabricator.wikimedia.org/T284145) [09:55:39] (03CR) 10CI reject: [V:04-1] Filter out additional addresses handled by gsuite and postfix that cannot be removed from VRTS [puppet] - 10https://gerrit.wikimedia.org/r/1034046 (https://phabricator.wikimedia.org/T284145) (owner: 10LSobanski) [09:57:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T352010)', diff saved to https://phabricator.wikimedia.org/P62677 and previous config saved to /var/cache/conftool/dbconfig/20240520-095706-ladsgroup.json [09:57:09] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [09:57:12] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [09:57:22] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [09:57:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2182 (T352010)', diff saved to https://phabricator.wikimedia.org/P62678 and previous config saved to /var/cache/conftool/dbconfig/20240520-095729-ladsgroup.json [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240520T1000) [10:02:16] (03PS1) 10Marostegui: Revert "db2175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1032816 [10:05:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2175.codfw.wmnet with reason: host reimage [10:08:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2175.codfw.wmnet with reason: host reimage [10:10:31] (03PS1) 10Filippo Giunchedi: Enable tracing for citoid and cxserver in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034047 (https://phabricator.wikimedia.org/T320563) [10:18:44] !log bounce prometheus@k8s in eqiad - T343529 [10:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:49] T343529: Prometheus doesn't reload or alert on expired client certificates - https://phabricator.wikimedia.org/T343529 [10:18:53] (03PS2) 10LSobanski: Filter out additional addresses handled by gsuite and postfix that cannot be removed from VRTS [puppet] - 10https://gerrit.wikimedia.org/r/1034046 (https://phabricator.wikimedia.org/T284145) [10:21:58] (03CR) 10CI reject: [V:04-1] Filter out additional addresses handled by gsuite and postfix that cannot be removed from VRTS [puppet] - 10https://gerrit.wikimedia.org/r/1034046 (https://phabricator.wikimedia.org/T284145) (owner: 10LSobanski) [10:24:32] (03PS3) 10LSobanski: Filter out addresses that cannot be removed from VRTS [puppet] - 10https://gerrit.wikimedia.org/r/1034046 (https://phabricator.wikimedia.org/T284145) [10:30:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 10%: After reimage', diff saved to https://phabricator.wikimedia.org/P62679 and previous config saved to /var/cache/conftool/dbconfig/20240520-103011-root.json [10:30:24] (03CR) 10Marostegui: [C:03+2] Revert "db2175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1032816 (owner: 10Marostegui) [10:31:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2175.codfw.wmnet with OS bookworm [10:37:05] (03PS1) 10Filippo Giunchedi: pki: add temporary profile for prometheus + k8s [puppet] - 10https://gerrit.wikimedia.org/r/1034048 (https://phabricator.wikimedia.org/T343529) [10:42:29] (03PS1) 10Filippo Giunchedi: prometheus: use 'prometheus' profile for k8s certs [puppet] - 10https://gerrit.wikimedia.org/r/1034050 (https://phabricator.wikimedia.org/T343529) [10:45:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 25%: After reimage', diff saved to https://phabricator.wikimedia.org/P62680 and previous config saved to /var/cache/conftool/dbconfig/20240520-104517-root.json [10:46:07] !log Restarting MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration [10:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:19] (03PS1) 10FNegri: wikireplica_dns: remove toolsdb and redis records [puppet] - 10https://gerrit.wikimedia.org/r/1034052 [10:49:58] (03PS1) 10Marostegui: check_depooled: Change line for all hosts [software] - 10https://gerrit.wikimedia.org/r/1034053 [10:50:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 1.22s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:50:30] (03CR) 10Marostegui: [C:03+2] check_depooled: Change line for all hosts [software] - 10https://gerrit.wikimedia.org/r/1034053 (owner: 10Marostegui) [10:51:06] FIRING: KubernetesAPINotScrapable: k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [10:51:36] (03CR) 10Effie Mouzeli: [C:03+1] trafficserver: move commons-on-k8s to 30% [puppet] - 10https://gerrit.wikimedia.org/r/1034043 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [10:52:05] (03CR) 10Volans: [C:03+1] "Looks sane to me, although I didn't test all the commands" [cookbooks] - 10https://gerrit.wikimedia.org/r/1032477 (https://phabricator.wikimedia.org/T362523) (owner: 10Ayounsi) [10:55:15] (03CR) 10EoghanGaffney: [C:03+1] vrts: aesthetic code improvements [puppet] - 10https://gerrit.wikimedia.org/r/1033657 (owner: 10AOkoth) [10:55:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 1.22s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:55:30] (03CR) 10FNegri: [C:04-1] "If I understand correctly, svc.eqiad.wmflabs is a legacy domain, so we don't really need to add a record in that domain, and we can add in" [puppet] - 10https://gerrit.wikimedia.org/r/1034042 (https://phabricator.wikimedia.org/T348407) (owner: 10FNegri) [10:57:04] (03PS1) 10Marostegui: db2175: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1034054 [10:57:28] (03PS2) 10FNegri: wikireplica_dns: remove toolsdb and redis records [puppet] - 10https://gerrit.wikimedia.org/r/1034052 [10:58:18] (03CR) 10Marostegui: [C:03+2] db2175: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1034054 (owner: 10Marostegui) [11:00:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 50%: After reimage', diff saved to https://phabricator.wikimedia.org/P62681 and previous config saved to /var/cache/conftool/dbconfig/20240520-110023-root.json [11:02:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2181 T363792', diff saved to https://phabricator.wikimedia.org/P62682 and previous config saved to /var/cache/conftool/dbconfig/20240520-110217-root.json [11:02:24] T363792: Upgrade s8 to MariaDB 10.6 - https://phabricator.wikimedia.org/T363792 [11:02:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2181.codfw.wmnet with reason: Migration to bookworm [11:02:51] (03PS1) 10Marostegui: db2181: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1034056 [11:03:04] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2181.codfw.wmnet with reason: Migration to bookworm [11:03:51] (03CR) 10Marostegui: [C:03+2] db2181: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1034056 (owner: 10Marostegui) [11:14:36] !log marostegui@cumin1002 START - Cookbook sre.hosts.ipmi-password-reset [11:14:37] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.ipmi-password-reset (exit_code=99) [11:14:47] !log marostegui@cumin1002 START - Cookbook sre.hosts.ipmi-password-reset [11:15:14] !log marostegui@cumin1002 Updating IPMI password on 1 hosts - marostegui@cumin1002 [11:15:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 75%: After reimage', diff saved to https://phabricator.wikimedia.org/P62684 and previous config saved to /var/cache/conftool/dbconfig/20240520-111530-root.json [11:15:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [11:16:46] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:20:27] 10ops-codfw, 06DBA: Reset db2181 idrac - https://phabricator.wikimedia.org/T365351 (10Marostegui) 03NEW [11:20:57] (03CR) 10Santiago Faci: [C:03+1] "Looks good!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017962 (https://phabricator.wikimedia.org/T356228) (owner: 10David Martin) [11:21:10] 10ops-codfw, 06DBA: Reset db2181 idrac - https://phabricator.wikimedia.org/T365351#9812120 (10Marostegui) p:05Triage→03Medium [11:22:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Migration to bookworm [11:22:22] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Migration to bookworm [11:26:17] (03CR) 10Volans: "Thanks a lot for starting this! Love to see it split into two and the new one being with class API. I did a first pass, feel free to ping " [cookbooks] - 10https://gerrit.wikimedia.org/r/1032544 (owner: 10DCausse) [11:30:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 100%: After reimage', diff saved to https://phabricator.wikimedia.org/P62685 and previous config saved to /var/cache/conftool/dbconfig/20240520-113038-root.json [11:37:49] 10ops-codfw, 06SRE, 06DBA: Reset db2181 idrac - https://phabricator.wikimedia.org/T365351#9812155 (10Marostegui) 05Open→03Resolved a:03Marostegui It was a password getting out of sync, I restarted it with: ` racadm set iDRAC.Users.2.Password XXXX ` ` $ sudo ipmitool -I lanplus -H db2181.mgmt.codfw... [11:38:02] (03PS16) 10Fabfur: cache:benthos: test for socket based activation in Benthos [puppet] - 10https://gerrit.wikimedia.org/r/1029615 (https://phabricator.wikimedia.org/T364379) [11:38:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2181.codfw.wmnet with OS bookworm [11:40:03] (03CR) 10Hnowlan: [C:03+2] trafficserver: move commons-on-k8s to 30% [puppet] - 10https://gerrit.wikimedia.org/r/1034043 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [11:40:16] !log migrating 30% of commons traffic to k8s [11:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:17] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2511/co" [puppet] - 10https://gerrit.wikimedia.org/r/1029615 (https://phabricator.wikimedia.org/T364379) (owner: 10Fabfur) [11:42:09] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests, 07Russian-Sites: Increase $wgMaxArticleSize to 4MB for ruwikisource - https://phabricator.wikimedia.org/T308893#9812183 (10Reedy) [11:42:11] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests, 07Russian-Sites: Increase $wgMaxArticleSize to 4MB for ruwikisource - https://phabricator.wikimedia.org/T308893#9812185 (10Reedy) [11:42:13] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests: Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319#9812184 (10Reedy) [11:42:45] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests: Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319#9812182 (10Reedy) [11:47:31] !log Deploy urgent schema change on s8 eqiad with replication dbmaint T365352 [11:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:35] T365352: Stop referencing rev_id as signed int in revtag table to counter revision id overflow in wikidatawiki - https://phabricator.wikimedia.org/T365352 [11:51:19] (03PS1) 10Hnowlan: trafficserver: move commons-on-k8s to 80% [puppet] - 10https://gerrit.wikimedia.org/r/1034058 (https://phabricator.wikimedia.org/T36232) [11:56:18] !log Deploy schema change on s5 eqiad with replication dbmaint T365352 [11:56:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2181.codfw.wmnet with reason: host reimage [11:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:22] T365352: Stop referencing rev_id as signed int in revtag table to counter revision id overflow in wikidatawiki - https://phabricator.wikimedia.org/T365352 [11:56:37] ooh [11:59:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2181.codfw.wmnet with reason: host reimage [12:01:19] !log Deploy schema change on s4 eqiad with replication dbmaint T365352 [12:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:23] T365352: Stop referencing rev_id as signed int in revtag table to counter revision id overflow in wikidatawiki - https://phabricator.wikimedia.org/T365352 [12:05:52] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for JayCano - https://phabricator.wikimedia.org/T365349#9812256 (10JayCano) [12:06:46] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:07:50] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for JayCano - https://phabricator.wikimedia.org/T365349#9812258 (10JayCano) Thank you, @reedy. Updated to include the template. I couldn't find the original request, it was probably 2-3 years ago and might have been filed by one of my directors. Is there any w... [12:09:36] (03CR) 10Alexandros Kosiaris: [C:03+1] trafficserver: move commons-on-k8s to 80% [puppet] - 10https://gerrit.wikimedia.org/r/1034058 (https://phabricator.wikimedia.org/T36232) (owner: 10Hnowlan) [12:10:51] RESOLVED: KubernetesAPINotScrapable: k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [12:12:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 1.515s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:15:06] (03CR) 10Volans: [C:03+1] "LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/1032849 (https://phabricator.wikimedia.org/T365123) (owner: 10Scott French) [12:17:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 1.331s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:21:46] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:22:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2181.codfw.wmnet with OS bookworm [12:24:46] (03PS1) 10Vgutierrez: lvs::realserver::ipip: Disable rp_filter without reboot [puppet] - 10https://gerrit.wikimedia.org/r/1034074 (https://phabricator.wikimedia.org/T365354) [12:27:37] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2512/co" [puppet] - 10https://gerrit.wikimedia.org/r/1034074 (https://phabricator.wikimedia.org/T365354) (owner: 10Vgutierrez) [12:30:32] (03PS17) 10Fabfur: cache:benthos: test for socket based activation in Benthos [puppet] - 10https://gerrit.wikimedia.org/r/1029615 (https://phabricator.wikimedia.org/T364379) [12:32:04] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [12:33:41] (03CR) 10Vgutierrez: [C:03+1] "looking good, I'm guessing you'll remove cp4037.yaml and cp4045.yaml before merging (and update the commit message to drop the reference)" [puppet] - 10https://gerrit.wikimedia.org/r/1029615 (https://phabricator.wikimedia.org/T364379) (owner: 10Fabfur) [12:34:05] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add dns for sretest2002 - cmooney@cumin1002" [12:34:57] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add dns for sretest2002 - cmooney@cumin1002" [12:34:57] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:35:16] !log Deploy schema change on s3 eqiad with replication dbmaint T365352 [12:36:01] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache sretest2002.wikimedia.org on all recursors [12:36:04] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) sretest2002.wikimedia.org on all recursors [12:36:26] (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1034074 (https://phabricator.wikimedia.org/T365354) (owner: 10Vgutierrez) [12:37:48] (03PS18) 10Fabfur: cache:benthos: test for socket based activation in Benthos [puppet] - 10https://gerrit.wikimedia.org/r/1029615 (https://phabricator.wikimedia.org/T364379) [12:39:21] (03PS1) 10Marostegui: Revert "db2181: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1032818 [12:43:49] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [12:44:30] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache sretest2002.mgmt.codfw.wmnet on all recursors [12:44:33] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) sretest2002.mgmt.codfw.wmnet on all recursors [12:45:57] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Change mgmt dns for sretest2002 - cmooney@cumin1002" [12:46:46] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Change mgmt dns for sretest2002 - cmooney@cumin1002" [12:46:46] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:47:23] (03CR) 10Marostegui: [C:03+2] Revert "db2181: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1032818 (owner: 10Marostegui) [12:47:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2181 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P62687 and previous config saved to /var/cache/conftool/dbconfig/20240520-124749-root.json [12:48:17] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2002.wikimedia.org with OS bookworm [12:48:30] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9812357 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest2002.... [12:52:04] !log Deploy schema change on s7 (only frwiktionary) eqiad with replication dbmaint T365352 [12:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:08] T365352: Stop referencing rev_id as signed int in revtag table to counter revision id overflow in wikidatawiki - https://phabricator.wikimedia.org/T365352 [12:52:45] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2515/co" [puppet] - 10https://gerrit.wikimedia.org/r/1029615 (https://phabricator.wikimedia.org/T364379) (owner: 10Fabfur) [12:53:40] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 104 probes of 734 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:54:08] !log disable puppet on A:ncredir && A:cp-upload_ulsfo before merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1034074 - T365354 [12:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:13] T365354: rp_filter should be disabled on puppet apply - https://phabricator.wikimedia.org/T365354 [12:55:15] 10ops-eqiad, 06SRE: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9812382 (10akosiaris) >>! In T363086#9765644, @Jclark-ctr wrote: > this server is out of warranty if it fails again we could look at swapping it with another decom server? @Jclark-ctr, this failed again, s... [12:56:58] (03PS2) 10Vgutierrez: lvs::realserver::ipip: Disable rp_filter without reboot [puppet] - 10https://gerrit.wikimedia.org/r/1034074 (https://phabricator.wikimedia.org/T365354) [12:57:57] (03PS1) 10Marostegui: db2181: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1034077 [12:58:42] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 71 probes of 734 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:59:07] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1006.eqiad.wmnet with OS bullseye [12:59:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9812386 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host kafka-main1006.eqiad.wmnet with OS bullseye [12:59:28] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1034074 (https://phabricator.wikimedia.org/T365354) (owner: 10Vgutierrez) [12:59:31] (03CR) 10Marostegui: [C:03+2] db2181: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1034077 (owner: 10Marostegui) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240520T1300). [13:00:04] _Gerges and NMW03: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:20] (03CR) 10Fabfur: [C:03+1] lvs::realserver::ipip: Disable rp_filter without reboot [puppet] - 10https://gerrit.wikimedia.org/r/1034074 (https://phabricator.wikimedia.org/T365354) (owner: 10Vgutierrez) [13:00:31] * TheresNoTime can't deploy this window, sorry! [13:01:06] both are just site config [13:02:01] One of which is already deployed [13:02:15] or... the wrong commit linked [13:02:21] (03PS2) 10NMW03: Enable wgMinervaShowCategories for English Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1033656 (https://phabricator.wikimedia.org/T365323) [13:02:23] (03CR) 10Reedy: [C:03+2] Enable wgMinervaShowCategories for English Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1033656 (https://phabricator.wikimedia.org/T365323) (owner: 10NMW03) [13:02:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2181 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P62688 and previous config saved to /var/cache/conftool/dbconfig/20240520-130257-root.json [13:03:02] (03Merged) 10jenkins-bot: Enable wgMinervaShowCategories for English Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1033656 (https://phabricator.wikimedia.org/T365323) (owner: 10NMW03) [13:03:32] (03PS4) 10GergesShamon: [frwiktionary] Create new namespace "Convention" & associated talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1033113 (https://phabricator.wikimedia.org/T360989) [13:03:35] (03CR) 10Reedy: [C:03+2] [frwiktionary] Create new namespace "Convention" & associated talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1033113 (https://phabricator.wikimedia.org/T360989) (owner: 10GergesShamon) [13:03:44] (03PS19) 10Fabfur: cache:benthos: test for socket based activation in Benthos [puppet] - 10https://gerrit.wikimedia.org/r/1029615 (https://phabricator.wikimedia.org/T364379) [13:04:24] (03Merged) 10jenkins-bot: [frwiktionary] Create new namespace "Convention" & associated talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1033113 (https://phabricator.wikimedia.org/T360989) (owner: 10GergesShamon) [13:04:40] !log depool, restart swift-proxy, repool moss-fe1001 as ~12% connection failures reported by envoy since late 14th May T360913 [13:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:44] T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913 [13:05:42] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 104 probes of 734 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:08:44] (03CR) 10Vgutierrez: [V:03+1 C:03+2] lvs::realserver::ipip: Disable rp_filter without reboot [puppet] - 10https://gerrit.wikimedia.org/r/1034074 (https://phabricator.wikimedia.org/T365354) (owner: 10Vgutierrez) [13:10:24] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2518/co" [puppet] - 10https://gerrit.wikimedia.org/r/1029615 (https://phabricator.wikimedia.org/T364379) (owner: 10Fabfur) [13:10:42] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 70 probes of 734 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:10:45] Reedy: are you deploying those two patches? [13:11:10] hi [13:11:46] can anyone access gerrit and toolforge? [13:11:53] !log Re-enable puppet on A:ncredir && A:cp-upload_ulsfo - T365354 [13:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:57] T365354: rp_filter should be disabled on puppet apply - https://phabricator.wikimedia.org/T365354 [13:12:03] NMW03: Define anyone [13:12:05] NMW03: yes.. up & running for me ATM [13:12:11] TheresNoTime: No I just +2'd them for fun ;) [13:12:34] Reedy: stranger things have happened :D but ack [13:14:13] (03PS2) 10Vgutierrez: depool upload@eqsin before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1032488 (https://phabricator.wikimedia.org/T357257) [13:14:48] Reedy: "The connection was reset." [13:14:58] I can't access the page [13:15:07] (03CR) 10Effie Mouzeli: [C:03+1] trafficserver: move commons-on-k8s to 80% [puppet] - 10https://gerrit.wikimedia.org/r/1034058 (https://phabricator.wikimedia.org/T36232) (owner: 10Hnowlan) [13:15:32] NMW03: which page exactly? [13:15:44] I saw you +2'ed my patch from phab, but I can't see it lol [13:16:17] nevermind, works now [13:16:55] (03PS5) 10Anzx: knwiki, knwikisource: Lift IP cap on 2024-05-24 for Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032440 (https://phabricator.wikimedia.org/T365221) [13:16:56] (03CR) 10Vgutierrez: [C:03+2] depool upload@eqsin before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1032488 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:17:08] !log depool upload@eqsin before enabling IPIP encapsulation - T357257 [13:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:12] T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257 [13:17:18] (03PS6) 10Anzx: knwiki, knwikisource: Lift IP cap on 2024-05-24 for Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032440 (https://phabricator.wikimedia.org/T365221) [13:18:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2181 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P62689 and previous config saved to /var/cache/conftool/dbconfig/20240520-131803-root.json [13:18:50] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [13:18:54] TheresNoTime: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1032440 I have a patch for deployment can I add it to calendar [13:19:26] anzx: I'm not able to deploy, but Reedy may be able to [13:19:30] !log adding outbound ACL on irb.2002 on lsw1 switches in codfw to test DHCP function T365204 [13:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:35] T365204: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204 [13:20:07] (03PS5) 10Alexandros Kosiaris: mobileapps: Use mesh modules version enabling IPv6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032779 (https://phabricator.wikimedia.org/T255568) [13:20:10] (03CR) 10Alexandros Kosiaris: mobileapps: Use mesh modules version enabling IPv6 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032779 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [13:21:56] (03CR) 10Hnowlan: [C:03+2] trafficserver: move commons-on-k8s to 80% [puppet] - 10https://gerrit.wikimedia.org/r/1034058 (https://phabricator.wikimedia.org/T36232) (owner: 10Hnowlan) [13:22:17] !log migrating 80% of commons traffic to k8s [13:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:46] (03PS1) 10Elukey: Remove CFSSL k8s-related auth_keys for cloud [puppet] - 10https://gerrit.wikimedia.org/r/1034079 (https://phabricator.wikimedia.org/T363829) [13:23:19] !log reedy@deploy1002 Synchronized wmf-config/: T360989 T365323 (duration: 15m 35s) [13:23:24] T360989: New Namespace for French Wiktionary: Convention - https://phabricator.wikimedia.org/T360989 [13:23:24] T365323: Enabling mobile categories by default in the English Wiktionary - https://phabricator.wikimedia.org/T365323 [13:24:18] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2519/console" [puppet] - 10https://gerrit.wikimedia.org/r/1034079 (https://phabricator.wikimedia.org/T363829) (owner: 10Elukey) [13:24:23] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [13:24:27] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [13:25:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:26:03] jouncebot: nowandnext [13:26:03] For the next 0 hour(s) and 33 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240520T1300) [13:26:04] In 2 hour(s) and 3 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240520T1530) [13:26:12] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2006.codfw.wmnet with OS bullseye [13:26:25] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9812443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host kafka-main2006.codfw.wmnet with OS bullseye [13:27:05] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2007.codfw.wmnet with OS bullseye [13:27:13] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9812446 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host kafka-main2007.codfw.wmnet with OS bullseye [13:27:24] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [13:27:54] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2008.codfw.wmnet with OS bullseye [13:28:11] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9812448 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host kafka-main2008.codfw.wmnet with OS bullseye [13:28:11] (03CR) 10Reedy: [C:03+2] knwiki, knwikisource: Lift IP cap on 2024-05-24 for Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032440 (https://phabricator.wikimedia.org/T365221) (owner: 10Anzx) [13:28:45] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2009.codfw.wmnet with OS bullseye [13:28:52] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9812451 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host kafka-main2009.codfw.wmnet with OS bullseye [13:28:57] (03Merged) 10jenkins-bot: knwiki, knwikisource: Lift IP cap on 2024-05-24 for Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032440 (https://phabricator.wikimedia.org/T365221) (owner: 10Anzx) [13:29:13] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2010.codfw.wmnet with OS bullseye [13:29:22] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9812453 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host kafka-main2010.codfw.wmnet with OS bullseye [13:29:59] (03CR) 10Volans: [C:04-1] "Needs to update also the BMC" [cookbooks] - 10https://gerrit.wikimedia.org/r/1008818 (owner: 10Ayounsi) [13:30:43] (03PS2) 10Vgutierrez: hiera: Enable IPIP on high-traffic2@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1032483 (https://phabricator.wikimedia.org/T357257) [13:31:01] Reedy when will you start scap backport [13:31:23] NMW03: Your patch is deployed [13:32:02] (03PS2) 10Volans: sre.hardware.DellAPI: auto-refresh session [cookbooks] - 10https://gerrit.wikimedia.org/r/1004053 (https://phabricator.wikimedia.org/T357756) [13:32:50] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compile" [puppet] - 10https://gerrit.wikimedia.org/r/1032483 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:32:57] just a heads-up, we're seeing a spike in saturation and backend response times since 13:05 - I dunno if that's too early to be backport deployment related though [13:33:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2181 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P62691 and previous config saved to /var/cache/conftool/dbconfig/20240520-133309-root.json [13:33:37] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Enable IPIP on high-traffic2@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1032483 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:35:15] (03PS2) 10Vgutierrez: hiera: Enable IPIP on upload@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1032484 (https://phabricator.wikimedia.org/T357257) [13:36:47] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1032484 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:37:56] (03PS19) 10EoghanGaffney: lists: Add lists role to list2001 [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) [13:38:15] (03CR) 10CI reject: [V:04-1] lists: Add lists role to list2001 [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [13:38:22] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [13:39:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [13:39:08] (03PS20) 10EoghanGaffney: lists: Add lists role to list2001 [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) [13:39:26] (03PS4) 10MdsShakil: Allow English Wikiversity custodians to use mass-delete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030313 (https://phabricator.wikimedia.org/T360977) [13:39:31] (03CR) 10Volans: [C:03+2] sre.hardware.DellAPI: auto-refresh session [cookbooks] - 10https://gerrit.wikimedia.org/r/1004053 (https://phabricator.wikimedia.org/T357756) (owner: 10Volans) [13:40:18] (03PS5) 10MdsShakil: Allow English Wikiversity custodians to use mass-delete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030313 (https://phabricator.wikimedia.org/T360977) [13:40:26] (03PS6) 10MdsShakil: Allow English Wikiversity custodians to use mass-delete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030313 (https://phabricator.wikimedia.org/T360977) [13:40:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T364299)', diff saved to https://phabricator.wikimedia.org/P62692 and previous config saved to /var/cache/conftool/dbconfig/20240520-134034-marostegui.json [13:40:39] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [13:42:06] (03CR) 10CDanis: [C:03+1] Remove CFSSL k8s-related auth_keys for cloud [puppet] - 10https://gerrit.wikimedia.org/r/1034079 (https://phabricator.wikimedia.org/T363829) (owner: 10Elukey) [13:43:01] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Enable IPIP on upload@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1032484 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:45:10] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main1006.eqiad.wmnet with OS bullseye [13:45:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9812539 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host kafka-main1006.eqiad.wmnet with OS bullseye execut... [13:45:46] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [13:45:59] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [13:46:01] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:46:05] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:46:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1161 (T352010)', diff saved to https://phabricator.wikimedia.org/P62693 and previous config saved to /var/cache/conftool/dbconfig/20240520-134613-ladsgroup.json [13:46:19] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:46:40] (03CR) 10Elukey: [V:03+1 C:03+2] Remove CFSSL k8s-related auth_keys for cloud [puppet] - 10https://gerrit.wikimedia.org/r/1034079 (https://phabricator.wikimedia.org/T363829) (owner: 10Elukey) [13:47:01] !log reedy@deploy1002 Synchronized wmf-config/throttle.php: T365221 (duration: 15m 20s) [13:47:05] T365221: Lift IP cap on 2024-05-24 for Editathon for knwiki and knwikisource - https://phabricator.wikimedia.org/T365221 [13:48:05] (03CR) 10Ottomata: [C:03+1] datasets-config: Bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032862 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [13:48:13] Reedy: thank you for deployment [13:48:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2181 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62694 and previous config saved to /var/cache/conftool/dbconfig/20240520-134815-root.json [13:48:26] np :) [13:49:34] Reedy Can you do another? https://phabricator.wikimedia.org/T360977 [13:51:14] (03CR) 10CDanis: [C:03+1] cache:benthos: test for socket based activation in Benthos [puppet] - 10https://gerrit.wikimedia.org/r/1029615 (https://phabricator.wikimedia.org/T364379) (owner: 10Fabfur) [13:52:50] (03CR) 10Reedy: [C:03+2] Allow English Wikiversity custodians to use mass-delete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030313 (https://phabricator.wikimedia.org/T360977) (owner: 10MdsShakil) [13:52:53] RECOVERY - Host ml-serve2002 is UP: PING OK - Packet loss = 0%, RTA = 30.42 ms [13:52:59] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 445, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:52:59] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 523, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:53:01] RECOVERY - SSH on ml-serve2002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:53:50] (03CR) 10Vgutierrez: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1029615 (https://phabricator.wikimedia.org/T364379) (owner: 10Fabfur) [13:53:58] (03Merged) 10jenkins-bot: Allow English Wikiversity custodians to use mass-delete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030313 (https://phabricator.wikimedia.org/T360977) (owner: 10MdsShakil) [13:55:39] (03Merged) 10jenkins-bot: sre.hardware.DellAPI: auto-refresh session [cookbooks] - 10https://gerrit.wikimedia.org/r/1004053 (https://phabricator.wikimedia.org/T357756) (owner: 10Volans) [13:55:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P62695 and previous config saved to /var/cache/conftool/dbconfig/20240520-135542-marostegui.json [13:58:17] (03CR) 10TChin: [C:03+2] datasets-config: Bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032862 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [13:58:34] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9812556 (10cmooney) So some interesting findings when testing today. I was able to reproduce the issue with sretest2002, and took... [13:59:15] (03Merged) 10jenkins-bot: datasets-config: Bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032862 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [14:00:45] !log rolling restart of pybal on lvs5005 and lvs5006 - T357257 [14:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:50] T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257 [14:01:56] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2002.wikimedia.org with OS bookworm [14:02:00] (03CR) 10FNegri: [C:04-1] "The change to maintain-views.yaml looks good. Please don't modify filtered_tables at this moment, it should be updated only if/when the co" [puppet] - 10https://gerrit.wikimedia.org/r/1032809 (https://phabricator.wikimedia.org/T361996) (owner: 10Matěj Suchánek) [14:02:09] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9812566 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host sretest2002.wikimedia.or... [14:02:56] !log tchin@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config: apply [14:03:03] !log tchin@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/datasets-config: apply [14:03:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2181 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62696 and previous config saved to /var/cache/conftool/dbconfig/20240520-140321-root.json [14:06:18] !log tchin@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config: apply [14:06:36] !log tchin@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/datasets-config: apply [14:06:46] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:07:25] (03PS1) 10Vgutierrez: Revert "depool upload@eqsin before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1032820 (https://phabricator.wikimedia.org/T357257) [14:10:17] (03CR) 10CDanis: [C:03+1] Enable tracing for citoid and cxserver in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034047 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:10:50] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: ml-serve2002 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T365291#9812571 (10Jhancock.wm) I rotated B1 to B2 to see if the error moves with it. After booting, not getting any errors. Can we repeal it to see if the error comes back? If i... [14:10:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P62697 and previous config saved to /var/cache/conftool/dbconfig/20240520-141050-marostegui.json [14:12:08] (03PS1) 10Hnowlan: trafficserver: move commons-on-k8s to 100% [puppet] - 10https://gerrit.wikimedia.org/r/1034088 (https://phabricator.wikimedia.org/T362323) [14:12:40] (03CR) 10Vgutierrez: [C:03+2] Revert "depool upload@eqsin before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1032820 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [14:12:51] !log repool upload@eqsin with IPIP encapsulation enabled - T357257 [14:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:56] T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257 [14:13:21] eoghan: ^^ [14:13:31] ack, thanks! [14:14:10] !log reedy@deploy1002 Synchronized wmf-config/core-Permissions.php: T360977 (duration: 15m 54s) [14:14:14] T360977: Please allow English Wikiversity custodians to use mass-delete (nuke) - https://phabricator.wikimedia.org/T360977 [14:15:17] Reedy Thanks :) [14:16:03] (03CR) 10Effie Mouzeli: "I do not have strong opinions here, sure we can have a go on staging" [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/1032482 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [14:17:06] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host kafka-main2006.codfw.wmnet with OS bullseye [14:17:30] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host kafka-main2007.codfw.wmnet with OS bullseye [14:18:17] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2008.codfw.wmnet with OS bullseye [14:18:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2181 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P62698 and previous config saved to /var/cache/conftool/dbconfig/20240520-141828-root.json [14:19:06] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2009.codfw.wmnet with OS bullseye [14:19:09] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9812591 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host kafka-main2008.codfw.wmnet with OS bullseye exec... [14:19:16] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9812594 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host kafka-main2009.codfw.wmnet with OS bullseye exec... [14:19:27] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2010.codfw.wmnet with OS bullseye [14:19:34] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9812596 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host kafka-main2010.codfw.wmnet with OS bullseye exec... [14:20:54] (03CR) 10Fabfur: [V:03+1 C:03+2] cache:benthos: test for socket based activation in Benthos [puppet] - 10https://gerrit.wikimedia.org/r/1029615 (https://phabricator.wikimedia.org/T364379) (owner: 10Fabfur) [14:22:51] (03CR) 10Eevans: [C:03+1] "Insofar as I understand any of this, it LGTM 😊" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032595 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [14:25:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T364299)', diff saved to https://phabricator.wikimedia.org/P62699 and previous config saved to /var/cache/conftool/dbconfig/20240520-142558-marostegui.json [14:26:01] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [14:26:04] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [14:26:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [14:26:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T364299)', diff saved to https://phabricator.wikimedia.org/P62700 and previous config saved to /var/cache/conftool/dbconfig/20240520-142621-marostegui.json [14:36:46] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:46] (03CR) 10Filippo Giunchedi: [C:03+2] Enable tracing for citoid and cxserver in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034047 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:38:57] !log mforns@deploy1002 Started deploy [analytics/refinery@4d42877]: Deploy Commons Impact Metrics query improvements [analytics/refinery@4d42877e] [14:39:20] (03PS1) 10Vgutierrez: hiera: Enable IPIP on high-traffic2@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1034090 (https://phabricator.wikimedia.org/T357257) [14:39:23] (03PS1) 10Vgutierrez: hiera: Enable IPIP on upload@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1034091 (https://phabricator.wikimedia.org/T357257) [14:39:36] !log filippo@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [14:39:39] !log filippo@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [14:39:45] !log filippo@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [14:39:48] (03CR) 10Scott French: [C:03+2] kubernetes: add data-gateway usernames for deployment server [puppet] - 10https://gerrit.wikimedia.org/r/1032591 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [14:40:29] !log filippo@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [14:40:42] !log filippo@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [14:40:58] (03PS1) 10Vgutierrez: depool upload@codfw before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1034092 (https://phabricator.wikimedia.org/T357257) [14:41:15] !log filippo@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [14:41:22] !log filippo@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply [14:41:45] !log filippo@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [14:41:54] !log filippo@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [14:42:25] !log filippo@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [14:43:41] (03PS2) 10Scott French: wmnet: add data-gateway CNAME record for k8s ingress [dns] - 10https://gerrit.wikimedia.org/r/1032590 (https://phabricator.wikimedia.org/T364921) [14:46:44] (03CR) 10Scott French: [C:03+2] wmnet: add data-gateway CNAME record for k8s ingress [dns] - 10https://gerrit.wikimedia.org/r/1032590 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [14:48:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [14:48:26] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest2002'] [14:48:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['sretest2002'] [14:48:51] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest2002'] [14:48:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['sretest2002'] [14:49:04] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 4 CORE_DIFF 2 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compile" [puppet] - 10https://gerrit.wikimedia.org/r/1034090 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [14:53:05] !log mforns@deploy1002 Finished deploy [analytics/refinery@4d42877]: Deploy Commons Impact Metrics query improvements [analytics/refinery@4d42877e] (duration: 14m 08s) [14:53:34] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1034091 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [14:53:52] (03CR) 10EoghanGaffney: lists: Add lists role to list2001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [14:55:21] !log mforns@deploy1002 Started deploy [analytics/refinery@4d42877] (thin): Deploy Commons Impact Metrics query improvements THIN [analytics/refinery@4d42877e] [14:59:22] !log mforns@deploy1002 Finished deploy [analytics/refinery@4d42877] (thin): Deploy Commons Impact Metrics query improvements THIN [analytics/refinery@4d42877e] (duration: 04m 00s) [15:01:04] !log mforns@deploy1002 Started deploy [analytics/refinery@4d42877] (hadoop-test): Deploy Commons Impact Metrics query improvements TEST [analytics/refinery@4d42877e] [15:01:29] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2524/co" [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [15:01:46] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:08] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2010.codfw.wmnet with OS bullseye [15:02:19] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9812737 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host kafka-main2010.codfw.wmnet with OS bullseye [15:03:45] (03PS1) 10Fabfur: hiera: test Benthos socket activation on cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1034099 (https://phabricator.wikimedia.org/T364379) [15:03:58] (03PS2) 10Vgutierrez: depool upload@codfw before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1034092 (https://phabricator.wikimedia.org/T357257) [15:04:55] !log mforns@deploy1002 Finished deploy [analytics/refinery@4d42877] (hadoop-test): Deploy Commons Impact Metrics query improvements TEST [analytics/refinery@4d42877e] (duration: 03m 50s) [15:05:15] (03CR) 10Vgutierrez: [C:03+2] depool upload@codfw before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1034092 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [15:05:24] !log depool upload@codfw before enabling IPIP encapsulation - T357257 [15:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:28] T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257 [15:08:04] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1034099 (https://phabricator.wikimedia.org/T364379) (owner: 10Fabfur) [15:13:01] (03CR) 10Vgutierrez: hiera: test Benthos socket activation on cp4037 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1034099 (https://phabricator.wikimedia.org/T364379) (owner: 10Fabfur) [15:13:07] PROBLEM - WDQS SPARQL on wdqs1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 429 Too Many Requests - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 754 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:14:30] FIRING: [2x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:15:26] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Enable IPIP on high-traffic2@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1034090 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [15:16:46] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:17:22] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Enable IPIP on upload@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1034091 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [15:19:31] PROBLEM - Router interfaces on cr2-magru is CRITICAL: CRITICAL: host 195.200.68.129, interfaces up: 48, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:21:04] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest2002'] [15:21:21] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sretest2002'] [15:30:05] jan_drewniak: gettimeofday() says it's time for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240520T1530) [15:30:05] !log rolling restart of pybal on lvs2014 and lvs2012 - T357257 [15:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:13] T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257 [15:30:36] 06SRE, 10conftool: confctl: log to SAL even if the selection doesn't match any host - https://phabricator.wikimedia.org/T155705#9812802 (10Volans) Fixing tags and subscribers. [15:31:29] (03CR) 10Scott French: [C:03+1] trafficserver: move commons-on-k8s to 100% [puppet] - 10https://gerrit.wikimedia.org/r/1034088 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [15:32:31] (03CR) 10Hnowlan: [C:03+2] trafficserver: move commons-on-k8s to 100% [puppet] - 10https://gerrit.wikimedia.org/r/1034088 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [15:33:14] !log move 100% of commons traffic to run on k8s [15:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:56] ooooh [15:34:21] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review, 10Puppet (Puppet 7.0): Spicerack puppetserver.destroy() raises an exception when certificate does not exist - https://phabricator.wikimedia.org/T360293#9812818 (10Volans) 05Open→03Resolved This is now live. [15:35:28] (03PS1) 10Vgutierrez: Revert "depool upload@codfw before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1032822 (https://phabricator.wikimedia.org/T357257) [15:36:42] (03PS1) 10Zabe: trafficserver: Move test-commons to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1034106 [15:37:29] (03CR) 10BBlack: [C:03+1] "Will improve my image latency :P" [dns] - 10https://gerrit.wikimedia.org/r/1032822 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [15:38:08] (03CR) 10Vgutierrez: [C:03+2] Revert "depool upload@codfw before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1032822 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [15:38:16] !log repool upload@codfw with IPIP encapsulation enabled - T357257 [15:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:20] T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257 [15:38:23] (03PS2) 10Zabe: trafficserver: Move test-commons to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1034106 [15:38:45] rzl, arnoldokoth: ^^ [15:39:26] 👍🏾 [15:41:13] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2009.codfw.wmnet with OS bullseye [15:41:23] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9812844 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host kafka-main2009.codfw.wmnet with OS bullseye [15:41:40] (03CR) 10Scott French: [C:03+2] admin_ng: add namespace for data-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032594 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [15:42:46] (03PS1) 10Jforrester: x-wikimedia-debug: Re-label 'k8s-experimental' as 'k8s-mwdebug' [puppet] - 10https://gerrit.wikimedia.org/r/1034107 (https://phabricator.wikimedia.org/T362662) [15:42:47] (03PS1) 10Jforrester: x-wikimedia-debug: Drop old 'k8s-experimental' alias label [puppet] - 10https://gerrit.wikimedia.org/r/1034108 (https://phabricator.wikimedia.org/T362662) [15:42:49] (03PS1) 10Jforrester: arclamp: Update description for k8s-mwdebug values [puppet] - 10https://gerrit.wikimedia.org/r/1034109 [15:43:54] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9812866 (10hnowlan) [15:44:44] (03Merged) 10jenkins-bot: admin_ng: add namespace for data-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032594 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [15:48:22] !log swfrench@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:50:20] !log swfrench@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:50:47] !log swfrench@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:51:49] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2010.codfw.wmnet with OS bullseye [15:52:07] !log swfrench@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:52:42] (03CR) 10Effie Mouzeli: [C:03+1] appservers: 6 appservers to insetup before reimaging [puppet] - 10https://gerrit.wikimedia.org/r/1032805 (https://phabricator.wikimedia.org/T353464) (owner: 10Hnowlan) [15:52:47] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [15:52:50] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9812921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host kafka-main2010.codfw.wmnet with OS bullseye exec... [15:53:06] (03CR) 10Effie Mouzeli: [C:03+1] blubber: update to use buildkit [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/1032481 (owner: 10Elukey) [15:54:56] (03PS4) 10Effie Mouzeli: ipoid: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031105 (https://phabricator.wikimedia.org/T346638) (owner: 10Scott French) [15:55:03] (03CR) 10Elukey: [C:03+2] blubber: update to use buildkit [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/1032481 (owner: 10Elukey) [15:55:24] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:56:14] (03Merged) 10jenkins-bot: blubber: update to use buildkit [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/1032481 (owner: 10Elukey) [15:56:33] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:58:03] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:59:57] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ecarg/Grace Choi - https://phabricator.wikimedia.org/T364414#9812945 (10Dzahn) I don't see any failed login attempt for user "ecarg" on deploy1002 or any of the bastion hosts. So it's very likely about the local SSH config, wrong user name or wr... [16:00:55] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2009.codfw.wmnet with reason: host reimage [16:02:00] (03CR) 10BCornwall: [C:03+2] hiera: Set p::contacts::role_contacts for ncmonitor [puppet] - 10https://gerrit.wikimedia.org/r/1033705 (owner: 10Vgutierrez) [16:02:31] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ecarg/Grace Choi - https://phabricator.wikimedia.org/T364414#9812954 (10Dzahn) a:05Dzahn→03None If you could paste the full command you are running and ideally your local ssh config as well we will be able to debug this more. also see https... [16:03:52] (03CR) 10Effie Mouzeli: memcached: add memcache user option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1026609 (https://phabricator.wikimedia.org/T273950) (owner: 10Effie Mouzeli) [16:04:02] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2009.codfw.wmnet with reason: host reimage [16:04:51] (03PS4) 10Matěj Suchánek: Remove deprecated abuse filter fields [puppet] - 10https://gerrit.wikimedia.org/r/1032809 (https://phabricator.wikimedia.org/T361996) [16:05:02] (03CR) 10Matěj Suchánek: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1032809 (https://phabricator.wikimedia.org/T361996) (owner: 10Matěj Suchánek) [16:05:41] (03CR) 10BCornwall: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1033705 (owner: 10Vgutierrez) [16:05:51] (03CR) 10Dzahn: "Additionally, even IF we'd say we need this check then the issue would still be "but Icinga doesn't do anything to tell us about it beside" [puppet] - 10https://gerrit.wikimedia.org/r/1032526 (owner: 10Dzahn) [16:06:46] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:08:45] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 37 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:09:53] 10SRE-tools, 10Spicerack: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372 (10Volans) 03NEW p:05Triage→03Medium [16:10:05] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: [spicerack] python-kafka does not support python 3.12, there's a fix but there has not been any releases since 2020 - https://phabricator.wikimedia.org/T354410#9813004 (10Volans) [16:10:14] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9813003 (10Jdforrester-WMF) [16:10:27] 06SRE, 10Cumin, 06Infrastructure-Foundations, 10netbox, 13Patch-For-Review: Cumin: add backend for Netbox - https://phabricator.wikimedia.org/T205900#9813005 (10Volans) [16:10:38] (03PS12) 10Effie Mouzeli: memcached: add memcache user option [puppet] - 10https://gerrit.wikimedia.org/r/1026609 (https://phabricator.wikimedia.org/T273950) [16:13:43] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 26 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:14:09] (03CR) 10Alexandros Kosiaris: [C:03+2] x-wikimedia-debug: Re-label 'k8s-experimental' as 'k8s-mwdebug' [puppet] - 10https://gerrit.wikimedia.org/r/1034107 (https://phabricator.wikimedia.org/T362662) (owner: 10Jforrester) [16:14:13] (03PS6) 10Effie Mouzeli: memcached: run as user memcache on mc-gp2003 [puppet] - 10https://gerrit.wikimedia.org/r/1032495 (https://phabricator.wikimedia.org/T273950) [16:14:22] (03PS1) 10Ilias Sarantopoulos: ml-services: increase min replicas for ruwiki-goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034114 (https://phabricator.wikimedia.org/T362503) [16:14:26] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9813010 (10hnowlan) [16:15:13] 06SRE, 10Cassandra, 06serviceops, 10Data Products (Data Products Sprint 13), and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9813020 (10Scott_French) Added k8s secret for the data_gateway role to private puppet in 3fcaf85cbd9341e339e2506acbe2cefe880d... [16:15:50] (03PS1) 10JHathaway: add mpic dummy secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1034115 [16:16:26] (03CR) 10Elukey: [C:03+1] ml-services: increase min replicas for ruwiki-goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034114 (https://phabricator.wikimedia.org/T362503) (owner: 10Ilias Sarantopoulos) [16:16:38] (03CR) 10Kevin Bazira: [C:03+1] ml-services: increase min replicas for ruwiki-goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034114 (https://phabricator.wikimedia.org/T362503) (owner: 10Ilias Sarantopoulos) [16:16:45] (03CR) 10JHathaway: [C:03+2] add mpic dummy secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1034115 (owner: 10JHathaway) [16:16:49] (03CR) 10JHathaway: [V:03+2 C:03+2] add mpic dummy secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1034115 (owner: 10JHathaway) [16:17:12] (03PS1) 10Alexandros Kosiaris: kafka-main: Switch to reuse [puppet] - 10https://gerrit.wikimedia.org/r/1034116 (https://phabricator.wikimedia.org/T363212) [16:19:41] (03CR) 10Alexandros Kosiaris: [C:04-1] mobileapps: Use mesh modules version enabling IPv6 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032779 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [16:21:19] (03CR) 10Scott French: [C:03+2] service: add data-gateway service (k8s ingress) [puppet] - 10https://gerrit.wikimedia.org/r/1032592 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [16:21:23] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2009.codfw.wmnet with OS bullseye [16:21:34] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9813059 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host kafka-main2009.codfw.wmnet with OS bullseye comp... [16:21:47] (03PS13) 10Effie Mouzeli: memcached: add memcache user option [puppet] - 10https://gerrit.wikimedia.org/r/1026609 (https://phabricator.wikimedia.org/T273950) [16:22:14] (03PS7) 10Effie Mouzeli: memcached: run as user memcache on mc-gp2003 [puppet] - 10https://gerrit.wikimedia.org/r/1032495 (https://phabricator.wikimedia.org/T273950) [16:24:17] (03PS4) 10Jdlrobson: Disable wgParserEnableLegacyMediaDOM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031610 (https://phabricator.wikimedia.org/T363597) [16:24:55] (03PS2) 10Alexandros Kosiaris: kafka-main: Switch to reuse [puppet] - 10https://gerrit.wikimedia.org/r/1034116 (https://phabricator.wikimedia.org/T363212) [16:25:07] (03PS14) 10Effie Mouzeli: memcached: add memcache user option [puppet] - 10https://gerrit.wikimedia.org/r/1026609 (https://phabricator.wikimedia.org/T273950) [16:25:30] (03PS8) 10Effie Mouzeli: memcached: run as user memcache on mc-gp2003 [puppet] - 10https://gerrit.wikimedia.org/r/1032495 (https://phabricator.wikimedia.org/T273950) [16:26:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T352010)', diff saved to https://phabricator.wikimedia.org/P62702 and previous config saved to /var/cache/conftool/dbconfig/20240520-162640-ladsgroup.json [16:26:45] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [16:26:55] (03PS3) 10Alexandros Kosiaris: kafka-main: Switch to reuse [puppet] - 10https://gerrit.wikimedia.org/r/1034116 (https://phabricator.wikimedia.org/T363212) [16:28:02] (03PS1) 10Jforrester: x-wikimedia-debug: Update k8s-mwdebug label, move to front [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034118 (https://phabricator.wikimedia.org/T362662) [16:28:17] (03PS2) 10Jdlrobson: Disable last remaining projects using share user scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031458 (https://phabricator.wikimedia.org/T301212) [16:28:20] (03PS2) 10Jforrester: x-wikimedia-debug: Drop old 'k8s-experimental' alias label [puppet] - 10https://gerrit.wikimedia.org/r/1034108 (https://phabricator.wikimedia.org/T362662) [16:28:23] (03PS2) 10Jdlrobson: Drop unused config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031459 (https://phabricator.wikimedia.org/T301212) [16:28:34] (03CR) 10CI reject: [V:04-1] Drop unused config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031459 (https://phabricator.wikimedia.org/T301212) (owner: 10Jdlrobson) [16:28:37] (03CR) 10CI reject: [V:04-1] memcached: add memcache user option [puppet] - 10https://gerrit.wikimedia.org/r/1026609 (https://phabricator.wikimedia.org/T273950) (owner: 10Effie Mouzeli) [16:29:47] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:31:54] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding sretest2002 to codfw - jhancock@cumin2002" [16:32:19] (03CR) 10Alexandros Kosiaris: [C:03+1] "Removing my own -1, this was actually fixed in I9a80af4b7c283e08606764b11d8d6885e90d716c" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032779 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [16:32:36] (03CR) 10Alexandros Kosiaris: [C:03+2] mobileapps: Use mesh modules version enabling IPv6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032779 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [16:32:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding sretest2002 to codfw - jhancock@cumin2002" [16:32:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:32:57] (03CR) 10Alexandros Kosiaris: [C:03+2] kafka-main: Switch to reuse [puppet] - 10https://gerrit.wikimedia.org/r/1034116 (https://phabricator.wikimedia.org/T363212) (owner: 10Alexandros Kosiaris) [16:33:42] (03Merged) 10jenkins-bot: mobileapps: Use mesh modules version enabling IPv6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032779 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [16:33:55] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:33:57] (03CR) 10JHathaway: [C:03+1] "looks good, one small suggestion" [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/1032482 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [16:35:16] (03CR) 10Hnowlan: appservers: 6 appservers to insetup before reimaging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1032805 (https://phabricator.wikimedia.org/T353464) (owner: 10Hnowlan) [16:35:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [16:35:52] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [16:36:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [16:36:46] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [16:37:14] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1006.eqiad.wmnet with OS bullseye [16:37:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9813144 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host kafka-main1006.eqiad.wmn... [16:37:39] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1007.eqiad.wmnet with OS bullseye [16:38:11] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1008.eqiad.wmnet with OS bullseye [16:38:16] (03PS15) 10Effie Mouzeli: memcached: add memcache user option [puppet] - 10https://gerrit.wikimedia.org/r/1026609 (https://phabricator.wikimedia.org/T273950) [16:38:57] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1010.eqiad.wmnet with OS bullseye [16:39:08] 10ops-magru: magru: add PDUs to Netbox - https://phabricator.wikimedia.org/T364628#9813146 (10RobH) Added https://netbox.wikimedia.org/dcim/device-types/289/ to netbox. Checking the elevation doc, the asset tag and serials are missing for the PDUs: https://docs.google.com/spreadsheets/d/1FiRfGo9wMXTvIcT5tIclQ2R... [16:39:24] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [16:39:31] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [16:40:22] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2006.codfw.wmnet with OS bullseye [16:40:56] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2007.codfw.wmnet with OS bullseye [16:41:43] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2008.codfw.wmnet with OS bullseye [16:41:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P62703 and previous config saved to /var/cache/conftool/dbconfig/20240520-164148-ladsgroup.json [16:42:04] (03PS16) 10Effie Mouzeli: memcached: add memcache user option [puppet] - 10https://gerrit.wikimedia.org/r/1026609 (https://phabricator.wikimedia.org/T273950) [16:42:17] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2010.codfw.wmnet with OS bullseye [16:44:08] (03PS9) 10Effie Mouzeli: memcached: run as user memcache on mc-gp2003 [puppet] - 10https://gerrit.wikimedia.org/r/1032495 (https://phabricator.wikimedia.org/T273950) [16:46:58] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:47:51] (03PS2) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-04-17-163312 to 2024-05-13-145903 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031004 (https://phabricator.wikimedia.org/T282716) [16:48:01] (03CR) 10Effie Mouzeli: "PCC looks ok https://puppet-compiler.wmflabs.org/output/1032495/2532/ (run against 1032495 which is the next patch of this branch)" [puppet] - 10https://gerrit.wikimedia.org/r/1026609 (https://phabricator.wikimedia.org/T273950) (owner: 10Effie Mouzeli) [16:48:17] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2024-04-17-163312 to 2024-05-13-145903 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031004 (https://phabricator.wikimedia.org/T282716) (owner: 10Jforrester) [16:48:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:49:06] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-04-17-163312 to 2024-05-13-145903 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031004 (https://phabricator.wikimedia.org/T282716) (owner: 10Jforrester) [16:50:09] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:50:54] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:51:33] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [16:52:03] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [16:52:11] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [16:52:46] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [16:52:55] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [16:53:05] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [16:53:42] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [16:53:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [16:53:59] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [16:54:04] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [16:54:23] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [16:54:30] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [16:54:37] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main1006.eqiad.wmnet with reason: host reimage [16:54:42] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main1007.eqiad.wmnet with reason: host reimage [16:54:42] (03PS2) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-04-18-150843 to 2024-05-13-145650 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031005 (https://phabricator.wikimedia.org/T282716) [16:55:07] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2024-04-18-150843 to 2024-05-13-145650 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031005 (https://phabricator.wikimedia.org/T282716) (owner: 10Jforrester) [16:55:09] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main1008.eqiad.wmnet with reason: host reimage [16:55:44] !log robh@cumin2002 START - Cookbook sre.dns.netbox [16:56:31] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2024-04-18-150843 to 2024-05-13-145650 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031005 (https://phabricator.wikimedia.org/T282716) (owner: 10Jforrester) [16:56:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P62704 and previous config saved to /var/cache/conftool/dbconfig/20240520-165656-ladsgroup.json [16:56:59] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [16:57:07] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:57:13] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:57:29] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [16:57:58] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main1006.eqiad.wmnet with reason: host reimage [16:58:16] !log taavi@cumin1002 START - Cookbook sre.wikireplicas.update-views [16:58:46] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:59:12] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240520T1700) [17:00:05] ryankemper: Time to do the Wikidata Query Service weekly deploy deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240520T1700). [17:00:10] (03PS1) 10Ladsgroup: mariadb: Bump alerting for replag [puppet] - 10https://gerrit.wikimedia.org/r/1034124 [17:00:23] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2006.codfw.wmnet with reason: host reimage [17:00:29] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2007.codfw.wmnet with reason: host reimage [17:00:46] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main1008.eqiad.wmnet with reason: host reimage [17:01:25] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2008.codfw.wmnet with reason: host reimage [17:01:31] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [17:01:51] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [17:02:06] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2010.codfw.wmnet with reason: host reimage [17:03:40] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2007.codfw.wmnet with reason: host reimage [17:03:48] (03CR) 10Ladsgroup: "Migrating this to alertmanager should be actually easy. Future-me problem though." [puppet] - 10https://gerrit.wikimedia.org/r/1034124 (owner: 10Ladsgroup) [17:03:57] !log taavi@cumin1002 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0) [17:04:00] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [17:06:51] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main1007.eqiad.wmnet with reason: host reimage [17:07:54] (03PS5) 10Matěj Suchánek: Remove deprecated abuse filter fields [puppet] - 10https://gerrit.wikimedia.org/r/1032809 (https://phabricator.wikimedia.org/T361996) [17:08:07] (03CR) 10Ladsgroup: [C:03+2] Remove deprecated abuse filter fields [puppet] - 10https://gerrit.wikimedia.org/r/1032809 (https://phabricator.wikimedia.org/T361996) (owner: 10Matěj Suchánek) [17:08:10] (03CR) 10Ladsgroup: [V:03+2 C:03+2] Remove deprecated abuse filter fields [puppet] - 10https://gerrit.wikimedia.org/r/1032809 (https://phabricator.wikimedia.org/T361996) (owner: 10Matěj Suchánek) [17:09:26] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2010.codfw.wmnet with reason: host reimage [17:11:41] (03CR) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [17:11:55] (03PS25) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) [17:12:05] (03PS26) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) [17:12:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T352010)', diff saved to https://phabricator.wikimedia.org/P62705 and previous config saved to /var/cache/conftool/dbconfig/20240520-171204-ladsgroup.json [17:12:07] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1183.eqiad.wmnet with reason: Maintenance [17:12:09] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [17:12:20] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1183.eqiad.wmnet with reason: Maintenance [17:12:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1183 (T352010)', diff saved to https://phabricator.wikimedia.org/P62706 and previous config saved to /var/cache/conftool/dbconfig/20240520-171228-ladsgroup.json [17:12:41] (03CR) 10CI reject: [V:04-1] Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [17:13:01] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2008.codfw.wmnet with reason: host reimage [17:14:31] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main1006.eqiad.wmnet with OS bullseye [17:16:20] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2006.codfw.wmnet with reason: host reimage [17:16:53] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - akosiaris@cumin1002" [17:21:13] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2007.codfw.wmnet with OS bullseye [17:23:12] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - akosiaris@cumin1002" [17:25:28] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2010.codfw.wmnet with OS bullseye [17:25:53] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1032872/2533/lists1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1032872 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [17:30:31] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2008.codfw.wmnet with OS bullseye [17:31:03] (03CR) 10Fabfur: [V:03+1] hiera: test Benthos socket activation on cp4037 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1034099 (https://phabricator.wikimedia.org/T364379) (owner: 10Fabfur) [17:33:46] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2006.codfw.wmnet with OS bullseye [17:33:52] (03CR) 10Dzahn: [V:03+1 C:03+2] lists: add timer to sync data from stewards hosts [puppet] - 10https://gerrit.wikimedia.org/r/1032872 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [17:34:08] (03CR) 10Marostegui: [C:03+1] mariadb: Bump alerting for replag [puppet] - 10https://gerrit.wikimedia.org/r/1034124 (owner: 10Ladsgroup) [17:35:54] (03CR) 10Ladsgroup: "I think there is a bit of misunderstanding here. If I206f51c6e3cb92c2b gets merged. We can fully override the mw footer icon and avoid dou" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester) [17:38:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2161 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P62707 and previous config saved to /var/cache/conftool/dbconfig/20240520-173831-root.json [17:39:12] (03PS2) 10Ladsgroup: mariadb: Bump alerting for replag [puppet] - 10https://gerrit.wikimedia.org/r/1034124 [17:39:18] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Bump alerting for replag [puppet] - 10https://gerrit.wikimedia.org/r/1034124 (owner: 10Ladsgroup) [17:40:48] Amir1: your is already merged now [17:41:01] awesome. thanks! [17:42:12] (03CR) 10Dzahn: [V:03+1 C:03+2] "[lists1001:~] $ sudo systemctl start stewards_subscriber_data_sync" [puppet] - 10https://gerrit.wikimedia.org/r/1032872 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [17:42:30] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@b977332]: (no justification provided) [17:42:57] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@b977332]: (no justification provided) (duration: 00m 27s) [17:47:56] (03PS1) 10Jsn.sherman: CommonSettings: Load AutoModerator extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034131 (https://phabricator.wikimedia.org/T361643) [17:47:57] (03PS1) 10Jsn.sherman: InitializeSettings: testwiki enable AutoModerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034132 (https://phabricator.wikimedia.org/T364034) [17:50:17] (03PS3) 10Scott French: services: add data-gateway service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032595 (https://phabricator.wikimedia.org/T364921) [17:51:07] (03PS2) 10Jsn.sherman: InitializeSettings: testwiki enable AutoModerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034132 (https://phabricator.wikimedia.org/T361643) [17:52:09] (03CR) 10Scott French: "Thank you both for the review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032595 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [17:53:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2161 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P62708 and previous config saved to /var/cache/conftool/dbconfig/20240520-175337-root.json [17:58:41] (03CR) 10Krinkle: arclamp: Update description for k8s-mwdebug values (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1034109 (owner: 10Jforrester) [17:59:13] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main1010.eqiad.wmnet with OS bullseye [17:59:15] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - akosiaris@cumin1002" [17:59:16] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main1007.eqiad.wmnet with OS bullseye [17:59:30] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - akosiaris@cumin1002" [17:59:31] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main1008.eqiad.wmnet with OS bullseye [18:00:10] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1010.eqiad.wmnet with OS bullseye [18:02:27] (03PS2) 10Jforrester: arclamp: Update description for k8s-mwdebug values [puppet] - 10https://gerrit.wikimedia.org/r/1034109 [18:03:05] (03PS3) 10Jforrester: arclamp: Update description for k8s values [puppet] - 10https://gerrit.wikimedia.org/r/1034109 [18:03:11] (03PS4) 10Jforrester: arclamp: Update description for k8s values [puppet] - 10https://gerrit.wikimedia.org/r/1034109 [18:03:17] (03CR) 10Jforrester: arclamp: Update description for k8s values (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1034109 (owner: 10Jforrester) [18:04:10] (03CR) 10Scott French: [C:03+2] citoid: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030191 (https://phabricator.wikimedia.org/T346638) (owner: 10Scott French) [18:05:00] (03Merged) 10jenkins-bot: citoid: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030191 (https://phabricator.wikimedia.org/T346638) (owner: 10Scott French) [18:06:46] FIRING: HelmReleaseBadStatus: Helm release datasets-config-next/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datasets-config-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:06:46] (03CR) 10Jforrester: "> I think there is a bit of misunderstanding here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester) [18:08:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2161 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P62709 and previous config saved to /var/cache/conftool/dbconfig/20240520-180844-root.json [18:09:48] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [18:10:28] (03CR) 10Andrea Denisse: "Thanks for taking a look. I tried it with centrallog1002 as it has a RAID array." [puppet] - 10https://gerrit.wikimedia.org/r/1032608 (https://phabricator.wikimedia.org/T267664) (owner: 10Andrea Denisse) [18:11:37] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [18:15:42] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply [18:16:46] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [18:17:17] (03PS8) 10Kimberly Sarabia: Remove readability survey tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980063 (https://phabricator.wikimedia.org/T349337) [18:19:28] (03PS1) 10Dzahn: lists/stewards: add timer to run mailman syncmembers for stewards-l [puppet] - 10https://gerrit.wikimedia.org/r/1034137 (https://phabricator.wikimedia.org/T351202) [18:20:41] (03CR) 10AOkoth: [C:03+2] vrts: aesthetic code improvements [puppet] - 10https://gerrit.wikimedia.org/r/1033657 (owner: 10AOkoth) [18:20:57] (03CR) 10Dzahn: "That weird "command => @(CMD/L)," syntax is heredoc in Puppet to have line breaks within a command line and still avoid the ugly way t" [puppet] - 10https://gerrit.wikimedia.org/r/1034137 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [18:23:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2161 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P62710 and previous config saved to /var/cache/conftool/dbconfig/20240520-182350-root.json [18:24:46] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:24:46] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:24:54] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:25:02] PROBLEM - BGP status on cr1-magru is CRITICAL: BGP CRITICAL - AS12956/IPv6: Connect - Telxius, AS12956/IPv4: Connect - Telxius https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:26:57] (03CR) 10Jeena Huneidi: [C:03+2] Remove broken deploy.sh script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032802 (https://phabricator.wikimedia.org/T305033) (owner: 10Hashar) [18:27:29] (03Merged) 10jenkins-bot: Remove broken deploy.sh script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032802 (https://phabricator.wikimedia.org/T305033) (owner: 10Hashar) [18:28:56] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [18:29:55] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [18:31:46] RECOVERY - BFD status on cr2-eqiad is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:31:46] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:31:56] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:32:02] RECOVERY - BGP status on cr1-magru is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:38:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2161 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62711 and previous config saved to /var/cache/conftool/dbconfig/20240520-183856-root.json [18:43:31] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:43:35] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:51:03] (03PS1) 10Matthias Mullie: Fix automatic numbering of copied titles [extensions/UploadWizard] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032823 (https://phabricator.wikimedia.org/T365107) [18:51:24] (03PS1) 10Matthias Mullie: Remove complicated synchronization of caption/description inputs [extensions/UploadWizard] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032824 (https://phabricator.wikimedia.org/T365119) [18:54:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2161 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62712 and previous config saved to /var/cache/conftool/dbconfig/20240520-185402-root.json [19:02:08] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:02:14] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:03:55] 10ops-codfw, 06SRE, 06DC-Ops: Duplicate IP on mgmt network - https://phabricator.wikimedia.org/T365379#9813863 (10phaultfinder) [19:04:43] (03PS1) 10Ebernhardson: cirrus: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034140 [19:08:15] (03CR) 10Ebernhardson: [C:03+2] cirrus: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034140 (owner: 10Ebernhardson) [19:08:26] (03PS1) 10Dzahn: admin: create user ceec (cstone) and add to fr-tech-devs [puppet] - 10https://gerrit.wikimedia.org/r/1034141 (https://phabricator.wikimedia.org/T365214) [19:09:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2161 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P62713 and previous config saved to /var/cache/conftool/dbconfig/20240520-190908-root.json [19:09:12] (03CR) 10CI reject: [V:04-1] admin: create user ceec (cstone) and add to fr-tech-devs [puppet] - 10https://gerrit.wikimedia.org/r/1034141 (https://phabricator.wikimedia.org/T365214) (owner: 10Dzahn) [19:09:13] (03Merged) 10jenkins-bot: cirrus: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034140 (owner: 10Ebernhardson) [19:10:43] (03PS2) 10Dzahn: admin: create user ceec (cstone) and add to fr-tech-devs [puppet] - 10https://gerrit.wikimedia.org/r/1034141 (https://phabricator.wikimedia.org/T365214) [19:11:30] (03CR) 10CI reject: [V:04-1] admin: create user ceec (cstone) and add to fr-tech-devs [puppet] - 10https://gerrit.wikimedia.org/r/1034141 (https://phabricator.wikimedia.org/T365214) (owner: 10Dzahn) [19:12:37] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for milimetric - https://phabricator.wikimedia.org/T365074#9813878 (10Dzahn) a:03KOfori Please let me know if it's approved and assign back to me. Thanks! [19:12:58] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for milimetric - https://phabricator.wikimedia.org/T365074#9813881 (10Dzahn) 05Open→03In progress p:05Triage→03High [19:14:37] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for sg912 - https://phabricator.wikimedia.org/T365118#9813907 (10Dzahn) a:05Eevans→03Dzahn Taking over as this week's clinic duty. We are still waiting for both approvals, please. [19:14:45] FIRING: [2x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:14:57] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for sg912 - https://phabricator.wikimedia.org/T365118#9813909 (10Dzahn) 05Open→03In progress p:05Triage→03High [19:16:46] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:20:25] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main1010.eqiad.wmnet with OS bullseye [19:23:49] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:23:53] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:25:54] (03CR) 10Scott French: [C:03+2] cxserver: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030195 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [19:26:48] (03Merged) 10jenkins-bot: cxserver: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030195 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [19:31:06] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [19:31:28] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [19:32:36] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [19:33:14] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [19:44:28] RECOVERY - WDQS SPARQL on wdqs1014 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:45:33] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [19:46:14] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [19:46:15] 06SRE, 06Infrastructure-Foundations, 10Mail: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395 (10jhathaway) 03NEW [19:49:30] RESOLVED: [2x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:50:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1014:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:51:52] 10ops-codfw, 06SRE, 06DC-Ops: Duplicate IP on mgmt network - https://phabricator.wikimedia.org/T365379#9814067 (10phaultfinder) [19:52:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T352010)', diff saved to https://phabricator.wikimedia.org/P62714 and previous config saved to /var/cache/conftool/dbconfig/20240520-195224-ladsgroup.json [19:52:30] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to snap out of that daydream and deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240520T2000). [20:00:05] kimberly_sarabia and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:44] i can deploy today [20:00:51] Thanks. I'm here [20:01:27] (03PS6) 10Kimberly Sarabia: Introduce sample overrides to web_ui_actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024813 (https://phabricator.wikimedia.org/T361962) [20:01:32] (03CR) 10Urbanecm: [C:03+2] Introduce sample overrides to web_ui_actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024813 (https://phabricator.wikimedia.org/T361962) (owner: 10Kimberly Sarabia) [20:01:36] (03CR) 10Scott French: "Thanks for the review, Riccardo." [software/conftool] - 10https://gerrit.wikimedia.org/r/1032849 (https://phabricator.wikimedia.org/T365123) (owner: 10Scott French) [20:01:38] hi kimberly_sarabia :) [20:01:48] Jdlrobson: hi, are you around? :) [20:02:08] (03Merged) 10jenkins-bot: Introduce sample overrides to web_ui_actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024813 (https://phabricator.wikimedia.org/T361962) (owner: 10Kimberly Sarabia) [20:02:13] urbanecm: hello! I told jon I can test for him [20:02:32] kimberly_sarabia: oh, okay. so you're going to test all four patches then? [20:03:30] Yup. unless he chimes in now. [20:03:58] ack [20:04:01] urbanecm: yep [20:04:05] (03PS5) 10Jdlrobson: Disable wgParserEnableLegacyMediaDOM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031610 (https://phabricator.wikimedia.org/T363597) [20:04:10] (03CR) 10Urbanecm: [C:03+2] Disable wgParserEnableLegacyMediaDOM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031610 (https://phabricator.wikimedia.org/T363597) (owner: 10Jdlrobson) [20:04:33] (03PS3) 10Jdlrobson: Disable last remaining projects using share user scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031458 (https://phabricator.wikimedia.org/T301212) [20:04:36] (03CR) 10Urbanecm: [C:03+2] Disable last remaining projects using share user scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031458 (https://phabricator.wikimedia.org/T301212) (owner: 10Jdlrobson) [20:04:50] (03Merged) 10jenkins-bot: Disable wgParserEnableLegacyMediaDOM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031610 (https://phabricator.wikimedia.org/T363597) (owner: 10Jdlrobson) [20:04:53] i love how wikibugs respects the impersonate-on-rebase [20:05:28] (03PS4) 10Jdlrobson: Disable last remaining projects using share user scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031458 (https://phabricator.wikimedia.org/T301212) [20:05:42] (03CR) 10Urbanecm: Disable last remaining projects using share user scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031458 (https://phabricator.wikimedia.org/T301212) (owner: 10Jdlrobson) [20:05:45] (03CR) 10Urbanecm: [C:03+2] Disable last remaining projects using share user scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031458 (https://phabricator.wikimedia.org/T301212) (owner: 10Jdlrobson) [20:06:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031458 (https://phabricator.wikimedia.org/T301212) (owner: 10Jdlrobson) [20:06:30] (03Merged) 10jenkins-bot: Disable last remaining projects using share user scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031458 (https://phabricator.wikimedia.org/T301212) (owner: 10Jdlrobson) [20:06:48] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1024813|Introduce sample overrides to web_ui_actions (T361962)]], [[gerrit:1031610|Disable wgParserEnableLegacyMediaDOM (T363597)]], [[gerrit:1031458|Disable last remaining projects using share user scripts (T301212)]] [20:07:00] T361962: Update Sample Rates for Metrics Platform Events - https://phabricator.wikimedia.org/T361962 [20:07:00] T363597: Change the heading markup for 3rd party and Minerva skins - https://phabricator.wikimedia.org/T363597 [20:07:01] T301212: Vector-2022.js should no longer load legacy Vector site and user scripts/styles - https://phabricator.wikimedia.org/T301212 [20:07:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P62715 and previous config saved to /var/cache/conftool/dbconfig/20240520-200732-ladsgroup.json [20:07:44] hi - is there someone already deploying? i got logged out of irc and just re-joining [20:08:03] cjming: yes urbanecm is [20:08:15] hi cjming ! [20:08:19] * cjming bows to urbanecm [20:08:43] * urbanecm bows back to cjming [20:09:25] !log urbanecm@deploy1002 urbanecm and jdlrobson and ksarabia: Backport for [[gerrit:1024813|Introduce sample overrides to web_ui_actions (T361962)]], [[gerrit:1031610|Disable wgParserEnableLegacyMediaDOM (T363597)]], [[gerrit:1031458|Disable last remaining projects using share user scripts (T301212)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:09:40] kimberly_sarabia: please test [20:09:47] ok [20:11:27] urbanecm: LGTM [20:11:31] !log urbanecm@deploy1002 urbanecm and jdlrobson and ksarabia: Continuing with sync [20:11:34] proceeding [20:13:56] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests: Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319#9814160 (10stjn) >>! In T275319#9813756, @Ladsgroup wrote: > While non-Latin characters take twice as space, since Arabic and Hebr... [20:14:09] (03PS9) 10Kimberly Sarabia: Remove readability survey tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980063 (https://phabricator.wikimedia.org/T349337) [20:15:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1014:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [20:16:19] Hang on... Disable last remaining projects using share user scripts may need a little longer (but not end of world if it's syncing..) it's just not working as expected for me [20:17:05] Jdlrobson: it's midway through already [20:17:12] should i revert it? [20:17:55] no keep going [20:18:01] i might need a follow up if it's not workign though [20:18:37] urbanecm: shoot yeh it's broken [20:18:40] i see what happened [20:18:46] broken how? [20:19:17] basically all Vector.js scripts are going to load on Vector 2022... which should be fine provided we backport quickly (as that's the way it's been for several years up into this point) [20:19:55] :/ [20:20:11] so...are we keeping it that way? or reverting? [20:20:12] not sure now [20:20:34] (03PS1) 10Jdlrobson: wgVectorShareUserScripts should be false now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034149 (https://phabricator.wikimedia.org/T301212) [20:20:42] oh [20:20:45] ^ urbanecm basically i made a mistake here [20:20:47] there is a diff property :) [20:21:12] it should be okay if the following state is in production for a little bit until that merges [20:21:23] i'll keep an eye on logstash [20:22:25] i'll add the above to wikitech:Deployments ? [20:22:40] (03CR) 10Urbanecm: [C:03+2] Remove readability survey tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980063 (https://phabricator.wikimedia.org/T349337) (owner: 10Kimberly Sarabia) [20:22:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P62716 and previous config saved to /var/cache/conftool/dbconfig/20240520-202240-ladsgroup.json [20:22:48] (03PS2) 10Jdlrobson: wgVectorShareUserScripts should be false now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034149 (https://phabricator.wikimedia.org/T301212) [20:22:50] (03CR) 10Urbanecm: [C:03+2] wgVectorShareUserScripts should be false now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034149 (https://phabricator.wikimedia.org/T301212) (owner: 10Jdlrobson) [20:22:56] Jdlrobson: please do [20:22:58] i'll deploy [20:23:04] ...once i get shell again :) [20:23:14] Done! [20:23:19] (03Merged) 10jenkins-bot: Remove readability survey tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980063 (https://phabricator.wikimedia.org/T349337) (owner: 10Kimberly Sarabia) [20:23:28] ty [20:23:30] (03Merged) 10jenkins-bot: wgVectorShareUserScripts should be false now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034149 (https://phabricator.wikimedia.org/T301212) (owner: 10Jdlrobson) [20:24:21] oh, i can run queue a new scap backport in. interesting! [20:25:06] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1024813|Introduce sample overrides to web_ui_actions (T361962)]], [[gerrit:1031610|Disable wgParserEnableLegacyMediaDOM (T363597)]], [[gerrit:1031458|Disable last remaining projects using share user scripts (T301212)]] (duration: 18m 18s) [20:25:15] T361962: Update Sample Rates for Metrics Platform Events - https://phabricator.wikimedia.org/T361962 [20:25:18] T363597: Change the heading markup for 3rd party and Minerva skins - https://phabricator.wikimedia.org/T363597 [20:25:19] T301212: Vector-2022.js should no longer load legacy Vector site and user scripts/styles - https://phabricator.wikimedia.org/T301212 [20:25:27] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:980063|Remove readability survey tool (T349337)]], [[gerrit:1034149|wgVectorShareUserScripts should be false now (T301212)]] [20:25:48] T349337: Remove community readability survey tool code from production - https://phabricator.wikimedia.org/T349337 [20:26:08] (03PS3) 10Dzahn: admin: create user ceec (cstone) and add to fr-tech-devs [puppet] - 10https://gerrit.wikimedia.org/r/1034141 (https://phabricator.wikimedia.org/T365214) [20:28:03] !log urbanecm@deploy1002 ksarabia and jdlrobson and urbanecm: Backport for [[gerrit:980063|Remove readability survey tool (T349337)]], [[gerrit:1034149|wgVectorShareUserScripts should be false now (T301212)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:28:16] Jdlrobson: please take a look :) [20:28:19] and kimberly_sarabia ^^ [20:29:00] lookiinngg [20:29:06] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (no kerberos, no ssh) for HNordeen - https://phabricator.wikimedia.org/T364801#9814186 (10Dzahn) [20:30:00] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (no kerberos, no ssh) for HNordeen - https://phabricator.wikimedia.org/T364801#9814188 (10Dzahn) 05Open→03In progress p:05Triage→03High confirmed per https://app.betterworks.com/app/#/profile/330968 [20:30:30] yep that one is doing what i expected! please sync urbanecm ! [20:30:35] !log urbanecm@deploy1002 ksarabia and jdlrobson and urbanecm: Continuing with sync [20:30:37] proceeding [20:34:01] 06SRE, 06Infrastructure-Foundations, 10Mail: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9814198 (10jhathaway) p:05Triage→03Medium [20:35:10] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (no kerberos, no ssh) for HNordeen - https://phabricator.wikimedia.org/T364801#9814201 (10Dzahn) @odimitrijevic or @ahoelzl This needs an approval from the group owner (unless that is skipped when there is no shell access, but I did... [20:36:46] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:37:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T352010)', diff saved to https://phabricator.wikimedia.org/P62717 and previous config saved to /var/cache/conftool/dbconfig/20240520-203748-ladsgroup.json [20:37:50] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance [20:37:53] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [20:38:04] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance [20:38:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1185 (T352010)', diff saved to https://phabricator.wikimedia.org/P62718 and previous config saved to /var/cache/conftool/dbconfig/20240520-203811-ladsgroup.json [20:38:45] 06SRE, 10SRE-Access-Requests: Requesting permissions for analytics-privatedata-users (with kerberos) for Mareike Heuer - https://phabricator.wikimedia.org/T364715#9814209 (10Dzahn) @MareikeHeuerWMDE Could you please send an email to Katie Francis (https://meta.wikimedia.org/wiki/User:KFrancis_(WMF))? She will... [20:39:50] 06SRE, 10SRE-Access-Requests: Requesting permissions for analytics-privatedata-users (with kerberos) for Mareike Heuer - https://phabricator.wikimedia.org/T364715#9814218 (10Dzahn) p:05Triage→03High [20:40:13] thanks urbanecm and sorry for the slight hiccup! [20:40:19] np [20:40:30] 06SRE, 10SRE-Access-Requests: Requesting permissions for analytics-privatedata-users (with kerberos) for Mareike Heuer - https://phabricator.wikimedia.org/T364715#9814215 (10Dzahn) 05Open→03In progress a:05CDanis→03Dzahn Taking over as this week's clinic duty. [20:40:50] should be finished in a few [20:44:02] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:980063|Remove readability survey tool (T349337)]], [[gerrit:1034149|wgVectorShareUserScripts should be false now (T301212)]] (duration: 18m 34s) [20:44:13] T349337: Remove community readability survey tool code from production - https://phabricator.wikimedia.org/T349337 [20:44:14] T301212: Vector-2022.js should no longer load legacy Vector site and user scripts/styles - https://phabricator.wikimedia.org/T301212 [20:44:15] and we're live [20:44:21] anything else Jdlrobson kimberly_sarabia ? [20:44:38] urbanecm: none from me. thanks [20:44:43] np [20:47:52] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf/analytics-privatedata-users for derenrich - https://phabricator.wikimedia.org/T365381#9814253 (10Dzahn) Hi! If possible could you please specifiy which of the following you are requesting? a) analytics-privatedata-users (no kerberos, no ssh) b) analytics-pr... [20:49:35] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf/analytics-privatedata-users for derenrich - https://phabricator.wikimedia.org/T365381#9814259 (10Dzahn) @odimitrijevic or @Ahoelzl This is another request for approval for group owners of analytics-privatedata-users. Thanks! [20:51:30] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:51:32] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:53:10] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf/analytics-privatedata-users for derenrich - https://phabricator.wikimedia.org/T365381#9814269 (10Dzahn) 05Open→03In progress p:05Triage→03High [20:55:46] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for JayCano - https://phabricator.wikimedia.org/T365349#9814286 (10Dzahn) I can't find a trace of the previous ticket / access. But I can confirm employee status per https://app.betterworks.com/app/#/profile/374585 [20:56:22] (03CR) 10Dzahn: [C:03+2] admin: create user ceec (cstone) and add to fr-tech-devs [puppet] - 10https://gerrit.wikimedia.org/r/1034141 (https://phabricator.wikimedia.org/T365214) (owner: 10Dzahn) [20:57:04] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:57:10] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:00:04] Reedy, sbassett, Maryum, and manfredi: OwO what's this, a deployment window?? Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240520T2100). nyaa~ [21:02:51] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ecarg/Grace Choi - https://phabricator.wikimedia.org/T364414#9814314 (10ecarg) Thank you, I just sent a Slack message with the details! [21:06:06] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to crm for cstone - https://phabricator.wikimedia.org/T365214#9814320 (10Dzahn) 05In progress→03Resolved This access should work now. I ran puppet on the bastion hosts and confirm the user exists on `crm2001.codfw.wmnet` now: ` [cr... [21:06:20] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf/analytics-privatedata-users for derenrich - https://phabricator.wikimedia.org/T365381#9814334 (10derenrich) I only need access to superset so I think regarding that document I only need analytics-privatedata-users (no kerberos, no ssh) [21:17:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T352010)', diff saved to https://phabricator.wikimedia.org/P62719 and previous config saved to /var/cache/conftool/dbconfig/20240520-211721-ladsgroup.json [21:17:27] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [21:20:05] (03PS1) 10JHathaway: gitlab: Move outbound email to mx-out{1001,2001}.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1034158 (https://phabricator.wikimedia.org/T365395) [21:20:27] (03PS1) 10Dzahn: admin: replace SSH key for ecarg (Grace Choi) [puppet] - 10https://gerrit.wikimedia.org/r/1034159 (https://phabricator.wikimedia.org/T365308) [21:22:36] !log bking@cumin2002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-flink-codfw cluster: Roll restart of jvm daemons. [21:23:57] 10ops-codfw, 06SRE, 06DC-Ops: Duplicate IP on mgmt network - https://phabricator.wikimedia.org/T365379#9814362 (10phaultfinder) [21:26:04] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for ecarg/Grace Choi - https://phabricator.wikimedia.org/T364414#9814363 (10Dzahn) Turns out there was still an old RSA key from the past in production. In this ticket there was no new key but we need the one that was listed in... [21:27:41] (03CR) 10Ecarg: [C:03+1] "TY!" [puppet] - 10https://gerrit.wikimedia.org/r/1034159 (https://phabricator.wikimedia.org/T365308) (owner: 10Dzahn) [21:28:15] (03CR) 10Dzahn: [C:03+2] admin: replace SSH key for ecarg (Grace Choi) [puppet] - 10https://gerrit.wikimedia.org/r/1034159 (https://phabricator.wikimedia.org/T365308) (owner: 10Dzahn) [21:29:01] !log bking@cumin2002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-flink-codfw cluster: Roll restart of jvm daemons. [21:31:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [21:31:56] FIRING: RdfStreamingUpdaterFlinkJobUnstable: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [21:32:06] !log bking@cumin2002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-flink-eqiad cluster: Roll restart of jvm daemons. [21:32:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P62720 and previous config saved to /var/cache/conftool/dbconfig/20240520-213230-ladsgroup.json [21:35:51] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests: Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319#9814384 (10cscott) This discussion risks going in circles. As I wrote previously in T275319#6884320: > zhwiki for example should h... [21:36:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [21:36:56] RESOLVED: RdfStreamingUpdaterFlinkJobUnstable: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [21:38:20] !log bking@cumin2002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-flink-eqiad cluster: Roll restart of jvm daemons. [21:40:16] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests: Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319#9814388 (10stjn) FWIW, I’ve read the comment and I disagree that my point above should be disregarded just because ‘it scales with... [21:45:03] 06SRE, 06Infrastructure-Foundations, 10Mail, 13Patch-For-Review: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9814406 (10jhathaway) [21:47:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P62721 and previous config saved to /var/cache/conftool/dbconfig/20240520-214739-ladsgroup.json [21:49:49] (03PS1) 10Ryan Kemper: zookeeper: enable 4lw cmds in zk 3.4.10 [puppet] - 10https://gerrit.wikimedia.org/r/1034162 (https://phabricator.wikimedia.org/T365400) [21:52:50] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1034162 (https://phabricator.wikimedia.org/T365400) (owner: 10Ryan Kemper) [21:53:27] (03PS10) 10Urbanecm: Add account_conversion event streams. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989216 (https://phabricator.wikimedia.org/T363815) (owner: 10Cyndywikime) [21:53:30] (03PS2) 10Ryan Kemper: zookeeper: enable 4lw cmds in zk 3.4.10 [puppet] - 10https://gerrit.wikimedia.org/r/1034162 (https://phabricator.wikimedia.org/T365400) [21:53:32] (03PS11) 10Urbanecm: Add account_conversion event streams. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989216 (https://phabricator.wikimedia.org/T363815) (owner: 10Cyndywikime) [21:53:56] (03PS12) 10Urbanecm: Add account_conversion event streams. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989216 (https://phabricator.wikimedia.org/T363815) (owner: 10Cyndywikime) [21:54:02] 06SRE, 10SRE-Access-Requests: Requesting update to SSH key for access to deployment for ecarg/Grace Choi - https://phabricator.wikimedia.org/T365308#9814422 (10Dzahn) We had to replace the production key with the one from this ticket. Fixed. [21:54:35] (03PS3) 10Ryan Kemper: zookeeper: enable 4lw cmds in bookworm or later [puppet] - 10https://gerrit.wikimedia.org/r/1034162 (https://phabricator.wikimedia.org/T365400) [21:54:53] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1034162 (https://phabricator.wikimedia.org/T365400) (owner: 10Ryan Kemper) [21:55:42] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ecarg/Grace Choi - https://phabricator.wikimedia.org/T364414#9814425 (10Dzahn) 05In progress→03Resolved This has been fixed by updating the SSH key. We could confirm Grace can now connect to bastion host and deployment host. [21:56:24] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf/analytics-privatedata-users for derenrich - https://phabricator.wikimedia.org/T365381#9814428 (10Dzahn) ACK, thanks @derenrich ! Will do that. If it turns out differently we will need that as a basis either way. [21:57:25] (03PS13) 10Urbanecm: Add account_conversion event streams. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989216 (https://phabricator.wikimedia.org/T363815) (owner: 10Cyndywikime) [21:58:21] (03CR) 10Urbanecm: [C:03+2] Add account_conversion event streams. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989216 (https://phabricator.wikimedia.org/T363815) (owner: 10Cyndywikime) [21:58:59] (03Merged) 10jenkins-bot: Add account_conversion event streams. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989216 (https://phabricator.wikimedia.org/T363815) (owner: 10Cyndywikime) [22:00:00] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:989216|Add account_conversion event streams. (T363815)]] [22:00:06] 10ops-codfw, 06SRE, 06DC-Ops: Duplicate IP on mgmt network - https://phabricator.wikimedia.org/T365379#9814431 (10phaultfinder) [22:00:13] T363815: Enable instrumentation for Temporary accounts <-> registered accounts flow - https://phabricator.wikimedia.org/T363815 [22:00:20] (03CR) 10Bking: [C:03+2] zookeeper: enable 4lw cmds in bookworm or later [puppet] - 10https://gerrit.wikimedia.org/r/1034162 (https://phabricator.wikimedia.org/T365400) (owner: 10Ryan Kemper) [22:01:12] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests: Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319#9814438 (10Fuzzy) In the case of Israeli laws, their length consistently falls below the Page limit. But we use complex templates... [22:01:24] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests: Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319#9814439 (10cscott) @stjn you are correct that this particular issue is a mix of social and technical factors as I pointed out in T... [22:02:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T352010)', diff saved to https://phabricator.wikimedia.org/P62722 and previous config saved to /var/cache/conftool/dbconfig/20240520-220247-ladsgroup.json [22:02:50] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance [22:02:53] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [22:03:04] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance [22:03:55] (03PS1) 10Scott French: dbctl: break up test_check_config test case [software/conftool] - 10https://gerrit.wikimedia.org/r/1034163 (https://phabricator.wikimedia.org/T365123) [22:06:46] FIRING: HelmReleaseBadStatus: Helm release datasets-config-next/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datasets-config-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:07:41] (03CR) 10Scott French: dbctl: extend dbconfig checks to external sections (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/1032849 (https://phabricator.wikimedia.org/T365123) (owner: 10Scott French) [22:07:46] (03PS1) 10Ebernhardson: cirrus: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034165 [22:10:57] (03CR) 10Ebernhardson: [C:03+2] cirrus: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034165 (owner: 10Ebernhardson) [22:11:47] (03Merged) 10jenkins-bot: cirrus: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034165 (owner: 10Ebernhardson) [22:14:42] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9814463 (10Papaul) @Eevans like you mentioned on IRC "it's the same slot(s) that are having issues" I think we need to replace the main board and see. We have 4 decom PowerEdge R440's. I will pin... [22:16:19] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:989216|Add account_conversion event streams. (T363815)]] (duration: 16m 18s) [22:16:24] T363815: Enable instrumentation for Temporary accounts <-> registered accounts flow - https://phabricator.wikimedia.org/T363815 [22:17:46] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:17:50] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:05:38] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-flink-eqiad cluster: Roll restart of jvm daemons. [23:11:55] 10ops-codfw, 06SRE, 06DC-Ops: Duplicate IP on mgmt network - https://phabricator.wikimedia.org/T365379#9814629 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm manually reset the idrac ip of the offending server. alert cleared. [23:12:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [23:13:04] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [23:13:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P62724 and previous config saved to /var/cache/conftool/dbconfig/20240520-231350-ladsgroup.json [23:16:46] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:17:10] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9814633 (10Jhancock.wm) @Papaul still getting an error on provisioning of the new server. 100.0% (1/1) success ratio (>= 100.0%... [23:18:28] (03PS1) 10RLazarus: cumin: Remove etcd::v3::kubernetes::staging from A:wikikube-staging-etcd [puppet] - 10https://gerrit.wikimedia.org/r/1034193 (https://phabricator.wikimedia.org/T363307) [23:23:52] (03PS1) 10Jdlrobson: Decouple MFUseDesktopSpecialWatchlistPage from EditWatchlist page [extensions/MobileFrontend] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1034168 [23:24:27] (03PS2) 10Jdlrobson: Enable desktop watchlist HTML on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032833 (https://phabricator.wikimedia.org/T109277) [23:26:17] !log LDAP - added jaycano to wmf group (T365349) [23:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:21] T365349: Grant Access to wmf for JayCano - https://phabricator.wikimedia.org/T365349 [23:28:13] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmf for JayCano - https://phabricator.wikimedia.org/T365349#9814641 (10Dzahn) 05Open→03Resolved a:03Dzahn @JayCano You have been added to the 'wmf' LDAP group as requested. Feel free to try logins now. [23:28:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P62725 and previous config saved to /var/cache/conftool/dbconfig/20240520-232858-ladsgroup.json [23:38:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1033392 [23:38:01] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1033392 (owner: 10TrainBranchBot) [23:39:16] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9814670 (10Papaul) @Jhancock.wm it looks like we have another sretest2002 setup in b7 the switch has that configuration already so... [23:44:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T352010)', diff saved to https://phabricator.wikimedia.org/P62726 and previous config saved to /var/cache/conftool/dbconfig/20240520-234406-ladsgroup.json [23:44:11] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [23:44:13] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:44:24] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [23:44:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T352010)', diff saved to https://phabricator.wikimedia.org/P62727 and previous config saved to /var/cache/conftool/dbconfig/20240520-234431-ladsgroup.json [23:52:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [23:53:01] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [23:58:33] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1033392 (owner: 10TrainBranchBot)